If you’re working on a software product (consumer-facing or SaaS), you’ve probably witnessed the recent explosion of AI-powered features. Now that AI has become an API, a lot is being built very quickly, and there’s an ambient sense of possibility in the space. (Seriously, try keeping up with ML Twitter!)
You may feel full of a similar optimism, or feel late to the party—if the latter, don’t worry! It’s still quite early stages, and a lot is up for grabs. Though there’s been some consolidation in terms of how to build, what to build remains an open question.
Advances in prompt engineering and vector DBs aside, AI product development is still product development. Whether you’re a product manager, engineer, or designer, you will still need to answer the same questions
In this post, we’ll describe:
Moving fast and flying blind don’t usually make for the most effective combo. Too often, companies develop feature after feature without thinking about where they’re going.
True product velocity is a combination of:
The powerful but tricky part of LLM-integrated apps is both the wide possibility of inputs that they accept, and the nondeterministic, black-box nature of their outputs. This makes it harder to answer:
1. Observability:Is the LLM even doing what you expect it to do?
Think “Sentry for LLMs”. This might allow you to monitor token usage, response consistency, etc.
2. Product analytics: Assuming the former (the “correctness” of your LLM integration) – how is your product actually being used?
Think “Mixpanel or Amplitude for LLMs”. This sort of tooling allows you to tie specific user behaviors to the metrics/KPIs you care about (e.g. signups, conversion rate, ACV, etc).
At Aquarium we are focusing on the latter — product analytics. Luckily, this is not a new discipline!
The typical stack, according to industry best practice, incorporates analytics products (Mixpanel, Amplitude, etc) with business intelligence (BI) tools (Mode, Looker, etc), all on top of data warehouses (BigQuery, Snowflake, etc). Most of these tools are well established, having been around since the early 2010s — more than a decade, at this point!
Historically, this sort of setup has allowed product teams to analyze user information like “users are clicking on this button to go to this page which leads to this action”, so they an determine what features needed to be tweaked or built.
LLMs are capable of a broad range of NLP tasks (translation, summarization, etc) — meaning they are extremely effective at interpreting intent. This makes them perfect to serve as the interface between your app and your users.
In other words, your users don’t have to click through various input fields or read docs about your query language — they can simply ask for what they want in plain English (or any other natural language).
In this new world, standard UI elements (dropdowns, buttons, input fields, etc) are replaced with a line of text.
For example, in the case of a task management app like Linear, a filter/search interface like:
is reduced to:
Or in the case of Slack, a request becomes conversational:
Say that you have an LLM-integrated product where users are able to type whatever they want into a query bar. You’re receiving thousands of these unique inputs every day — other than doing manual spot checks, you have no real way of understanding what users are requesting.
Why wasn’t this a problem before? Traditional UIs have a limited “action space” where users follow a predetermined flow of button clicks, dropdown selections, etc. Even when there was freeform text input, it was usually scoped to a particular use case (e.g. name of task
or keywords in description
). Each of these fixed UI elements provided a place where you could hook in some tracking:
Most analytics tools expect some sort of structured input (e.g. {"click": true, "selected_type": "hotel"}
) because user interactions were themselves structured.
However, this changes with LLMs.
Take, for example, Airbnb’s booking interface, which is one such traditionally “structured UI”:
Because the UI provides finite options for what the user can select, it’s easy for existing analytics tools to:
property type
being one of ["house", "apartment", "guesthouse", "hotel"]
, or by selected amenity types)property type = hotel
“?)).In contrast, an LLM-version of Airbnb (or any other booking UI) might be a single input box, receiving freeform input like the following:
A bright cozy studio in Azores for 2 people with great views
Lodge in the Alps not too expensive
Family vacation with two young children and in-laws, need space and proximity to beach
treehouse
How would existing analytics tools break apart all this text into something useful? How would you even categorize all of the different things that users are asking for here?
With LLM-powered UIs, you no longer have dropdown selections or preset input fields to log — just a single piece of freeform text. Even though the full text from a user actually has more context than a sequence of click events…when you have millions of unstructured inputs, it’s hard to aggregate, filter, query, manipulate, or otherwise interpret them — let alone use them to identify cohorts or patterns that you can correlate with your team’s KPIs
Ideally, you’d have an exploratory tool, that would help you answer the following questions:
With some processing, any natural language text input like Lodge in the Alps not too expensive
can be turned into something more structured — for example, key-value pairs like {"type": "lodge", "location": "mountains", "price": "budget"}
.
In this case, attributes are the keys and categories are the possible values for a particular attribute. For instance:
But how do you know what sort of structure matters? The same input can be tagged with any number of attributes. You could just as easily label this input with "first letter of input": "L"
— but that hardly does you any good.
To aggregate your user data in a meaningful way and understand its “shape”, you’d need to:
Luckily, embeddings and LLMs make this process quite doable.
If you can convert your unstructured text inputs into a taxonomy of key-value metadata (attribute = key, category = value) — you can then pass this extracted metadata through existing product analytics tools.
To determine whether users are getting value out of your LLM feature, you’d want to slice your production data along a given attribute, chart the distribution of its categories, and correlate these to key product metrics (e.g. ACV, clickthrough, churn rate).
For instance, in the travel booking example — do queries that specify exact dates tend to correspond to higher booking rates?
If your tool is able to convert unstructured text into metadata labels, you can both:
See the single-point-in-time distribution of LLM text input. This in itself would probably be enough to allow you to understand common use cases, and make hypotheses on what redesigns or new features are necessary.
Track production data over time (independently or in relation to key metrics). You could gauge the effectiveness of a change to your LLM features (e.g. better prompt wrapper, new button, etc), based on how it changes the distribution of incoming user inputs. Or you might notice that users who ask a query in a certain way are more likely to make it through your conversion pipeline
You should be able to integrate the new attribute labels into existing product analytics stacks, essentially “enriching” the data you already have. This would include Customer Data Platforms (CDPs) like Segment, and dashboards like Mode, Mixpanel, Amplitude, etc.
The following diagram shows some different ways you could hook into the existing data/product analytics stack:
In terms of ingestion, you could:
You should also have flexibility when exporting your enriched events:
This has the additional benefit of encouraging cross-team collaboration and visibility — existing people in your org already use these tools, and from their perspective, no changes would be needed. Without having to be completely dialed into the specifics of LLMs, they’d start seeing more facets of data to explore.
Here at Aquarium we strongly believe that data matters, and that virtuous feedback loops are the key to success. If you have the insights to drive your actions, your team can move much more quickly towards that hallowed land of product-market fit
Interested in what this might look like? We’ve been working on something that can serve as the “LLM product analytics” tool we’ve just described — check out Tidepool!