Is Your LLM App Useful? Answer with Product Analytics

Make sure your embeddings encode similarity the way you want them to

If you’re working on a software product (consumer-facing or SaaS), you’ve probably witnessed the recent explosion of AI-powered features. Now that AI has become an API, a lot is being built very quickly, and there’s an ambient sense of possibility in the space. (Seriously, try keeping up with ML Twitter!)

You may feel full of a similar optimism, or feel late to the party—if the latter, don’t worry! It’s still quite early stages, and a lot is up for grabs. Though there’s been some consolidation in terms of how to build, what to build remains an open question.

Advances in prompt engineering and vector DBs aside, AI product development is still product development. Whether you’re a product manager, engineer, or designer, you will still need to answer the same questions

Who are your power users? Who is churning?
What are their common usage patterns?
Is your product being used in a way you expect?
Is there a previously unidentified use case?

In this post, we’ll describe:

Why product analytics have been historically useful for finding product-market fit
How large language models (LLMs) fundamentally change UIs
Why UI changes mean that a new LLM-specific “product analytics” tool is necessary

The Case for Product Analytics

Moving fast and flying blind don’t usually make for the most effective combo. Too often, companies develop feature after feature without thinking about where they’re going.

True product velocity is a combination of:

Speed: the ability to build quickly — we trust that you have this!
Speed: the ability to build quickly — we trust that you have this!

The powerful but tricky part of LLM-integrated apps is both the wide possibility of inputs that they accept, and the nondeterministic, black-box nature of their outputs. This makes it harder to answer:

1. Observability:Is the LLM even doing what you expect it to do?

Think “Sentry for LLMs”. This might allow you to monitor token usage, response consistency, etc.

2. Product analytics: Assuming the former (the “correctness” of your LLM integration) – how is your product actually being used?

Think “Mixpanel or Amplitude for LLMs”. This sort of tooling allows you to tie specific user behaviors to the metrics/KPIs you care about (e.g. signups, conversion rate, ACV, etc).

At Aquarium we are focusing on the latter — product analytics. Luckily, this is not a new discipline!

The typical stack, according to industry best practice, incorporates analytics products (Mixpanel, Amplitude, etc) with business intelligence (BI) tools (Mode, Looker, etc), all on top of data warehouses (BigQuery, Snowflake, etc). Most of these tools are well established, having been around since the early 2010s — more than a decade, at this point!

Take these arrows with a grain of salt; a lot of these products have expanded beyond their original “category” so something that is marked as “product analytics” may also support “reverse ETL”.

Historically, this sort of setup has allowed product teams to analyze user information like “users are clicking on this button to go to this page which leads to this action”, so they an determine what features needed to be tweaked or built.

How LLMs Change UIs

Wait, how do I even get started with LLMs?

LLMs are capable of a broad range of NLP tasks (translation, summarization, etc) — meaning they are extremely effective at interpreting intent. This makes them perfect to serve as the interface between your app and your users.

In other words, your users don’t have to click through various input fields or read docs about your query language — they can simply ask for what they want in plain English (or any other natural language).

In this new world, standard UI elements (dropdowns, buttons, input fields, etc) are replaced with a line of text.

For example, in the case of a task management app like Linear, a filter/search interface like:

is reduced to:

Or in the case of Slack, a request becomes conversational:

*(For more examples, see our other post* *Product Applications of LLMs*.)

Potential Limitations of Current Product Analytics Tooling

Say that you have an LLM-integrated product where users are able to type whatever they want into a query bar. You’re receiving thousands of these unique inputs every day — other than doing manual spot checks, you have no real way of understanding what users are requesting.

Why wasn’t this a problem before? Traditional UIs have a limited “action space” where users follow a predetermined flow of button clicks, dropdown selections, etc. Even when there was freeform text input, it was usually scoped to a particular use case (e.g. name of taskor keywords in description). Each of these fixed UI elements provided a place where you could hook in some tracking:

Most analytics tools expect some sort of structured input (e.g. {"click": true, "selected_type": "hotel"}) because user interactions were themselves structured.

However, this changes with LLMs.

Take, for example, Airbnb’s booking interface, which is one such traditionally “structured UI”:

Because the UI provides finite options for what the user can select, it’s easy for existing analytics tools to:

Bucket user requests by the pre-provided categories (e.g. property type being one of ["house", "apartment", "guesthouse", "hotel"], or by selected amenity types)
Track/associate the followup actions associated with these queries (e.g. “are users more likely to end up booking this session for property type = hotel“?)).

In contrast, an LLM-version of Airbnb (or any other booking UI) might be a single input box, receiving freeform input like the following:

A bright cozy studio in Azores for 2 people with great views
Lodge in the Alps not too expensive
Family vacation with two young children and in-laws, need space and proximity to beach
treehouse

How would existing analytics tools break apart all this text into something useful? How would you even categorize all of the different things that users are asking for here?

With LLM-powered UIs, you no longer have dropdown selections or preset input fields to log — just a single piece of freeform text. Even though the full text from a user actually has more context than a sequence of click events…when you have millions of unstructured inputs, it’s hard to aggregate, filter, query, manipulate, or otherwise interpret them — let alone use them to identify cohorts or patterns that you can correlate with your team’s KPIs

What’s Needed for LLM Product Analytics

Ideally, you’d have an exploratory tool, that would help you answer the following questions:

1. What are the patterns in my users’ text events?

With some processing, any natural language text input like Lodge in the Alps not too expensive can be turned into something more structured — for example, key-value pairs like {"type": "lodge", "location": "mountains", "price": "budget"}.

In this case, attributes are the keys and categories are the possible values for a particular attribute. For instance:

But how do you know what sort of structure matters? The same input can be tagged with any number of attributes. You could just as easily label this input with "first letter of input": "L" — but that hardly does you any good.

To aggregate your user data in a meaningful way and understand its “shape”, you’d need to:

Know what attributes might be most relevant: Perhaps you know this ahead of time. Or perhaps it would help if something auto-generated this on your behalf, based on its understanding of your entire set of input data.

Cluster your data according to each attribute: Consider your data along a single dimension (attribute), and identify the main groupings (categories) that appear.

Name each attribute and each of its constituent categories: Intelligently summarize all of this in plain English (or any other natural language), so that someone knows what they’re looking at as they explore the dataset.

Luckily, embeddings and LLMs make this process quite doable.

If you can convert your unstructured text inputs into a taxonomy of key-value metadata (attribute = key, category = value) — you can then pass this extracted metadata through existing product analytics tools.

2. How do I correlate these attributes to user behavior?

To determine whether users are getting value out of your LLM feature, you’d want to slice your production data along a given attribute, chart the distribution of its categories, and correlate these to key product metrics (e.g. ACV, clickthrough, churn rate).

For instance, in the travel booking example — do queries that specify exact dates tend to correspond to higher booking rates?

If your tool is able to convert unstructured text into metadata labels, you can both:

See the single-point-in-time distribution of LLM text input. This in itself would probably be enough to allow you to understand common use cases, and make hypotheses on what redesigns or new features are necessary.

Track production data over time (independently or in relation to key metrics). You could gauge the effectiveness of a change to your LLM features (e.g. better prompt wrapper, new button, etc), based on how it changes the distribution of incoming user inputs. Or you might notice that users who ask a query in a certain way are more likely to make it through your conversion pipeline

3. How can we enrich our existing analytics pipelines with these extracted attributes?

You should be able to integrate the new attribute labels into existing product analytics stacks, essentially “enriching” the data you already have. This would include Customer Data Platforms (CDPs) like Segment, and dashboards like Mode, Mixpanel, Amplitude, etc.

The following diagram shows some different ways you could hook into the existing data/product analytics stack:

In terms of ingestion, you could:

Directly log LLM input events from your website/app
Feed/post-process data from a CDP
Feed/post-process data from your data warehouse (e.g. using reverse ETL)
Generically ingest events via an AP

You should also have flexibility when exporting your enriched events:

Send them downstream to other product analytics tools (Amplitude, Mixpanel, etc) or BI viz tools (Mode, Looker, etc)
Store them in your data warehouse
Download in a generic structured format (e.g. CSV)

This has the additional benefit of encouraging cross-team collaboration and visibility — existing people in your org already use these tools, and from their perspective, no changes would be needed. Without having to be completely dialed into the specifics of LLMs, they’d start seeing more facets of data to explore.

Conclusion

Here at Aquarium we strongly believe that data matters, and that virtuous feedback loops are the key to success. If you have the insights to drive your actions, your team can move much more quickly towards that hallowed land of product-market fit

Interested in what this might look like? We’ve been working on something that can serve as the “LLM product analytics” tool we’ve just described — check out Tidepool!

Jessica Yao

Software engineer and occasional technical writer. Former early employee at Aquarium