Chain-of-Thought prompting in JSON

Intro 

At Tidepool, one way we help users interpret their text datasets is by surfacing interesting patterns in their data. There are many techniques for doing this, such as clustering or network analysis, but the easiest way to get started is just asking an LLM to do it. While you lose some explainability and scalability, you can get something working in no time, with easy zero-shot steering in case you have specific ways you’d like to interpret the data. Compared to other methods for topic modeling, using a generative LLM approach is also better at producing topics that are semantically meaningful to your data and other explicit domain constraints.

We’ll use this example to highlight a small prompt writing technique that we’ve found useful, especially if you’re using a JSON mode API. This continues a trend from one of our earlier posts about the impact of content ordering within a prompt.

Setup

We want to write a prompt that will take in a list of short text documents, identifies some relevant topic clusters, and give us back a JSON object of the form:

[
    {
        "topic_name": "a specific name for the topic",
        "topic_description": "a description of the topic that cites multiple examples as evidence"
    },
    ...
]

To mimic identifying patterns from real world data, we’ll use a subset of Twitter exchanges with company-run accounts from the Customer Support on Twitter dataset. In order to keep it bounded in scope, we arbitrarily filtered it to the American Airlines support account. We’ll then randomly sample a few hundred messages and pass them into a basic prompt that asks the LLM to identify and return a list of topics of users’ favorite things about American Airlines. 

Here are a few tweets from the dataset:

0: @americanair i️ am on the divert flight 679 and it has been absolute hell waiting, but taylor has been an excellent agent dealing with everything and has been incredibly helpful and calm in this nightmare!
1: @americanair not too pleased that i paid to upgrade to mce on aa78 to london and the entertainment would not work. flight crew was useless. thanks for a very boring 9 hour flight. waste of money.
2: @americanair homeboy flight attendant sounds like the announcer from the @2109 👨✈️
3: @americanair had the best experience at the clt airport! thank you to chantelle & chelly! amazing customer service!
4: this @americanair plane is so small i can taste the guy’s hamburger three rows behind me. 🤢
5: @americanair just insisted i show them a passport to get on a flight to east texas.  they're kind of not wrong but ???
6: @americanair i would like to talk to someone about some poor customer service and a refund i was told by your employees that i’m due. i need to hear from someone asap.
7: @americanair  how is flying the day after thanksgiving? is it calm or crazy?
8: @americanair booked a ticket 6 hours ago for my son to come home from college. credit card charged, but still awaiting confirmation email.
9: thank you @americanair for letting me finally watch guardians of the galaxy 2

What makes a good topic?

We don’t have any ground truth correct answers, and topic quality is a bit subjective, so we’ll be looking for the following qualities:

  • Are all topics sufficiently distinct, without too much overlap? 
  • Are all topics at a reasonable specificity and appropriate to the domain?
  • Are all topics supported by at least N different documents?

In practice, you might have different goals. For example, mutual exclusivity between topics can be beneficial, but it can also be restrictive and not match your actual data. You might also care about topics that are relevant to the taxonomy of your data but that might not be present in the dataset yet.

Our starting prompt

We’ll start with this prompt template. All prompts will be run against the gpt-4-1106-preview model, with the default system prompt, temperature set to 0.1 for repeatability, and all other settings left at default.

You will be provided with a list of interactions on twitter with the American Airlines twitter account. We want to identify high level topics that cover people's favorite things about American Airlines, in order to do analysis.

Carefully review each message in this list. Afterwards, propose a list of distinct topics.

# Messages Start

0: @americanair i️ am on the divert flight 679 and it has been absolute hell waiting, but taylor has been an excellent agent dealing with everything and has been incredibly helpful and calm in this nightmare!
...
199: @americanair on hold again for the past 45 minutes. @americanair get 👏🏻 it 👏🏻 together 👎🏻👎🏻👎🏻👎🏻👎🏻

# Messages End

A valid topic must match multiple messages in the data. Be as specific as possible with your topics and topic names, and pick topics that do not overlap. Provide 3 to 8 topics of people's favorite things in the following JSON schema:

[
    {
        "topic_name": "a specific name for the topic",
        "topic_description": "a description of the topic that cites multiple examples as evidence"
    },
    ...
]


Descriptions must include a summary of several representative examples. Write just the JSON without a code block please.

Which leads to the following output:

[
    {
        "topic_name": "Customer Service Praise",
        "topic_description": "Positive feedback about American Airlines' customer service, including helpful and calm agents during stressful situations (message 0), amazing service at the airport by specific staff members (message 3), and impressive social media team support (message 11)."
    },
    {
        "topic_name": "Complimentary Upgrades and Perks",
        "topic_description": "Customers express gratitude for unexpected upgrades and perks such as free upgrades to first class (message 46), being accommodated on an earlier flight (message 20), and receiving help with tight connections (message 52)."
    },
    {
        "topic_name": "In-flight Entertainment Satisfaction",
        "topic_description": "Passengers appreciate the in-flight entertainment options provided by American Airlines, including the opportunity to watch movies (message 9) and the presence of entertaining flight attendants (message 2)."
    },
    {
        "topic_name": "Flight Crew Appreciation",
        "topic_description": "Acknowledgment of the flight crew's excellent performance, including friendly and non-grouchy attendants (message 13), outstanding service on specific flights (message 41), and great customer service during the flight (message 173)."
    },
    {
        "topic_name": "Efficient Problem Resolution",
        "topic_description": "Customers are thankful for the efficient resolution of issues, such as the social media team quickly finding solutions (message 11) and the airline's assistance with almost impossible connections (message 52)."
    },
    {
        "topic_name": "Recognition of Individual Employees",
        "topic_description": "Passengers often mention specific employees by name or location who provided exceptional service, such as Taylor (message 0), Chantelle & Chelly (message 3), and Clarissa in MEM (message 34)."
    }
]

Our initial set of topics is:

  • Customer Service Praise
  • Complimentary Upgrades and Perks
  • In-flight Entertainment Satisfaction
  • Flight Crew Appreciation
  • Efficient Problem Resolution
  • Recognition of Individual Employees

While these do match our overall goal of finding positive interactions, they do overlap quite a bit, often resembling various forms of “Customer Service Praise” (eg. “Recognition of Individual Employees” and “Efficient Problem Resolution”). They’re fine, but they could probably be better, and don’t seem particularly insightful about specific reasons for the positive interactions.

Reordering JSON schema keys

Now let’s try running it, but reordering the keys in our sample JSON schema, so that the description is written before the name:

You will be provided with a list of interactions on twitter with the American Airlines twitter account. We want to identify high level topics that cover people's favorite things about American Airlines, in order to do analysis.

Carefully review each message in this list. Afterwards, propose a list of distinct topics.

# Messages Start

0: @americanair i️ am on the divert flight 679 and it has been absolute hell waiting, but taylor has been an excellent agent dealing with everything and has been incredibly helpful and calm in this nightmare!
...
199: @americanair on hold again for the past 45 minutes. @americanair get 👏🏻 it 👏🏻 together 👎🏻👎🏻👎🏻👎🏻👎🏻

# Messages End

A valid topic must match multiple messages in the data. Be as specific as possible with your topics and topic names, and pick topics that do not overlap. Provide 3 to 8 topics of people's favorite things in the following JSON schema:

[
    {
        "topic_description": "a description of the topic that cites multiple examples as evidence",
        "topic_name": "a specific name for the topic"
    },
    ...
]


Descriptions must include a summary of several representative examples. Write just the JSON without a code block please.

Producing the following output:

[
    {
        "topic_description": "Customers express gratitude for the helpful and friendly staff, both on the ground and in the air. Examples include an excellent agent named Taylor who handled a difficult situation calmly (message 0), amazing customer service by Chantelle & Chelly at CLT airport (message 3), and a shoutout to a favorite AA agent, Clarissa in MEM (message 34).",
        "topic_name": "Exceptional Customer Service"
    },
    {
        "topic_description": "Passengers appreciate the convenience and perks of upgrades, whether it's a free upgrade to first class (message 46), being accommodated on an earlier flight (message 20), or the excitement of a first class upgrade (message 127).",
        "topic_name": "Upgrades and Accommodations"
    },
    {
        "topic_description": "Travelers are thankful for in-flight entertainment options that help pass the time, such as being able to watch movies like Guardians of the Galaxy 2 (message 9) and the availability of magazines with inspiring stories (message 12).",
        "topic_name": "In-Flight Entertainment"
    },
    {
        "topic_description": "Some customers express satisfaction with the airline's responsiveness and problem-solving, particularly through social media channels. This includes the social media team quickly resolving issues (message 11) and helping with almost impossible connections (message 52).",
        "topic_name": "Responsive Social Media Support"
    },
    {
        "topic_description": "Passengers are pleased with the airline's handling of logistical challenges, such as efficiently dealing with connections at the airport (message 52) and restoring direct flights between specific cities (message 103).",
        "topic_name": "Efficient Logistics and Routing"
    },
    {
        "topic_description": "Customers show appreciation for the airline's loyalty program benefits, such as earning miles (message 76), working towards elite status (message 84), and receiving upgrades as a loyalty program member (message 18).",
        "topic_name": "Loyalty Program Benefits"
    }
]

Our new set of topics is:

  • Exceptional Customer Service
  • Upgrades and Accommodations
  • In-Flight Entertainment
  • Responsive Social Media Support
  • Efficient Logistics and Routing
  • Loyalty Program Benefits

In addition to the LLM respecting this ordering in how it writes the JSON, we also start seeing more qualitative differences between the outputs. The topic names are more specific, and there’s significantly less overlap between topics, better following the instructions in our prompt. Where we once had many variations of “Customer Service Praise,” we now have results that emphasize other things people like about American Airlines, like logistics and miles-based loyalty programs.

We suspect this is because letting it write out a long-form description before committing to a single topic name serves a similar purpose to an “inner monologue” section, and other chain-of-thought based techniques. By starting with the description, we’re giving the LLM space to meander and avoid jumping to the most immediate conclusions.

Faking chain-of-thought in OpenAI’s JSON mode

This is particularly useful if we want to use the newly released OpenAI JSON mode. By setting the following field on an OpenAI chat completion API request, you can force the response to be a valid JSON object:

response_format: { "type": "json_object" } 

Unfortunately, that doesn’t leave room for our friend “think step by step,” since the LLM has to start immediately with “{“ or “[“. This doesn’t leave room for any chain-of-thought reasoning before committing to an answer.

While you can get chain-of-thought by splitting it into two LLM calls and prompts (one for reasoning, and the other for JSON completion with JSON Mode), this is a simple technique that gets you some of the benefits without significantly increasing calling complexity or token usage. Similar to hiding an inner monologue in an intermediate LLM call, you can get the LLM to write its reasoning as a key in a JSON object. You can even ignore that key from the output after parsing, truly treating it like a hidden inner monologue step.