[ChatCharlene] From BERT to GPT-4, an NLP Engineer’s Perspective

Make sure your embeddings encode similarity the way you want them to

Charlene Chambliss is a senior software engineer at Aquarium Learning, where she’s working on tooling to help ML teams improve their model performance by improving their data. In addition to being an incredible engineer with an inspiring backstory, Charlene previously worked on NLP applications at Primer.AI. In this blog post, we interview Charlene about her experiences working with older models like BERT, and the perspective this gives her on the more recent wave of generative, RLHF-based LLMs (e.g. GPT-4 and LLaMA).

Could you talk a bit about what you worked on at Primer?

When I was at Primer, the main product, Analyze, analyzed raw news articles. We took in all the latest news, from news outlets both big and small, and we were trying to cluster them into discrete, structured “events.”

For example, “Tropical Storm Hilary hits southern California” — that’s an event. But identifying events is nontrivial, because every article will talk about this event differently, with different titles, vocabulary, focus, and so on. Also, what we humans think of as “an event” might span multiple days, or weeks, or months, so the event clustering problem is pretty challenging.

Much of my time at Primer was on the applied research side — training models, evaluating models, and setting them up to do model inference on the firehose of incoming documents. I worked on a bunch of different tasks: document-level classification, question-answering, relationship extraction, and named-entity recognition.

What does named entity recognition refer to?

Named-entity recognition (NER) is a classic NLP task that takes in a document and identifies the important “nouns” in the document text — for example, people, places, organizations and miscellaneous entities (like the name of a product, or a proposed legislation bill). So the goal was to be able to provide users with a list of entities associated with each event.

Who were your users?

The end users were mostly government analysts — they were responsible for monitoring some sort of key area, like “I need to know what’s going on in Pakistan today”, and could use the product as a source of “open-source intelligence” (OSINT). But you might imagine other sorts of users, like finance customers who need to monitor their portfolio companies, or anything tangentially related that might affect their stock prices. Anyone who might find it helpful to have the “pulse” on something.

What did a pre-GPT-era architecture look like?

Generally speaking, the older models had to be trained to do one specific task and that task only […] so from an architectural perspective, in the pre-GPT days, your LLM application would often be made up of a collection of different, independently trained models.

I think the number one difference between an older model like BERT vs today’s GPT-3.5 is that, generally speaking, the older models had to be trained to do one specific task and that task only. You could fine-tune a BERT model to do classification, or named entity recognition, or summarization, or what-have-you. But you couldn’t, for example, take a classification model and then start using it to do question-answering – fine-tuned weights weren’t transferable to other tasks.

If you wanted to do multiple tasks, you would have to train multiple models, each of which is its own distinct project, which will each take at least two weeks or so to be production-ready. The new models have been trained in a way that allows them to generalize to multiple different tasks, but at a slightly lower quality than if you had done a fine-tuned, dedicated model. (There are rumors that some of them ARE in fact multiple models, via the Mixture-of-Experts method, but those are still unverified!)

Effective real-world LLM applications actually are structured like this, drawing on multiple tasks. With Primer, for example, Analyze didn’t just do named-entity recognition; it also did the initial clustering of the documents, then within each cluster it was doing extractive summarization, document classification by topic, quote extraction, relationship extraction between entities, among other things.

So from an architectural perspective, in the pre-GPT days, your LLM application would often be made up of a collection of different, independently trained models. If GPT-3.5 had been available back then, it might have looked more like a series of API calls to a single catch-all model, where the tasks would be differentiated by prompts.

Could you also talk a bit more about the differences in upfront cost, between BERT and today’s off-the-shelf models?

Another big difference is that you don’t need your own inference infrastructure in order to use these new LLMs. When we were working with various fine-tuned BERT models, we’d have to figure out how to host them and run inference. At the time, there were maybe a few places starting to offer cloud hosting for ML inference workloads, but nothing like the ecosystem now.

Today you have HuggingFace, or Replicate, or tons of different places that allow you to just upload a model (or use an existing one) and ping their API for inference, without having to maintain your own infra for GPU-intensive workloads. As a result, today’s startups are able to build genuinely useful products with much smaller and less specialized teams.

You’d mentioned two weeks to spin up a model for production—what was the bulk of this time spent on?

The biggest thing was data labeling. If you wanted a model to determine whether something is a financial document or not, you’d need to get a dataset where you have X documents that are finance-related and Y documents that aren’t. Like the models themselves, datasets weren’t reusable – you’d need an entirely new set of labeled documents for a new class, like whether a document is clickbait.

So you’d wait for a team to label some documents (enough for both training AND evaluation), your ML engineer trains a model on a small cloud VM or similar, and then they’d pull it back down and evaluate it. In my experience, the majority of those 2 weeks was waiting for turnaround on the data. Writing a script and training and evaluating the model often took as little as an hour or so. Sometimes I would label it myself because I was impatient.

Do you think there’s still a place today for these older models like BERT?

If you’re dealing with a high-volume, narrowly-defined task, then there’s big cost savings to be had by using the older and smaller transformers […] that said, if your application really does need to deal with unpredictable requests […] and immediate answers […] it makes sense to use a larger and more general model.

Absolutely. If you’re dealing with a high-volume, narrowly-defined task, then there’s big cost savings to be had by using the older and smaller transformers. Or if your task is specialized, like if whatever domain you’re working in isn’t well-represented in the LLM training data.

High volume: you have a bulk amount of documents that you just need to churn through every day.
Narrowly-defined: all you want is named-entity recognition, or to classify the document as X/Y/Z, or to do some other single task reliably and at high accuracy.

For these kinds of tasks, at scale it’s cheaper to fine-tune and deploy an older model than it would be to try to use GPT-3.5 or LLaMA. Older models are much smaller in terms of their memory footprint, and there have also been a lot of transformer-specific optimizations made in the last few years to make inference fast. It ends up costing a fraction of what it would cost to do that same task on the same number of documents using a newer and larger model.

That said, if your application really does need to deal with unpredictable requests, users can provide their own documents and expect immediate answers, and that kind of thing, it makes sense to use a larger and more general model since you can’t really encode that as a “do-at-scale” task for narrow models.

Wouldn’t sourcing the training data still be a bottleneck, in the case of fine-tuning?

Actually, it’s easier than ever to train the narrower models because you can now use the generative models to label data for you. If your task is straightforward enough, the outputs from GPT-3.5 or LLaMA 2 might be correct on 90% of your examples, so instead of paying people to label 100% of the data, you pay just for inference and then for people to correct that other 10% of the data.

What is your take on fine-tuning in general, as opposed to “waiting until GPT-5”? Is it reasonable to expect significnant improvements in subsequent models?

You can always be working on better UX while waiting for LLM improvements […] UX and generation quality will both be bottlenecks at different stages of your product’s development.

I wouldn’t feel comfortable saying yes or no at this point as to whether LLMs will level off in quality significantly. We haven’t seen any quantum leaps since GPT-4, but that was also only just earlier this year. There could very well be another quantum leap within the next 12 months.

From the product perspective, I do err on the side of waiting and seeing, because oftentimes the quality of the generation isn’t the limiting factor in terms of whether people will use your product and get value out of it. The limiting factor is usually UX. A tool that’s easy to use and which occasionally produces bad outputs will have more users than a tool with pristine outputs but that’s awkward to use, every day of the week.

You can always be working on better UX while waiting for LLM improvements. Just listen to what users say about using your product’s interface, or integrations. Read their comments about the generation quality and see if it’s actually a problem that can be solved by adding a simple heuristic on top, like good old string matching or regex. And don’t forget about tools like Microsoft’s guidance, which can help you more easily apply well-researched prompt engineering techniques to get more reliable outputs.

UX and generation quality will both be bottlenecks at different stages of your product’s development. No one will notice generation quality if the UI is annoying to use, because they’ll churn and go use something else, but once people are using it and liking it, they’ll put up with some jank here and there if the outputs are good enough.

What are your top three favorite AI-powered interfaces right now?

If I had to pick a top three for my favorite AI-powered interfaces right now, I would say that my favorite is probably the perplexity.ai search engine. You can use it for open-ended questions, learning, debugging, and of course traditional search queries.

And then you have GitHub Copilot, of course. It’s great for churning out boilerplate code. I think the autocomplete on Copilot is currently still the best in the code generation market. It’s not perfect, but it often does know what you’re trying to do, so it’s great as long as you’re willing to be attentive.

The third one is Sourcegraph Cody. Copilot is great for code generation, but Cody is king when it comes to understanding your entire codebase. For example, I’ve had it tell me about entire multipart request lifecycles, like “here’s where the request is made in the frontend, how it’s processed in the server, how it’s processed by this other service, and the artifacts get uploaded to GCS here.” Normally I would have to ask someone or just trace the code all the way through, which could take hours for something sufficiently complex.

Sometimes people dismiss AI coding tools because they don’t want the tools to write code for them. So don’t! They can still make you faster at other things, like writing tests and documentation. This is true in general: AI often isn’t accurate enough to fully delegate tasks to, but it can still provide a lot of time savings as an assistant.

Thanks to Charlene Chambliss for this interview! You can find her on LinkedIn or check out her personal website.

Jessica Yao

Software engineer and occasional technical writer. Former early employee at Aquarium