If you’re a builder, it’s important to know what’s available in your toolbox, and the right time to use a given tool. Depending on what you’re doing, there are probably ones you use more often (hammer, screwdriver), and ones that you use less often (say, a hacksaw).
A lot of very smart people are experimenting with LLMs right now — resulting in a pretty jam-packed toolbox, acronyms and all (fine-tuning, RLHF, RAG, chain-of-thought, etc). It’s easy to get stuck in the decision paralysis stage of “what technical approach do I use”, even if your ultimate goal is to “build an app for X”.
On their own, people often run into issues with base model LLMs — “the model didn’t return what I wanted” or “the model hallucinated, its answer makes no sense” or “the model doesn’t know anything about Y because it wasn’t trained on it”.
People sometimes turn to a fairly involved technique called fine-tuning, in hopes that it will solve all of the above. In this post, we’ll talk about why fine-tuning is probably not necessary for your app
Specifically, people often think of “fine-tuning” when they want one or both of the following:
A combination of these two techniques is actually sufficient for most use cases.
What even is fine-tuning?
Base LLMs have a lot of abilities (question answering, summarization, etc) that you likely want to leverage in your app, but you may find them too generic (or unaware of) your particular use case.
You might be drawn to fine-tuning because you believe “more training” can help your LLM application eke out better accuracy on your target task. Intuitively this makes sense — why wouldn’t it be best to adjust the model based on text from your specific domain?
However, the following table explains why it is sufficient, easier, and often preferable to apply other techniques to the existing base model:
Just to reiterate, fine-tuning (except in some rare cases) negates most of the resource-saving benefits from recent LLMs — the reasons that people are flocking to this technology in the first place. The biggest reason why NLP was hard to do before late 2022 was because you needed to collect data, label data, train models, host infra — and all that requires hiring an ML ops and eng team!
Now with using LLMs out of the box, the startup cost is incredibly low. There are a whole bunch of orgs that never would have done NLP if not for LLMs making the bar so low. Is it worth investing your eng time into fine-tuning when state-of-the-art is advancing so quickly? Sure, you’ll have a slight competitive advantage if your model has better accuracy/quality — but will you still think so a few months later when other companies get the same boosted functionality with GPT-5, no effort required?
This is why we recommend that you focus your attention on lighter-touch approaches like few-shot prompting and retrieval augmented generation (RAG)
Although most popular LLMs have been trained to respond in a Q&A format, you may want them to perform a specific task (e.g. sentiment analysis, or a range of applications) or to output answers in a particular format (e.g. JSON).
While you can provide these instructions directly in the prompt, the LLM’s response is probabilistic, not deterministic. There isn’t a guarantee that it will answer in the way that you expect (and perhaps this is why people sometimes think fine-tuning is needed).
However, you can ensure more consistency with the following:
For example, the following prompt provides examples (taken from the yelp_review_full
dataset) so that the LLM (ChatGPT in this case) knows how to classify user reviews:
This place is absolute garbage... Half of the tees are not available, including all the grass tees. It is cash only, and they sell the last bucket at 8, despite having lights. And if you finish even a minute after 8, don't plan on getting a drink. The vending machines are sold out (of course) and they sell drinks inside, but close the drawers at 8 on the dot. There are weeds grown all over the place. I noticed some sort of batting cage, but it looks like those are out of order as well. Someone should buy this place and turn it into what it should be.
{“sentiment”: “negative”,“business_type”: “golf range”}
Usually the base model is sufficient to extrapolate what you want from the examples you provide (“few-shot”). However, if you need the model to perform a task using specialized knowledge that it may not have been trained on, few-shot prompting won’t be enough.
By itself, an LLM can’t answer questions about content it hasn’t been trained on. However, it has an extended context window of 32,000 tokens — essentially, 24,000 words of “memory”. This memory can include:
The supported context window length has been trending bigger and bigger (to the point where you really can provide pages of source material in the real-time query).
Say that you want to use an LLM to do smarter search over your own internal documentation. Your total number of docs > 12,000 words by a long shot. However, if you have another means of figuring out which doc (and for longer docs, which “chunk” of a particular doc) likely has what you’re looking for, you can provide that “chunk” as context for your question to the LLM
Luckily, vector DBs do just that — assuming that you’ve already split your docs into LLM-friendly chunks and stored these in the DB in vector form, you can subsequently:
The following diagram illustrates how you’d store your documents into a vector DB:
This second diagram illustrates how you’d query the vector DB and provide this as context to the LLM model:
Below is some pseudocode for the second part (query logic), using the vector DB Pineconein this case. FYI this won’t actually run, but is meant to give you a sense of what’s needed:
Given how far you can get with the alternatives mentioned above, there aren’t that many truly valid cases of fine-tuning. You might still consider it if:
To close this out with a callback to the original toolbox metaphor, you can consider a base LLM model to be like a Swiss army knife — it is sufficient and adaptable to most use cases.
Given the upfront time and computational resources required for fine-tuning, we’d recommend starting with this base first (along with the supplemental techniques mentioned) and seeing how far you get with it!