Why You (Probably) Don’t Need to Fine-tune an LLM

This post is targeted towards folks focused on building LLM (Large Language Model) applications (as opposed to research).

If you’re a builder, it’s important to know what’s available in your toolbox, and the right time to use a given tool. Depending on what you’re doing, there are probably ones you use more often (hammer, screwdriver), and ones that you use less often (say, a hacksaw).

A lot of very smart people are experimenting with LLMs right now — resulting in a pretty jam-packed toolbox, acronyms and all (fine-tuning, RLHF, RAG, chain-of-thought, etc). It’s easy to get stuck in the decision paralysis stage of “what technical approach do I use”, even if your ultimate goal is to “build an app for X”.

On their own, people often run into issues with base model LLMs — “the model didn’t return what I wanted” or “the model hallucinated, its answer makes no sense” or “the model doesn’t know anything about Y because it wasn’t trained on it”.

People sometimes turn to a fairly involved technique called fine-tuning, in hopes that it will solve all of the above. In this post, we’ll talk about why fine-tuning is probably not necessary for your app.

Specifically, people often think of “fine-tuning” when they want one or both of the following:

  • Additional structure/style: They want the LLM to do a more specific task (beyond open-ended question answering) + provide answers in a desired format
    • This can be done with few-shot prompting
  • Additional source knowledge: They want the base LLM to answer questions about things it may not have been trained on (and consequently is unaware of), and which is not publicly available on the internet (e.g. even to GPT-4).
    • This can be done with retrieval-augmented generation (RAG).

A combination of these two techniques is actually sufficient for most use cases.

Why People Think Fine-tuning Might Be Helpful

What even is fine-tuning? 🤔

Fine-tuning involves taking a pre-trained LLM (e.g. GPT-3.5, LLaMA 2) and further training it on a smaller, domain-specific dataset to make it more specialized for that particular task or data.

Out-of-the-box language models have already been pre-trained on open-source data (e.g. Common Crawl) — the LLM interface you interact with already has “fixed weights” within its transformer/neural network architecture, which do not update based on the queries you feed it (aka, though there can be session history provided as context, it doesn’t “learn” on the fly).

When you fine-tune a model, you are essentially “unlocking” these weights and allowing them to update based on whatever new training data you feed it (e.g. a collection of legal cases, or company earnings reports, or a specific user’s tweets). These new weights should allow the model to better handle tasks related to that new domain.


As of Aug 2023:

  • OpenAI only supports fine-tuning for its GPT-3 models (not the newer GPT-3.5 and GPT-4 models that back ChatGPT) (see how-to guide)
  • LLaMA 2 has similar performance to ChatGPT and is open-source, so it is typically the one that people have been using to fine-tune (while maintaining chat capabilities)

Base LLMs have a lot of abilities (question answering, summarization, etc) that you likely want to leverage in your app, but you may find them too generic (or unaware of) your particular use case.

You might be drawn to fine-tuning because you believe “more training” can help your LLM application eke out better accuracy on your target task. Intuitively this makes sense — why wouldn’t it be best to adjust the model based on text from your specific domain?

However, the following table explains why it is sufficient, easier, and often preferable to apply other techniques to the existing base model:

Initial Motivation for Fine-tuning            Why a Base LLM is Sufficient
It’s cheaper than training a model from scratch, and leverages a base LLM’s existing training.

Yes, but comparing to “starting from scratch” fails to consider approaches that don’t require any retraining (and retraining requires access to resources like GPUs).

Also, the context window length — referring to the LLM Input token limit — has only been longer and longer. This length can be allocated not just to your question/task description itself, but also to (1) conversation history (2) examples of output (3) additional info the LLM wasn’t trained on.

Longer context windows and possible version improvements (e.g. GPT-5) mean that a lot of improvements from fine-tuning will become baseline within a few months — so why invest all those time/computational resources now?

The base model doesn’t have access to private knowledge bases.

There is a technique called retrieval-augmented generation (RAG), where documents can be stored as embeddings in vector databases, queried for based on semantic meaning, and passed into the base model prompt via the context window.

The base model doesn’t return answers in the desired style or format.

Actually, this can be done with a more specific prompt — and the odds of consistently formatted answers can be improved with few-shot prompting, an approach that provides the base model with examples (think: SAT analogies) within the context window.

Fine-tuning means that you don’t need to provide additional context in each prompt – this will save token usage per query.

Perhaps in the long run, but token usage is incredibly cheap and this is usually not an issue. ( $0.06 / 1K tokens if using 32K context, AKA $1.92 for ~24,000 words)

Retraining/fine-tuning can cost hundreds of dollars, may still require some prompt engineering, and does not guarantee that your LLM will provide accurate answers.

Fine-tuning will result in more accurate, domain-specific results.

Fine-tuning does not prevent LLM from hallucinating – in fact, it may be more reliable to provide a source to the base model (e.g. from a vector database), and ask clear questions about it (and have the model indicate when the answer is not found).

Also there have been some experiments which have shown that LLM’s cross-functional abilities (summarization, classification, generation, etc) can actually degrade because fine-tuning results in overtraining.

Just to reiterate, fine-tuning (except in some rare cases) negates most of the resource-saving benefits from recent LLMs — the reasons that people are flocking to this technology in the first place. The biggest reason why NLP was hard to do before late 2022 was because you needed to collect data, label data, train models, host infra — and all that requires hiring an ML ops and eng team!

Now with using LLMs out of the box, the startup cost is incredibly low. There are a whole bunch of orgs that never would have done NLP if not for LLMs making the bar so low. Is it worth investing your eng time into fine-tuning when state-of-the-art is advancing so quickly? Sure, you’ll have a slight competitive advantage if your model has better accuracy/quality — but will you still think so a few months later when other companies get the same boosted functionality with GPT-5, no effort required?

This is why we recommend that you focus your attention on lighter-touch approaches like few-shot prompting and retrieval augmented generation (RAG).

Alternative 1 (for Structure): Base LLM + Few-Shot Prompting

Although most popular LLMs have been trained to respond in a Q&A format, you may want them to perform a specific task (e.g. sentiment analysis, or a range of applications) or to output answers in a particular format (e.g. JSON).

While you can provide these instructions directly in the prompt, the LLM’s response is probabilistic, not deterministic. There isn’t a guarantee that it will answer in the way that you expect (and perhaps this is why people sometimes think fine-tuning is needed). 

However, you can ensure more consistency with the following:

  • Using basic heuristics to vet that its response falls into a domain of known/desired outputs
  • Providing examples of input/output pairs within the context window (what’s passed into the LLM, in addition to the main query) — it’s almost as if you are providing a small number of training data within the real-time query. This is few-shot prompting.

For example, the following prompt provides examples (taken from the yelp_review_full dataset) so that the LLM (ChatGPT in this case) knows how to classify user reviews:

Context for prompt:

You will be given inputs (reviews for various businesses, e.g. from Yelp) and your job will be to (1) classify the sentiment (one of [“positive”, “negative”, “neutral”, “mixed”]) and (2) try to determine the sort of place the review is for. You will output results in JSON format (with the keys “sentiment” and “business_type”).

Input: “This location never disappoints!! Food is always consistently great, and if you come at the right time, (witching hours) you may see the cook singing and dancing along with the music in the back. And it is awesome! ! Love this place!!”

Output: {“sentiment”: “positive”, “business_type”: “restaurant”}

Input: “I'm writing this review to give you a heads up before you see this Doctor. The office staff and administration are very unprofessional. I left a message with multiple people regarding my bill, and no one ever called me back. I had to hound them to get an answer about my bill. \\n\\nSecond, and most important, make sure your insurance is going to cover Dr. Goldberg's visits and blood work. He recommended to me that I get a physical, and he knew I was a student because I told him. I got the physical done. Later, I found out my health insurance doesn't pay for preventative visits. I received an $800.00 bill for the blood work. I can't pay for my bill because I'm a student and don't have any cash flow at this current time. I can't believe the Doctor wouldn't give me a heads up to make sure my insurance would cover work that wasn't necessary and was strictly preventative. The office can't do anything to help me cover the bill. In addition, the office staff said the onus is on me to make sure my insurance covers visits. Frustrating situation!”

Output: {“sentiment”: “negative”, “business_type”: “doctor’s office”}

Input: “Good beer selection. Understaffed for a light Monday night crowd, it wasn't her fault she was the only server. But it took about an hour to get our sandwiches. Mine was one of the best reubens I've ever had.”

Output: {“sentiment”: “mixed”, “business_type”: “restaurant”}

Input: “Place was alright for a one-night stay. Nothing special, room was a bit old, but fine as a pitstop on our long roadtrip.”

Output: {“sentiment”: “neutral”, “business_type”: “hotel”}

Input: <INPUT>

Output:

Example of raw input (inserted into the <INPUT> placeholder in the prompt above):

This place is absolute garbage...  Half of the tees are not available, including all the grass tees.  It is cash only, and they sell the last bucket at 8, despite having lights.  And if you finish even a minute after 8, don't plan on getting a drink.  The vending machines are sold out (of course) and they sell drinks inside, but close the drawers at 8 on the dot.  There are weeds grown all over the place.  I noticed some sort of batting cage, but it looks like those are out of order as well.  Someone should buy this place and turn it into what it should be.

ChatGPT (turbo-3.5) output:

{
“sentiment”: “negative”,
“business_type”: “golf range”
}

Usually the base model is sufficient to extrapolate what you want from the examples you provide (“few-shot”). However, if you need the model to perform a task using specialized knowledge that it may not have been trained on, few-shot prompting won’t be enough.

Alternative 2 (for Knowledge): Base LLM + Retrieval Augmented Generation (RAG)

By itself, an LLM can’t answer questions about content it hasn’t been trained on. However, it has an extended context window of 32,000 tokens — essentially, 24,000 words of “memory”.  This memory can include:

  • Previous conversation history (if any)
  • Additional info needed that the LLM hasn’t been trained on
  • Your actual query

The supported context window length has been trending bigger and bigger (to the point where you really can provide pages of source material in the real-time query).

Say that you want to use an LLM to do smarter search over your own internal documentation. Your total number of docs > 12,000 words by a long shot. However, if you have another means of figuring out which doc (and for longer docs, which “chunk” of a particular doc) likely has what you’re looking for, you can provide that “chunk” as context for your question to the LLM.

Luckily, vector DBs do just that — assuming that you’ve already split your docs into LLM-friendly chunks and stored these in the DB in vector form, you can subsequently:

  1. Encode your search query as a vector
  2. Query the DB using to find the top X chunks most related to (1)
  3. Provide each of these to the LLM along with the original search query, to see if it can determine the answer with that extra context

The following diagram illustrates how you’d store your documents into a vector DB:

This second diagram illustrates how you’d query the vector DB and provide this as context to the LLM model:

Below is some pseudocode for the second part (query logic), using the vector DB Pinecone in this case. FYI this won’t actually run, but is meant to give you a sense of what’s needed:

When You Might Actually Fine-tune

Given how far you can get with the alternatives mentioned above, there aren’t that many truly valid cases of fine-tuning. You might still consider it if:

  • You have super stringent accuracy requirements for a certain task that justifies putting in a lot of engineering and ops resources (a heavily-resourced company like Bloomberg can do this)
  • You really really care about fast edge inference (e.g. an LLM model running locally on your phone). In which case you’re probably not going to fine-tune an LLM, you’ll probably want to use a BERT type model because it’s lighter and more task-generalizable (src).
  • If few-shot and RAG combined do not get you the performance you want (e.g. a more involved style transfer task) and you really want to make it better…even then, as we’ve mentioned, the move here might be “welp let’s wait until the latest LLM version update”

Conclusion

To close this out with a callback to the original toolbox metaphor, you can consider a base LLM model to be like a Swiss army knife — it is sufficient and adaptable to most use cases.

Given the upfront time and computational resources required for fine-tuning, we’d recommend starting with this base first (along with the supplemental techniques mentioned) and seeing how far you get with it!

Software engineer and occasional technical writer. Former early employee at Aquarium! 🐬

This is a staging enviroment