Finetuning

AI
Your post description
Author
Affiliation
Published

May 30, 2025

Finetuning is a process of adapting a model to a specific task by further training the whole model or part of the model. It is one of the three very common AI engineering techniques used to adapt a model to specific needs, alongside prompt engineering and Retrieval-Augmented Generation (RAG). While prompt-based methods like prompt engineering and RAG influence a model’s quality solely through inputs without modifying the model itself, finetuning adapts a model by adjusting its weights. Finetuning techniques are generally more complicated and require more data compared to prompt-based methods. However, they can significantly improve a model’s quality, latency, and cost. Adapting a model by changing its weights makes things possible that aren’t otherwise, such as adapting it to a new task it wasn’t exposed to during initial training.

Finetuning is considered part of a model’s training process, specifically an extension of model pre-training. Training that happens after pre-training is referred to as finetuning, and it can take various forms. Chapter 2 discusses two types of finetuning: supervised finetuning and preference finetuning.

The goal of finetuning is to get a base model, which has some but not all of the necessary capabilities, to perform well enough for a specific task. Finetuning improves sample efficiency, meaning a model can learn the desired behavior with fewer examples than training from scratch. For instance, while training a model for legal question answering from scratch might require millions of examples, finetuning a good base model might only require a few hundred. Finetuning can enhance various aspects of a model, including its domain-specific capabilities (like coding or medical question answering) and safety, but it is most often used to improve the model’s instruction-following ability, especially to adhere to specific output styles and formats.

1. When to Finetune:

  • Enhancing domain-specific capabilities: If a model struggles with a specific domain (e.g., a less common SQL dialect or customer-specific queries), finetuning on relevant data can help.

  • Improving instruction following and structured outputs: Finetuning is the most effective and general approach to get models to generate outputs in a desired format. While prompting is less reliable, finetuning a model on examples following the desired format is much more reliable. For certain tasks like classification, modifying the model’s architecture before finetuning by adding a classifier head can guarantee the output format.

  • Bias mitigation: Finetuning with carefully curated data can counteract biases present in the base model’s training data. For example, finetuning on data with female CEOs or texts authored by women/African authors can reduce gender and racial biases.

  • Distillation: Finetuning a smaller model to imitate the behavior of a larger model using data generated by the larger model is a common approach called distillation. This makes the smaller model cheaper and faster to use in production.

  • Optimizing token usage (historically): Before prompt caching, finetuning could help optimize token usage by training the model on examples instead of including them in every prompt, resulting in shorter, cheaper, and lower-latency prompts. Although prompt caching has reduced this benefit, finetuning still removes the limitation of context length on the number of examples used.

  • Extending context length: Long-context finetuning requires modifying the model’s architecture and can increase the maximum context length, though it is harder to do and the resulting model might degrade on shorter sequences.

2. Reasons Not to Finetune:

  • Performance degradation on other tasks: Finetuning for a specific task can sometimes degrade performance on other tasks.

  • High up-front investment and continual maintenance: Finetuning requires significant resources, including acquiring high-quality annotated data (which can be slow and expensive) and ML knowledge to evaluate base models, monitor training, and debug.

  • Serving complexity: Once finetuned, serving the model requires figuring out hosting (in-house or API) and inference optimization, which is non-trivial for large models.

  • Pace of base model improvement: New base models are constantly being developed and may improve faster than a finetuned model can be updated.

  • Prompting might be sufficient: Many practitioners find that after complaints about prompting’s ineffectiveness, refining the prompt experiment process shows that prompting alone can be sufficient.