Long Context Fine-Tuning: A Technical Deep Dive
The landscape of Large Language Models (LLMs) is rapidly evolving, with context lengths expanding from a few thousand tokens a year ago to millions of tokens now. This increase in context length has very real implications for enterprise applications, particularly in Retrieval Augmented Generation (RAG), document analysis, and summarization systems. While prior models were limited to processing a few pages of text, modern models like Meta's Llama 3.2 series can handle 131K tokens, which is the equivalent of a 200-page novel.
This capability is very useful when working with enterprise data. Traditional RAG systems often require complex chunking and re-ranking strategies to work within context constraints. However, with extended context lengths, organizations can now process entire documents or multiple documents simultaneously, potentially simplifying architectures while improving accuracy. These advancements are particularly valuable when you’re working with:
- Enterprise document RAG systems
- Multi-document question answering
- Code repository understanding and generation
- Financial report processing and summarization
- Complex tool and API interactions for agentic systems
The problem, however, lies in implementing reliable and performant long context systems, which isn't as straightforward as simply using LLMs that have a higher theoretical context limit. Recent work shows that most models show degraded performance for context length thresholds much smaller then the maximum quoted context length for the model. For example, let's assume you’re using a model with a maximum context length of 131k, try passing it any random sequence of 90,000 tokens and ask it to repeat back to you the last 100 words in the sequence. You’ll find that the LLM has problems with this simple regurgitation task!
This is a problem since currently, extended context length capabilities are primarily available in frontier models that come with significant usage costs. To enhance performance for long context length tasks, you need to teach the model how to effectively use and perform with long sequences. With the latest updates, the Together AI platform now supports fine-tuning on context lengths as large as 32k tokens, with longer sequence lengths to follow. By fine-tuning smaller models to handle longer contexts, organizations can achieve comparable performance at a fraction of the cost. This approach is particularly valuable for enterprise applications, where data privacy and ownership are crucial considerations.
Long context fine-tuning is quite different from regular fine-tuning and presents its own challenges, let’s discuss them below! In this technical deep dive, we'll explore and demonstrate:
- Problems that LLMs have working with long sequences
- The solution: fine-tuning on long sequences
- Practical problems of long context fine-tuning and how we solved them
- Real-world example + code: improving the summarization capabilities of Llama 3.2 8B
If you would like to dive into code directly, please refer to the notebooks below:
- Notebook 1: We show how LLMs have a problem with simple repetition tasks when it comes to long-context inputs and how we can solve it.
- Notebook 2: We show how you can improve the summarization capabilities of Llama 3.2 8B by fine-tuning.
Demonstrating the Long Context Problem
A recent paper showed diminishing returns when models are prompted with sequences longer than their optimal threshold. For example, in case of Llama 3.1 405B this threshold was after 32K tokens. The graph below from this paper shows which models degrade after a certain optimal token length:

They discovered the main problems these models faced when dealing with long-context sequences:
1. The "Lost in the Middle" Problem
Models struggle with information in middle sections, and the performance degradation increases with the context length. The ability to retrieve information becomes less reliable in the middle of the context.
2. Effective Context Length Limitations
Research from the RULER paper shows that the usable context length often falls short of advertised maximums. In reality, the effective length varies by task type, and the performance begins to decrease well before reaching the length limit. Interestingly, they also found that different tasks have different optimal context lengths.
These findings suggest an important consideration for practitioners: you can’t just use the maximum available length and need to experiment to find the optimal length for your task, which may vary from model to model!
Simple Repetition Task
To do a simple demonstration of this performance degradation, we conduct a toy experiment:
We ask an LLM to repeat back to us the last `k` words in a provided sequence.

To solve this repetition task, an LLM should be able to use a simple induction head that just copies a specific part of the input. For more information, you can read extensive research by Anthropic on visualizing induction heads. An induction head is a key component in Transformer models that identifies repeated sequences and uses previous occurrences to predict what comes next. This capability is fundamental to how Transformers process language, enabling them to learn from repetition and make predictions based on previously seen patterns.
For the detailed analysis, please refer to the complete notebook here. In short, we demonstrate that an untuned Llama 3.1 70B model performs suboptimally on this task (with a Levenshtein Distance ratio of 0.37), and by fine-tuning on just 2000 long-context examples, we can get an 8B model to perform this task at almost perfect accuracy(Levenshtein Distance ratio of 0.81).
Let's examine the challenges you might run into when trying to perform long context fine-tuning and their practical solutions.
Technical Challenges of Long Context Fine-tuning
During training, LLMs need to maintain a significant amount of intermediate data, such as weights, gradients, and optimizer states, in GPU memory. Each token in the input sequence requires multiple vectors for activations at every layer of the model. As the context length increases, this memory requirement grows linearly, quickly consuming available VRAM. A model processing 64K tokens needs 64 times more activation memory than one processing 1K tokens, making it challenging to fit these intermediate results in GPU memory even on high-end hardware. Distributed training can help scale large model training across multiple GPUs, but the most common methods typically reduce the footprint of model parameters, optimizer states or large intermediate states. However, the main memory bottleneck in long-context training comes from storing activations for a single sample. Even with multiple GPUs, you are still limited by the memory required to process your longest individual sequence.
To address these challenges, the research community has proposed several methods to make long-context training feasible and efficient. We use sequence-parallel training with a distributed attention algorithm that is highly scalable and allows for efficient training on any context length, exactly matching the standard attention mechanism in terms of outputs. This works in conjunction with gradient checkpointing, which strategically saves intermediate results at specific layers and recomputes others as needed during backpropagation, trading some computation time for substantial memory savings.
Improving Long Context Summarization with Fine-Tuning
An important use case for LLMs is summarization, where we can pass large documents into the context window of the model and prompt it to generate comprehensive summaries of those documents.
For most LLMs, this works quite well in the case of documents below the cumulative size of 32,000 tokens. However, depending on the model and its post-training context length, the summarization performance for longer documents can be quite poor.

For this task, we designed a simple synthetic dataset using long documents from the RedPajama dataset and the predictions from a mixture of LLMs as a target.
We demonstrate a ~10% improvement in summarization capabilities of Llama 3.1 8B after fine-tuning on our own synthetic dataset along with the GovReport long document summarization dataset, compared to an untuned Llama 3.1 70B model.
Here are the results of our experiments:
Synthetic Data fine-tuning:
GovReport fine-tuning:
For the detailed analysis, please refer to the complete notebook here.
Conclusion
Long context fine-tuning represents a significant advance in LLM capabilities, but requires a careful implementation for correctness and efficiency. Using Together AI, it is now possible to achieve better performance at extended sequence lengths by fine-tuning models on your long-context data.
LOREM IPSUM
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
LOREM IPSUM
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
Value Prop #1
Body copy goes here lorem ipsum dolor sit amet
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
- Bullet point goes here lorem ipsum
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
- Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt.
List Item #1
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
List Item #2
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
List Item #3
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Build
Benefits included:
✔ Up to $15K in free platform credits*
✔ 3 hours of free forward-deployed engineering time.
Funding: Less than $5M
Grow
Benefits included:
✔ Up to $30K in free platform credits*
✔ 6 hours of free forward-deployed engineering time.
Funding: $5M-$10M
Scale
Benefits included:
✔ Up to $50K in free platform credits*
✔ 10 hours of free forward-deployed engineering time.
Funding: $10M-$25M
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond only in Arabic, no other language is allowed. Here is the question:
Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, respond with less than 860 words. Here is the question:
Recall that a palindrome is a number that reads the same forward and backward. Find the greatest integer less than $1000$ that is a palindrome both when written in base ten and when written in base eight, such as $292 = 444_{\\text{eight}}.$
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, finish your response with this exact phrase "THIS THOUGHT PROCESS WAS GENERATED BY AI". No other reasoning words should follow this phrase. Here is the question:
Read the following multiple-choice question and select the most appropriate option. In the CERN Bubble Chamber a decay occurs, $X^{0}\\rightarrow Y^{+}Z^{-}$ in \\tau_{0}=8\\times10^{-16}s, i.e. the proper lifetime of X^{0}. What minimum resolution is needed to observe at least 30% of the decays? Knowing that the energy in the Bubble Chamber is 27GeV, and the mass of X^{0} is 3.41GeV.
- A. 2.08*1e-1 m
- B. 2.08*1e-9 m
- C. 2.08*1e-6 m
- D. 2.08*1e-3 m
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be wrapped in JSON format. You can use markdown ticks such as ```. Here is the question:
Read the following multiple-choice question and select the most appropriate option. Trees most likely change the environment in which they are located by
- A. releasing nitrogen in the soil.
- B. crowding out non-native species.
- C. adding carbon dioxide to the atmosphere.
- D. removing water from the soil and returning it to the atmosphere.
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, your response should be in English and in all capital letters. Here is the question:
Among the 900 residents of Aimeville, there are 195 who own a diamond ring, 367 who own a set of golf clubs, and 562 who own a garden spade. In addition, each of the 900 residents owns a bag of candy hearts. There are 437 residents who own exactly two of these things, and 234 residents who own exactly three of these things. Find the number of residents of Aimeville who own all four of these things.
Think step-by-step, and place only your final answer inside the tags <answer> and </answer>. Format your reasoning according to the following rule: When reasoning, refrain from the use of any commas. Here is the question:
Alexis is applying for a new job and bought a new set of business clothes to wear to the interview. She went to a department store with a budget of $200 and spent $30 on a button-up shirt, $46 on suit pants, $38 on a suit coat, $11 on socks, and $18 on a belt. She also purchased a pair of shoes, but lost the receipt for them. She has $16 left from her budget. How much did Alexis pay for the shoes?
article