Present long-context massive language fashions (LLMs) can course of inputs as much as 100,000 tokens, but they wrestle to generate outputs exceeding even a modest size of two,000 phrases. Managed experiments reveal that the mannequin’s efficient technology size is inherently restricted by the examples seen throughout supervised fine-tuning (SFT). In different phrases, this output limitation stems from the shortage of long-output examples in present SFT datasets.
Latest developments in long-context LLMs have led to the event of fashions with considerably expanded reminiscence capacities, able to processing historical past exceeding 100,000 tokens in size. Nevertheless, regardless of their capability to deal with intensive inputs, present long-context LLMs wrestle to generate equally prolonged outputs.
To discover this limitation, LongWriter probes the utmost output size of state-of-the-art long-context fashions with a number of queries that require responses of various lengths, akin to “Write a 10,000-word article on the history of the Roman Empire.” The outcomes present that each one fashions constantly fail to supply outputs past 2,000 phrases in size. In the meantime, evaluation of consumer interplay logs reveals that over 1% of consumer prompts explicitly request outputs exceeding this restrict, highlighting a urgent want in present analysis to beat this limitation.
To handle this, LongWriter introduces AgentWrite, an agent-based pipeline that decomposes ultra-long technology duties into subtasks, enabling off-the-shelf LLMs to generate coherent outputs exceeding 20,000 phrases. Leveraging AgentWrite, LongWriter constructs LongWriter-6k, a dataset containing 6,000 SFT information samples with output lengths starting from 2k to 32k phrases. By incorporating this dataset into mannequin coaching, LongWriter efficiently scales the output size of present fashions to over 10,000 phrases whereas sustaining output high quality.
LongWriter additionally develops LongBench-Write, a complete benchmark for evaluating ultra-long technology capabilities. The 9B parameter mannequin, additional improved by way of DPO, achieves state-of-the-art efficiency on this benchmark, surpassing even a lot bigger proprietary fashions.
On this article, we’ll talk about the LongWriter framework, discover its structure, and evaluate its efficiency towards state-of-the-art long-context massive language fashions. Let’s get began.
Latest developments in lengthy context massive language fashions (LLMs) have led to the creation of fashions with considerably elevated reminiscence capacities, able to processing histories that exceed 100,000 tokens. Regardless of this capability to deal with intensive inputs, present long-context LLMs wrestle to generate outputs of comparable size. To analyze this limitation, LongWriter examines the utmost output size of state-of-the-art long-context fashions by way of numerous queries that require totally different response lengths, akin to “Write a 10,000-word article on the history of the Roman Empire.” Based mostly on the findings, LongWriter observes that each one fashions constantly fail to generate outputs longer than 2,000 phrases. Moreover, an evaluation of consumer interplay logs signifies that over 1% of consumer prompts particularly request outputs past this restrict, highlighting an pressing want in present analysis to deal with this challenge.
LongWriter’s examine reveals a key perception: the constraint on output size is primarily rooted within the traits of the Supervised Fantastic-Tuning (SFT) datasets. Particularly, LongWriter finds {that a} mannequin’s most technology size is successfully capped by the higher restrict of output lengths current in its SFT dataset, regardless of its publicity to for much longer sequences through the pretraining section. This discovering explains the ever-present 2,000-word technology restrict throughout present fashions, as present SFT datasets hardly ever comprise examples exceeding this size. Moreover, as many datasets are distilled from state-of-the-art LLMs, in addition they inherit the output size limitation from their supply fashions.
To handle this limitation, LongWriter introduces AgentWrite, a novel agent-based pipeline designed to leverage off-the-shelf LLMs to routinely assemble prolonged, coherent outputs. AgentWrite operates in two phases: First, it crafts an in depth writing plan outlining the construction and goal phrase depend for every paragraph primarily based on the consumer’s enter. Then, following this plan, it prompts the mannequin to generate content material for every paragraph in a sequential method. LongWriter’s experiments validate that AgentWrite can produce high-quality and coherent outputs of as much as 20,000 phrases.
Constructing upon the AgentWrite pipeline, LongWriter leverages GPT-4o to generate 6,000 long-output SFT information, named LongWriter-6k, and provides this information to coach present fashions. Notably, LongWriter-6k efficiently unlocks the mannequin’s capability to generate well-structured outputs exceeding 10,000 phrases in size. To carefully consider the effectiveness of this strategy, LongWriter develops the LongBench-Write benchmark, which comprises a various set of consumer writing directions, with output size specs starting from 0-500 phrases, 500-2,000 phrases, 2,000-4,000 phrases, and past 4,000 phrases. Analysis on LongBench-Write reveals that LongWriter’s 9B dimension mannequin achieves state-of-the-art efficiency, even in comparison with bigger proprietary fashions. LongWriter additional constructs choice information and makes use of DPO to assist the mannequin higher observe lengthy writing directions and generate greater high quality written content material, which has additionally been confirmed efficient by way of experiments.
To summarize, LongWriter’s work makes the next novel contributions:
- Evaluation of Technology Size Limits: LongWriter identifies the first issue limiting the output size of present long-context LLMs, which is the constraint on the output size within the SFT information.
- AgentWrite: To beat this limitation, LongWriter proposes AgentWrite, which makes use of a divide-and-conquer strategy with off-the-shelf LLMs to routinely assemble SFT information with ultra-long outputs. Utilizing this methodology, LongWriter constructs the LongWriter-6k dataset.
- Scaling Output Window Measurement of Present LLMs: LongWriter incorporates the LongWriter-6k dataset into its SFT information, efficiently scaling the output window dimension of present fashions to 10,000+ phrases with out compromising output high quality. LongWriter reveals that DPO additional enhances the mannequin’s long-text writing capabilities.
AgentWrite: Computerized Knowledge Building
To make the most of off-the-shelf LLMs for routinely producing SFT information with longer outputs, LongWriter designs AgentWrite, a divide-and-conquer model agent pipeline. AgentWrite first breaks down lengthy writing duties into a number of subtasks, with every subtask requiring the mannequin to put in writing just one paragraph. The mannequin then executes these subtasks sequentially, and LongWriter concatenates the subtask outputs to acquire the ultimate lengthy output. Such an strategy of breaking down a posh activity into a number of subtasks utilizing LLM brokers has already been utilized in numerous fields, akin to problem-solving, software program improvement, and mannequin analysis. LongWriter’s work is the primary to discover integrating planning to allow fashions to finish advanced long-form writing duties. Every step of AgentWrite is launched intimately beneath.
Step I: Plan
Impressed by the thought means of human writers, who usually begin by making an total plan for lengthy writing duties, LongWriter makes use of the planning capabilities of LLMs to output such a writing define given a writing instruction. This plan consists of the primary content material and phrase depend necessities for every paragraph. The immediate utilized by LongWriter is as follows:
“I need you to help me break down the following long-form writing instruction into multiple subtasks. Each subtask will guide the writing of one paragraph in the essay and should include the main points and word count requirements for that paragraph. The writing instruction is as follows: {User Instruction}. Please break it down in the following format, with each subtask taking up one line:
Paragraph 1 – Main Point: [Describe the main point of the paragraph, in detail] – Word Count: [Word count requirement, e.g., 400 words]
Paragraph 2 – Main Point: [Describe the main point of the paragraph, in detail] – Word Count: [Word count requirement, e.g. 1000 words].Make sure that each subtask is clear and specific, and that all subtasks cover the entire content of the writing instruction. Do not split the subtasks too finely; each subtask’s paragraph should be no less than 200 words and no more than 1000 words. Do not output any other content.”
Step II: Write
After acquiring the writing plan from Step I, LongWriter calls the LLM serially to finish every subtask, producing the writing content material part by part. To make sure the coherence of the output, when LongWriter calls the mannequin to generate the n-th part, the beforehand generated n−1 sections are additionally enter, permitting the mannequin to proceed writing the subsequent part primarily based on the prevailing writing historical past. Though this serial method prevents parallel calls to the mannequin to finish a number of subtasks concurrently, and the enter size turns into longer, LongWriter reveals in validation that the general coherence and high quality of the writing obtained this fashion are far superior to the output generated in parallel. The immediate in use by LongWriter is:
“You are an excellent writing assistant. I will give you an original writing instruction and my planned writing steps. I will also provide you with the text I have already written. Please help me continue writing the next paragraph based on the writing instruction, writing steps, and the already written text.
Writing instruction:
{User Instruction}
Writing steps:
{The writing plan generated in Step I}
Already written text:
{Previous generated (n-1) paragraphs}
Please integrate the original writing instruction, writing steps, and the already written text, and now continue writing {The plan for the n-th paragraph, i.e., the n-th line in the writing plan}.”
Validation
LongWriter exams the technology size and high quality of the proposed AgentWrite methodology on two long-form writing datasets. The primary one, LongWrite-Ruler, is used to measure precisely how lengthy of an output the strategy can present. The second, LongBench-Write, is principally used to guage how nicely the model-generated content material aligns with consumer directions when it comes to size and writing high quality.
LongBench-Write: To judge the mannequin’s efficiency on a extra various vary of long-form writing directions, LongWriter collects 120 different consumer writing prompts, with 60 in Chinese language and 60 in English. To higher assess whether or not the mannequin’s output size meets consumer necessities, LongWriter ensures that each one these directions embody specific phrase depend necessities. These directions are divided into 4 subsets primarily based on the phrase depend necessities: 0-500 phrases, 500-2,000 phrases, 2,000-4,000 phrases, and over 4,000 phrases. Moreover, the directions are categorized into seven varieties primarily based on the output kind: Literature and Inventive Writing, Tutorial and Monograph, Standard Science, Purposeful Writing, Information Report, Neighborhood Discussion board, and Schooling and Coaching.
Throughout analysis, LongWriter adopts two metrics: one for scoring the output size and one other for scoring the output high quality. The mannequin’s output size is scored primarily based on how shut it’s to the necessities specified within the directions. For output high quality, LongWriter makes use of the LLM-as-a-judge strategy, choosing the state-of-the-art GPT-4o mannequin to attain the output throughout six dimensions: Relevance, Accuracy, Coherence, Readability, Breadth and Depth, and Studying Expertise. The ultimate rating is computed by averaging the size rating and the standard rating.
Validation outcomes: LongWriter presents the output size measurement on LongWrite-Ruler and finds that AgentWrite efficiently extends the output size of GPT-4o from a most of 2k phrases to roughly 20k phrases. LongWriter additionally assesses each the output high quality and adherence to the required output size on LongBench-Write, displaying that GPT-4o can efficiently full duties with outputs underneath 2,000 phrases in size when evaluating AgentWrite’s efficiency.
Supervised Fantastic-Tuning
LongWriter conducts coaching primarily based on two of the newest open-source fashions, particularly GLM-4-9B and Llama-3.1-8B. Each of those are base fashions and assist a context window of as much as 128k tokens, making them naturally appropriate for coaching on lengthy outputs. To make the coaching extra environment friendly, LongWriter adopts packing coaching with loss weighting. The coaching on the 2 fashions leads to two fashions: LongWriter-9B (abbreviated for GLM-4-9B-LongWriter) and LongWriter-8B (abbreviated for Llama-3.1-8B-LongWriter).
On the similar time, LongWriter notices that if the loss is averaged by sequence, i.e., taking the imply of every sequence’s common loss inside a batch, the contribution of every goal token to the loss in lengthy output information can be considerably lower than these with shorter outputs. In LongWriter’s experiments, it’s also discovered that this results in suboptimal mannequin efficiency on duties with lengthy outputs. Due to this fact, LongWriter chooses a loss weighting technique that averages the loss by token, the place the loss is computed because the imply of losses throughout all goal tokens inside that batch.
All fashions are educated utilizing a node with 8xH800 80G GPUs and DeepSpeed+ZeRO3+CPU offloading. LongWriter makes use of a batch dimension of 8, a studying price of 1e-5, and a packing size of 32k. The fashions are educated for 4 epochs, which takes roughly 2,500-3,000 steps.
Alignment (DPO)
To additional enhance the mannequin’s output high quality and improve its capability to observe size constraints in directions, LongWriter performs direct choice optimization (DPO) on the supervised fine-tuned LongWriter-9B mannequin. The DPO information comes from GLM-4’s chat DPO information (roughly 50k entries). Moreover, LongWriter constructs 4k pairs of knowledge particularly focusing on long-form writing directions. For every writing instruction, LongWriter samples 4 outputs from LongWriter-9B and scores these outputs following a selected methodology. A length-following rating can also be mixed as computed. The best-scoring output is then chosen because the optimistic pattern, and one of many remaining three outputs is randomly chosen because the adverse pattern.
The ensuing mannequin, LongWriter-9B-DPO, is educated for 250 steps on the above information combination. LongWriter follows a selected recipe for DPO coaching.
LongWriter: Experiments and Outcomes
LongWriter evaluates 4 proprietary fashions and 5 open-source fashions on LongBench-Write, together with the educated LongWriter fashions. To the most effective of LongWriter’s data, Suri-IORPO is the one prior mannequin that can also be aligned for long-form textual content technology. It’s educated primarily based on Mistral-7B-Instruct-v0.2 utilizing LoRA. In step with the analysis setup on LongWrite-Ruler, LongWriter units the output temperature to 0.5 and configures the mannequin’s technology max tokens parameter to the utmost allowed by its API name. For open-source fashions, it’s set to 32,768.
Most earlier fashions are unable to fulfill the size requirement of over 2,000 phrases, whereas LongWriter fashions constantly present longer and richer responses to such prompts.
Observing the output size rating SlS_lSl for prompts in every required size vary, LongWriter finds that earlier fashions typically carry out poorly (scoring beneath 70) on prompts within the [2k, 4k) vary, with solely Claude 3.5 Sonnet attaining an honest rating. For prompts within the [4k, 20k) vary, virtually all earlier fashions are fully unable to achieve the goal output size, even scoring 0 (that means all output lengths are lower than one-third of the required size). By including coaching information from LongWriter-6k, LongWriter’s educated mannequin can successfully attain the required output size whereas sustaining good high quality, as prompt by the scores within the [2k, 20k) vary and the scatter plots.
DPO successfully improves each the mannequin’s output high quality and its capability to observe size necessities in lengthy technology.
By evaluating the scores of LongWriter-9B and LongWriter9B-DPO, we discover that DPO considerably improves each Sl (+4%) and Sq (+3%) scores, and the advance is constant throughout all ranges. This reveals that in lengthy technology situation, DPO nonetheless helps to enhance the mannequin’s output high quality and might higher align the mannequin’s output size with 8 Preprint Determine 7: Cumulative common NLL lack of GLM4-9B and Llama-3.1-8B at totally different positions of LongWriter fashions’ outputs. Determine 8: LongWrite-Ruler check outcomes of LongWriter fashions, displaying their most technology lengths between 10k-20k phrases. the requested size. The latter conclusion has additionally been not too long ago noticed in Yuan et al. (2024) in shorter generations. We additionally manually annotate pairwise wins and losses for GPT-4o and three longwriter fashions on their outputs in LongBench-Write and visualize the leads to Determine 9. We will see that people favor the DPO-trained mannequin over LongWriter-9B in 58% of the instances. Furthermore, regardless of having fewer parameters, LongWriter-9B-DPO achieves a tie with GPT-4o.
The output size restrict of the LongWriter fashions is prolonged to between 10k and 20k phrases, whereas extra information with lengthy outputs is required to assist even longer outputs.
Following the LongWrite-Ruler tes,we additionally current the LongWrite-Ruler check outcomes of LongWriter fashions. The outcomes recommend that their most technology lengths are between 10k-20k phrases. The dearth of SFT information with longer outputs is probably going the first purpose stopping the mannequin from attaining longer output lengths.
Closing Ideas
On this work, now we have talked about LongWriter, an agent-based pipeline that decomposes ultra-long technology duties into subtasks, identifies a 2,000-word technology restrict for present LLMs and proposes growing their output window dimension by including long-output information throughout alignment. To routinely assemble long-output information, LongWriter develops AgentWrite, an agent-based pipeline that makes use of off-the-shelf LLMs to create prolonged, coherent outputs. LongWriter efficiently scales the output window dimension of present LLMs to over 10,000 phrases with the constructed LongWriter-6k. In depth ablation research on the coaching information show the effectiveness of this strategy. For future work, LongWriter suggests the next three instructions: 1. Broaden the AgentWrite framework to assemble information with longer outputs to additional prolong LLMs’ output window dimension. 2. Refine the AgentWrite framework to attain greater high quality long-output information. 3. Longer mannequin outputs carry challenges to inference effectivity. A number of strategies have been proposed to enhance inference effectivity. It’s value investigating how these strategies can guarantee improved mannequin effectivity with out compromising technology high quality.