Microsoft's Phi-4 is a small (14B parameters) language model that is a massive testament to the importance of data quality in training Large Language Models (LLMs). phi-4 uses the exact same architecture as the phi-3 model which came out 8 months ago - yet it outperforms phi-3 on both benchmarks as well as ✨vibe checks✨.

In fact, when I went through their 36-page long technical report, what astounded me was the fact that only one paragraph is devoted to details of the model architecture, and the rest of the report talks almost exclusively about the data or evaluation pipeline.

Table 1 in the phi-4 Report showcases its performance compared to other LLMs, and importance of data vs phi-3 (emphasis mine)

Through this post I will walkthrough the training data collection and curation pipeline used in training LLMs today (afterall, phi-4 is less than six months old). I picked out phi-4 to review, compared to say DeepSeek-V3, since the latter focusses a lot on the architecture as well as the infrastructure used. Rightfully so, there are a lot of cool systems-level innovations that drove the success of DeepSeek-V3. But data, architecture and infrastructure innovations are largely independent and symbiotic - parallel innovations in each of these areas are driving the success of LLMs.

LLM Training: A Brief Primer

Before we proceed with discussing the data pipeline for phi-4, it would be helpful to understand the generic training pipeline used to train LLMs. It involves three stages:

Pre-Training: Unsupervised learning of LLMs for next token prediction on large unlabeled data corpus
Supervised Fine-Tuning (SFT): Supervised learning of LLMs for things like chat modelling on a smaller labeled data corpus
Reinforcement Learning From Human Feedback (RLHF): Reinforcement learning of LLMs to align their outputs with human-desired values by training on human preference data

SFT and RLHF are often grouped together under the banner of "Post-Training" - since pre-training comprises the bulk of compute FLOPs involved in training the LLM compared to these two.

Yann LeCun gave a great visual metaphor for these three training stages with his cake analogy at Predictive Learning, NIPS 2016 Talk. This analogy transfers very well to training of LLMs today with pre-training, SFT and RLHF.

If you are unfamiliar with some of these terms, and would like a deeper dive the article Data is the Foundation of Language Models on the Deep (Learning) Focus substack is a great and accessible read.

Now that we have the big-picture view of the three training methodologies, let's do a deep dive into the data pipeline for each training stage in phi-4. The rest of this blog post discusses the data pipeline used in phi-4, in a chronological order as the model saw it.

As you can see in the pyramid above, phi-4 does not strictly follow the same training heirarchy as mentioned earlier, as there are two key changes:

Instead of Reinforcement Learning for Human Preference based optimization, it uses DPO i.e. Direct Preference Optimization. DPO is a simplified formulation for the RLHF training objective, with the benefit of not requiring RL to optimize a discretized loss. Hence the optimization problem is simpler and tends to be more stable during training. For further details, I found this article on HuggingFace to be a great and accessible read: Simplifying Alignment: From RLHF to Direct Preference Optimization (DPO)
It introduces a new stage "Mid-Training" between pre-training and post-training. Mid-training is new terminology for even active researchers, as when I asked my friends pursuing PhDs in NLP only one of them was aware of the term. I will discuss it in further detail in the corresponding section on Mid-Training, but it is essentially a curriculum step after pre-training that optimizes for objectives beyond next-token prediction.

Author's Notes : Legend

While there were quite a lot of useful details in the phi-4 report, especially on the Synthetic Data and Preference data curation, I was a bit disappointed with the lack of details on the reasoning and long-context data in pre-training, as well as all of the SFT data. It is, of course, by design - since data is the secret sauce of machine learning, frontier model labs are unwilling to give it away for others to build upon as the AI race heats up. I have tried to provide additional context on standard techniques wherever details were missing under the "Author's Notes" callouts (this one is an example).

Pre-Training

The phi-4 training process involved the model being trained on nearly 10 Trillion tokens!

Tokenization

phi-4 uses the tiktoken tokenizer from OpenAI used in the GPT example above. All the tokens it sees come from a vocabulary of size 100,352.

The 10T pretraining tokens are made up of a cocktail of data from various sources. This table shows each of these sources in summary, as well as how they add up to the total pre-training dataset size.

Data Source	Fraction of Training	Unique Token Count	Number of Epochs	Total Token Count
Web	15%	1.3T	1.2	1.56T
Web rewrites	15%	290B	5.2	1.508T
Synthetic	40%	290B	13.8	4.002T
Code data	20%	820B	2.4	1.968T
Acquired sources	10%	580B	1.7	0.986T
			Total tokens	~10T

Table 5 in the phi-4 Report summarizes the data mix used in the pre-training process - last column added by me

Author's Note: Synthetic Data for Pre-Training

Over half of the pre-training data (40% for Synthetic + 15% for Web rewrites) was not human generated data - instead it was generated from other LLMs with a lot of expert curation 🤯 Compare this to the Llama-2 13B model of a similar size which come out a year before phi-4, and was trained on a similar order of real world data (2B tokens) - but had 0% synthetic training data!

Let's do a deeper dive into what each of these pre-training data sources consist of. Since the Synthetic Data is the single largest, and perhaps most promising part of the training mix, we will discuss it last and in detail.

A. Web Data (15%)

As is standard for LLM training, a lot of web based data was utilized for phi-4:

A.1 Q&A Web Datasets

According to the authors, they knew that Q&A datasets are important for model capabilities like mathematical reasoning and academic performance. They collected "tens-of-millions" of "high-quality" Q&A data from public websites, existing datasets, and acquiring external datasets.

Author's Note: Q&A Datasets

Standard Q&A datasets likely include MMLU, GSM8K, and collections from BigBench. The "tens of millions" scale suggests scraping from Stack Exchange, Reddit's AskScience, and similar platforms. See The Pile for how Q&A data is typically sourced and filtered. It would be safe to consider that several of the standard Q&A datasets were used (check out List of Question Answering datasets on Huggingface).

Besides using the datasets, there were a bunch of data augmentations performed on the Q&A datasets to increase their size and quality. The authors mention using "several ways to synthetically augment the dataset of organic questions to obtain a larger dataset" and also "replaced the (wrong) answers with synthetically generated ones and used majority-voting to increase accuracy".

Author's Notes: Question Rewriting Techniques

Common augmentation methods include paraphrasing via InstructGPT techniques and back-translation. Microsoft's GLAN paper details systematic question generation approaches. See also Self-Instruct for automated instruction generation methods.

This is an active research area within NLP with various approaches - PapersWithCode lists 15 research papers on question rewriting.

The phi-4 authors mention that "organic questions" i.e. obtained from human data from the internet are more useful for training LLMs than "synthetic questions" i.e. generated from another LLM. However, they also mention that "rewritten questions improved the model’s capabilities, the gains were not as pronounced" implying that the data augmentation helped, but not by a lot.

Author's Notes: Answer Verification

While specific answer verification methods aren't detailed, this likely involved techniques similar to those in automated fact-checking (see A Survey on Automated Fact-Checking).

Incorrect answers were replaced by "synthetic ones" i.e. other LLMs were used to generate the correct answers, and then majority voting i.e. probably passing several LLM generated answers into a model that picks the response which gets repeated across multiple synthetic answers.

A.1.1 Decontamination of Q&A Data

In order to avoid training on questions that are included in benchmark datasets used for performance evaluation, a decontamination algorithm was used on the Q&A data to remove such questions (and answers). The ARC-Easy, MBPP, phibench, CommonsenseQA, WinoGrande, mcphi, MedQA, MATH, AGIEval, PIQA, OpenBookQA, HellaSwag, GPQA, mt-bench, MMLUPro, GSM8k, HumanEval, arena hard, ARC-Challenge, and MMLU benchmarks were checked against. The exact algorithm is not key, but 7-gram and 13-gram features were used (i.e. sequential combinations of 7 and 13 words) and then compared against the features from benchmark sets.

Author's Note: N-gram Decontamination

The 7/13-gram approach follows GPT-3's decontamination (Section 4.1). For more robust methods, see Detecting Pretraining Data from LLMs which uses Min-K% Prob for contamination detection.

A.2 Multilingual Web Data

Datasets from CommonCrawl and Wikipedia for languages like German, Spanish, French, Portuguese, Italian, Hindi, and Japanese were incorporated. The multilingual data filtering pipeline consisted of first classifying data into 176 languages using a simple text classifier from fasttext and then filtering data for quality using classifiers (discussed in next subsection).

Author's Note: Multilingual Data Processing

See mT5 and ROOTS for multilingual filtering pipelines. CulturaX details language-specific quality thresholds. For tokenization challenges, see No Language Left Behind.

A.3 Filtered Web Dumps

Author's Note: Web Dumps

Web dumps are just large collated content from the web, a classic example I am familiar with is CommonCrawl. Think, for example, all the posts and comments from Reddit, or all of StackExchange posts and comments, every blog post etc.

Web dumps, as implied by their name, are large uncurated collections which don't separate 'good' training data from 'bad' when it comes to training LLMs. The phi-4 authors did several filtering and cleanup steps to obtain "information-rich web sources".

First, small text classifiers were trained on around ~ $10^6$ LLM-generated annotations on data quality. The web dumps were filtered using these text classifiers to obtain high quality annotations.
The authors found that this pipeline over-indexes on STEM related keywords, so they trained a separate quality classification model for other topics (e.g. arts, history, travel, culture, and entertainment).
From the filtered documents, corrupted text and binary files were removed. Outlier detection based on n-gram statistics and compression ratios was used for this purpose.

Author's Note: Compression Ratio

Compressions ratios are measure of how information dense a piece of text is. In The Story Behind Data Compression and LLM Performance, the authors showed a theoretical and empirical proof that training on data that is low in compression ratio tends to help LLM performance during SFT and RLHF. See also Compression Represents Intelligence Linearly for a theoretical discussion on how compression efficiency of text data corresponds to LLM capabilities.

Finally, a custom HTML-to-text extractor was used on web data, to preserve text data that gets corrupted by naive parsers (e.g., TeX/MathML equations, code blocks, tables, and forum thread structure). This parses is based on various heuristics such as HTML tag names, CSS classes, content length, and depth of the DOM tree.

Author's Note: Parsing Web Data for LLMs

While details are missing in the phi-4 paper, I would like to emphasize good parsing of web data for LLM training is a very challenging problem in of itself. There are entire companies (Unstructured.io, Datalab, JINA Reader among others) focused on solving this problem, and most large companies have data engineers who build ETL pipelines for converting web data into something more amenable to LLM training.

Trafilatura and Readability are standard tools. For math extraction, see LaTeX-OCR and MathPix. The CommonCrawl extraction pipeline shows production-scale HTML processing.

B. Code Data (20%)

Publicly permissible data from GitHub was used to train the model. Code data tends to be a very high quality for training LLMs - programming language grammar is self-defined and fully described, the syntax is much smaller compared to natural languages, a lot of eccentricities and redundancies built into human language (like morphology) are not present in programming languages etc.

Author's Note: Code Data for LLMs

A cool paper from a few years ago, Language Models of Code are Few-Shot Commonsense Learners, showed that "pre-trained LMs of code are better structured commonsense reasoners than LMs of natural language, even when the downstream task does not involve source code at all".

C. Acquired Sources (10%)

The authors mention using reasoning-dense documents that are either publicly permissible to use (e.g. arXiv, PubMed Central) or were explicitly licensed by them (e.g. books) to include data that is more comprehensive and clean compared to web sourced data.

So, it seems like research papers and textbooks largely comprised this acquired sources data.

Custom data extraction and cleaning pipelines were build for each of these sources "(e.g., multi-file TeX source, ePub and other XML-like formats, Microsoft Word documents, and PDFs)".

Author's Note: Document Data Cleaning Pipeline

Similar to parsing web data for LLMs, parsing structured documents for LLMs is also a huge area of research and engineering in itself. Every major cloud provider has a service for it e.g. Amazon's Textract, Google's Document AI. There are tons and tons of other open-source and partly open-source tools for this as well: llamaparse, Mistral-OCR to name a few.

D. Synthetic Data (40%)

Synthetic data generated from other LLMs was used to train phi-4 quite extensively. Over 50 types of synthetic datasets were created - corresponding to around 400B unweighted tokens, using different sets of seeds and prompting procedures. The synthetic data generation follows an extensive pipeline, which is described below:

D.1. Seed Curation

First, high-quality "seeds" (i.e. real data that can be used as a starting point to generate synthetic data) was curated. The seeds are used for creating exercises, discussions, and reasoning tasks later in the synthetic data pipeline.

D.1.1 Web and Code-based Seeds

The authors:

First classify pages/books/code repos with "strong educational potential"
Then segment selected pages into passages, with associated scores for factual and reasoning content

An example of an excerpt passage annotated with this metadata is given (from Appendix D):

Start Line: 8, End Line: 11
Information Type: Empirical Result Discussion
Brief Description: Discussion of the role of SNPs in AD development and severity.
Complexity Level: B
Factual Obscurity: C
Chain of Reasoning: True
Reasoning Steps:
    1. Assumption: Specific SNPs influence immune response.
    Conclusion: These SNPs may affect AD severity by modulating IgE levels.
    Description: The study links genetic variations to clinical phenotypes.
    2. Assumption: SNPs associated with AD are replicated across cohorts.
    Conclusion: Replication supports their role as potential genetic markers.
    Description: Statistical analysis confirms the robustness of the findings.
Dependencies: Previous GWAS findings, statistical methods (lines 10-15)

Author's Note: Passage Segmentation

The authors do not provide any details on how this classification for "strong educational potential" or tagging of metadata was done. But Textbooks Are All You Need (phi-1) was one of the first to introduce this approach of chosing high-educational content based training data. The FineWeb-Edu paper describes training classifiers on educational annotations. See also QWEN's data pipeline for educational scoring metrics.

Other works that discuss this: Longformer introduced sliding window approaches for passage segmentation. Dense Passage Retrieval uses 100-word passages. For educational segmentation specifically, see MathInstruct.

D.1.2 Question Dataset Seeds

Questions were sourced from websites, forums, and Q&A platforms, then filtered with a plurality-based technique. First, several answers were generated from the phi-4 model, then the cases with the following questions were discarded:

All answers agreed (question too easy for phi-4)
Answers entirely inconsistent (question too hard/ambiguous for phi-4)

This gives a dataset of questions that are just 'challenging enough' for the model (the human analogue would Zone of proximal development).

Author's Note: Plurality Answers

The paper mentions "The plurality answers were used in place of the ground truth in our rejection-sampling based generations" but doesn't elaborate on how exactly these plurality answers (from the majority voting process) were integrated into the rejection sampling pipeline. This likely means that instead of using the original "correct" answers from the datasets, they used the consensus answers from multiple model generations as the target for training.

D.1.3 Q&A Pairs Seeds from Acquired Sources

LLMs were used to generate Q&A pairs from the code, books, and scientific pairs data collected (mentioned previously).

Author's Note: Q&A Pair Generation

No further details are given, but something akin to Generate question/answer training pairs out of raw text was probably done where a prompt like this extracts questions:

You are an expert user extracting information to quiz people on documentation. You will be passed a page extracted from the documentation, write a numbered list of questions that can be answered based *solely* on the given text.

And a follow-up prompt like this one extracts answers:

You are an expert user answering questions. You will be passed a page extracted from a documentation and a question. Generate a comprehensive and informative answer to the question based *solely* on the given text.

D.2. Rewrite and Augment

Seeds are changed into synthetic data using LLMs to rewrite useful content into "exercises, discussions, or structured reasoning tasks."

D.2.1 Creating Conversations from Snippets

One component of the synthetic data generation pipeline involves converting the phrases into discussions, following these steps:

First, the model generates an initial foundational response. Optionally, the tone and style are guided through the generation by curating the participant profile.
Next, a sequence of dialogue turns is generated by LLM agents which direct the conversation, summarize past dialogue context and inject new scenarios/topics consistent with prior messages in the dialogue
Lastly, self-revision is used after each dialogue turn to evaluate the responses, in order to maximize nuance and complexity. See the next subsection for an example of a self-revision prompt used.

D.2.2 Fill-in-the-middle Exercises

Fill-in-the-middle exercises represent an innovative approach to creating language training data from existing code snippets. The phi-4 authors used a technique where they take functional code and strategically remove portions, creating exercises that require the model to reconstruct the missing segments.

Author's Note: Fill-in-the-Middle Training

FIM was introduced in Efficient Training of Language Models to Fill in the Middle. CodeLlama (Section 2.3) details infilling implementation for code. StarCoder shows FIM's impact on real-world code completion.

The process works by taking code snippets, writing a prompt (given below) that describes the code context and asks a model to generate a text description and code solution for it. When the generated response matches with the removed code snippet, it is included in training data.

Fill-in-the-Middle Exercise Generation Prompt

Fill-in-the-Middle Exercise

Context (Excerpt from Code Snippet):

for p in [path + [n] for n in graph[node]]:
    smalls = collections.defaultdict(lambda: 0)
    for n in path:
        if all_c['start'] > 1 or all_c['end'] > 1:
            continue
        for c in smalls.values():
            if c <= 1:
                continue
            <<THIS PART DELETED>>
        if not bad:
            paths.append(p)

Solution:

Background: This snippet processes paths in a graph, tracking visits to nodes, particularly "small" nodes identified by lowercase strings. Key conditions:

Start and end nodes cannot be revisited.
"Small" nodes may have specific limits on revisits.

Reasoning: The missing part handles violations of small node visit limits using two variables:

cused indicates if one small node has already been revisited.
bad marks the path as invalid if a second such revisit occurs.

The loop iterating over smalls.values() likely checks whether a small node is revisited for the first time (allowing it by setting cused to True) or violates the limit (setting bad to True and breaking the loop).

Final Guess: The missing part is likely:

if c > 1:
    if not cused:
        cused = True
    else:
        bad = True
        break

Comparison to Correct Answer:

The solution captures the general logic but omits the specific condition elif c == 2 and not cused, making it broader.
Correctness Discussion: The logic is close but lacks precision, as the revisits must explicitly check for exactly two visits.

Grading and Feedback:

Grade: 3 (partial correctness).
Hint: Consider revisiting the specific condition for revisits. Focus on the exact count of visits that should trigger allowances.

D.2.3 Agent task trajectories

AgentKit - a framework for LLM prompting for multifunctional agents, was used to create task trajectories for long horizon tasks with chain-of-thought reasoning. The phi-4 authors ran AgentKit on various environments, and rewrote the guided raw reasoning outputs into a variety of self-contained thoughts and statements. It lead to improvement in planning, reasoning, tool use, math, and error correction on Microsoft's internal benchmarks.

Example of phi-4 interacting with Tower of Hanoi

Analyzing the current situation and reflecting on past actions is crucial for determining the next best step. Here's a breakdown of the thought process:

### Analysis of Current and Past Environment

1. **Current Configuration:**
   - Rod A: |bottom, [0], top|
   - Rod B: |bottom, [1], top|
   - Rod C: |bottom, [2], top|

2. **Past Performance:**
   - There have been repeated unsuccessful attempts to move disks, particularly from rod B to C and A to C. These failures were due to rule violations, such as attempting to place a larger disk on a smaller one.

3. **Environmental Patterns and Rules:**
   - Only one disk can be moved at a time.
   - A larger disk cannot be placed on a smaller disk.
   - The goal is to move all disks from rod A to rod C.

### Reflection on Past Performance

- **Mistakes Identified:**
  - Repeated attempts to make illegal moves indicate a misunderstanding or oversight of the rules.
  - The plan was not effectively followed, as seen by the repeated failed attempts to move disks from rod B to C and A to C.

- **Correct Moves:**
  - Successfully moved disk 0 from C to A, which was a necessary step to free rod C for subsequent moves.

### Relevant Considerations

- **Current Objective:**
  - The immediate goal is to move disk 1 from rod B to rod C. This move is legal and aligns with the rules since disk 1 is smaller than disk 2 on rod C.

- **Strategic Alignment:**
  - Moving disk 1 from B to C is a strategic step that aligns with the overall plan to eventually stack all disks on rod C in the correct order.

### Additional Thoughts

- Ensuring compliance with the rules is paramount to avoid repeated failures.
- Each move should be evaluated for legality before execution to maintain progress towards the goal.

### Updated Plan

1. **Move Disk 1 from B to C:**
   - Completion Quantification: Disk 1 is on rod C.
   - Certainty: Known. This move is legal and necessary to progress towards the goal configuration.

2. **Move Disk 0 from A to C:**
   - Completion Quantification: Disk 0 is on rod C, on top of disk 1.
   - Certainty: Known. This move is legal once disk 1 is on rod C.

3. **Move Disk 2 from C to B:**
   - Completion Quantification: Disk 2 is on rod B.
   - Certainty: Known. This move is necessary to free rod C for the final sequence.

### Chosen Action

Based on the analysis and the updated plan, the next logical and legal action is to move disk 1 from rod B to rod C. This move adheres to the rules and progresses towards the goal configuration.

```xml
<updated plan>
1. **Move Disk 1 from B to C**:
   - Completion Quantification: Disk 1 is on rod C.
   - Certainty: Known. This move is legal and necessary to progress towards the goal configuration.

2. **Move Disk 0 from A to C**:
   - Completion Quantification: Disk 0 is on rod C, on top of disk 1.
   - Certainty: Known. This move is legal once disk 1 is on rod C.

3. **Move Disk 2 from C to B**:
   - Completion Quantification: Disk 2 is on rod B.
   - Certainty: Known. This move is necessary to free rod C for the final sequence.
</updated plan>

<action>Move the top disk of rod B to the top of rod C</action>

D.3. Self-Revision

The responses then goes through an iterative self-revision process, where a model (not mentioned, but probably GPT-4o) is used to critique and improve its outputs, with rubrics focused on reasoning and factual accuracy.

Self-Revision Prompt Example

Excerpt: "Future studies should replicate our findings for OXTR and BDNF, but also include additional stress-related candidate genes. [...] could reverse aberrant DNA methylation-could become an important goal in the development of new treatment approaches."

Revision 0:
Plan: To create an advanced reading comprehension exercise:

Incorporate technical terms from genetics and epigenetics.
Add nested counterfactuals and conditional reasoning.
Require logical reasoning and background knowledge.

Exercise: Researchers focused on DNA methylation following psychosocial stress. What conclusion can be drawn from studies examining both aversive and positive experiences, diversified cohorts, and refined sampling intervals?

A monotonic relationship between stress and methylation is confirmed.
Differential methylation patterns reveal bidirectional sensitivity (correct).
Early-life adversities remain the primary influence.
Positive experiences have negligible effects.

Requirements (Flavored):

Exercises must require both understanding the text and external advanced knowledge.
Incorrect choices should be plausible enough to mislead without full comprehension.
The content should challenge advanced undergraduates in genetics or epigenetics.

Critique (Agent):

External Knowledge: Does not explicitly require concepts beyond the excerpt. Score: 1
Plausibility of Choices: Incorrect options are plausible but not misleading enough. Score: 2
Suggestions: Introduce external concepts like epigenetic drift or the diathesis-stress model, and refine incorrect choices to address common misconceptions.

Revision 1:
Plan: Add references to the hypothalamic-pituitary-adrenal (HPA) axis and cortisol's role in stress responses, integrating advanced neuroendocrinology knowledge.

Exercise: Considering DNA methylation and the HPA axis's role, how could cortisol influence classical and non-classical epigenetic changes?

Cortisol is irrelevant to the modifiers discussed.
Cortisol effects are linear and align with classical models.
The dynamic epigenetic role of cortisol enriches research paradigms (correct).
Cortisol’s role is limited to downregulation of methylation.

Critique (Agent):

Challenge Level: Still insufficiently difficult for advanced undergraduates. Score: 1
Suggestions: Add nuanced alternatives based on theories like eustress vs. distress or glucocorticoid response elements.

Revision 2:
Plan: Refine incorrect options and add concepts like glucocorticoid response elements to deepen the challenge. Reframe exercise to compare classical and non-classical pathways in epigenetics.

D.4. Instruction Reversal

Instruction Reversal is a technique used to improve the model's ability to generate outputs from instructions.

Author's Note: Instruction Backtranslation

The instruction reversal technique originates from Self-Alignment with Instruction Backtranslation. It is similar to the older Unnatural Instructions technique, but generates instructions from outputs rather than vice versa. Code Alpaca implements this for code.

First, code snippets from code data corpus are used to generate problem descriptions/task prompts with corresponding instructions.
Resulting data is paired, with instruction appearing before the code during training.
From the regenerated code (using an LLM), the fidelity between the regenerated code and original code is compared.

Only data where the fidelity between the original code and the regenerated code is high is retained for pre-training.

D.5. Validation

Further validation of the generated synthetic data is done wherever possible:

For code data, the code is executed and tests run to ensure if is reasonable.
For scientific data, questions are extracted to ensure "high relevance, groundedness, and difficulty balance".

Author's Note: Validation

NLP has long-standing research problems such as entailment and grounding which investigate the verification of text generated from models in real world information. Microsoft published a recent article called Claimify which showcases an approach on verifying claims made in LLM outputs.

E. Web Rewrites (15%)

According to the phi-4 technical report:

"Web rewrites is a sub-category of synthetic data that is substantially large and contains direct rewrites of web content"

Author's Note: Rewriting

Since no further information is available in the report, I am going to go out on a limb and assume that it is made up of the web data directly passed to an LLM (like GPT-4o) with some simple prompt instruction like "Rewrite this text to preserve all semantic and syntactic information, without introducing any new information and removing any information...."

Rewriting text to generate diversity is not the only use case for LLM based rewriting. In-fact, human queries chat-based LLMs are often re-written by another lightweight language model to generate a new query that is better suited at extracting the relevant information from the LLM+its knowledge base. See: Query Rewriting for Retrieval-Augmented Large Language Models

Mid-Training

Mid meme

Author's Note: Mid-Training

This was the first time I came across the term "mid-training" for LLMs. Granted that I haven't done deep dives into LLM papers in the last 1-1.5 years, but it seems like mid-training is a new term for LLM folks too. This article is a great summary What's the deal with mid-training? In summary, it is a re-branding of the training steps that most people would previously consider part of pre-training, but these steps seem key to implementing a training curriculum which improves reasoning performance or long-context performance of LLMs.

For phi-4, the mid-training process specifically involved expanding the context length in training from 4K tokens to 16K tokens. Context length is the number of tokens a model considers as part of its likelihood when calculating the conditional probability for generating the next token in an autoregressive manner. If that sounds like mumbo-jumbo, just think of the context length as the tokens previously generated that influence the generation of the next token, in addition to the model's knowledge of course.

How is this done? All "high-quality" organic data sources (i.e. the Acquired Sources and Code Data) was filtered to obtain the samples above 8K tokens in length. Next, the samples that were 16K or higher in length were up-weighted (i.e. they would appear more frequently during this training stage).

The mid-training step was carried out for a total of 250B tokens.

Author's Note: Mid-training Context Length

DeepSeek-V3 and V-2 uses a two-phase mid-training for increasing context lengths. Long Context Scaling Laws explains why gradual curriculum based context extension works.

Post-Training

As mentioned earlier, post-training is done to turn the LLM into a chat assistant, and to make sure that its responses align with human values (to be more pedantic, the values of the folks training the model).

Supervised Fine-Tuning

The SFT training step was carried out for a total of 8B tokens.

The data is formatted using the chatml format, to train the model to carry out conversations with humans (or increasingly, other AI users). An example of the data format seen by the model is given below:

<|im start|>system<|im sep|>system message<|im end|>
<|im start|>user<|im sep|>prompt1<|im end|><|im start|>assistant<|im sep|>response1<|im end|>
<|im start|>user<|im sep|>prompt2<|im end|><|im start|>assistant<|im sep|>

The details of the SFT dataset used for phi-4 are quite sparse. Here is what we know:

Data was taken from diverse domains like "math, coding, reasoning, conversation, model identity, and safety".
Multilingual data was added from 40 languages.
Hallucination mitigation was done using additional synthetic data generated from GPT-4o (see section on SFT Hallucination Mitigation)

Preference Optimization

phi-4 uses DPO (Direct Preference Optimization) as its algorithm for aligning the LLM outputs with human preference, and also to perform safety related controls.

DPO is done over two rounds: first using a technique introduced in the paper called Pivotal Token Search (PTS) to generate 250K preference pairs, followed by judge-guided DPO where a approximately 850K pairs of desired versus undesired outputs are used.

Author's Note: Safety Data Curation

Both stages of DPO mention using safety data for preference optimization. See Constitutional AI for systematic safety data generation. Red Teaming Language Models details adversarial prompt collection. Microsoft's Safety-Tuned LLaMAs describes their safety taxonomy.

Pivotal Token DPO

Dataset Name	Sample Count
unknown + safety data	3,000
generic multiple-choice Q&A	132,859
math data	76,552
python data	16,080
cpp, go, java, js, rust data	21,806
Total pairs	250,297

Table 7 in the phi-4 report summarizes the data mixture used for the Pivotal Token DPO

Without going into a lot of technical detail, pivotal tokens are those which significantly influence the probability of a response being correct. Pivotal Token Search (PTS) is used on . It creates preference pairs by generating two responses which vary starting from a preference token, particularly for cases where $0.2 <= p(success) <= 0.8$

Example of pair generated with PTS:

User:
Find the product of all real values of r for which $\frac{1}{2x} \;=\;\frac{r - x}{7}$ has exactly one real solution.
AI:
To solve the problem, we need to determine the values of r such that the equation
$\frac{1}{2x} = \frac{r−x}{7}$ has exactly one real solution for xx.
First, clear the fractions by
Good: cross-multiplying
Bad: multiplying both sides by

Judge-Guided DPO

Dataset Name	Sample Count
unknown + safety data	43,842
any vs any overall	266,000
any vs any accuracy	532,000
Total pairs	841,842

Table 8 in the phi-4 report summarizes the data mixture used for the Judge-Guided DPO

For judge-guided DPO, first prompts are sourced from publicly available instruction tuning datasets and safety and responsible ai datasets. Next, GPT-4o, GPT-4t and phi-4 are used to generate responses for these prompts. Various combinations of answer pairs are generated from these responses, and GPT-4o is used as the judge to label positive/negative sample in a pair. Each response is scored on accuracy, style and detail - and either the higher accuracy or higher overall score is taken as the positive sample.

Here is the GPT-4o prompt used to generate the accuracy, style and detail ratings:

Judge Prompt

Your task is to judge which of the following reply given by an AI assistant is better.
Conversation
{{ chat }}
Replies
{{ replies }}
Guideline
Produce your output in the following JSON format (without comments and with correct escape characters):

{
  "faults": {
    "Assistant 1": "(string) List all the problems with the assistant 1 reply. For each problem try to determine whether this is due to lack of comprehension of the relevant material, a logical error, a factual error, a stylistic issue. If the answer is perfect, write none. If the question did not ask for a specific level of detail in the explanation, do not penalize the answer for being too detailed or too concise.",
    "Assistant 2": "..."
  },
  "faults_discussion": "(string) Discuss the general strengths and weaknesses of each assistant. What are the main differences between the answers in terms of style, accuracy and level of detail?",
  "accuracy": {
    "Assistant 1": "(1-5) how would you rate assistant 1 in terms of accuracy?",
    "...": "..."
  },
  "style": {
    "Assistant 1": "(1-5) how would you rate assistant 1 in terms of style?",
    "...": "..."
  },
  "detail": {
    "Assistant 1": "(1-5) how would you rate assistant 1 in terms of level of detail?",
    "...": "..."
  }
}

Hallucination Mitigation in Post-Training

Hallucination is the tendency of LLMs to generate wrong answers for questions that they don't have knowledge to answer correctly, or unable to synthesize their knowledge correctly.

For phi-4, the authors focused on letting the model learn "as much...as possible" during pre-training i.e. not perform any hallucination mitigation training, instead performing it post-training.

How as this done? First, phi-4 was run multiple times on questions to estimate its ability to solve a question. For reference, consistency across multiple LLM generations is a simple but very useful approach towards hallucination mitigation (for other common approaches, see my tutorial at last year's Toronto Machine Learning Summit: A Practitioner's Guide To Safeguarding Your LLM Applications)

Additionally, GPT-4o was used to generate

a correct answer
a refusal (to answer the question)
a bogus question variant (that shouldn't be possible to solve)
a refusal to the bogus variant

SFT Hallucination Mitigation

During SFT, the model was trained on following Q&A pairs to ensure that it refuses to answer questions instead of coming up with bogus answers:

(question, correct answer) for questions where phi-4 was usually correct
(question, refusal) for questions where phi-4 was usually incorrect
(bogus question, refusal) for all bogus questions

DPO Hallucination Mitigation

For preference rating, the phi-4 model was given the following signal:

correct > refusal, for questions where base phi-4 is usually correct
refusal > incorrect, for questions where base phi-4 is usually incorrect

Final Thoughts

Inductive biases (e.g. convolutions, graph neural networks) tailored towards exploiting inherent structure in the data are efficient at learning from little data and the pre-GPT3 machine learning (ML) research focused a lot on getting their architectural details right. During my time doing ML research (2019-23), there was this slow but sure realization that happened across the ML research community that model idiosyncrasies are usually overcome given enough, and the right kind of training data. A testament to this fact in my own research journey was the fact that I went from writing papers like Knowledge-enabled vqa model that can read and reason and Neural Structure Mapping For Learning Abstract Visual Analogies (both focused on exploiting inherent properties of the data via model structure) to papers like Beating power law scaling via data pruning and Photorealistic and semantically controllable synthetic data for representation learning (both aimed at obtaining training data that fit a desirable property to be modelled).

There is however, a real concern that the low-hanging fruit of model performance gains that comes from training models on increasingly larger datasets is coming to an end (see Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data) I do not have very strong opinions either way regarding synthetic data being the salvation or not out of this forthcoming problem. I don't think it is our salvation out of the data vacuum we are barreling towards in terms of novel training data for LLMs (see Is Synthetic Data all We Need? Benchmarking the Robustness of Models Trained with Synthetic Images), nor do I think it is completely garbage and will lead to an LLM ouroboros that spouts gibberish after several iterations of LLM->synthetic data->LLM training cycle (see AI models collapse when trained on recursively generated data). I don't think synthetic data as we can generate today is good enough to replace "organic data" - as evidenced by the author's use of code, research papers, and textbook data. However, I think it will serve its utility as the base of the training cake - providing large number of easily generated tokens for early pre-training. This limits its utility in breaking through current model capabilities, but I think a large amount of synthetic data + targeted high-quality real data regime will continue to drive incremental gains in model performance and capabilities in the near future.