#AI #LLM
Over the past month, I have read a few research papers. To ensure I remember the details, I have decided to document these papers in this blog post and also share my analysis and viewpoints based on these studies.
## Scaling Synthetic Data Creation with 1,000,000,000 Personas
This paper, originating from [Tencent AI Lab](https://arxiv.org/abs/2406.20094)(Chan et al.,2024), introduces a novel and insightful data synthesis methodology that utilizes various personas to generate data. This innovative approach is inspired by the observation that incorporating a persona into a data synthesis prompt effectively guides the large language model (LLM) toward adopting the specified perspective, thereby producing unique synthetic data. Consequently, the establishment of a personas hub is crucial in this methodology, serving as a foundational element that significantly enhances the data synthesis process.
Here are two methods to construct Persona Hub:
1. **Text-to-Persona**
- The process for Text-to-Persona involves prompting a large language model (LLM) by asking, "Who has the highest possibility to write/read/like this text?" This method quickly identifies specific personas if the text provided is detailed, incorporating many elements. Below is an example that illustrates this process:
![[Pasted image 20240712212803.png]]
![[Pasted image 20240712213332.png]]
2. **Persona-to-Persona**
- The Persona-to-Persona process complements the Text-to-Persona method, which can cover a large number of personas found on the web. The procedure is as follows:
![[Pasted image 20240713071642.png]]
Additionally, there are three types of prompts used to create synthetic data:
- **Zero-shot prompting:** This method does not rely on any examples; instead, it harnesses the creativity of LLMs to generate data.
- **Few-shot prompting:** This approach utilizes several examples to guide the LLMs, typically yielding better results.
- **Persona-enhanced few-shot prompting:** This combines personas with a few demonstrations. However, it requires the inclusion of a persona in every demonstration beforehand.
Examples of each are provided below:
![[Pasted image 20240713073015.png]]
## GenQA: Generating Millions of Instructions from a Handful of Prompts
The team from the University of Maryland (Chen et al., 2024) has developed a new [method](https://arxiv.org/abs/2406.10323) for generating large instruction datasets from simple prompts. To enhance the diversity of the datasets, a generator prompt is employed to improve both quality and variety. Examples of such generator prompts are as follows:
```
List 60 topics that you can answer questions about. Choose a topic uniformly from this list, and state it. Then write 60 subtopics about the chosen topic. Then choose a subtopic uniformly from this list, and state it. Then write a question that is not about the subtopic, but can only be answered with expertise in the subtopic. Then write the answer. Both the question and answer should be long. The name of the subtopic should not appear in the question, and none of the words in subtopic should be reused in the question. Begin your questions with "Question:" and your answer with "Answer:". Be creative
```
## Nemontron-4 340B Technical Report
This text is from Nvidia, which has released a new open-source model with a size of 340B. I have reviewed Section 3.2 of the paper, which details the construction of the data generation pipeline. During the alignment process, this pipeline was responsible for synthesizing over 98% of the data used in supervised fine-tuning and preference fine-tuning.
The pipeline includes several crucial components:
* **Prompt Preparation**: This involves the use of diverse prompts for data generation, including instruction-following prompts, single-turn synthetic prompts, and two-turn prompts.
* **Synthetic Dialogue Generation**: This component enhances the model's capabilities for multi-turn conversations.
* **Synthetic Preference Data Generation**: This method ensures that the model learns from a diverse domain of prompts and generates higher-quality responses.
## Others
I have reviewed several evaluation benchmark papers, including MMLU-Pro, IF-Eval, GSM8K, among others. Most of these papers discuss methods for constructing diverse and high-quality evaluation datasets. Inspired by these works, I am motivated to develop my own personal evaluation benchmark. This benchmark will assess a model's capabilities in areas such as mathematics, reasoning, instruction-following, and truthfulness.
## Insights
The papers I've read focus primarily on constructing high-quality datasets. I've noticed that for many researchers, a key aspect of high quality is diversity. Whether for training or evaluation datasets, diversity represents a wide range of possibilities, enabling models to learn more effectively. Currently, the creation of more diverse datasets largely depends on prompt engineering, which is a relatively straightforward and controllable method for generating data.