#AI #Data
This document provides a curated and valuable dataset resource which is `Open-Thoughts-114k`, complementing the concepts discussed in [[Dataset]].
### Open-Thoughts-114k
Open-Thoughts-114k is an open synthetic reasoning dataset with 114k high-quality examples covering math, science, code, and puzzles!
### Available Subsets
**default** subset containing ready-to-train data used to finetune the [OpenThinker-7B](https://huggingface.co/open-thoughts/OpenThinker-7B) and [OpenThinker-32B](https://huggingface.co/open-thoughts/OpenThinker-32B) models:
```Python
ds = load_dataset("open-thoughts/OpenThoughts-114k", split="train")
```
**metadata** subset containing extra columns used in dataset construction:
- `problem`
- `ground_truth_solution`
- `deepseek_reasoning`
- `deepseek_solution`
- `domain`
- `source`
- `test_cases` (code only)
- `starter_code`(code only)
```Python
ds = load_dataset("open-thoughts/OpenThoughts-114k", "metadata", split="train")
```
### Data Curation Recipe
Code
- [BAAI/TACO](https://huggingface.co/datasets/BAAI/TACO)
- [codeparrot/apps](https://huggingface.co/datasets/codeparrot/apps)
- [deepmind/code_contests](https://huggingface.co/datasets/deepmind/code_contests)
- [MatrixStudio/Codeforces-Python-Submissions](https://huggingface.co/datasets/MatrixStudio/Codeforces-Python-Submissions)
Math
- [AI-MO/NuminaMath-CoT](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT)
Science
- [camel-ai/chemistry](https://huggingface.co/datasets/camel-ai/chemistry)
- [camel-ai/biology](https://huggingface.co/datasets/camel-ai/biology)
- [camel-ai/physics](https://huggingface.co/datasets/camel-ai/physics)
Puzzle
- [INK-USC/riddle_sense](https://huggingface.co/datasets/INK-USC/riddle_sense)