#AI #Data This document provides a curated and valuable dataset resource which is `Open-Thoughts-114k`, complementing the concepts discussed in [[Dataset]]. ### Open-Thoughts-114k Open-Thoughts-114k is an open synthetic reasoning dataset with 114k high-quality examples covering math, science, code, and puzzles! ### Available Subsets **default** subset containing ready-to-train data used to finetune the [OpenThinker-7B](https://huggingface.co/open-thoughts/OpenThinker-7B) and [OpenThinker-32B](https://huggingface.co/open-thoughts/OpenThinker-32B) models: ```Python ds = load_dataset("open-thoughts/OpenThoughts-114k", split="train") ``` **metadata** subset containing extra columns used in dataset construction: - `problem` - `ground_truth_solution` - `deepseek_reasoning` - `deepseek_solution` - `domain` - `source` - `test_cases` (code only) - `starter_code`(code only) ```Python ds = load_dataset("open-thoughts/OpenThoughts-114k", "metadata", split="train") ``` ### Data Curation Recipe Code - [BAAI/TACO](https://huggingface.co/datasets/BAAI/TACO) - [codeparrot/apps](https://huggingface.co/datasets/codeparrot/apps) - [deepmind/code_contests](https://huggingface.co/datasets/deepmind/code_contests) - [MatrixStudio/Codeforces-Python-Submissions](https://huggingface.co/datasets/MatrixStudio/Codeforces-Python-Submissions) Math - [AI-MO/NuminaMath-CoT](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT) Science - [camel-ai/chemistry](https://huggingface.co/datasets/camel-ai/chemistry) - [camel-ai/biology](https://huggingface.co/datasets/camel-ai/biology) - [camel-ai/physics](https://huggingface.co/datasets/camel-ai/physics) Puzzle - [INK-USC/riddle_sense](https://huggingface.co/datasets/INK-USC/riddle_sense)