The Most Comprehensive Sharing for Reasoning Dataset: CoT - Related Datasets
Reasoning Datasets and Chain-of-Thought
Reasoning datasets are a category of datasets specifically designed to train and evaluate the reasoning capabilities of models. They typically involve complex tasks such as logical reasoning, commonsense reasoning, mathematical reasoning, and causal reasoning, helping models handle multi-step reasoning problems and complex reasoning scenarios. With the development of large language models (LLMs) and reasoning methods like Chain-of-Thought (CoT), the importance of reasoning tasks in natural language processing (NLP) has grown significantly.
Chain-of-Thought (CoT) is a strategy used in the field of natural language processing (NLP) for reasoning, particularly widely applied in large language models like GPT. The core idea of CoT is to simulate the step-by-step reasoning process of human thinking by breaking down problems, thereby helping models better understand complex tasks and provide more accurate answers. The key concept of the CoT method is to enable the model to generate an ordered chain of reasoning rather than jumping directly to conclusions. Specifically, the CoT method decomposes complex tasks into a series of subtasks or intermediate steps, with each step providing more detailed reasoning information to help the model derive the correct final conclusion through reasoning. By reasoning step-by-step, the CoT approach not only improves the accuracy of solving complex problems but also enhances the interpretability of models.
To advance the development of the CoT method, researchers and developers have created multiple open-source datasets specifically designed to evaluate and train models' reasoning capabilities, particularly for tasks involving multi-step reasoning and complex problems. In the first installment of the Reasoning Dataset Sharing Series, we have compiled open-source reasoning datasets that incorporate the CoT approach. These datasets cover various domains, from commonsense reasoning to mathematical reasoning, and from situational reasoning to paragraph comprehension.
CoT- Related Open-Source Datasets
NuminaMath-CoT
- Publisher: AI-MO
- Download Address: https://projectnumina.ai/
- Release Year: 2023
- Size: Approximately 1GB (contains thousands of math problems with step-by-step reasoning processes)
- Description: NuminaMath-CoT is a mathematical reasoning dataset specifically designed to evaluate the reasoning capabilities of large-scale language models. Each math problem in the dataset includes a step-by-step reasoning process (Chain-of-Thought, CoT), helping models maintain high accuracy when solving complex mathematical problems. This dataset is not only suitable for basic arithmetic problems but also includes more advanced topics such as algebra, geometry, and number theory. The CoT method encourages models to clearly demonstrate problem-solving approaches through multi-step logical reasoning, thereby enhancing computational and reasoning abilities.
source synthetic_math
problem Consider the terms of an arithmetic sequence:
$-\frac{1}{3}, y+2, 4y, \ldots$
. Solve for
$y$
.
solution For an arithmetic sequence, the difference between consecutive terms must be equal. Therefore, we can set up the following equations based on the sequence given: \[ (y + 2) - \left(-\frac{1}{3}\right) = 4y - (y+2) \] Simplify and solve these equations: \[ y + 2 + \frac{1}{3} = 4y - y - 2 \] \[ y + \frac{7}{3} = 3y - 2 \] \[ \frac{7}{3} + 2 = 3y - y \] \[ \frac{13}{3} = 2y \] \[ y = \frac{13}{6} \] Thus, the value of
$$y$$ that satisfies the given arithmetic sequence is
$\boxed{\frac{13}{6}}$
.
messages [ { "content": "Consider the terms of an arithmetic sequence: $-\\frac{1}{3}, y+2, 4y, \\ldots$. Solve for $y$.", "role": "user" }, { "content": "For an arithmetic sequence, the difference between consecutive terms must be equal. Therefore, we can set up the following equations based on the sequence given:\n\\[ (y + 2) - \\left(-\\frac{1}{3}\\right) = 4y - (y+2) \\]\n\nSimplify and solve these equations:\n\\[ y + 2 + \\frac{1}{3} = 4y - y - 2 \\]\n\\[ y + \\frac{7}{3} = 3y - 2 \\]\n\\[ \\frac{7}{3} + 2 = 3y - y \\]\n\\[ \\frac{13}{3} = 2y \\]\n\\[ y = \\frac{13}{6} \\]\n\nThus, the value of $$y$$ that satisfies the given arithmetic sequence is $\\boxed{\\frac{13}{6}}$.", "role": "assistant" } ]
LLaVA-CoT-100k
- Publisher: PKU-YUAN-Lab
- Download Address: https://huggingface.co/datasets/Xkev/LLaVA-CoT-100k
- Release Year: 2023
- Size: Approximately 10GB (contains 100,000 multi-step reasoning tasks)
- Description: LLaVA-CoT-100k is a dataset containing 100,000 multi-step reasoning tasks, designed to enhance the reasoning capabilities of large language models (LLMs) in visual and language tasks. Each problem requires the model to extract key information from visual inputs and combine it with textual reasoning to derive the answer step-by-step. This dataset particularly focuses on reasoning tasks assisted by visual inputs, making it suitable for multimodal reasoning training of models.

CoT-Collection
- Publisher: kaist-ai
- Download Address: https://huggingface.co/datasets/kaist-ai/CoT-Collection
- Release Year: 2023
- Size: Approximately 3GB (contains various types of reasoning tasks)
- Description: CoT-Collection is a diverse reasoning task dataset covering a wide range of fields from mathematics to logical reasoning. The dataset provides detailed reasoning processes for each problem and requires models to demonstrate complete reasoning chains during problem-solving. CoT-Collection aims to train models to handle complex reasoning problems, testing not only their computational abilities but also challenging their logical and abstract thinking skills.
source Article: Phytochemistry is a branch of plant biochemistry primarily concerned with the chemical substances produced by plants during secondary metabolism. Some of these compounds are toxins such as the alkaloid coniine from hemlock. Others, such as the essential oils peppermint oil and lemon oil are useful for their aroma, as flavourings and spices (e.g., capsaicin), and in medicine as pharmaceuticals as in opium from opium poppies. Many medicinal and recreational drugs, such as tetrahydrocannabinol (active ingredient in cannabis), caffeine, morphine and nicotine come directly from plants. Others are simple derivatives of botanical natural products. For example, the pain killer aspirin is the acetyl ester of salicylic acid, originally isolated from the bark of willow trees, and a wide range of opiate painkillers like heroin are obtained by chemical modification of morphine obtained from the opium poppy. Popular stimulants come from plants, such as caffeine from coffee, tea and chocolate, and nicotine from tobacco. Most alcoholic beverages come from fermentation of carbohydrate-rich plant products such as barley (beer), rice (sake) and grapes (wine). Now answer this question: Where do some medicines and recreational drugs come from?
target from plants
rationale The article states that many medicinal and recreational drugs, such as tetrahydrocannabinol (active ingredient in cannabis), caffeine, morphine and nicotine come directly from plants. These are some examples of the medicines found in plants mentioned by the author. Thus it can be stated with certainty that some medicines do indeed come from plants. Therefore, "from plants" is the correct answer option to this question based on the context provided."
task squad_v1
type CoT
cot_flan
- Publisher: causal-lm
- Download Address: https://huggingface.co/datasets/causal-lm/cot_flan
- Release Year: 2023
- Size: Approximately 3GB (contains a large number of reasoning tasks, suitable for language models)
- Description: cot_flan is a dataset optimized for language models, focusing on enhancing the reasoning and inference capabilities of language models through the Chain-of-Thought (CoT) method. The tasks in the dataset span multiple domains, including reasoning, logical reasoning, and mathematical reasoning. Each task requires the model to provide detailed reasoning steps to help it better understand complex input data and generate high-quality outputs.
instruction There were students from three different schools taking part in a debate organized by the town's mayor on a stage. The first school had twice as many students as the second school, and the second school had 40 more students than the third school. If the third school had 200 students, and after the debate was over, every student shook the mayor's hand, calculate the total number of students who shook the mayor's hand on the stage. Thoughts? Step-by-step reasoning:
input
output If the third school had 200 students, and the second school had 40 more students than the third school, the second school had 200 + 40 = 240 students. The third and second schools had 240 + 200 = 440 students on stage taking part in the debate. Since the first school had twice as many students as the second school, and the second school had 240 students, the first school had 240 * 2 = 480 students on stage taking part in the debate. If after the debate was over, every student at the stage shook the mayor's hand, the number of students who shook the mayor's hand is 480 + 440 = 920 students. Thus, the answer is 920.
GSM8K (Grade-School Math 8K)
- Publisher: OpenAI
- Download Address: https://github.com/openai/grade-school-math?tab=readme-ov-file
- Release Year: 2022
- Size: Approximately 2GB (contains over 8,000 math problems and their solution steps)
- Description: GSM8K is a dataset containing over 8,000 math problems, primarily targeting elementary school-level math questions. Each problem includes detailed solution steps and requires the model to derive the correct answer step-by-step. The application of the Chain-of-Thought (CoT) concept in this dataset involves step-by-step reasoning, ensuring that the model not only provides the final answer but also demonstrates each reasoning step. By explicitly showcasing the reasoning process, the CoT method helps models solve complex arithmetic, algebra, and geometry problems, improving their reasoning and computational accuracy.

cot_gsm8k
- Publisher: Dahoas
- Download Address: https://huggingface.co/datasets/Dahoas/cot_gsm8k
- Release Year: 2023
- Size: Approximately 2.5GB (contains over 8,000 math problems)
- Description: cot_gsm8k is an extended version of the GSM8K dataset, focusing on enhancing the mathematical reasoning capabilities of models through the Chain-of-Thought (CoT) method. The dataset includes a variety of math problems, covering content from basic arithmetic to advanced algebra and geometry. Each problem contains step-by-step reasoning processes, emphasizing the reasoning chain of the model when solving problems. This dataset is particularly suitable for training and evaluating AI systems with reasoning capabilities, especially for elementary and middle school-level math education scenarios.
question Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
answer Natalia sold 48/2 = <<48/2=24>>24 clips in May. Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May. #### 72
prompt Q: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? A:
response Natalia sold 48/2 = <<48/2=24>>24 clips in May. Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May. #### 72
MATH (Mathematics Dataset)
- Publisher: Hendrycks et al.
- Download Address: MATH GitHub
- Release Year: 2021
- Size: Approximately 1GB (contains multiple complex math problems)
- Description: The MATH dataset includes math problems of varying difficulty levels, covering content from simple arithmetic to advanced mathematics. Each problem requires multi-step reasoning to arrive at the correct answer. The application of Chain-of-Thought (CoT) in this dataset is primarily reflected in solving complex algebra, geometry, probability, and other math problems through step-by-step reasoning. The CoT method helps models decompose and analyze problems incrementally, avoiding direct answers and enhancing the transparency of problem-solving.

CommonsenseQA
- Publisher: Microsoft Research
- Download Address: https://www.tau-nlp.org/commonsenseqa
- Release Year: 2019
- Size: Approximately 1GB (contains 12,247 questions)
- Description: CommonsenseQA is a commonsense reasoning dataset that includes a large number of multiple-choice questions, where the answers require reasoning based on commonsense knowledge. The application of the Chain-of-Thought (CoT) concept in this dataset is reflected in breaking down questions into multiple reasoning steps, helping the model generate logical reasoning chains to select the most commonsense answer. This step-by-step reasoning approach enables models to handle more complex reasoning problems, especially in situations where explicit context is lacking.

SWAG (Situations With Adversarial Generations)
- Publisher: Facebook AI Research
- Download Address: https://rowanzellers.com/swag/
- Release Year: 2018
- Size: Approximately 2GB (contains 113k situational questions)
- Description: The SWAG dataset contains approximately 113,000 multiple-choice questions based on everyday situations, requiring the model to infer the most likely subsequent event. The application of Chain-of-Thought (CoT) in this dataset is reflected in the model's ability to decompose situational information into multiple steps through step-by-step reasoning, thereby deriving reasonable subsequent actions. Through CoT, the model can better understand the underlying relationships in the context and select the most commonsense answer.

DROP (Discrete Reasoning Over Paragraphs)
- Publisher: Facebook AI Research
- Download Address: https://github.com/allenai/allennlp-reading-comprehension/blob/master/allennlp_rc/eval/drop_eval.py
- Release Year: 2019
- Size: Approximately 2GB (contains over 7,000 passage-based questions)
- Description: The DROP dataset focuses on paragraph-level reasoning tasks, with questions typically involving discrete reasoning such as addition, subtraction, and summation. The application of the Chain-of-Thought (CoT) concept in this dataset is demonstrated through multi-step reasoning and information extraction, enabling the model to extract necessary information from the passage and perform step-by-step reasoning calculations to arrive at the final answer. This step-by-step reasoning approach helps the model achieve higher accuracy in understanding text and performing complex mathematical reasoning.

ReClor (Reasoning with Commonsense Logic)
- Publisher: Tsinghua University
- Download Address: https://whyu.me/reclor/
- Release Year: 2020
- Size: Approximately 1GB (contains over 9,000 reasoning questions)
- Description: The ReClor dataset includes various types of commonsense and logical reasoning questions, requiring models to answer based on commonsense knowledge and logical reasoning. The application of the Chain-of-Thought (CoT) method in this dataset involves breaking down each problem into multiple reasoning steps, enabling the model to derive the correct answer through a more systematic and structured reasoning path. The step-by-step reasoning approach of CoT in the ReClor dataset makes complex reasoning problems easier to understand and solve.

AQUA-RAT (AQUA Reasoning and Answering Task)
- Publisher: Facebook AI Research
- Download Address: https://github.com/google-deepmind/AQuA
- Release Year: 2021
- Size: Approximately 500MB (contains about 10,000 reasoning questions)
- Description: The AQUA-RAT dataset includes open-ended questions that require commonsense reasoning and step-by-step reasoning. The application of Chain-of-Thought (CoT) in this dataset is reflected in helping the model derive answers through multi-step reasoning. Each problem requires the model to decompose the reasoning process, thereby drawing reasonable conclusions based on the provided information and commonsense knowledge. The use of the CoT method enables the model to handle more complex reasoning tasks, improving the accuracy and interpretability of the answers.

Conclusion
In this installment of the Reasoning Dataset Sharing Series, we have focused on introducing diverse datasets based on the Chain-of-Thought (CoT) reasoning method. The Most Comprehensive Reasoning Dataset Sharing Series aims to provide researchers and developers with a rich collection of open-source datasets. In the future, we will continue to publish more articles on reasoning datasets, exploring more challenging reasoning tasks and helping everyone better understand and apply these datasets.