Skip to content

GIST-DSLab/ARC_Prompt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning Corpus

This repository is the experiment code for "Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning Corpus"

Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning Corpus

Seungpil Lee, Woochang Sim, Donghyeon Shin, Sanha Hwang, Wongyu Seo, Jiwon Park, Seokki Lee, Sejin Kim, Sundong Kim

The existing methods for evaluating the inference abilities of Large Language Models (LLMs) have been result-centric, making it difficult to assess the inference process. We introduce a new approach using the Abstract and Reasoning Corpus (ARC) dataset to evaluate the inference and contextual understanding abilities of large language models in a process-centric manner. ARC demands rigorous logical structures for problem-solving, making it a benchmark that facilitates the comparison of model inference abilities with humans. Experimental results confirm that while large language models possess weak inference abilities, they still lag in terms of logical coherence, compositionality, and productivity. Our experiments highlight the reasoning capabilities of LLMs, proposing development paths for achieving human-level reasoning.

1710421234908-a8370feb-4cad-4839-bc28-138199ff19ad_1

Setup

  1. Follow instructions from Create and deploy an Azure OpenAI Service resource.

  2. Follow instructions from Quickstart: Get started using GPT-35-Turbo and GPT-4 with Azure OpenAI Service.

  3. Set environment variables.

export AZURE_OPENAI_API_KEY="REPLACE_WITH_YOUR_KEY_VALUE_HERE"
export AZURE_OPENAI_ENDPOINT="REPLACE_WITH_YOUR_ENDPOINT_HERE"
export AZURE_OPENAI_DEPLOYMENT_NAME="REPLACE_WITH_YOUR_DEPLOYMENT_NAME_HERE"
  1. Clone this repository & install the required packages.
git clone https://github.com/GIST-DSLab/ARC_Prompt.git
cd ARC_Prompt
pip install -r requirements.txt
  1. Follow Quick Start instructions for each experiment.
    1. Logical Coherence
    2. Compositionality
    3. Productivity

The accuracy is based on solving 100 random ARC tasks with CoT, LtM, and ToT prompts, each repeated 5 times. The accuracy outside the parentheses refers to the accuracy when only the results are correct, while the accuracy inside the parentheses indicates the accuracy when both the results and the process are correct.

Iteration Chain of thought Least to Most Tree of Thoughts
1 11%(3%) 6%(4%) 7%(3%)
2 10%(2%) 7%(4%) 5%(1%)
3 10%(5%) 6%(3%) 7%(2%)
4 10%(4%) 4%(2%) 7%(4%)
5 12%(6%) 5%(2%) 6%(2%)

Analyzing LLMs’ reasoning capabilities by task difficulty, following prior categorization from ARC-Game. The number of ARC tasks corresponding to each category is listed in the table, and the experiment was performed 5 times for each task.

Entry Easy Medium Hard
Tasks 2 20 46 14
Trials 10 100 230 70
CoT 100.00% 30.00% 0.00% 0.00%
LtM 20.00% 19.00% 0.00% 2.85%
ToT 50.00% 22.00% 0.00% 0.00%
Average 56.67% 23.67% 0.00% 0.95%

The accuracy is based on solving 99 random ARC tasks with ToT prompt and DSL. These tasks are included in Logical_Coherence experiment.

Entry Easy Medium Hard Tedious Multiple solutions Unfixed
Tasks 2 19 46 14 11 6 1
Correct 0 0 3 0 0 0 0
ToT 0.00% 0.00% 6.52% 0.00% 0.00% 0.00% 0.00%

Based on 160 ARC tasks classified by ConceptARC, we evaluated the validity of a total of 2,913 generated examples.

Problem Category Total available The number of generated data The number of valid augmentated data Ratio(valid/generated)
Above Below 58 158 34 21.52%
Center 65 236 35 14.83%
Clean Up 106 183 83 45.36%
Complete Shape 58 147 37 25.17%
Copy 27 153 4 2.61%
Count 56 202 29 14.36%
Extend To Boundary 37 167 8 4.79%
Extract Objects 44 176 21 11.93%
Filled Not Filled 58 203 29 14.29%
Horizontal Vertical 32 114 7 6.14%
Inside Outside 52 191 24 12.57%
Move To Boundary 36 165 12 7.27%
Order 47 162 26 16.05%
Same Different 107 246 76 30.89%
Top Bottom 2D 92 255 59 23.14%
Top Bottom 3D 55 215 25 11.63%
Total 930 2913 509 17.12%

Citation

If you find this repo useful for your research, please consider citing our paper:

@misc{lee2024reasoning,
      title={Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning Corpus}, 
      author={Seungpil Lee and Woochang Sim and Donghyeon Shin and Sanha Hwang and Wongyu Seo and Jiwon Park and Seokki Lee and Sejin Kim and Sundong Kim},
      year={2024},
      eprint={2403.11793},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published