Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning Corpus

This repository is the experiment code for "Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning Corpus"

Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning Corpus

Seungpil Lee, Woochang Sim, Donghyeon Shin, Sanha Hwang, Wongyu Seo, Jiwon Park, Seokki Lee, Sejin Kim, Sundong Kim

The existing methods for evaluating the inference abilities of Large Language Models (LLMs) have been result-centric, making it difficult to assess the inference process. We introduce a new approach using the Abstract and Reasoning Corpus (ARC) dataset to evaluate the inference and contextual understanding abilities of large language models in a process-centric manner. ARC demands rigorous logical structures for problem-solving, making it a benchmark that facilitates the comparison of model inference abilities with humans. Experimental results confirm that while large language models possess weak inference abilities, they still lag in terms of logical coherence, compositionality, and productivity. Our experiments highlight the reasoning capabilities of LLMs, proposing development paths for achieving human-level reasoning.

Setup

Follow instructions from Create and deploy an Azure OpenAI Service resource.
Follow instructions from Quickstart: Get started using GPT-35-Turbo and GPT-4 with Azure OpenAI Service.
Set environment variables.

export AZURE_OPENAI_API_KEY="REPLACE_WITH_YOUR_KEY_VALUE_HERE"
export AZURE_OPENAI_ENDPOINT="REPLACE_WITH_YOUR_ENDPOINT_HERE"
export AZURE_OPENAI_DEPLOYMENT_NAME="REPLACE_WITH_YOUR_DEPLOYMENT_NAME_HERE"

Clone this repository & install the required packages.

git clone https://github.com/GIST-DSLab/ARC_Prompt.git
cd ARC_Prompt
pip install -r requirements.txt

Follow Quick Start instructions for each experiment.

Logical Coherence

The accuracy is based on solving 100 random ARC tasks with CoT, LtM, and ToT prompts, each repeated 5 times. The accuracy outside the parentheses refers to the accuracy when only the results are correct, while the accuracy inside the parentheses indicates the accuracy when both the results and the process are correct.

Iteration	Chain of thought	Least to Most	Tree of Thoughts
1	11%(3%)	6%(4%)	7%(3%)
2	10%(2%)	7%(4%)	5%(1%)
3	10%(5%)	6%(3%)	7%(2%)
4	10%(4%)	4%(2%)	7%(4%)
5	12%(6%)	5%(2%)	6%(2%)

Analyzing LLMs’ reasoning capabilities by task difficulty, following prior categorization from ARC-Game. The number of ARC tasks corresponding to each category is listed in the table, and the experiment was performed 5 times for each task.

	Entry	Easy	Medium	Hard
Tasks	2	20	46	14
Trials	10	100	230	70
CoT	100.00%	30.00%	0.00%	0.00%
LtM	20.00%	19.00%	0.00%	2.85%
ToT	50.00%	22.00%	0.00%	0.00%
Average	56.67%	23.67%	0.00%	0.95%

Compositionality

The accuracy is based on solving 99 random ARC tasks with ToT prompt and DSL. These tasks are included in Logical_Coherence experiment.

	Entry	Easy	Medium	Hard	Tedious	Multiple solutions	Unfixed
Tasks	2	19	46	14	11	6	1
Correct	0	0	3	0	0	0	0
ToT	0.00%	0.00%	6.52%	0.00%	0.00%	0.00%	0.00%

Productivity

Based on 160 ARC tasks classified by ConceptARC, we evaluated the validity of a total of 2,913 generated examples.

Problem Category	Total available	The number of generated data	The number of valid augmentated data	Ratio(valid/generated)
Above Below	58	158	34	21.52%
Center	65	236	35	14.83%
Clean Up	106	183	83	45.36%
Complete Shape	58	147	37	25.17%
Copy	27	153	4	2.61%
Count	56	202	29	14.36%
Extend To Boundary	37	167	8	4.79%
Extract Objects	44	176	21	11.93%
Filled Not Filled	58	203	29	14.29%
Horizontal Vertical	32	114	7	6.14%
Inside Outside	52	191	24	12.57%
Move To Boundary	36	165	12	7.27%
Order	47	162	26	16.05%
Same Different	107	246	76	30.89%
Top Bottom 2D	92	255	59	23.14%
Top Bottom 3D	55	215	25	11.63%
Total	930	2913	509	17.12%

Citation

If you find this repo useful for your research, please consider citing our paper:

@misc{lee2024reasoning,
      title={Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning Corpus}, 
      author={Seungpil Lee and Woochang Sim and Donghyeon Shin and Sanha Hwang and Wongyu Seo and Jiwon Park and Seokki Lee and Sejin Kim and Sundong Kim},
      year={2024},
      eprint={2403.11793},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compositionality

Compositionality

Logical_Coherence

Logical_Coherence

Productivity

Productivity

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning Corpus

Setup

Logical Coherence

Compositionality

Productivity

Citation

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 319 Commits
Compositionality		Compositionality
Logical_Coherence		Logical_Coherence
Productivity		Productivity
README.md		README.md
requirements.txt		requirements.txt

GIST-DSLab/ARC_Prompt

Folders and files

Latest commit

History

Repository files navigation

Reasoning Abilities of Large Language Models: In-Depth Analysis on the Abstraction and Reasoning Corpus

Setup

Citation

About

Resources

Stars

Watchers

Forks

Languages