#AI #LLM #benchmarks
This note summarizes the paper [SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines](https://arxiv.org/abs/2502.14739) by ByteDance.Inc and 2077.AI. The paper introduces **SuperGPQA**, a comprehensive benchmark designed to evaluate LLMs' knowledge and reasoning capabilities across 285 graduate-level disciplines.
For building **SuperGPQA**, the authors propose a large-scale human-LLM collaboration system and share the valuable lessons learned in the paper. They divide the annotation system of **SuperGPQA** into three major stages:
* Source Screening
- Expert annotators collect credible question sources across different disciplines
- Focus on ensuring reliability and appropriate difficulty level of raw questions
- Emphasis on maintaining academic standards and domain expertise
- Crowd-sourcing annotators perform several tasks:
- Convert raw questions into multiple-choice format
- Create plausible distractor options
- Assess question difficulty and reliability
- Evaluate questions based on expert input and LLM performance
- Quality Inspection
- Systematic identification of suspicious questions using LLM response patterns
- Expert review and revision of flagged questions with web-based verification
- Difficulty calibration based on LLM performance to ensure benchmark discrimination
There are some valuable lessons the authors learned during this building process:
* Source Screening Insights
- Crowd-sourcing is ineffective for collecting high-expertise content
- Online question repositories require careful verification, even when "verified"
- Questions derived from calculation and reasoning problems show better discrimination than pre-existing multiple-choice questions
* Transcription Challenges
- Crowd-sourcing annotators struggle with distractor quality assessment
- Questions requiring correct/incorrect option selection need standardized reformatting
- Special attention needed for maintaining consistency in question format
* Quality Inspection Findings
- Consistent incorrect answers across LLMs indicate potential question issues
- Uniform errors across SOTA LLMs often suggest memorization from unreliable sources
- Regular pattern analysis of LLM responses helps identify problematic questions