#AI #LLM #benchmarks  
This note summarizes the paper [SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines](https://arxiv.org/abs/2502.14739) by ByteDance.Inc and 2077.AI. The paper introduces **SuperGPQA**, a comprehensive benchmark designed to evaluate LLMs' knowledge and reasoning capabilities across 285 graduate-level disciplines. 
For building **SuperGPQA**, the authors propose a large-scale human-LLM collaboration system and share the valuable lessons learned in the paper. They divide the annotation system of **SuperGPQA** into three major stages:  
* Source Screening 
	- Expert annotators collect credible question sources across different disciplines
	- Focus on ensuring reliability and appropriate difficulty level of raw questions
	- Emphasis on maintaining academic standards and domain expertise 
- Crowd-sourcing annotators perform several tasks: 
  - Convert raw questions into multiple-choice format
  - Create plausible distractor options
  - Assess question difficulty and reliability
  - Evaluate questions based on expert input and LLM performance 
- Quality Inspection  
 A rigorous three-tier inspection process: 
	- Systematic identification of suspicious questions using LLM response patterns
	- Expert review and revision of flagged questions with web-based verification
	- Difficulty calibration based on LLM performance to ensure benchmark discrimination
There are some valuable lessons the authors learned during this building process:  
* Source Screening Insights 
	- Crowd-sourcing is ineffective for collecting high-expertise content
	- Online question repositories require careful verification, even when "verified"
	- Questions derived from calculation and reasoning problems show better discrimination than pre-existing multiple-choice questions    
* Transcription Challenges 
	- Crowd-sourcing annotators struggle with distractor quality assessment
	- Questions requiring correct/incorrect option selection need standardized reformatting
	- Special attention needed for maintaining consistency in question format 
* Quality Inspection Findings 
	- Consistent incorrect answers across LLMs indicate potential question issues
	- Uniform errors across SOTA LLMs often suggest memorization from unreliable sources
	- Regular pattern analysis of LLM responses helps identify problematic questions