#AI #LLM #benchmarks This note summarizes the paper [SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines](https://arxiv.org/abs/2502.14739) by ByteDance.Inc and 2077.AI. The paper introduces **SuperGPQA**, a comprehensive benchmark designed to evaluate LLMs' knowledge and reasoning capabilities across 285 graduate-level disciplines. For building **SuperGPQA**, the authors propose a large-scale human-LLM collaboration system and share the valuable lessons learned in the paper. They divide the annotation system of **SuperGPQA** into three major stages: * Source Screening - Expert annotators collect credible question sources across different disciplines - Focus on ensuring reliability and appropriate difficulty level of raw questions - Emphasis on maintaining academic standards and domain expertise - Crowd-sourcing annotators perform several tasks: - Convert raw questions into multiple-choice format - Create plausible distractor options - Assess question difficulty and reliability - Evaluate questions based on expert input and LLM performance - Quality Inspection - Systematic identification of suspicious questions using LLM response patterns - Expert review and revision of flagged questions with web-based verification - Difficulty calibration based on LLM performance to ensure benchmark discrimination There are some valuable lessons the authors learned during this building process: * Source Screening Insights - Crowd-sourcing is ineffective for collecting high-expertise content - Online question repositories require careful verification, even when "verified" - Questions derived from calculation and reasoning problems show better discrimination than pre-existing multiple-choice questions * Transcription Challenges - Crowd-sourcing annotators struggle with distractor quality assessment - Questions requiring correct/incorrect option selection need standardized reformatting - Special attention needed for maintaining consistency in question format * Quality Inspection Findings - Consistent incorrect answers across LLMs indicate potential question issues - Uniform errors across SOTA LLMs often suggest memorization from unreliable sources - Regular pattern analysis of LLM responses helps identify problematic questions