LLM Evaluation Framework
This directory contains tools for evaluating large language models on various benchmarks.
Overview
The evaluation framework supports multiple benchmark datasets across different domains:
- Math: AIME24, AIME25 (evaluation scripts provided)
- Coding: LiveCodeBench v5, LiveCodeBench v6 (evaluation scripts provided)
- Multiple Choice: MMLU, MMLU Pro, GPQA (MMLU evaluation script provided)
- Instruction Following: IFEval, IFBench (refer to official evaluation toolkits)
- General Helpfulness: Arena-Hard (refer to official evaluation toolkit)
Installation
Install required dependencies:
pip install transformers vllm torch tqdm pandas
Directory Structure
evaluation/
βββ inference.py # Main inference script
βββ arguments.py # Command-line argument definitions
β
βββ data/ # Benchmark datasets and preprocessing
β βββ benchmark.py # Dataset preprocessing functions
β βββ aime24/, aime25/ # AIME competition problems
β βββ gpqa/ # GPQA dataset
β βββ livecodebench/ # LiveCodeBench v5 and v6
β βββ mmlu/, mmlu_pro/ # MMLU variants
β βββ arena-hard-v0.1/, arena-hard-v2.0/ # Arena-Hard benchmarks
β βββ ifeval/, IFBench/ # Instruction following benchmarks
β βββ mt_bench/ # MT-Bench data
β
βββ eval/ # Evaluation scripts
β βββ get_scores_math.py # Math benchmarks (AIME24, AIME25)
β βββ get_scores_mmlu_batch.py # MMLU, MMLU-Pro evaluation
β βββ get_scores_gpqa.py # GPQA evaluation
β βββ get_scores_code.py # Code benchmarks (LiveCodeBench)
β βββ tools/ # Evaluation utilities
β βββ grader.py # Math answer grading
β βββ code_verifier_utils.py # Code execution and verification
β βββ latex2sympy/ # LaTeX to SymPy conversion
β
βββ run.sh # Example single benchmark run
βββ run_local.sh # Local evaluation script
βββ run_all.sh # Run multiple benchmarks in parallel
βββ README.md # This file
Usage
Quick Start
- Edit
run.shto configure your model and data paths - Run the evaluation:
bash run.sh
Advanced Usage
Run inference directly with custom parameters:
python inference.py \
--model-folder /path/to/models \
--model-name your-model \
--tokenizer-folder /path/to/tokenizers \
--tokenizer-name your-tokenizer \
--benchmark-folder /path/to/benchmarks \
--eval-dataset aime24 \
--temperature 0.6 \
--topp 0.95 \
--batch-size 2048
We suggest following the paper config and running benchmarks with k different random seeds.
Key Arguments
Model Configuration (Required)
--model-folder: Directory containing model weights--model-name: Name of the model subdirectory--tokenizer-folder: Directory containing tokenizer files--tokenizer-name: Name of the tokenizer subdirectory
Dataset Selection (Required for evaluation)
--benchmark-folder: Root directory containing all benchmark datasets--eval-dataset: Name of the evaluation dataset (see supported datasets above)
Inference Parameters (Optional)
--temperature: Sampling temperature (default: 0 for greedy decoding)--topp: Top-p (nucleus) sampling threshold (default: 1.0)--topk: Top-k sampling threshold (default: 1)--max-output-len: Maximum output length in tokens (default: 2048)--batch-size: Batch size for inference (default: 16)--tensor-parallel-size: Number of GPUs for tensor parallelism (default: 1)
Dataset Subsetting (Optional)
--start-idx: Starting index for dataset subsetting (default: -1, disabled)--end-idx: Ending index for dataset subsetting (default: -1, disabled)
Other Options
--seed: Random seed for reproducibility (default: 42)--no-think: Disable thinking mode (flag, thinking enabled by default)--yarn-factor: Scaling factor for YaRN RoPE extension (default: 1)--device-id: Comma-separated GPU device IDs (optional)--model-output-path: Path to first turn output (required for mtbench_secondturn only)
Supported Datasets
aime24/aime25: AIME competition problemslcb5/lcb6: LiveCodeBench (versions 5 and 6)mmlu: MMLU 5-shot evaluationmmlu_pro: MMLU Pro datasetgpqa_diamond: GPQA Diamond subsetifeval: IFEval instruction followingifbench: IFBench instruction followingarena_hard: Arena-Hard v0.1
Running Evaluation Scripts
After generating model outputs using inference.py, you can compute metrics using the evaluation scripts in the eval/ directory.
We also attach our cached generation files in the corresponding model repo for reproducibility.
Math Benchmarks (AIME24, AIME25)
Evaluate math problem-solving performance:
cd eval
python get_scores_math.py \
--modelfolder /path/to/model/outputs \
--testfolder /path/to/test_benchmarks
This script:
- Evaluates AIME24 and AIME25 benchmarks
- Extracts answers from
\boxed{}and other formats - Computes accuracy with mathematical equivalence checking
- Reports mean accuracy and standard deviation across multiple runs
Multiple Choice (MMLU, MMLU-Pro, GPQA)
Evaluate MMLU and variants:
cd eval
python get_scores_mmlu_batch.py \
--modelfolder /path/to/model/outputs \
--testfolder /path/to/test_benchmarks \
--verbose # Optional: print per-category accuracy
This script evaluates:
- MMLU: Standard MMLU with 4 choices (A-D)
- MMLU-Pro: Extended version with up to 16 choices (A-P)
Features:
- Supports boxed answer format (e.g.,
\boxed{A}) - Extracts letter choices from various formats (parentheses, text, etc.)
- Handles batch-split output files automatically
- Computes accuracy across all MMLU variants
- Optional per-category breakdown with
--verboseflag
Evaluate GPQA (Graduate-Level Google-Proof Q&A) performance:
cd eval
python get_scores_gpqa.py \
--modelfolder /path/to/model/outputs \
--testfolder /path/to/test_benchmarks
This script:
- Evaluates GPQA Diamond subset
- Extracts answers from boxed and text formats
- Uses mathematical equivalence checking for complex answers
- Reports accuracy with standard deviation
Code Generation (LiveCodeBench)
Evaluate code generation performance:
cd eval
python get_scores_code.py \
--modelfolder /path/to/model/outputs \
--testfolder /path/to/test_benchmarks
This script:
- Evaluates LiveCodeBench v5 and v6
- Executes generated code against test cases
- Computes pass rate (percentage of problems solved correctly)
- Reports finish rate (percentage of valid code generations)
Note: Code execution requires:
pip install numpy tqdm
Other Benchmarks
For the following benchmarks, please refer to their official evaluation repositories due to licensing restrictions:
- Arena-Hard: Use the official Arena-Hard evaluation toolkit
- IFEval: Use the official IFEval evaluation script
- IFBench: Use the official IFBench evaluation toolkit
These benchmarks require specific evaluation logic and may have licensing terms that restrict redistribution of evaluation code.
Output Format
Results are saved as JSONL files in:
{model_folder}/{model_name}/outputs_vllm073[_topp{topp}_seed{seed}]/{eval_dataset}.jsonl
Each line contains:
task_idorquestion_id: Unique identifier for the questionoutput: Model's generated responsereason: Whether reasoning was used (boolean)reason_text: The reasoning/thinking content (if applicable)- Additional dataset-specific fields
Adding New Datasets
To add a new dataset:
Add a preprocessing function in
data/benchmark.py:def preprocess_your_dataset(data_file): """Preprocess your dataset. Args: data_file: Path to dataset file Returns: tuple: (prompt_list, qid_list) or just prompt_list """ # Your preprocessing logic passAdd the dataset path argument in
arguments.py:group.add_argument('--your-dataset-path', type=str, default='path/to/dataset')Add the dataset case in
inference.pyin theget_prompt_list()function:elif args.eval_dataset == "your_dataset": from data.benchmark import preprocess_your_dataset input_datapath = os.path.join(args.benchmark_folder, args.your_dataset_path) prompt_list, qid_list = preprocess_your_dataset(input_datapath)
Notes
- The framework uses vLLM for efficient inference with batching and tensor parallelism support
- Special handling is provided for models like DeepSeek-R1 that require eager mode
- Thinking mode (
<think>tags) is supported for models trained with reasoning capabilities - YaRN RoPE scaling is supported for extended context lengths
License
See the main repository LICENSE file for licensing information.