boxin-wbx's picture
Upload folder using huggingface_hub
521774d verified

LLM Evaluation Framework

This directory contains tools for evaluating large language models on various benchmarks.

Overview

The evaluation framework supports multiple benchmark datasets across different domains:

  • Math: AIME24, AIME25 (evaluation scripts provided)
  • Coding: LiveCodeBench v5, LiveCodeBench v6 (evaluation scripts provided)
  • Multiple Choice: MMLU, MMLU Pro, GPQA (MMLU evaluation script provided)
  • Instruction Following: IFEval, IFBench (refer to official evaluation toolkits)
  • General Helpfulness: Arena-Hard (refer to official evaluation toolkit)

Installation

Install required dependencies:

pip install transformers vllm torch tqdm pandas

Directory Structure

evaluation/
β”œβ”€β”€ inference.py                    # Main inference script
β”œβ”€β”€ arguments.py                    # Command-line argument definitions
β”‚
β”œβ”€β”€ data/                          # Benchmark datasets and preprocessing
β”‚   β”œβ”€β”€ benchmark.py               # Dataset preprocessing functions
β”‚   β”œβ”€β”€ aime24/, aime25/           # AIME competition problems
β”‚   β”œβ”€β”€ gpqa/                      # GPQA dataset
β”‚   β”œβ”€β”€ livecodebench/             # LiveCodeBench v5 and v6
β”‚   β”œβ”€β”€ mmlu/, mmlu_pro/           # MMLU variants
β”‚   β”œβ”€β”€ arena-hard-v0.1/, arena-hard-v2.0/  # Arena-Hard benchmarks
β”‚   β”œβ”€β”€ ifeval/, IFBench/          # Instruction following benchmarks
β”‚   └── mt_bench/                  # MT-Bench data
β”‚
β”œβ”€β”€ eval/                          # Evaluation scripts
β”‚   β”œβ”€β”€ get_scores_math.py         # Math benchmarks (AIME24, AIME25)
β”‚   β”œβ”€β”€ get_scores_mmlu_batch.py   # MMLU, MMLU-Pro evaluation
β”‚   β”œβ”€β”€ get_scores_gpqa.py         # GPQA evaluation
β”‚   β”œβ”€β”€ get_scores_code.py         # Code benchmarks (LiveCodeBench)
β”‚   └── tools/                     # Evaluation utilities
β”‚       β”œβ”€β”€ grader.py              # Math answer grading
β”‚       β”œβ”€β”€ code_verifier_utils.py # Code execution and verification
β”‚       └── latex2sympy/           # LaTeX to SymPy conversion
β”‚
β”œβ”€β”€ run.sh                         # Example single benchmark run
β”œβ”€β”€ run_local.sh                   # Local evaluation script
β”œβ”€β”€ run_all.sh                     # Run multiple benchmarks in parallel
└── README.md                      # This file

Usage

Quick Start

  1. Edit run.sh to configure your model and data paths
  2. Run the evaluation:
bash run.sh

Advanced Usage

Run inference directly with custom parameters:

python inference.py \
    --model-folder /path/to/models \
    --model-name your-model \
    --tokenizer-folder /path/to/tokenizers \
    --tokenizer-name your-tokenizer \
    --benchmark-folder /path/to/benchmarks \
    --eval-dataset aime24 \
    --temperature 0.6 \
    --topp 0.95 \
    --batch-size 2048

We suggest following the paper config and running benchmarks with k different random seeds.

Key Arguments

Model Configuration (Required)

  • --model-folder: Directory containing model weights
  • --model-name: Name of the model subdirectory
  • --tokenizer-folder: Directory containing tokenizer files
  • --tokenizer-name: Name of the tokenizer subdirectory

Dataset Selection (Required for evaluation)

  • --benchmark-folder: Root directory containing all benchmark datasets
  • --eval-dataset: Name of the evaluation dataset (see supported datasets above)

Inference Parameters (Optional)

  • --temperature: Sampling temperature (default: 0 for greedy decoding)
  • --topp: Top-p (nucleus) sampling threshold (default: 1.0)
  • --topk: Top-k sampling threshold (default: 1)
  • --max-output-len: Maximum output length in tokens (default: 2048)
  • --batch-size: Batch size for inference (default: 16)
  • --tensor-parallel-size: Number of GPUs for tensor parallelism (default: 1)

Dataset Subsetting (Optional)

  • --start-idx: Starting index for dataset subsetting (default: -1, disabled)
  • --end-idx: Ending index for dataset subsetting (default: -1, disabled)

Other Options

  • --seed: Random seed for reproducibility (default: 42)
  • --no-think: Disable thinking mode (flag, thinking enabled by default)
  • --yarn-factor: Scaling factor for YaRN RoPE extension (default: 1)
  • --device-id: Comma-separated GPU device IDs (optional)
  • --model-output-path: Path to first turn output (required for mtbench_secondturn only)

Supported Datasets

  • aime24 / aime25: AIME competition problems
  • lcb5 / lcb6: LiveCodeBench (versions 5 and 6)
  • mmlu: MMLU 5-shot evaluation
  • mmlu_pro: MMLU Pro dataset
  • gpqa_diamond: GPQA Diamond subset
  • ifeval: IFEval instruction following
  • ifbench: IFBench instruction following
  • arena_hard: Arena-Hard v0.1

Running Evaluation Scripts

After generating model outputs using inference.py, you can compute metrics using the evaluation scripts in the eval/ directory.

We also attach our cached generation files in the corresponding model repo for reproducibility.

Math Benchmarks (AIME24, AIME25)

Evaluate math problem-solving performance:

cd eval
python get_scores_math.py \
    --modelfolder /path/to/model/outputs \
    --testfolder /path/to/test_benchmarks

This script:

  • Evaluates AIME24 and AIME25 benchmarks
  • Extracts answers from \boxed{} and other formats
  • Computes accuracy with mathematical equivalence checking
  • Reports mean accuracy and standard deviation across multiple runs

Multiple Choice (MMLU, MMLU-Pro, GPQA)

Evaluate MMLU and variants:

cd eval
python get_scores_mmlu_batch.py \
    --modelfolder /path/to/model/outputs \
    --testfolder /path/to/test_benchmarks \
    --verbose  # Optional: print per-category accuracy

This script evaluates:

  • MMLU: Standard MMLU with 4 choices (A-D)
  • MMLU-Pro: Extended version with up to 16 choices (A-P)

Features:

  • Supports boxed answer format (e.g., \boxed{A})
  • Extracts letter choices from various formats (parentheses, text, etc.)
  • Handles batch-split output files automatically
  • Computes accuracy across all MMLU variants
  • Optional per-category breakdown with --verbose flag

Evaluate GPQA (Graduate-Level Google-Proof Q&A) performance:

cd eval
python get_scores_gpqa.py \
    --modelfolder /path/to/model/outputs \
    --testfolder /path/to/test_benchmarks

This script:

  • Evaluates GPQA Diamond subset
  • Extracts answers from boxed and text formats
  • Uses mathematical equivalence checking for complex answers
  • Reports accuracy with standard deviation

Code Generation (LiveCodeBench)

Evaluate code generation performance:

cd eval
python get_scores_code.py \
    --modelfolder /path/to/model/outputs \
    --testfolder /path/to/test_benchmarks

This script:

  • Evaluates LiveCodeBench v5 and v6
  • Executes generated code against test cases
  • Computes pass rate (percentage of problems solved correctly)
  • Reports finish rate (percentage of valid code generations)

Note: Code execution requires:

pip install numpy tqdm

Other Benchmarks

For the following benchmarks, please refer to their official evaluation repositories due to licensing restrictions:

These benchmarks require specific evaluation logic and may have licensing terms that restrict redistribution of evaluation code.

Output Format

Results are saved as JSONL files in:

{model_folder}/{model_name}/outputs_vllm073[_topp{topp}_seed{seed}]/{eval_dataset}.jsonl

Each line contains:

  • task_id or question_id: Unique identifier for the question
  • output: Model's generated response
  • reason: Whether reasoning was used (boolean)
  • reason_text: The reasoning/thinking content (if applicable)
  • Additional dataset-specific fields

Adding New Datasets

To add a new dataset:

  1. Add a preprocessing function in data/benchmark.py:

    def preprocess_your_dataset(data_file):
        """Preprocess your dataset.
        
        Args:
            data_file: Path to dataset file
        
        Returns:
            tuple: (prompt_list, qid_list) or just prompt_list
        """
        # Your preprocessing logic
        pass
    
  2. Add the dataset path argument in arguments.py:

    group.add_argument('--your-dataset-path', type=str, default='path/to/dataset')
    
  3. Add the dataset case in inference.py in the get_prompt_list() function:

    elif args.eval_dataset == "your_dataset":
        from data.benchmark import preprocess_your_dataset
        input_datapath = os.path.join(args.benchmark_folder, args.your_dataset_path)
        prompt_list, qid_list = preprocess_your_dataset(input_datapath)
    

Notes

  • The framework uses vLLM for efficient inference with batching and tensor parallelism support
  • Special handling is provided for models like DeepSeek-R1 that require eager mode
  • Thinking mode (<think> tags) is supported for models trained with reasoning capabilities
  • YaRN RoPE scaling is supported for extended context lengths

License

See the main repository LICENSE file for licensing information.