Million LLMs Track Illustration
Image generated by ChatGPT using the prompt: "Create an image in the style of Van Gogh depicting a million LLMs in humanoid form (as a metaphor), working together to respond to user queries."

TREC 2025: Million LLMs Track

Benchmarking the Discovery of Expert LLMs

1. Introduction

The Million LLMs Track introduces a novel challenge: ranking large language models (LLMs) based on their expected ability to answer specific user queries.

As organizations deploy ensembles of LLMs—ranging from general-purpose to domain-specific—it becomes crucial to determine which models to consult for a given task. This track focuses on evaluating systems that can effectively identify the most capable LLM(s) for a query, without issuing new queries to the models.

Participants are provided with LLM responses and metadata in advance, and must rank the LLMs for each test query based solely on this information.

2. Task

LLM Ranking Task

Given a user query, your system must rank a fixed set of LLMs. The goal is to predict which LLMs are most likely to produce high-quality answers.

Your submission should consist of a ranked list of LLMs for each query.

3. Data Provided

The dataset is split into two parts:

Use the discovery data to predict the expertise of each LLM and develop your ranking system, then submit your results on the test set.

Note: No access to raw document collections or LLMs is required. The task is designed to benchmark ranking systems under fixed conditions using provided data.

4. Submission Guidelines

Submission Format

CSV or TSV with the following columns:

Runs Allowed

Each team may submit up to 3 official runs.

Submission Deadline

September 2025

Submission Portal

Codalab link TBD

System Description

You must provide a short description of your method, including any training data or models used.

5. Evaluation

Submissions will be evaluated using relevance-based metrics.

Evaluation Metrics

Evaluation will be based on hidden ground-truth labels. Results will be reported in aggregate and per query category.

6. Timeline

Date Event
July 1, 2025 Training and development data released
September 2025 Test queries released
September 2025 Submission deadline
October 2025 Evaluation & results shared
November 2025 TREC conference

7. Participation & Contact

We welcome participation from:

Organizers

Evangelos Kanoulas
Evangelos Kanoulas
University of Amsterdam, The Netherlands
Panagiotis Eustratiadis
Panagiotis Eustratiadis
University of Amsterdam, The Netherlands
Mark Sanderson
Mark Sanderson
RMIT University, Australia
Jamie Callan
Jamie Callan
Carnegie Mellon University, USA

Co-Organizers

Vaishali Pal
Vaishali Pal
University of Amsterdam, The Netherlands
Yougang Lyu
Yougang Lyu
University of Amsterdam, The Netherlands
Zihan Wang
Zihan Wang
University of Amsterdam, The Netherlands