Million LLMs Track Illustration
Image generated by ChatGPT using the prompt: "Create an image in the style of Van Gogh depicting a million LLMs in humanoid form (as a metaphor), working together to respond to user queries."

TREC 2025: Million LLMs Track

Benchmarking the Discovery of Expert LLMs

1. Introduction

The way we search for and access information is rapidly evolving. Instead of retrieving documents and snippets, users now interact with AI front-ends—Large Language Models (LLMs) that deliver direct, synthesized answers. In this emerging landscape, the traditional search engine is replaced by an ecosystem of LLMs, each with specialized knowledge, unique capabilities, and distinct access models:

How do we evaluate expertise, compare models, and build reliable multi-agent IR systems in this new world?

The Million LLMs Track introduces a novel challenge: ranking large language models (LLMs) based on their expected ability to answer specific user queries. As organizations deploy ensembles of LLMs—ranging from general-purpose to domain-specific it becomes crucial to determine which models to consult for a given task. This track focuses on evaluating systems that can effectively identify the most capable LLM(s) for a query, without issuing new queries to the models.

Participants are provided with LLM responses and metadata in advance, and must rank the LLMs for each test query based solely on this information.

2. Task

LLM Ranking Task

Goal:

Given a user query and a set of LLM IDs, your system must rank the LLMs by predicted expertise — that is, how likely each model is to provide a high-quality answer to the query. This task tests your system’s ability to assess and match LLM capabilities to information needs, before seeing any answers.

What you submit:

For each query, submit a ranked list of LLM IDs, ordered from most to least likely to produce a good answer.

3. Data Provided

The dataset is split into two parts:

Use the discovery data to predict the expertise of each LLM and develop your ranking system, then submit your results on the test set.

Note: No access to raw document collections or LLMs is required. The task is designed to benchmark ranking systems under fixed conditions using provided data.

Downloadable Files

File Name Description
discovery_data_1 LLM expertise discovery queries and precomputed LLM responses.
discovery_metadata_1 LLM expertise discovery queries, precomputed LLM responses, along with metadata (the same as discovery_data_1 but with logprobs).
discovery_data_2 LLM expertise discovery queries and precomputed LLM responses.
discovery_metadata_2 LLM expertise discovery queries, precomputed LLM responses, along with metadata (the same as discovery_data_2 but with logprobs).

4. Submission Guidelines

We will be following the classic TREC submission formatting, which is outlined below. White space is used to separate columns. The width of the columns in the format is not important, but it is crucial to have exactly six columns per line with at least one space between the columns.

1 Q0 llmid1 1 2.73 runid1
1 Q0 llmid2 2 2.71 runid1
1 Q0 llmid3 3 2.61 runid1
1 Q0 llmid4 4 2.05 runid1
1 Q0 llmid5 5 1.89 runid1

Where:
The first column is the topic (query) number.
The second column is currently unused and should always be "Q0".
The third column is the official identifier of the ranked LLM.
The fourth column is the rank at which the LLM is ranked.
The fifth column shows the score (integer or floating point) that generated the ranking. This score must be in descending (non-increasing) order.
The sixth column is the ID of the run being submitted.

Submission Types

The main type of TREC submission is automatic, which means that there is no manual intervention when running the test queries. This means: You should not adjust your runs, rewrite the query, retrain your model, or make any manual adjustments after seeing the test queries. Ideally, you should only check the test queries to verify that they ran properly (i.e., no bugs) before submitting your automatic runs.

However, if you want to have a human in the loop for your run or make any manual adjustments to the model or ranking after seeing the test queries, you can mark your run as manual and provide a description of the types of alterations performed.

Further, we expect that only the discovery data is used to assess the expertise of each LLM. If this is the case we mark these submissions as internal. However, if you wish to use external tools, e.g. LLMs, search engines, data, to verify the expertise of each LLM then you should mark your run as external.

Runs Allowed

Each team may submit up to 5 official runs.

Submission Deadline

September 2025

Submission Portal

Codalab link TBD

System Description

You must provide a short description of your method, including any training data or models used.

5. Evaluation

Submissions will be evaluated using relevance-based metrics.

Evaluation Metrics

Evaluation will be based on hidden ground-truth labels. Results will be reported in aggregate and per query category.

6. Timeline

Date Event
July 7, 2025 Expertise discovery and development data released
September 2025 Test queries released
September 2025 Submission deadline
October 2025 Evaluation & results shared
November 2025 TREC conference

7. Participation & Contact

We welcome participation from:

Organizers

Evangelos Kanoulas
Evangelos Kanoulas
University of Amsterdam, The Netherlands
Panagiotis Eustratiadis
Panagiotis Eustratiadis
University of Amsterdam, The Netherlands
Mark Sanderson
Mark Sanderson
RMIT University, Australia
Jamie Callan
Jamie Callan
Carnegie Mellon University, USA

Co-Organizers

Yongkang Li
Yongkang Li
University of Amsterdam, The Netherlands
Jingfen Qiao
Jingfen Qiao
University of Amsterdam, The Netherlands
Vaishali Pal
Vaishali Pal
University of Amsterdam, The Netherlands