TREC 2025: Million LLMs Track

1. Introduction

The way we search for and access information is rapidly evolving. Instead of retrieving documents and snippets, users now interact with AI front-ends—Large Language Models (LLMs) that deliver direct, synthesized answers. In this emerging landscape, the traditional search engine is replaced by an ecosystem of LLMs, each with specialized knowledge, unique capabilities, and distinct access models:

Some LLMs excel at in-depth answers; others provide concise, factual responses.
Some are trained on general knowledge, while others focus on specific domains like law, medicine, or technology.
Some LLMs are multilingual, while others specialize in a single language.
Some LLMs are designed for specific tasks like summarization or translation, while others are general-purpose.
Some LLMs are designed for specific user groups, such as children or professionals, while others are more general.

How do we evaluate expertise, compare models, and build reliable multi-agent IR systems in this new world?

The Million LLMs Track introduces a novel challenge: ranking large language models (LLMs) based on their expected ability to answer specific user queries. As organizations deploy ensembles of LLMs—ranging from general-purpose to domain-specific it becomes crucial to determine which models to consult for a given task. This track focuses on evaluating systems that can effectively identify the most capable LLM(s) for a query, without issuing new queries to the models.

Participants are provided with LLM responses and metadata in advance, and must rank the LLMs for each test query based solely on this information.

2. Task

LLM Ranking Task

Goal:

Given a user query and a set of LLM IDs, your system must rank the LLMs by predicted expertise — that is, how likely each model is to provide a high-quality answer to the query. This task tests your system’s ability to assess and match LLM capabilities to information needs, before seeing any answers.

What you submit:

For each query, submit a ranked list of LLM IDs, ordered from most to least likely to produce a good answer.

3. Data Provided

The dataset is split into two parts:

Discovery Set: Ranking LLM ids on the basis of their expertise is not possible if you do not know anything about these LLMs. The discovery set provides a set of queries and precomputed responses from all LLMs, along with metadata that can be used to predict the expertise of each LLM.
- A set of training queries
- Precomputed responses from all LLMs
- Response metadata
Development Set: This dataset is used to develop and evaluate your ranking system, and will imitate the final evaluation conditions, by NIST. It includes:
- A set of development queries
- A qrel with an expertise label for each query-LLM pair
Test Set: This is a held-out set of queries for final evaluation. It includes:
- A set of held-out queries

Use the discovery data to predict the expertise of each LLM and develop your ranking system, then submit your results on the test set.

Note: No access to raw document collections or LLMs is required. The task is designed to benchmark ranking systems under fixed conditions using provided data.

Downloadable Files

File Name	Description
discovery_data_1	LLM expertise discovery queries and precomputed LLM responses.
discovery_metadata_1	LLM expertise discovery queries, precomputed LLM responses, along with metadata (the same as discovery_data_1 but with logprobs).
discovery_data_2	LLM expertise discovery queries and precomputed LLM responses.
discovery_metadata_2	LLM expertise discovery queries, precomputed LLM responses, along with metadata (the same as discovery_data_2 but with logprobs).
dev_data	Development queries for which LLMs should be ranked according to exertise; the file contains 342 queries and their IDs.
dev_qrels	Qrels against the development queries. For each query LLMs are "judged" of whether they possess the expertise to answer the query.

4. Submission Guidelines

We will be following the classic TREC submission formatting, which is outlined below. White space is used to separate columns. The width of the columns in the format is not important, but it is crucial to have exactly six columns per line with at least one space between the columns.


        1 Q0 llmid1 1 2.73 runid1

        1 Q0 llmid2 2 2.71 runid1

        1 Q0 llmid3 3 2.61 runid1

        1 Q0 llmid4 4 2.05 runid1

        1 Q0 llmid5 5 1.89 runid1

Where:
The first column is the topic (query) number.
The second column is currently unused and should always be "Q0".
The third column is the official identifier of the ranked LLM.
The fourth column is the rank at which the LLM is ranked.
The fifth column shows the score (integer or floating point) that generated the ranking. This score must be in descending (non-increasing) order.
The sixth column is the ID of the run being submitted.

Submission Types

The main type of TREC submission is automatic, which means that there is no manual intervention when running the test queries. This means: You should not adjust your runs, rewrite the query, retrain your model, or make any manual adjustments after seeing the test queries. Ideally, you should only check the test queries to verify that they ran properly (i.e., no bugs) before submitting your automatic runs.

However, if you want to have a human in the loop for your run or make any manual adjustments to the model or ranking after seeing the test queries, you can mark your run as manual and provide a description of the types of alterations performed.

Further, we expect that only the discovery data is used to assess the expertise of each LLM. If this is the case we mark these submissions as internal. However, if you wish to use external tools, e.g. LLMs, search engines, data, to verify the expertise of each LLM then you should mark your run as external.

Runs Allowed

Each team may submit up to 5 official runs.

Submission Deadline

September 2025

Submission Portal

Codalab link TBD

System Description

You must provide a short description of your method, including any training data or models used.

5. Evaluation

Submissions will be evaluated using relevance-based metrics.

Evaluation Metrics

nDCG@10 – Discounted gain over relevance grades
MRR – Mean reciprocal rank

Evaluation will be based on hidden ground-truth labels. Results will be reported in aggregate and per query category.

6. Timeline

Date	Event
July 7, 2025	Expertise discovery and development data released
September 5, 2025	Test queries released
September 21, 2025	Submission deadline
October 2025	Evaluation & results shared
November 2025	TREC conference