1. Introduction
The way we search for and access information is rapidly evolving. Instead of retrieving documents and snippets, users now interact with AI front-ends—Large Language Models (LLMs) that deliver direct, synthesized answers. In this emerging landscape, the traditional search engine is replaced by an ecosystem of LLMs, each with specialized knowledge, unique capabilities, and distinct access models:
- Some LLMs excel at in-depth answers; others provide concise, factual responses.
- Some are trained on general knowledge, while others focus on specific domains like law, medicine, or technology.
- Some LLMs are multilingual, while others specialize in a single language.
- Some LLMs are designed for specific tasks like summarization or translation, while others are general-purpose.
- Some LLMs are designed for specific user groups, such as children or professionals, while others are more general.
How do we evaluate expertise, compare models, and build reliable multi-agent IR systems in this new world?
The Million LLMs Track introduces a novel challenge: ranking large language models (LLMs) based on their expected ability to answer specific user queries. As organizations deploy ensembles of LLMs—ranging from general-purpose to domain-specific it becomes crucial to determine which models to consult for a given task. This track focuses on evaluating systems that can effectively identify the most capable LLM(s) for a query, without issuing new queries to the models.
Participants are provided with LLM responses and metadata in advance, and must rank the LLMs for each test query based solely on this information.
2. Task
LLM Ranking Task
Goal:Given a user query and a set of LLM IDs, your system must rank the LLMs by predicted expertise — that is, how likely each model is to provide a high-quality answer to the query. This task tests your system’s ability to assess and match LLM capabilities to information needs, before seeing any answers.
What you submit:For each query, submit a ranked list of LLM IDs, ordered from most to least likely to produce a good answer.
3. Data Provided
The dataset is split into two parts:
-
Discovery Set: Ranking LLM ids on the basis of their expertise is not possible if you do not know
anything about these LLMs. The discovery set provides a set of queries and precomputed responses from all LLMs,
along with metadata that can be used to predict the expertise of each LLM.
- A set of training queries
- Precomputed responses from all LLMs
- Response metadata
-
Development Set: This dataset is used to develop and evaluate your ranking system, and will imitate the
final evaluation conditions, by NIST. It includes:
- A set of development queries
- A qrel with an expertise label for each query-LLM pair
-
Test Set: This is a held-out set of queries for final evaluation. It includes:
- A set of held-out queries
Use the discovery data to predict the expertise of each LLM and develop your ranking system, then submit your results on the test set.
Note: No access to raw document collections or LLMs is required. The task is designed to benchmark ranking systems under fixed conditions using provided data.
Downloadable Files
| File Name | Description |
|---|---|
| LLM expertise discovery queries and precomputed LLM responses. | |
| LLM expertise discovery queries, precomputed LLM responses, along with metadata (the same as discovery_data_1 but with logprobs). | |
| LLM expertise discovery queries and precomputed LLM responses. | |
| LLM expertise discovery queries, precomputed LLM responses, along with metadata (the same as discovery_data_2 but with logprobs). | |
| Development queries for which LLMs should be ranked according to exertise; the file contains 342 queries and their IDs. | |
| Qrels against the development queries. For each query LLMs are "judged" of whether they possess the expertise to answer the query. |
4. Submission Guidelines
We will be following the classic TREC submission formatting, which is outlined below. White space is used to separate columns. The width of the columns in the format is not important, but it is crucial to have exactly six columns per line with at least one space between the columns.
1 Q0 llmid1 1 2.73 runid1
1 Q0 llmid2 2 2.71 runid1
1 Q0 llmid3 3 2.61 runid1
1 Q0 llmid4 4 2.05 runid1
1 Q0 llmid5 5 1.89 runid1
Where:
The first column is the topic (query) number.
The second column is currently unused and should always be "Q0".
The third column is the official identifier of the ranked LLM.
The fourth column is the rank at which the LLM is ranked.
The fifth column shows the score (integer or floating point) that generated the ranking. This score must be in descending (non-increasing) order.
The sixth column is the ID of the run being submitted.
Submission Types
The main type of TREC submission is automatic, which means that there is no manual intervention when running the test queries. This means: You should not adjust your runs, rewrite the query, retrain your model, or make any manual adjustments after seeing the test queries. Ideally, you should only check the test queries to verify that they ran properly (i.e., no bugs) before submitting your automatic runs.
However, if you want to have a human in the loop for your run or make any manual adjustments to the model or ranking after seeing the test queries, you can mark your run as manual and provide a description of the types of alterations performed.
Further, we expect that only the discovery data is used to assess the expertise of each LLM. If this is the case we mark these submissions as internal. However, if you wish to use external tools, e.g. LLMs, search engines, data, to verify the expertise of each LLM then you should mark your run as external.
Runs Allowed
Each team may submit up to 5 official runs.
Submission Deadline
September 2025
Submission Portal
System Description
You must provide a short description of your method, including any training data or models used.
5. Evaluation
Submissions will be evaluated using relevance-based metrics.
Evaluation Metrics
- nDCG@10 – Discounted gain over relevance grades
- MRR – Mean reciprocal rank
Evaluation will be based on hidden ground-truth labels. Results will be reported in aggregate and per query category.
6. Timeline
| Date | Event |
|---|---|
| July 7, 2025 | Expertise discovery and development data released |
| September 5, 2025 | Test queries released |
| September 21, 2025 | Submission deadline |
| October 2025 | Evaluation & results shared |
| November 2025 | TREC conference |
7. Participation & Contact
We welcome participation from:
- Academic researchers
- Industry teams
- Independent developers
- Open-source contributors
Organizers
Evangelos Kanoulas
University of Amsterdam, The Netherlands
|
Panagiotis Eustratiadis
University of Amsterdam, The Netherlands
|
Mark Sanderson
RMIT University, Australia
|
Jamie Callan
Carnegie Mellon University, USA
|
Co-Organizers
Yongkang Li
University of Amsterdam, The Netherlands
|
Jingfen Qiao
University of Amsterdam, The Netherlands
|
Vaishali Pal
University of Amsterdam, The Netherlands
|