Benchmarking the Discovery of Expert LLMs
The way we search for and access information is rapidly evolving. Instead of retrieving documents and snippets, users now interact with AI front-ends—Large Language Models (LLMs) that deliver direct, synthesized answers. In this emerging landscape, the traditional search engine is replaced by an ecosystem of LLMs, each with specialized knowledge, unique capabilities, and distinct access models:
How do we evaluate expertise, compare models, and build reliable multi-agent IR systems in this new world?
The Million LLMs Track introduces a novel challenge: ranking large language models (LLMs) based on their expected ability to answer specific user queries. As organizations deploy ensembles of LLMs—ranging from general-purpose to domain-specific it becomes crucial to determine which models to consult for a given task. This track focuses on evaluating systems that can effectively identify the most capable LLM(s) for a query, without issuing new queries to the models.
Participants are provided with LLM responses and metadata in advance, and must rank the LLMs for each test query based solely on this information.
Given a user query and a set of LLM IDs, your system must rank the LLMs by predicted expertise — that is, how likely each model is to provide a high-quality answer to the query. This task tests your system’s ability to assess and match LLM capabilities to information needs, before seeing any answers.
What you submit:For each query, submit a ranked list of LLM IDs, ordered from most to least likely to produce a good answer.
The dataset is split into two parts:
Use the discovery data to predict the expertise of each LLM and develop your ranking system, then submit your results on the test set.
Note: No access to raw document collections or LLMs is required. The task is designed to benchmark ranking systems under fixed conditions using provided data.
File Name | Description |
---|---|
LLM expertise discovery queries and precomputed LLM responses. | |
LLM expertise discovery queries, precomputed LLM responses, along with metadata (the same as discovery_data_1 but with logprobs). | |
LLM expertise discovery queries and precomputed LLM responses. | |
LLM expertise discovery queries, precomputed LLM responses, along with metadata (the same as discovery_data_2 but with logprobs). |
We will be following the classic TREC submission formatting, which is outlined below. White space is used to separate columns. The width of the columns in the format is not important, but it is crucial to have exactly six columns per line with at least one space between the columns.
1 Q0 llmid1 1 2.73 runid1
1 Q0 llmid2 2 2.71 runid1
1 Q0 llmid3 3 2.61 runid1
1 Q0 llmid4 4 2.05 runid1
1 Q0 llmid5 5 1.89 runid1
Where:
The first column is the topic (query) number.
The second column is currently unused and should always be "Q0".
The third column is the official identifier of the ranked LLM.
The fourth column is the rank at which the LLM is ranked.
The fifth column shows the score (integer or floating point) that generated the ranking. This score must be in descending (non-increasing) order.
The sixth column is the ID of the run being submitted.
The main type of TREC submission is automatic, which means that there is no manual intervention when running the test queries. This means: You should not adjust your runs, rewrite the query, retrain your model, or make any manual adjustments after seeing the test queries. Ideally, you should only check the test queries to verify that they ran properly (i.e., no bugs) before submitting your automatic runs.
However, if you want to have a human in the loop for your run or make any manual adjustments to the model or ranking after seeing the test queries, you can mark your run as manual and provide a description of the types of alterations performed.
Further, we expect that only the discovery data is used to assess the expertise of each LLM. If this is the case we mark these submissions as internal. However, if you wish to use external tools, e.g. LLMs, search engines, data, to verify the expertise of each LLM then you should mark your run as external.
Each team may submit up to 5 official runs.
September 2025
You must provide a short description of your method, including any training data or models used.
Submissions will be evaluated using relevance-based metrics.
Evaluation will be based on hidden ground-truth labels. Results will be reported in aggregate and per query category.
Date | Event |
---|---|
July 7, 2025 | Expertise discovery and development data released |
September 2025 | Test queries released |
September 2025 | Submission deadline |
October 2025 | Evaluation & results shared |
November 2025 | TREC conference |
We welcome participation from:
![]() Evangelos Kanoulas
University of Amsterdam, The Netherlands
|
![]() Panagiotis Eustratiadis
University of Amsterdam, The Netherlands
|
![]() Mark Sanderson
RMIT University, Australia
|
![]() Jamie Callan
Carnegie Mellon University, USA
|
![]() Yongkang Li
University of Amsterdam, The Netherlands
|
![]() Jingfen Qiao
University of Amsterdam, The Netherlands
|
![]() Vaishali Pal
University of Amsterdam, The Netherlands
|