WEKO3
アイテム
Lies, Damned Lies and Benchmarks: An Exploration of LLM Inference Benchmarks for Long Context Workloads
https://ipsj.ixsq.nii.ac.jp/records/2005948
https://ipsj.ixsq.nii.ac.jp/records/2005948dd7a43e4-354f-45fb-ace6-e27b41ecfc6f
| 名前 / ファイル | ライセンス | アクション |
|---|---|---|
|
2027年11月24日からダウンロード可能です。
|
Copyright (c) 2025 by the Information Processing Society of Japan
|
|
| 非会員:¥660, IPSJ:学会員:¥330, OS:会員:¥0, DLIB:会員:¥0 | ||
| Item type | Symposium(1) | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 公開日 | 2025-11-24 | |||||||||
| タイトル | ||||||||||
| 言語 | ja | |||||||||
| タイトル | Lies, Damned Lies and Benchmarks: An Exploration of LLM Inference Benchmarks for Long Context Workloads | |||||||||
| タイトル | ||||||||||
| 言語 | en | |||||||||
| タイトル | Lies, Damned Lies and Benchmarks: An Exploration of LLM Inference Benchmarks for Long Context Workloads | |||||||||
| 言語 | ||||||||||
| 言語 | eng | |||||||||
| 資源タイプ | ||||||||||
| 資源タイプ識別子 | http://purl.org/coar/resource_type/c_5794 | |||||||||
| 資源タイプ | conference paper | |||||||||
| 著者所属 | ||||||||||
| IBM Research―Tokyo | ||||||||||
| 著者所属 | ||||||||||
| Delft University of Technology | ||||||||||
| 著者所属(英) | ||||||||||
| en | ||||||||||
| IBM Research―Tokyo | ||||||||||
| 著者所属(英) | ||||||||||
| en | ||||||||||
| Delft University of Technology | ||||||||||
| 著者名 |
Valentijn Dymphnus,Van De Beek
× Valentijn Dymphnus,Van De Beek
× Takeshi,Yoshimura
|
|||||||||
| 著者名(英) |
Valentijn Dymphnus Van De Beek
× Valentijn Dymphnus Van De Beek
× Takeshi Yoshimura
|
|||||||||
| 論文抄録 | ||||||||||
| 内容記述タイプ | Other | |||||||||
| 内容記述 | After the introduction of the Transformer architecture in 2017, neural networks have seen widespread adoption across industry, academia, and the wider public. One notable aspect of Large Language Models (LLM) is the ability to develop emerging capabilities on tasks that it has not been trained for, such as prompt engineering, video generation, and multistep reasoning. This can be done by increasing the hardware, training data or context data available to the model. This has resulted in model providers steadily increasing the size of context from 8k tokens in 2023 to 10m tokens in 2025. Alongside this growth has been the publication of a large set of literature aimed at measuring the impact of these increases in context size. In this paper, we analyze 16 of these benchmarks published between 2023 and 2025 in terms of what they measure, what tasks they perform, and the distribution of various attributes. We found that the papers in the question do not consider attributes inherit to the benchmarks such as the token size distribution, the variance between the tasks or variance in prompts in the same task. Additionally, these attributes do have a significant impact on the accuracy of the benchmark. This makes it difficult to compare tasks within the benchmark and trust the accuracy reported in the same task. The amount of variance has increased significantly between generations of the models, while the median of token size has increased at a slower pace. The claimed increase of context size therefore seems to rely on the addition of outliers rather than increasing the overall size of tasks. Due to these attributes, the current set of long-context benchmarks is unsuitable for measuring results of systems research or evaluating large-scale cloud inferencing solutions. | |||||||||
| 論文抄録(英) | ||||||||||
| 内容記述タイプ | Other | |||||||||
| 内容記述 | After the introduction of the Transformer architecture in 2017, neural networks have seen widespread adoption across industry, academia, and the wider public. One notable aspect of Large Language Models (LLM) is the ability to develop emerging capabilities on tasks that it has not been trained for, such as prompt engineering, video generation, and multistep reasoning. This can be done by increasing the hardware, training data or context data available to the model. This has resulted in model providers steadily increasing the size of context from 8k tokens in 2023 to 10m tokens in 2025. Alongside this growth has been the publication of a large set of literature aimed at measuring the impact of these increases in context size. In this paper, we analyze 16 of these benchmarks published between 2023 and 2025 in terms of what they measure, what tasks they perform, and the distribution of various attributes. We found that the papers in the question do not consider attributes inherit to the benchmarks such as the token size distribution, the variance between the tasks or variance in prompts in the same task. Additionally, these attributes do have a significant impact on the accuracy of the benchmark. This makes it difficult to compare tasks within the benchmark and trust the accuracy reported in the same task. The amount of variance has increased significantly between generations of the models, while the median of token size has increased at a slower pace. The claimed increase of context size therefore seems to rely on the addition of outliers rather than increasing the overall size of tasks. Due to these attributes, the current set of long-context benchmarks is unsuitable for measuring results of systems research or evaluating large-scale cloud inferencing solutions. | |||||||||
| 書誌情報 |
コンピュータシステム・シンポジウム論文集 巻 2025, p. 59-70, 発行日 2025-11-24 |
|||||||||
| 出版者 | ||||||||||
| 言語 | ja | |||||||||
| 出版者 | 情報処理学会 | |||||||||