ログイン 新規登録
言語:

WEKO3

  • トップ
  • ランキング
To
lat lon distance
To

Field does not validate



インデックスリンク

インデックスツリー

メールアドレスを入力してください。

WEKO

One fine body…

WEKO

One fine body…

アイテム

  1. シンポジウム
  2. シンポジウムシリーズ
  3. コンピュータシステム・シンポジウム
  4. 2025

Lies, Damned Lies and Benchmarks: An Exploration of LLM Inference Benchmarks for Long Context Workloads

https://ipsj.ixsq.nii.ac.jp/records/2005948
https://ipsj.ixsq.nii.ac.jp/records/2005948
dd7a43e4-354f-45fb-ace6-e27b41ecfc6f
名前 / ファイル ライセンス アクション
IPSJ-ComSys2025007.pdf IPSJ-ComSys2025007.pdf (2.1 MB)
 2027年11月24日からダウンロード可能です。
Copyright (c) 2025 by the Information Processing Society of Japan
非会員:¥660, IPSJ:学会員:¥330, OS:会員:¥0, DLIB:会員:¥0
Item type Symposium(1)
公開日 2025-11-24
タイトル
言語 ja
タイトル Lies, Damned Lies and Benchmarks: An Exploration of LLM Inference Benchmarks for Long Context Workloads
タイトル
言語 en
タイトル Lies, Damned Lies and Benchmarks: An Exploration of LLM Inference Benchmarks for Long Context Workloads
言語
言語 eng
資源タイプ
資源タイプ識別子 http://purl.org/coar/resource_type/c_5794
資源タイプ conference paper
著者所属
IBM Research―Tokyo
著者所属
Delft University of Technology
著者所属(英)
en
IBM Research―Tokyo
著者所属(英)
en
Delft University of Technology
著者名 Valentijn Dymphnus,Van De Beek

× Valentijn Dymphnus,Van De Beek

Valentijn Dymphnus,Van De Beek

Search repository
Takeshi,Yoshimura

× Takeshi,Yoshimura

Takeshi,Yoshimura

Search repository
著者名(英) Valentijn Dymphnus Van De Beek

× Valentijn Dymphnus Van De Beek

en Valentijn Dymphnus Van De Beek

Search repository
Takeshi Yoshimura

× Takeshi Yoshimura

en Takeshi Yoshimura

Search repository
論文抄録
内容記述タイプ Other
内容記述 After the introduction of the Transformer architecture in 2017, neural networks have seen widespread adoption across industry, academia, and the wider public. One notable aspect of Large Language Models (LLM) is the ability to develop emerging capabilities on tasks that it has not been trained for, such as prompt engineering, video generation, and multistep reasoning. This can be done by increasing the hardware, training data or context data available to the model. This has resulted in model providers steadily increasing the size of context from 8k tokens in 2023 to 10m tokens in 2025. Alongside this growth has been the publication of a large set of literature aimed at measuring the impact of these increases in context size. In this paper, we analyze 16 of these benchmarks published between 2023 and 2025 in terms of what they measure, what tasks they perform, and the distribution of various attributes. We found that the papers in the question do not consider attributes inherit to the benchmarks such as the token size distribution, the variance between the tasks or variance in prompts in the same task. Additionally, these attributes do have a significant impact on the accuracy of the benchmark. This makes it difficult to compare tasks within the benchmark and trust the accuracy reported in the same task. The amount of variance has increased significantly between generations of the models, while the median of token size has increased at a slower pace. The claimed increase of context size therefore seems to rely on the addition of outliers rather than increasing the overall size of tasks. Due to these attributes, the current set of long-context benchmarks is unsuitable for measuring results of systems research or evaluating large-scale cloud inferencing solutions.
論文抄録(英)
内容記述タイプ Other
内容記述 After the introduction of the Transformer architecture in 2017, neural networks have seen widespread adoption across industry, academia, and the wider public. One notable aspect of Large Language Models (LLM) is the ability to develop emerging capabilities on tasks that it has not been trained for, such as prompt engineering, video generation, and multistep reasoning. This can be done by increasing the hardware, training data or context data available to the model. This has resulted in model providers steadily increasing the size of context from 8k tokens in 2023 to 10m tokens in 2025. Alongside this growth has been the publication of a large set of literature aimed at measuring the impact of these increases in context size. In this paper, we analyze 16 of these benchmarks published between 2023 and 2025 in terms of what they measure, what tasks they perform, and the distribution of various attributes. We found that the papers in the question do not consider attributes inherit to the benchmarks such as the token size distribution, the variance between the tasks or variance in prompts in the same task. Additionally, these attributes do have a significant impact on the accuracy of the benchmark. This makes it difficult to compare tasks within the benchmark and trust the accuracy reported in the same task. The amount of variance has increased significantly between generations of the models, while the median of token size has increased at a slower pace. The claimed increase of context size therefore seems to rely on the addition of outliers rather than increasing the overall size of tasks. Due to these attributes, the current set of long-context benchmarks is unsuitable for measuring results of systems research or evaluating large-scale cloud inferencing solutions.
書誌情報 コンピュータシステム・シンポジウム論文集

巻 2025, p. 59-70, 発行日 2025-11-24
出版者
言語 ja
出版者 情報処理学会
戻る
0
views
See details
Views

Versions

Ver.1 2025-11-18 08:00:36.023262
Show All versions

Share

Mendeley Twitter Facebook Print Addthis

Cite as

エクスポート

OAI-PMH
  • OAI-PMH JPCOAR
  • OAI-PMH DublinCore
  • OAI-PMH DDI
Other Formats
  • JSON
  • BIBTEX

Confirm


Powered by WEKO3


Powered by WEKO3