Lies, Damned Lies and Benchmarks: An Exploration of LLM Inference Benchmarks for Long Context Workloads

Valentijn Dymphnus,Van De Beek; Takeshi,Yoshimura; Valentijn Dymphnus Van De Beek; Takeshi Yoshimura

WEKO3

lat lon distance

[[sub_check.contents]]

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

インデックスツリー

アイテム

Lies, Damned Lies and Benchmarks: An Exploration of LLM Inference Benchmarks for Long Context Workloads

https://ipsj.ixsq.nii.ac.jp/records/2005948

名前 / ファイル	ライセンス	アクション
IPSJ-ComSys2025007.pdf (2.1 MB) 2027年11月24日からダウンロード可能です。	Copyright (c) 2025 by the Information Processing Society of Japan
非会員：¥660, IPSJ:学会員：¥330, OS:会員：¥0, DLIB:会員：¥0

Item type

Symposium(1)

公開日

2025-11-24

タイトル

言語

タイトル

Lies, Damned Lies and Benchmarks: An Exploration of LLM Inference Benchmarks for Long Context Workloads

タイトル

言語

タイトル

Lies, Damned Lies and Benchmarks: An Exploration of LLM Inference Benchmarks for Long Context Workloads

言語

eng

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_5794

資源タイプ

conference paper

著者所属

IBM Research―Tokyo

著者所属

Delft University of Technology

著者所属(英)

IBM Research―Tokyo

著者所属(英)

Delft University of Technology

著者名

Valentijn Dymphnus,Van De Beek
Takeshi,Yoshimura

著者名(英)

Valentijn Dymphnus Van De Beek
Takeshi Yoshimura

論文抄録

内容記述タイプ

Other

内容記述

After the introduction of the Transformer architecture in 2017, neural networks have seen widespread adoption across industry, academia, and the wider public. One notable aspect of Large Language Models (LLM) is the ability to develop emerging capabilities on tasks that it has not been trained for, such as prompt engineering, video generation, and multistep reasoning. This can be done by increasing the hardware, training data or context data available to the model. This has resulted in model providers steadily increasing the size of context from 8k tokens in 2023 to 10m tokens in 2025. Alongside this growth has been the publication of a large set of literature aimed at measuring the impact of these increases in context size. In this paper, we analyze 16 of these benchmarks published between 2023 and 2025 in terms of what they measure, what tasks they perform, and the distribution of various attributes. We found that the papers in the question do not consider attributes inherit to the benchmarks such as the token size distribution, the variance between the tasks or variance in prompts in the same task. Additionally, these attributes do have a significant impact on the accuracy of the benchmark. This makes it difficult to compare tasks within the benchmark and trust the accuracy reported in the same task. The amount of variance has increased significantly between generations of the models, while the median of token size has increased at a slower pace. The claimed increase of context size therefore seems to rely on the addition of outliers rather than increasing the overall size of tasks. Due to these attributes, the current set of long-context benchmarks is unsuitable for measuring results of systems research or evaluating large-scale cloud inferencing solutions.

論文抄録(英)

内容記述タイプ

Other

内容記述

書誌情報

コンピュータシステム・シンポジウム論文集

巻 2025, p. 59-70, 発行日 2025-11-24

出版者

言語

出版者

情報処理学会

戻る

views

See details

	Views

Versions

Ver.1

2025-11-18 08:00:36.023262

Show All versions

Cite as

エクスポート

OAI-PMH

JPCOAR
DublinCore
DDI

Other Formats

JSON
BIBTEX

インデックスリンク

インデックスツリー

アイテム

Lies, Damned Lies and Benchmarks: An Exploration of LLM Inference Benchmarks for Long Context Workloads

× Valentijn Dymphnus,Van De Beek

× Takeshi,Yoshimura

× Valentijn Dymphnus Van De Beek

× Takeshi Yoshimura

Versions

Share

Cite as

エクスポート