ログイン 新規登録
言語:

WEKO3

  • トップ
  • ランキング
To
lat lon distance
To

Field does not validate



インデックスリンク

インデックスツリー

メールアドレスを入力してください。

WEKO

One fine body…

WEKO

One fine body…

アイテム

  1. 論文誌(ジャーナル)
  2. Vol.66
  3. No.4

Improving Behavior-aware Driving Video Captioning through Better Use of In-vehicle Sensors and References

https://ipsj.ixsq.nii.ac.jp/records/2001757
https://ipsj.ixsq.nii.ac.jp/records/2001757
69cf758d-93df-431e-b23c-aeb004744a35
名前 / ファイル ライセンス アクション
IPSJ-JNL6604013.pdf IPSJ-JNL6604013.pdf (25.8 MB)
 2027年4月15日からダウンロード可能です。
Copyright (c) 2025 by the Information Processing Society of Japan
非会員:¥0, IPSJ:学会員:¥0, 論文誌:会員:¥0, DLIB:会員:¥0
Item type Journal(1)
公開日 2025-04-15
タイトル
言語 ja
タイトル Improving Behavior-aware Driving Video Captioning through Better Use of In-vehicle Sensors and References
タイトル
言語 en
タイトル Improving Behavior-aware Driving Video Captioning through Better Use of In-vehicle Sensors and References
言語
言語 eng
キーワード
主題Scheme Other
主題 [一般論文] driving video captioning, behavior-aware captioning, multimodal fusion
資源タイプ
資源タイプ識別子 http://purl.org/coar/resource_type/c_6501
資源タイプ journal article
著者所属
Bosch Corporation Headquarters
著者所属
National Institute of Informatics
著者所属
Nagoya University
著者所属(英)
en
Bosch Corporation Headquarters
著者所属(英)
en
National Institute of Informatics
著者所属(英)
en
Nagoya University
著者名 Hongkuan,Zhang

× Hongkuan,Zhang

Hongkuan,Zhang

Search repository
Koichi,Takeda

× Koichi,Takeda

Koichi,Takeda

Search repository
Ryohei,Sasano

× Ryohei,Sasano

Ryohei,Sasano

Search repository
著者名(英) Hongkuan Zhang

× Hongkuan Zhang

en Hongkuan Zhang

Search repository
Koichi Takeda

× Koichi Takeda

en Koichi Takeda

Search repository
Ryohei Sasano

× Ryohei Sasano

en Ryohei Sasano

Search repository
論文抄録
内容記述タイプ Other
内容記述 Driving video captioning aims to automatically generate descriptions for videos from driving recorders. Driving video captions are generally required to describe first-person driving behaviors which implicitly characterize the driving videos but are challenging to anchor to concrete visual evidence. To generate captions with better driving behavior descriptions, existing work has introduced behavior-related in-vehicle sensors into a captioning model for behavior-aware captioning. However, a better method for fusing the sensor modality with visual modalities has not been fully investigated, and the accuracy and informativeness of generated behavior-related descriptions remain unsatisfactory. In this paper, we compare three modality fusion methods by using a Transformer-based video captioning model and propose two training strategies to improve both the accuracy and the informativeness of generated behavior descriptions: 1) joint training the captioning model with multilabel behavior classification by explicitly using annotated behavior tags; and 2) weighted training by assigning weights to reference captions (references) according to the informativeness of behavior descriptions in references. Experiments on a Japanese driving video captioning dataset, City Traffic (CT), show the efficacy and positive interaction of the proposed training strategies. Moreover, larger improvements on out-of-distribution data demonstrate the improved generalization ability.
------------------------------
This is a preprint of an article intended for publication Journal of
Information Processing(JIP). This preprint should not be cited. This
article should be cited as: Journal of Information Processing Vol.33(2025) (online)
DOI http://dx.doi.org/10.2197/ipsjjip.33.284
------------------------------
論文抄録(英)
内容記述タイプ Other
内容記述 Driving video captioning aims to automatically generate descriptions for videos from driving recorders. Driving video captions are generally required to describe first-person driving behaviors which implicitly characterize the driving videos but are challenging to anchor to concrete visual evidence. To generate captions with better driving behavior descriptions, existing work has introduced behavior-related in-vehicle sensors into a captioning model for behavior-aware captioning. However, a better method for fusing the sensor modality with visual modalities has not been fully investigated, and the accuracy and informativeness of generated behavior-related descriptions remain unsatisfactory. In this paper, we compare three modality fusion methods by using a Transformer-based video captioning model and propose two training strategies to improve both the accuracy and the informativeness of generated behavior descriptions: 1) joint training the captioning model with multilabel behavior classification by explicitly using annotated behavior tags; and 2) weighted training by assigning weights to reference captions (references) according to the informativeness of behavior descriptions in references. Experiments on a Japanese driving video captioning dataset, City Traffic (CT), show the efficacy and positive interaction of the proposed training strategies. Moreover, larger improvements on out-of-distribution data demonstrate the improved generalization ability.
------------------------------
This is a preprint of an article intended for publication Journal of
Information Processing(JIP). This preprint should not be cited. This
article should be cited as: Journal of Information Processing Vol.33(2025) (online)
DOI http://dx.doi.org/10.2197/ipsjjip.33.284
------------------------------
書誌レコードID
収録物識別子タイプ NCID
収録物識別子 AN00116647
書誌情報 情報処理学会論文誌

巻 66, 号 4, 発行日 2025-04-15
ISSN
収録物識別子タイプ ISSN
収録物識別子 1882-7764
公開者
言語 ja
出版者 情報処理学会
戻る
0
views
See details
Views

Versions

Ver.1 2025-04-09 00:41:32.802070
Show All versions

Share

Mendeley Twitter Facebook Print Addthis

Cite as

エクスポート

OAI-PMH
  • OAI-PMH JPCOAR
  • OAI-PMH DublinCore
  • OAI-PMH DDI
Other Formats
  • JSON
  • BIBTEX

Confirm


Powered by WEKO3


Powered by WEKO3