{"links":{},"id":2001757,"metadata":{"_oai":{"id":"oai:ipsj.ixsq.nii.ac.jp:02001757","sets":["581:11839:11843"]},"path":["11843"],"owner":"80578","recid":"2001757","title":["Improving Behavior-aware Driving Video Captioning through Better Use of In-vehicle Sensors and References"],"pubdate":{"attribute_name":"PubDate","attribute_value":"2025-04-15"},"_buckets":{"deposit":"9d8376cf-db2d-4b50-89a5-aff95089da97"},"_deposit":{"id":"2001757","pid":{"type":"depid","value":"2001757","revision_id":0},"owners":[80578],"status":"published","created_by":80578},"item_title":"Improving Behavior-aware Driving Video Captioning through Better Use of In-vehicle Sensors and References","author_link":[],"item_titles":{"attribute_name":"タイトル","attribute_value_mlt":[{"subitem_title":"Improving Behavior-aware Driving Video Captioning through Better Use of In-vehicle Sensors and References","subitem_title_language":"ja"},{"subitem_title":"Improving Behavior-aware Driving Video Captioning through Better Use of In-vehicle Sensors and References","subitem_title_language":"en"}]},"item_keyword":{"attribute_name":"キーワード","attribute_value_mlt":[{"subitem_subject":"[一般論文] driving video captioning, behavior-aware captioning, multimodal fusion","subitem_subject_scheme":"Other"}]},"item_type_id":"2","publish_date":"2025-04-15","item_2_text_3":{"attribute_name":"著者所属","attribute_value_mlt":[{"subitem_text_value":"Bosch Corporation Headquarters"},{"subitem_text_value":"National Institute of Informatics"},{"subitem_text_value":"Nagoya University"}]},"item_2_text_4":{"attribute_name":"著者所属(英)","attribute_value_mlt":[{"subitem_text_value":"Bosch Corporation Headquarters","subitem_text_language":"en"},{"subitem_text_value":"National Institute of Informatics","subitem_text_language":"en"},{"subitem_text_value":"Nagoya University","subitem_text_language":"en"}]},"item_language":{"attribute_name":"言語","attribute_value_mlt":[{"subitem_language":"eng"}]},"control_number":"2001757","publish_status":"0","weko_shared_id":-1,"item_file_price":{"attribute_name":"Billing file","attribute_type":"file","attribute_value_mlt":[{"url":{"url":"https://ipsj.ixsq.nii.ac.jp/record/2001757/files/IPSJ-JNL6604013.pdf","label":"IPSJ-JNL6604013.pdf"},"date":[{"dateType":"Available","dateValue":"2027-04-15"}],"format":"application/pdf","billing":["billing_file"],"filename":"IPSJ-JNL6604013.pdf","filesize":[{"value":"25.8 MB"}],"mimetype":"application/pdf","priceinfo":[{"tax":["include_tax"],"price":"0","billingrole":"5"},{"tax":["include_tax"],"price":"0","billingrole":"6"},{"tax":["include_tax"],"price":"0","billingrole":"8"},{"tax":["include_tax"],"price":"0","billingrole":"44"}],"accessrole":"open_date","version_id":"952a9363-14e5-4826-813f-00b4ec1f4007","displaytype":"detail","licensetype":"license_note","license_note":"Copyright (c) 2025 by the Information Processing Society of Japan"}]},"item_2_creator_5":{"attribute_name":"著者名","attribute_type":"creator","attribute_value_mlt":[{"creatorNames":[{"creatorName":"Hongkuan,Zhang"}]},{"creatorNames":[{"creatorName":"Koichi,Takeda"}]},{"creatorNames":[{"creatorName":"Ryohei,Sasano"}]}]},"item_2_creator_6":{"attribute_name":"著者名(英)","attribute_type":"creator","attribute_value_mlt":[{"creatorNames":[{"creatorName":"Hongkuan Zhang","creatorNameLang":"en"}]},{"creatorNames":[{"creatorName":"Koichi Takeda","creatorNameLang":"en"}]},{"creatorNames":[{"creatorName":"Ryohei Sasano","creatorNameLang":"en"}]}]},"item_2_source_id_9":{"attribute_name":"書誌レコードID","attribute_value_mlt":[{"subitem_source_identifier":"AN00116647","subitem_source_identifier_type":"NCID"}]},"item_resource_type":{"attribute_name":"資源タイプ","attribute_value_mlt":[{"resourceuri":"http://purl.org/coar/resource_type/c_6501","resourcetype":"journal article"}]},"item_2_publisher_15":{"attribute_name":"公開者","attribute_value_mlt":[{"subitem_publisher":"情報処理学会","subitem_publisher_language":"ja"}]},"item_2_source_id_11":{"attribute_name":"ISSN","attribute_value_mlt":[{"subitem_source_identifier":"1882-7764","subitem_source_identifier_type":"ISSN"}]},"item_2_description_7":{"attribute_name":"論文抄録","attribute_value_mlt":[{"subitem_description":"Driving video captioning aims to automatically generate descriptions for videos from driving recorders. Driving video captions are generally required to describe first-person driving behaviors which implicitly characterize the driving videos but are challenging to anchor to concrete visual evidence. To generate captions with better driving behavior descriptions, existing work has introduced behavior-related in-vehicle sensors into a captioning model for behavior-aware captioning. However, a better method for fusing the sensor modality with visual modalities has not been fully investigated, and the accuracy and informativeness of generated behavior-related descriptions remain unsatisfactory. In this paper, we compare three modality fusion methods by using a Transformer-based video captioning model and propose two training strategies to improve both the accuracy and the informativeness of generated behavior descriptions: 1) joint training the captioning model with multilabel behavior classification by explicitly using annotated behavior tags; and 2) weighted training by assigning weights to reference captions (references) according to the informativeness of behavior descriptions in references. Experiments on a Japanese driving video captioning dataset, City Traffic (CT), show the efficacy and positive interaction of the proposed training strategies. Moreover, larger improvements on out-of-distribution data demonstrate the improved generalization ability.\n------------------------------\nThis is a preprint of an article intended for publication Journal of\nInformation Processing(JIP). This preprint should not be cited. This\narticle should be cited as: Journal of Information Processing Vol.33(2025) (online)\nDOI　http://dx.doi.org/10.2197/ipsjjip.33.284\n------------------------------","subitem_description_type":"Other"}]},"item_2_description_8":{"attribute_name":"論文抄録(英)","attribute_value_mlt":[{"subitem_description":"Driving video captioning aims to automatically generate descriptions for videos from driving recorders. Driving video captions are generally required to describe first-person driving behaviors which implicitly characterize the driving videos but are challenging to anchor to concrete visual evidence. To generate captions with better driving behavior descriptions, existing work has introduced behavior-related in-vehicle sensors into a captioning model for behavior-aware captioning. However, a better method for fusing the sensor modality with visual modalities has not been fully investigated, and the accuracy and informativeness of generated behavior-related descriptions remain unsatisfactory. In this paper, we compare three modality fusion methods by using a Transformer-based video captioning model and propose two training strategies to improve both the accuracy and the informativeness of generated behavior descriptions: 1) joint training the captioning model with multilabel behavior classification by explicitly using annotated behavior tags; and 2) weighted training by assigning weights to reference captions (references) according to the informativeness of behavior descriptions in references. Experiments on a Japanese driving video captioning dataset, City Traffic (CT), show the efficacy and positive interaction of the proposed training strategies. Moreover, larger improvements on out-of-distribution data demonstrate the improved generalization ability.\n------------------------------\nThis is a preprint of an article intended for publication Journal of\nInformation Processing(JIP). This preprint should not be cited. This\narticle should be cited as: Journal of Information Processing Vol.33(2025) (online)\nDOI　http://dx.doi.org/10.2197/ipsjjip.33.284\n------------------------------","subitem_description_type":"Other"}]},"item_2_biblio_info_10":{"attribute_name":"書誌情報","attribute_value_mlt":[{"bibliographic_titles":[{"bibliographic_title":"情報処理学会論文誌"}],"bibliographicIssueDates":{"bibliographicIssueDate":"2025-04-15","bibliographicIssueDateType":"Issued"},"bibliographicIssueNumber":"4","bibliographicVolumeNumber":"66"}]},"relation_version_is_last":true,"weko_creator_id":"80578"},"created":"2025-04-09T00:41:29.964205+00:00","updated":"2025-05-01T05:08:45.830877+00:00"}