WEKO3
アイテム
Spatial Hierarchical Attention Network Based Video-guided Machine Translation
https://ipsj.ixsq.nii.ac.jp/records/225940
https://ipsj.ixsq.nii.ac.jp/records/2259405fb71f46-84b9-4bb2-b7c0-a8830a8a6630
名前 / ファイル | ライセンス | アクション |
---|---|---|
![]()
2025年5月15日からダウンロード可能です。
|
Copyright (c) 2023 by the Information Processing Society of Japan
|
|
非会員:¥0, IPSJ:学会員:¥0, 論文誌:会員:¥0, DLIB:会員:¥0 |
Item type | Journal(1) | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
公開日 | 2023-05-15 | |||||||||||||
タイトル | ||||||||||||||
タイトル | Spatial Hierarchical Attention Network Based Video-guided Machine Translation | |||||||||||||
タイトル | ||||||||||||||
言語 | en | |||||||||||||
タイトル | Spatial Hierarchical Attention Network Based Video-guided Machine Translation | |||||||||||||
言語 | ||||||||||||||
言語 | eng | |||||||||||||
キーワード | ||||||||||||||
主題Scheme | Other | |||||||||||||
主題 | [一般論文] multimodal machine translation, video-guided machine translation, hierarchical attention network, spatial features | |||||||||||||
資源タイプ | ||||||||||||||
資源タイプ識別子 | http://purl.org/coar/resource_type/c_6501 | |||||||||||||
資源タイプ | journal article | |||||||||||||
著者所属 | ||||||||||||||
Graduate School of Informatics, Kyoto University | ||||||||||||||
著者所属 | ||||||||||||||
Graduate School of Informatics, Kyoto University | ||||||||||||||
著者所属 | ||||||||||||||
Graduate School of Informatics, Kyoto University | ||||||||||||||
著者所属 | ||||||||||||||
Graduate School of Informatics, Kyoto University | ||||||||||||||
著者所属(英) | ||||||||||||||
en | ||||||||||||||
Graduate School of Informatics, Kyoto University | ||||||||||||||
著者所属(英) | ||||||||||||||
en | ||||||||||||||
Graduate School of Informatics, Kyoto University | ||||||||||||||
著者所属(英) | ||||||||||||||
en | ||||||||||||||
Graduate School of Informatics, Kyoto University | ||||||||||||||
著者所属(英) | ||||||||||||||
en | ||||||||||||||
Graduate School of Informatics, Kyoto University | ||||||||||||||
著者名 |
Weiqi, Gu
× Weiqi, Gu
× Haiyue, Song
× Chenhui, Chu
× Sadao, Kurohashi
|
|||||||||||||
著者名(英) |
Weiqi, Gu
× Weiqi, Gu
× Haiyue, Song
× Chenhui, Chu
× Sadao, Kurohashi
|
|||||||||||||
論文抄録 | ||||||||||||||
内容記述タイプ | Other | |||||||||||||
内容記述 | Video-guided machine translation, as one type of multimodal machine translation, aims to engage video contents as auxiliary information to address the word sense ambiguity problem in machine translation. Previous studies only use features from pre-trained action detection models as motion representations of the video to solve the verb sense ambiguity and neglect the noun sense ambiguity problem. To address this, we propose a video-guided machine translation system using both spatial and motion representations. For the spatial part, we propose a hierarchical attention network to model the spatial information from object-level to video-level. We investigate and discuss spatial features extracted from objects with pre-trained convolutional neural network models and spatial concept features extracted from object labels and attributes with pre-trained language models. We further investigate spatial feature filtering by referring to corresponding source sentences. Experiments on the VATEX dataset show that our system achieves a 35.86 BLEU-4 score, which is 0.51 score higher than the single model of the SOTA method. Experiments on the How2 dataset further verify the generalization ability of our proposed system. ------------------------------ This is a preprint of an article intended for publication Journal of Information Processing(JIP). This preprint should not be cited. This article should be cited as: Journal of Information Processing Vol.31(2023) (online) DOI http://dx.doi.org/10.2197/ipsjjip.31.299 ------------------------------ |
|||||||||||||
論文抄録(英) | ||||||||||||||
内容記述タイプ | Other | |||||||||||||
内容記述 | Video-guided machine translation, as one type of multimodal machine translation, aims to engage video contents as auxiliary information to address the word sense ambiguity problem in machine translation. Previous studies only use features from pre-trained action detection models as motion representations of the video to solve the verb sense ambiguity and neglect the noun sense ambiguity problem. To address this, we propose a video-guided machine translation system using both spatial and motion representations. For the spatial part, we propose a hierarchical attention network to model the spatial information from object-level to video-level. We investigate and discuss spatial features extracted from objects with pre-trained convolutional neural network models and spatial concept features extracted from object labels and attributes with pre-trained language models. We further investigate spatial feature filtering by referring to corresponding source sentences. Experiments on the VATEX dataset show that our system achieves a 35.86 BLEU-4 score, which is 0.51 score higher than the single model of the SOTA method. Experiments on the How2 dataset further verify the generalization ability of our proposed system. ------------------------------ This is a preprint of an article intended for publication Journal of Information Processing(JIP). This preprint should not be cited. This article should be cited as: Journal of Information Processing Vol.31(2023) (online) DOI http://dx.doi.org/10.2197/ipsjjip.31.299 ------------------------------ |
|||||||||||||
書誌レコードID | ||||||||||||||
収録物識別子タイプ | NCID | |||||||||||||
収録物識別子 | AN00116647 | |||||||||||||
書誌情報 |
情報処理学会論文誌 巻 64, 号 5, 発行日 2023-05-15 |
|||||||||||||
ISSN | ||||||||||||||
収録物識別子タイプ | ISSN | |||||||||||||
収録物識別子 | 1882-7764 | |||||||||||||
公開者 | ||||||||||||||
言語 | ja | |||||||||||||
出版者 | 情報処理学会 |