Item type |
SIG Technical Reports(1) |
公開日 |
2020-10-29 |
タイトル |
|
|
タイトル |
How Far CanWe Go with Scene Descriptions for Visual Question Answering? |
タイトル |
|
|
言語 |
en |
|
タイトル |
How Far CanWe Go with Scene Descriptions for Visual Question Answering? |
言語 |
|
|
言語 |
eng |
キーワード |
|
|
主題Scheme |
Other |
|
主題 |
セッション4 |
資源タイプ |
|
|
資源タイプ識別子 |
http://purl.org/coar/resource_type/c_18gh |
|
資源タイプ |
technical report |
著者所属 |
|
|
|
Osaka University |
著者所属 |
|
|
|
Osaka University |
著者所属 |
|
|
|
CyberAgent, Inc. |
著者所属 |
|
|
|
Kyoto University |
著者所属 |
|
|
|
Osaka University |
著者所属 |
|
|
|
Osaka University |
著者所属 |
|
|
|
Osaka University |
著者所属(英) |
|
|
|
en |
|
|
Osaka University |
著者所属(英) |
|
|
|
en |
|
|
Osaka University |
著者所属(英) |
|
|
|
en |
|
|
CyberAgent, Inc. |
著者所属(英) |
|
|
|
en |
|
|
Kyoto University |
著者所属(英) |
|
|
|
en |
|
|
Osaka University |
著者所属(英) |
|
|
|
en |
|
|
Osaka University |
著者所属(英) |
|
|
|
en |
|
|
Osaka University |
著者名 |
Yusuke, Hirota
Noa, Garcia
Mayu, Otani
Chenhui, Chu
Yuta, Nakashima
Ittetsu, Taniguchi
Takao, Onoye
|
著者名(英) |
Yusuke, Hirota
Noa, Garcia
Mayu, Otani
Chenhui, Chu
Yuta, Nakashima
Ittetsu, Taniguchi
Takao, Onoye
|
論文抄録 |
|
|
内容記述タイプ |
Other |
|
内容記述 |
Visual question answering (VQA) is the task of answering questions about an image's visual content. To represent images, the bounding box-based visual representations have been widely used as the de-facto standard. In contrast, the recent progress in Transformer language models has made it possible to represent simultaneously inter-relationships between two consecutive sentences as well as intra-relationships between the individual words in a sentence. The outstanding performance of such language models in multiple language-based tasks inspired us to consider textual representations of images for VQA. Thus, instead of using visual features directly extracted from images, we propose to generate scene descriptions by using state-of-the-art recognition models. Results on VQA-CP v2 show our proposed textual descriptions have the potential to be a faithful representation for VQA. Even so, our experiments reveal there is still room for improvement in our generated scene descriptions. |
論文抄録(英) |
|
|
内容記述タイプ |
Other |
|
内容記述 |
Visual question answering (VQA) is the task of answering questions about an image's visual content. To represent images, the bounding box-based visual representations have been widely used as the de-facto standard. In contrast, the recent progress in Transformer language models has made it possible to represent simultaneously inter-relationships between two consecutive sentences as well as intra-relationships between the individual words in a sentence. The outstanding performance of such language models in multiple language-based tasks inspired us to consider textual representations of images for VQA. Thus, instead of using visual features directly extracted from images, we propose to generate scene descriptions by using state-of-the-art recognition models. Results on VQA-CP v2 show our proposed textual descriptions have the potential to be a faithful representation for VQA. Even so, our experiments reveal there is still room for improvement in our generated scene descriptions. |
書誌レコードID |
|
|
収録物識別子タイプ |
NCID |
|
収録物識別子 |
AN10100541 |
書誌情報 |
研究報告コンピュータグラフィックスとビジュアル情報学(CG)
巻 2020-CG-180,
号 11,
p. 1-7,
発行日 2020-10-29
|
ISSN |
|
|
収録物識別子タイプ |
ISSN |
|
収録物識別子 |
2188-8949 |
Notice |
|
|
|
SIG Technical Reports are nonrefereed and hence may later appear in any journals, conferences, symposia, etc. |
出版者 |
|
|
言語 |
ja |
|
出版者 |
情報処理学会 |