音声因子句による条件付けを用いた発話スタイルキャプショニング

安藤,厚志; 森谷,崇史; 堀口,翔太; 増村,亮

WEKO3

lat lon distance

[[sub_check.contents]]

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

インデックスツリー

アイテム

音声因子句による条件付けを用いた発話スタイルキャプショニング

https://ipsj.ixsq.nii.ac.jp/records/2000403

名前 / ファイル	ライセンス	アクション
IPSJ-SLP25155079.pdf (1.0 MB) 2027年2月23日からダウンロード可能です。	Copyright (c) 2025 by the Information Processing Society of Japan
非会員：¥660, IPSJ:学会員：¥330, SLP:会員：¥0, DLIB:会員：¥0

Item type

SIG Technical Reports(1)

公開日

2025-02-23

タイトル

言語

タイトル

音声因子句による条件付けを用いた発話スタイルキャプショニング

タイトル

言語

タイトル

Speaking Style Captioning Using Speech Factor Conditioning

言語

jpn

キーワード

主題Scheme

Other

主題

ポスター講演

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_18gh

資源タイプ

technical report

著者所属

日本電信電話株式会社

著者所属

日本電信電話株式会社

著者所属

日本電信電話株式会社

著者所属

日本電信電話株式会社

著者名

安藤,厚志
森谷,崇史
堀口,翔太
増村,亮

論文抄録

内容記述タイプ

Other

内容記述

本稿では，発話スタイルに関する情報（話者性別，音量，音高，...）を正確に認識しながら多様な表現を生成する，新たな発話スタイルキャプショニング手法を提案する．従来手法では，発話スタイルに関する語だけでなく構文に関する語も含むキャプションをそのまま正解文として学習するため，音声からの発話スタイル情報の学習が難しく，文法は正しいが発話スタイル情報に誤りがある文を生成しやすいという課題があった．この問題を解決するため，提案手法では発話スタイル情報を表す音声因子句を導入し，音声因子句を生成させたのちキャプションを生成させるようモデル学習を行うことで，発話スタイル情報を明示的に学習させる．さらに，発話スタイル情報を正確に認識しながら多様なキャプションを生成させるための新たなデコーディング手法も提案する．実験の結果，提案手法は従来手法に比べて発話スタイル情報をより高精度に認識しつつ，より多様なキャプションを生成できることが確認された．

論文抄録(英)

内容記述タイプ

Other

内容記述

This work presents a novel speaking-style captioning method that generates diverse descriptions while accurately including speaking-style information such as gender, pitch, and volume. Conventional methods rely on original captions, which contain not only speaking-style-related terms but also syntactic words, making it difficult to learn speaking-style characteristics from speech and often resulting in incorrect captions. To address this problem, the proposed method introduces factor-conditioned captioning (FCC), which first outputs a factor phrase representing speaking-style information and then generates a caption to ensure the model explicitly learns speaking-style factors. Additionally, we propose greedy-then-sampling (GtS) decoding, which first predicts speaking-style factors deterministically to guarantee semantic accuracy and then generates a caption based on factor-conditioned sampling to ensure diversity. Experiments show that the proposed method generates more diverse captions while improving style prediction performance compared to conventional methods.

書誌レコードID

収録物識別子タイプ

NCID

収録物識別子

AN10442647

書誌情報

研究報告音声言語情報処理（SLP）

巻 2025-SLP-155, 号 79, p. 1-7, 発行日 2025-02-23

ISSN

収録物識別子タイプ

ISSN

収録物識別子

2188-8663

Notice

SIG Technical Reports are nonrefereed and hence may later appear in any journals, conferences, symposia, etc.

出版者

言語

出版者

情報処理学会

戻る

views

See details

	Views

Versions

Ver.1

2025-02-18 06:12:29.493978

Show All versions

Cite as

エクスポート

OAI-PMH

JPCOAR
DublinCore
DDI

Other Formats

JSON
BIBTEX

インデックスリンク

インデックスツリー

アイテム

音声因子句による条件付けを用いた発話スタイルキャプショニング

× 安藤,厚志

× 森谷,崇史

× 堀口,翔太

× 増村,亮

Versions

Share

Cite as

エクスポート