| Item type |
SIG Technical Reports(1) |
| 公開日 |
2023-10-07 |
| タイトル |
|
|
タイトル |
言語表現による喉頭摘出者のための音声強調システム |
| タイトル |
|
|
言語 |
en |
|
タイトル |
Electrolaryngeal Speech Enhancement Through Strong Linguistic Encoding Methods |
| 言語 |
|
|
言語 |
eng |
| キーワード |
|
|
主題Scheme |
Other |
|
主題 |
音声 |
| 資源タイプ |
|
|
資源タイプ識別子 |
http://purl.org/coar/resource_type/c_18gh |
|
資源タイプ |
technical report |
| 著者所属 |
|
|
|
名古屋大学情報学研究科知能システム学専攻 |
| 著者所属 |
|
|
|
名古屋大学情報学研究科知能システム学専攻 |
| 著者所属 |
|
|
|
名古屋大学情報学研究科知能システム学専攻 |
| 著者所属 |
|
|
|
名古屋大学情報学研究科知能システム学専攻/株式会社TARVO |
| 著者所属 |
|
|
|
名古屋大学情報基盤センター |
| 著者所属(英) |
|
|
|
en |
|
|
Graduate School of Informatics, Nagoya University |
| 著者所属(英) |
|
|
|
en |
|
|
Graduate School of Informatics, Nagoya University |
| 著者所属(英) |
|
|
|
en |
|
|
Graduate School of Informatics, Nagoya University |
| 著者所属(英) |
|
|
|
en |
|
|
Graduate School of Informatics, Nagoya University / TARVO, Inc. |
| 著者所属(英) |
|
|
|
en |
|
|
Information Technology Center, Nagoya University |
| 著者名 |
Lester, Phillip Violeta
Wen-ChinHuang, Ding Ma
山本, 龍一
小林, 和弘
戸田, 智基
|
| 著者名(英) |
Lester, Phillip Violeta
Wen-Chin, Huang
Ding, Ma
Ryuichi, Yamamoto
Kazuhiro, Kobayashi
Tomoki, Toda
|
| 論文抄録 |
|
|
内容記述タイプ |
Other |
|
内容記述 |
Although pretraining and fine-tuning approaches have proven to work well in speech intelligibility enhancement, various mismatches, such as the speech type mismatch or a speaker mismatches between the datasets used in each stage, can deteriorate the conversion performance of this framework. We propose a linguistic encoder robust enough to project both EL and typical speech in the same latent space, while still being able to extract accurate linguistic information, creating a unified representation to reduce the speech type mismatch. Furthermore, we introduce HuBERT output features to the proposed framework for reducing the speaker mismatch. Such a framework makes it possible to effectively use a large-scale parallel dataset during pretraining. We show that compared to the conventional framework using mel-spectrogram input and output features, using the proposed framework enables the model to synthesize more intelligible and naturally sounding speech, as shown by a significant 16% improvement in character error rate and 0.83 improvement in naturalness score. |
| 論文抄録(英) |
|
|
内容記述タイプ |
Other |
|
内容記述 |
Although pretraining and fine-tuning approaches have proven to work well in speech intelligibility enhancement, various mismatches, such as the speech type mismatch or a speaker mismatches between the datasets used in each stage, can deteriorate the conversion performance of this framework. We propose a linguistic encoder robust enough to project both EL and typical speech in the same latent space, while still being able to extract accurate linguistic information, creating a unified representation to reduce the speech type mismatch. Furthermore, we introduce HuBERT output features to the proposed framework for reducing the speaker mismatch. Such a framework makes it possible to effectively use a large-scale parallel dataset during pretraining. We show that compared to the conventional framework using mel-spectrogram input and output features, using the proposed framework enables the model to synthesize more intelligible and naturally sounding speech, as shown by a significant 16% improvement in character error rate and 0.83 improvement in naturalness score. |
| 書誌レコードID |
|
|
収録物識別子タイプ |
NCID |
|
収録物識別子 |
AN10442647 |
| 書誌情報 |
研究報告音声言語情報処理(SLP)
巻 2023-SLP-148,
号 8,
p. 1-6,
発行日 2023-10-07
|
| ISSN |
|
|
収録物識別子タイプ |
ISSN |
|
収録物識別子 |
2188-8663 |
| Notice |
|
|
|
SIG Technical Reports are nonrefereed and hence may later appear in any journals, conferences, symposia, etc. |
| 出版者 |
|
|
言語 |
ja |
|
出版者 |
情報処理学会 |