話者性を制御可能な音声合成のための話者埋め込み空間に関する実験的検討

森田, 湧大; 齋藤, 大輔; 峯松, 信明; Wakuto, Morita; Daisuke, Saito; Nobuaki, Minematsu

WEKO3

lat lon distance

[[sub_check.contents]]

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

インデックスツリー

アイテム

話者性を制御可能な音声合成のための話者埋め込み空間に関する実験的検討

https://ipsj.ixsq.nii.ac.jp/records/232517

名前 / ファイル	ライセンス	アクション
IPSJ-SLP24151047.pdf (2.6 MB)	Copyright (c) 2024 by the Institute of Electronics, Information and Communication Engineers This SIG report is only available to those in membership of the SIG.
SLP:会員：¥0, DLIB:会員：¥0

Item type

SIG Technical Reports(1)

公開日

2024-02-22

タイトル

話者性を制御可能な音声合成のための話者埋め込み空間に関する実験的検討

タイトル

言語

タイトル

An experimental survey on speaker embedding spaces for controlling speaker identity in speech synthesis system

言語

jpn

キーワード

主題Scheme

Other

主題

ポスターセッション2 SP/SLP

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_18gh

資源タイプ

technical report

著者所属

東京大学大学院工学系研究科

著者所属

東京大学大学院工学系研究科

著者所属

東京大学大学院工学系研究科

著者所属(英)

Graduate School of Engineering, The University of Tokyo

著者所属(英)

Graduate School of Engineering, The University of Tokyo

著者所属(英)

Graduate School of Engineering, The University of Tokyo

著者名

森田, 湧大
齋藤, 大輔
峯松, 信明

著者名(英)

Wakuto, Morita
Daisuke, Saito
Nobuaki, Minematsu

論文抄録

内容記述タイプ

Other

内容記述

本論文では，弁別能力の異なる話者埋め込み抽出モデルを用いた，話者性の制御が可能な音声合成モデルの比較実験について報告する．実験では，複数の話者埋め込み空間に基づく複数話者音声合成モデルを学習し，二人の異なる話者対にて内挿に相当する合成音声を生成することで，話者性を制御する音声合成アプリケーションのユーザを想定した評価実験を行った．話者性の主観評価実験の結果より，低次元の話者埋め込み空間に基づく音声合成モデルの中に，被験者の主観に近い話者性の変化を見せたモデルが確認された．また，合成音声の品質についても評価実験を行い，特異的な話者が学習データに含まれた場合において，低次元の埋め込み空間を用いることにより，音声合成モデル全体の品質劣化が抑えられることが示された．最後に，これらの話者を学習データから取り除く手法についても検討を行い，品質劣化を招きうる話者を音声合成モデル学習の前処理時に検出できることが示唆された．

論文抄録(英)

内容記述タイプ

Other

内容記述

This study investigated the inﬂuence of the discriminability of speaker encoders on speech synthesis models that can control speaker individuality with speaker embeddings. In this experiment, we trained multi-speaker speech synthesis models based on multiple speaker embedding spaces, and evaluated synthesized speech from intermediate speaker embeddings between two diﬀerent speakers to simulate a use case to control the speaker individuality. The results of “individuality” subjective evaluation show that there is a speech synthesis model based on a low-dimensional speaker embedding space which is close to the way of changes in human perception. In addition, the results of “naturalness” conﬁrm that high-dimensional embedding spaces tend to degrade the overall quality of synthesized speech when there are unique speakers in the training data. Finally, we conducted an additional experiment to exclude such speakers that may cause quality degradation before training synthesis models, which suggests that it is possible to detect them only from ground-truth wav ﬁles.

書誌レコードID

収録物識別子タイプ

NCID

収録物識別子

AN10442647

書誌情報

研究報告音声言語情報処理（SLP）

巻 2024-SLP-151, 号 47, p. 1-6, 発行日 2024-02-22

ISSN

収録物識別子タイプ

ISSN

収録物識別子

2188-8663

Notice

SIG Technical Reports are nonrefereed and hence may later appear in any journals, conferences, symposia, etc.

出版者

言語

出版者

情報処理学会

戻る

views

See details

	Views

Versions

Ver.1

2025-01-19 10:25:17.141830

Show All versions

Cite as

エクスポート

OAI-PMH

JPCOAR
DublinCore
DDI

Other Formats

JSON
BIBTEX

インデックスリンク

インデックスツリー

アイテム

話者性を制御可能な音声合成のための話者埋め込み空間に関する実験的検討

× 森田, 湧大

× 齋藤, 大輔

× 峯松, 信明

× Wakuto, Morita

× Daisuke, Saito

× Nobuaki, Minematsu

Versions

Share

Cite as

エクスポート