下位N位スコア平均に基づくMOS予測モデル学習

近藤, 祐斗; 亀岡, 弘和; 田中, 宏; 金子, 卓弘; Yuto, Kondo; Hirokazu, Kameoka; Kou, Tanaka; Takuhiro, Kaneko

WEKO3

lat lon distance

[[sub_check.contents]]

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

インデックスツリー

アイテム

下位N位スコア平均に基づくMOS予測モデル学習

https://ipsj.ixsq.nii.ac.jp/records/232518

名前 / ファイル	ライセンス	アクション
IPSJ-SLP24151048.pdf (1.4 MB)	Copyright (c) 2024 by the Institute of Electronics, Information and Communication Engineers This SIG report is only available to those in membership of the SIG.
SLP:会員：¥0, DLIB:会員：¥0

Item type

SIG Technical Reports(1)

公開日

2024-02-22

タイトル

下位N位スコア平均に基づくMOS予測モデル学習

タイトル

言語

タイトル

Selecting N-lowest scores for training MOS prediction models

言語

jpn

キーワード

主題Scheme

Other

主題

ポスターセッション2 SP/SLP

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_18gh

資源タイプ

technical report

著者所属

日本電信電話株式会社

著者所属

日本電信電話株式会社

著者所属

日本電信電話株式会社

著者所属

日本電信電話株式会社

著者所属(英)

NTT Corporation

著者所属(英)

NTT Corporation

著者所属(英)

NTT Corporation

著者所属(英)

NTT Corporation

著者名

近藤, 祐斗
亀岡, 弘和
田中, 宏
金子, 卓弘

著者名(英)

Yuto, Kondo
Hirokazu, Kameoka
Kou, Tanaka
Takuhiro, Kaneko

論文抄録

内容記述タイプ

Other

内容記述

主観音声品質予測は時間や手間のかかる被験者アンケートを行うことなく自動的に音声の主観音声品質を計算するというタスクである．特に，テキスト音声合成（TTS）システムや音声変換 (VC) システムで生成された音声の MOS を予測するニューラルモデルの学習が近年盛んに研究されている．TTS 音声や VC 音声の MOS 予測の難点は，音声品質が時間セグメントごとに異なるという点である．そのため，MOS テストにおいて，評点付けのためにどのセグメントに注目するかは各聴取者に依ることとなる．私たちは，『被験者が評価するにあたり低品質な音のセグメントに注目する傾向にあり，それゆえに単一の音声に対する被験者間での評価の散らばりは低音質な時間区間を見逃して誤って高得点を付けた被験者の影響を受けている』という仮説を立てる．本稿では VCC2018 及び BVCC データセットを分析することでこの仮説を部分的に裏付け，下位 N 位スコアの平均である N -lowest MOS（Nlow-MOS）を MOS 予測モデルの学習に使用することを提案する．実験により，Nlow-MOS を MOSNet の学習に使用することで LCC とSRCC が MOS を使用する時に比べて向上することを示す．これは，Nlow-MOS がより主観音声品質を正しく反映している代表値であることを表す．

論文抄録(英)

内容記述タイプ

Other

内容記述

Automatic speech quality assessment (SQA) is a task to evaluate the quality of speech samples without resorting to time-consuming listener questionnaires. Attempts have recently been made to train neural-based SQA models to predict the mean opinion score (MOS) of the speech samples produced by text-to-speech or voice conversion systems. One diﬃculty in the MOS prediction is that the quality of a (particularly automatically generated) speech sample can vary from segment to segment. Thus, in subjective MOS evaluation, it is up to each listener what segments of the speech sample to focus on to determine the score. We hypothesize that listeners tend to base their judgments on low-quality segments, and that the variation among listeners in their ratings of each speech sample is primarily due to their mistakenly assigning higher scores by overlooking such segments. We analyze the VCC2018 and BVCC datasets to support this hypothesis, and propose the use of Nlow-MOS, the mean of the N -lowest opinion scores, for training MOS predictor models. Experimental results show that when Nlow-MOS was used to train MOSNet, higher LCC and SRCC were obtained than when regular MOS was used, suggesting that Nlow-MOS is more likely to reﬂect subjective speech quality.

書誌レコードID

収録物識別子タイプ

NCID

収録物識別子

AN10442647

書誌情報

研究報告音声言語情報処理（SLP）

巻 2024-SLP-151, 号 48, p. 1-6, 発行日 2024-02-22

ISSN

収録物識別子タイプ

ISSN

収録物識別子

2188-8663

Notice

SIG Technical Reports are nonrefereed and hence may later appear in any journals, conferences, symposia, etc.

出版者

言語

出版者

情報処理学会

戻る

views

See details

	Views

Versions

Ver.1

2025-01-19 10:25:16.061956

Show All versions

Cite as

エクスポート

OAI-PMH

JPCOAR
DublinCore
DDI

Other Formats

JSON
BIBTEX

インデックスリンク

インデックスツリー

アイテム

下位N位スコア平均に基づくMOS予測モデル学習

× 近藤, 祐斗

× 亀岡, 弘和

× 田中, 宏

× 金子, 卓弘

× Yuto, Kondo

× Hirokazu, Kameoka

× Kou, Tanaka

× Takuhiro, Kaneko

Versions

Share

Cite as

エクスポート