AutoVCを用いたゼロショットリアルタイム声質変換手法の実装と評価

鈴木, 大志; 鷹合, 大輔; 中沢, 実; Daishi, Suzuki; Daisuke, Takago; Minoru, Nakazawa

WEKO3

lat lon distance

[[sub_check.contents]]

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

インデックスツリー

アイテム

AutoVCを用いたゼロショットリアルタイム声質変換手法の実装と評価

https://doi.org/10.20729/00232320

名前 / ファイル	ライセンス	アクション
IPSJ-JNL6502048.pdf (3.0 MB)	Copyright (c) 2024 by the Information Processing Society of Japan
オープンアクセス

Item type

Journal(1)

公開日

2024-02-15

タイトル

AutoVCを用いたゼロショットリアルタイム声質変換手法の実装と評価

タイトル

言語

タイトル

An Implementation and Its Evaluation of Zero-shot Real-time Voice Conversion Method Using AutoVC

言語

jpn

キーワード

主題Scheme

Other

主題

[特集:ネットワークサービスと分散処理] 音声変換，ゼロショット，リアルタイム，深層学習

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_6501

資源タイプ

journal article

ID登録

10.20729/00232320

ID登録タイプ

JaLC

著者所属

金沢工業大学

著者所属

金沢工業大学

著者所属

金沢工業大学

著者所属(英)

Kanazawa Institute of Technology

著者所属(英)

Kanazawa Institute of Technology

著者所属(英)

Kanazawa Institute of Technology

著者名

鈴木, 大志
鷹合, 大輔
中沢, 実

著者名(英)

Daishi, Suzuki
Daisuke, Takago
Minoru, Nakazawa

論文抄録

内容記述タイプ

Other

内容記述

声質変換とは，ある話者の声質を別の話者の声質に変換する技術である．声質変換技術のうち，変換モデルの学習に使用した音声にない声質間での変換をする手法は特にゼロショット声質変換と呼ばれる．AutoVCをはじめとして，ゼロショット声質変換法では，(1)入力話者音声をメルスペクトログラムに変換，(2)入力話者のメルスペクトログラムを出力話者のものに変換，(3)出力話者のメルスペクトログラムで音声信号を生成，という手順となっていることが多い．声質変換に要する時間が入力音声よりも短ければリアルタイム声質変換が可能であるが，(2)と(3)の両方で深層学習モデルを使うことで演算量が増加し，それがリアルタイム声質変換の実現を困難にしている．そこで本研究では音声の特徴量はスペクトル包絡，基本周波数，非周期性指標の3つとし，深層学習モデルはスペクトル包絡の変換にのみ適用することで演算量を削減する方法を提案する．深層学習モデルはAutoVCの構造をベースとして，前処理と後処理部分を変更したものを用いる．実験により，音声信号1秒あたりの処理に要する時間はGPU環境では0.2秒以下となり，リアルタイム声質変換可能であることが示された．また，従来のAutoVCと比べ，品質が改善できていることもMOS（Mean Opinion Score）による評価結果から示された．

論文抄録(英)

内容記述タイプ

Other

内容記述

Voice conversion is a technology that converts the voice of one speaker into the voice of another speaker. Among the voice conversion techniques, the method that converts between voice qualities that are not present in the speech used for training the conversion model is called zero-shot voice conversion. AutoVC and other zero-shot voice conversion methods include (1) converting the input speaker's speech into a mel-spectrogram, (2) converting the input speaker's mel-spectrogram into that of the output speaker, and (3) converting the output speaker's mel-spectrogram into that of the output speaker. In many cases, the procedure is to generate a speech signal using a mel spectrogram. Real-time voice conversion is possible if the time required for voice conversion is shorter than that of the input speech. making it difficult to implement. Therefore, in this research, we propose a method to reduce the amount of computation by applying the deep learning model only to the transformation of the spectral envelope, with three speech features: the spectral envelope, the fundamental frequency, and the aperiodicity index. The deep learning model is based on the structure of AutoVC, and uses the modified pre-processing and post-processing parts. Experiments show that the time required to process one second of audio signal is less than 0.2 seconds in a GPU environment, indicating that real-time voice conversion is possible. In addition, the evaluation results by MOS (Mean Opinion Score) showed that the quality was improved compared to the conventional AutoVC.

書誌レコードID

収録物識別子タイプ

NCID

収録物識別子

AN00116647

書誌情報

情報処理学会論文誌

巻 65, 号 2, p. 529-537, 発行日 2024-02-15

ISSN

収録物識別子タイプ

ISSN

収録物識別子

1882-7764

公開者

言語

出版者

情報処理学会

戻る

views

See details

	Views

Versions

Ver.1

2025-01-19 10:22:27.942406

Show All versions

Cite as

エクスポート

OAI-PMH

JPCOAR
DublinCore
DDI

Other Formats

JSON
BIBTEX

インデックスリンク

インデックスツリー

アイテム

AutoVCを用いたゼロショットリアルタイム声質変換手法の実装と評価

× 鈴木, 大志

× 鷹合, 大輔

× 中沢, 実

× Daishi, Suzuki

× Daisuke, Takago

× Minoru, Nakazawa

Versions

Share

Cite as

エクスポート