WEKO3
アイテム
Fusing deep speaker specific features and MFCC for robust speaker verification
https://ipsj.ixsq.nii.ac.jp/records/94544
https://ipsj.ixsq.nii.ac.jp/records/94544cfff0d96-88d3-48b3-b977-617396b591c5
名前 / ファイル | ライセンス | アクション |
---|---|---|
![]() |
Copyright (c) 2013 by the Information Processing Society of Japan
|
|
オープンアクセス |
Item type | SIG Technical Reports(1) | |||||||
---|---|---|---|---|---|---|---|---|
公開日 | 2013-07-18 | |||||||
タイトル | ||||||||
タイトル | Fusing deep speaker specific features and MFCC for robust speaker verification | |||||||
タイトル | ||||||||
言語 | en | |||||||
タイトル | Fusing deep speaker specific features and MFCC for robust speaker verification | |||||||
言語 | ||||||||
言語 | eng | |||||||
キーワード | ||||||||
主題Scheme | Other | |||||||
主題 | 話者 | |||||||
資源タイプ | ||||||||
資源タイプ識別子 | http://purl.org/coar/resource_type/c_18gh | |||||||
資源タイプ | technical report | |||||||
著者所属 | ||||||||
Tokyo Institute of Technology | ||||||||
著者所属 | ||||||||
Tokyo Institute of Technology | ||||||||
著者所属 | ||||||||
Tokyo Institute of Technology | ||||||||
著者所属(英) | ||||||||
en | ||||||||
Tokyo Institute of Technology | ||||||||
著者所属(英) | ||||||||
en | ||||||||
Tokyo Institute of Technology | ||||||||
著者所属(英) | ||||||||
en | ||||||||
Tokyo Institute of Technology | ||||||||
著者名 |
Ryan, Price
× Ryan, Price
|
|||||||
著者名(英) |
Ryan, Price
× Ryan, Price
|
|||||||
論文抄録 | ||||||||
内容記述タイプ | Other | |||||||
内容記述 | Acoustic representations typically used in speaker recognition are general and carry mixed information, including information that is irrelevant to the specific task of speaker recognition. Extracting specific information components from the speech signal for a desired task, such as extracting the speaker information component for speaker verification, is challenging. In this study, a nonlinear feature transformation discriminatively trained to extract speaker specific features from MFCCs is combined with a Gaussian mixture model support vector machine (GMM-SVM) system. Separation of the speaker information component and non-speaker related information in the speech signal is accomplished using a regularized siamese deep network (RSDN). RSDN learns a hidden representation that well characterizes speaker information by training a subset of the hidden units using pairs of speech segments. The hybrid RSDN GMM-SVM system achieves about 5% relative improvement over the baseline GMM-SVM system when applied to text-independent speaker verification using a subset of the NIST SRE 2006 1conv4w-1conv4w task. Speaker verification systems that fuse information typically provide better performance than those based on a single input modality. Score level fusion, in which scores from several classifiers are combined, is commonly employed as a fusion method for speaker verification. This study explores several fusion methods for RSDN and MFCC information, including score fusion, and the much less widely utilized fusion methods of GMM supervector fusion, and feature fusion. Score fusion and GMM supervector fusion offered further performance improvement, both achieving a 6.6% relative improvement over the baseline GMM-SVM system. | |||||||
論文抄録(英) | ||||||||
内容記述タイプ | Other | |||||||
内容記述 | Acoustic representations typically used in speaker recognition are general and carry mixed information, including information that is irrelevant to the specific task of speaker recognition. Extracting specific information components from the speech signal for a desired task, such as extracting the speaker information component for speaker verification, is challenging. In this study, a nonlinear feature transformation discriminatively trained to extract speaker specific features from MFCCs is combined with a Gaussian mixture model support vector machine (GMM-SVM) system. Separation of the speaker information component and non-speaker related information in the speech signal is accomplished using a regularized siamese deep network (RSDN). RSDN learns a hidden representation that well characterizes speaker information by training a subset of the hidden units using pairs of speech segments. The hybrid RSDN GMM-SVM system achieves about 5% relative improvement over the baseline GMM-SVM system when applied to text-independent speaker verification using a subset of the NIST SRE 2006 1conv4w-1conv4w task. Speaker verification systems that fuse information typically provide better performance than those based on a single input modality. Score level fusion, in which scores from several classifiers are combined, is commonly employed as a fusion method for speaker verification. This study explores several fusion methods for RSDN and MFCC information, including score fusion, and the much less widely utilized fusion methods of GMM supervector fusion, and feature fusion. Score fusion and GMM supervector fusion offered further performance improvement, both achieving a 6.6% relative improvement over the baseline GMM-SVM system. | |||||||
書誌レコードID | ||||||||
収録物識別子タイプ | NCID | |||||||
収録物識別子 | AN10442647 | |||||||
書誌情報 |
研究報告音声言語情報処理(SLP) 巻 2013-SLP-97, 号 3, p. 1-7, 発行日 2013-07-18 |
|||||||
Notice | ||||||||
SIG Technical Reports are nonrefereed and hence may later appear in any journals, conferences, symposia, etc. | ||||||||
出版者 | ||||||||
言語 | ja | |||||||
出版者 | 情報処理学会 |