@techreport{oai:ipsj.ixsq.nii.ac.jp:00216620, author = {森谷, 崇史 and 芦原, 孝典 and 安藤, 厚志 and 佐藤, 宏 and 田中, 智大 and 松浦, 孝平 and 増村, 亮 and デルクロア, マーク and 篠崎, 隆宏 and Takafumi, Moriya and Takanori, Ashihara and Atsushi, Ando and Hiroshi, Sato and Tomohiro, Tanaka and Kohei, Matsuura and Ryo, Masumura and Marc, Delcroix and Takahiro, Shinozaki}, issue = {19}, month = {Feb}, note = {本研究ではストリーミング音声認識における Recurrent neural network-transducer(RNN-T)と Atten- tion-based decoder(AD)を組み合わせた Hybrid RNN-T/Attention モデルの改善手法について述べる.一般に AD は注意重みの計算に始端から終端までの入力音声情報が必要なためストリーミング動作が困難であった.そこで我々は先行研究として始端から各 trigger の位置までの音響特徴量を用いて注意重みを計算する Triggered attention-based decoder(TAD)と組み合わせることでストリーミング動作可能な Hybrid RNN-T/Attention モデルを提案した.しかしながら従来の TAD ではストリーミング処理を可能としたが,計算量やメモリ消費量に課題があった.本研究では認識精度を保ちながら計算コストが削減可能な Triggered chunkwise attention-based decoder(TCAD)を用いた Hybrid RNN-T/Attention モデルを提案する.また,本研究ではさらなる認識精度の改善に向けて Hybrid RNN-T/Attention モデルが持つ 2 種類の内部言語モデルを用いた言語モデルの統合方法についても検討を行なう., In this paper we propose improvements to our recently proposed hybrid RNN-T/Attention architecture that includes a shared encoder followed by recurrent neural network-transducer (RNN-T) and triggered attention-based decoders (TAD). The use of triggered attention enables the attention-based decoder (AD) to operate in a streaming manner. When a trigger point is detected by RNN-T, TAD uses the context from the start-of-speech up to that trigger point to compute the attention weights. Consequently, the computation costs and the memory consumptions are quadratically increased with the duration of the utterances because all input features must be stored and used to re-compute the attention weights. In this paper, we use a short context from a few frames prior to each trigger point for attention weight computation resulting in reduced computation and memory costs. We call the proposed framework triggered chunkwise AD (TCAD). We also investigate the effectiveness of internal language model (ILM) estimation approach using both ILMs of RNN-T and TCAD heads for improving RNN-T performance.}, title = {Hybrid RNN-T/Attention 構造を用いたストリーミング型End-to-End 音声認識モデルと内部言語モデル統合の検討}, year = {2022} }