形態素解析のための拡張統計モデル

浅原, 正幸; 松本, 裕治; Masayuki, Asahara; Yuji, Matsumoto

WEKO3

lat lon distance

[[sub_check.contents]]

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

インデックスツリー

アイテム

形態素解析のための拡張統計モデル

https://ipsj.ixsq.nii.ac.jp/records/11693

名前 / ファイル	ライセンス	アクション
IPSJ-JNL4303001.pdf (254.0 kB)	Copyright (c) 2002 by the Information Processing Society of Japan
オープンアクセス

Item type

Journal(1)

公開日

2002-03-15

タイトル

形態素解析のための拡張統計モデル

タイトル

言語

タイトル

Extended Statistical Model for Morphological Analysis

言語

jpn

キーワード

主題Scheme

Other

主題

論文

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_6501

資源タイプ

journal article

その他タイトル

その他のタイトル

自然言語処理

著者所属

奈良先端科学技術大学院大学情報科学研究科

著者所属

奈良先端科学技術大学院大学情報科学研究科

著者所属(英)

Graduate School of Information Science, Nara Institute of Science and Technology

著者所属(英)

Graduate School of Information Science, Nara Institute of Science and Technology

著者名

浅原, 正幸松本, 裕治

著者名(英)

Masayuki, Asahara Yuji, Matsumoto

論文抄録

内容記述タイプ

Other

内容記述

自然言語処理の分野で最も基本的な処理として形態素解析がある．近年大量のタグ付きコーパスが整備され，コーパスに基づいた統計的形態素解析器が開発されてきた．しかし単純な統計的手法ではコーパスに出現しない例外的な言語現象に対処することができない．この問題に対処するため，本論文ではより柔軟な拡張統計モデルを提案する．例外的な現象に対応するために単語レベルの統計値を利用する．この拡張により，細かく分類された大量のタグを扱う際，必要なコーパスの量は増加する．一般に適切なコーパスの量で学習するために複数のタグを同値類へとグループ化することによりタグの数を減らすことが行われる．我々はこれを拡張し，マルコフモデルの条件付き確率計算について，先行する品詞タグ集合と，後続する品詞タグ集合とで，別々の品詞タグの同値類を導入するようにした．コーパスの量が不足する場合にtri-gramモデルを構築すると，学習データへの過学習が起きる．これを回避するために選択的tri-gramモデルを導入した．一方，これらの拡張のため，語彙化するタグやtri-gram文脈の選択を人手で設定することは困難である．そこで，この素性選択に誤り駆動の手法を導入し半自動化した．日本語・中国語形態素解析，英語品詞タグ付けについて評価実験を行い，これらの拡張の有効性を検証した．

論文抄録(英)

内容記述タイプ

Other

内容記述

Recently, large-scale part-of-speech tagged corpora have becomeavailable, making it possible to develop statistical morphologicalanalyzers trained on these corpora.Nevertheless, statistical approaches in isolation cannot coverexceptional language phenomena which do not appear in the corpora.In this paper, we propose three extensions to statistical modelsin order to cope with such exceptional language phenomena.First of all, we incorporate lexicalized part-of-speech tags into the modelby using the word itself as a part-of-speech tag.Second, because the tag set becomes fragmented by the use of lexicalized tags, we reduce the size of the tag set by introducing a new type of grouping technique where the tag set ispartitioned creating two different equivalent classes for the events in theconditional probabilities of a Markov Model.Third, to avoid over-fitting, we selectively introduce tri-gram contexts into a bi-gram model.In order to implement these extensions, we introduce error-driven methods to semi-automatically determine the words to be used as lexicalized tags and the tri-gram contextsto be introduced.We investigate how our extension is effective through experiments onJapanese, Chinese and English.

書誌レコードID

収録物識別子タイプ

NCID

収録物識別子

AN00116647

書誌情報

情報処理学会論文誌

巻 43, 号 3, p. 685-695, 発行日 2002-03-15

ISSN

収録物識別子タイプ

ISSN

収録物識別子

1882-7764

戻る

views

See details

	Views

Versions

Ver.1

2025-01-23 02:03:38.067160

Show All versions

Cite as

エクスポート

OAI-PMH

JPCOAR
DublinCore
DDI

Other Formats

JSON
BIBTEX

インデックスリンク

インデックスツリー

アイテム

形態素解析のための拡張統計モデル

× 浅原, 正幸松本, 裕治

× Masayuki, Asahara Yuji, Matsumoto

Versions

Share

Cite as

エクスポート

インデックスリンク

インデックスツリー

アイテム

形態素解析のための拡張統計モデル

× 浅原, 正幸 松本, 裕治

× Masayuki, Asahara Yuji, Matsumoto

Versions

Share

Cite as

エクスポート

× 浅原, 正幸松本, 裕治