統計的機械学習を用いた歴史的資料への濁点付与の自動化

岡, 照晃; 小町, 守; 小木曽, 智信; 松本, 裕治; Teruaki, Oka; Mamoru, Komachi; Toshinobu, Ogiso; Yuji, Matsumoto

WEKO3

インデックスツリー

RootNode

アイテム

統計的機械学習を用いた歴史的資料への濁点付与の自動化

https://ipsj.ixsq.nii.ac.jp/records/91615

名前 / ファイル	ライセンス	アクション
IPSJ-JNL5404049.pdf (844.5 kB)	Copyright (c) 2013 by the Information Processing Society of Japan
オープンアクセス

Item type

Journal(1)

公開日

2013-04-15

タイトル

統計的機械学習を用いた歴史的資料への濁点付与の自動化

タイトル

言語

タイトル

A Statistical Machine Learning Approach to Automatic Labeling of Voiced Consonants for Historical Texts

言語

jpn

キーワード

主題Scheme

Other

主題

[一般論文] 自然言語処理，機械学習，歴史的資料，濁点，近代文語論説文

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_6501

資源タイプ

journal article

著者所属

奈良先端科学技術大学院大学

著者所属

奈良先端科学技術大学院大学

著者所属

奈良先端科学技術大学院大学／国立国語研究所

著者所属

奈良先端科学技術大学院大学

著者所属(英)

Nara Institute of Science and Technology

著者所属(英)

Nara Institute of Science and Technology

著者所属(英)

Nara Institute of Science and Technology / National Institute for Japanese Language and Linguistics

著者所属(英)

Nara Institute of Science and Technology

著者名

岡, 照晃小町, 守小木曽, 智信松本, 裕治

著者名(英)

Teruaki, Oka Mamoru, Komachi Toshinobu, Ogiso Yuji, Matsumoto

論文抄録

内容記述タイプ

Other

内容記述

生の歴史的資料の中には，濁点が期待されるのに濁点の付いていない，濁点無表記の文字が多く含まれている．濁点無表記文字は可読性・検索性を下げるため，歴史コーパス整備の際には濁点付与が行われる．しかし，濁点付与は専門家にしか行えないため，作業人員の確保が大きな課題となっている．また，作業対象が膨大であるため，作業を完了するまでにも時間がかかる．そこで本論文では，濁点付与の自動化について述べる．我々は濁点付与を文字単位のクラス分類問題として定式化した．提案手法は分類を周辺文字列の情報のみで行うため，分類器の学習には形態素解析済みコーパスを必要としない．大規模な近代語のコーパスを学習に使用し，近代の雑誌「国民之友」に適合率96%，再現率98%の濁点付与を達成した．

論文抄録(英)

内容記述タイプ

Other

内容記述

Raw historical texts often include mark-lacking characters, which lack compulsory voiced consonant mark. Since mark-lacking characters degrade readability and retrievability, voiced consonant marks are annotated when creating historical corpus. However, since only experts can perform the labeling procedure for historical texts, getting annotators is a large challenge. Also, it is time-consuming to conduct annotation for large-scale historical texts. In this paper, we propose an approach to automatic labeling of voiced consonant marks for mark-lacking characters. We formulate the task into a character-based classification problem. Since our method uses as its feature set only surface information about the surrounding characters, we do not require corpus annotated with word boundaries and POS-tags for training. We exploited large data sets and achieved 96% precision and 98% recall on a near-modern Japanese magazine, Kokumin-no-Tomo.

書誌レコードID

収録物識別子タイプ

NCID

収録物識別子

AN00116647

書誌情報

情報処理学会論文誌

巻 54, 号 4, p. 1641-1654, 発行日 2013-04-15

ISSN

収録物識別子タイプ

ISSN

収録物識別子

1882-7764

戻る

views

See details

	Views

Versions

Ver.1

2025-01-21 15:24:57.792302

Show All versions

Cite as

松本, 裕治, 2013: 1641–1654 p.

エクスポート

OAI-PMH

JPCOAR
DublinCore
DDI

Other Formats

JSON
BIBTEX

インデックスリンク

インデックスツリー

アイテム

統計的機械学習を用いた歴史的資料への濁点付与の自動化

× 岡, 照晃小町, 守小木曽, 智信松本, 裕治

× Teruaki, Oka Mamoru, Komachi Toshinobu, Ogiso Yuji, Matsumoto

Versions

Share

Cite as

エクスポート

インデックスリンク

インデックスツリー

アイテム

統計的機械学習を用いた歴史的資料への濁点付与の自動化

× 岡, 照晃 小町, 守 小木曽, 智信 松本, 裕治

× Teruaki, Oka Mamoru, Komachi Toshinobu, Ogiso Yuji, Matsumoto

Versions

Share

Cite as

エクスポート

× 岡, 照晃小町, 守小木曽, 智信松本, 裕治