Webコーパスの提案

関口, 洋一; 山本, 和英; Youichi, Sekiguchi; Kazuhide, Yamamoto

WEKO3

インデックスツリー

RootNode

アイテム

Webコーパスの提案

https://ipsj.ixsq.nii.ac.jp/records/48268

名前 / ファイル	ライセンス	アクション
IPSJ-NL03157017.pdf (920.0 kB)	Copyright (c) 2003 by the Information Processing Society of Japan
オープンアクセス

Item type

SIG Technical Reports(1)

公開日

2003-09-29

タイトル

Webコーパスの提案

タイトル

言語

タイトル

Web Corpus Construction with Quality Improvement

言語

jpn

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_18gh

資源タイプ

technical report

著者所属

長岡技術科学大学電気系

著者所属

長岡技術科学大学電気系

著者所属(英)

Department of Electrical Engineering, Nagaoka University of Technology

著者所属(英)

Department of Electrical Engineering, Nagaoka University of Technology

著者名

関口, 洋一山本, 和英

著者名(英)

Youichi, Sekiguchi Kazuhide, Yamamoto

論文抄録

内容記述タイプ

Other

内容記述

Webをコーパスの情報源としたWebコーパスの構築手法を提案する．一般的に用いられている新聞コーパスの量やそれに伴う用例の少なさは否めない．そこで，我々はWebに着目した．Webを用いることで量的な問題を解決できるが，そのまま用いたのでは表現そのものや，文の構造に問題がある．そこでコーパスを質の面から検討を行う．質改善の手法として，HTMLタグや日本語文章の書法を用いて改善を試みる外面的質の考慮を挙げる．さらに記号を多用した文や話しことばの崩れた文を削除し，文字種の割合を示す字面比を用いて文を削除する等の内面的質を考慮する手法を提案する．構築したWebコーパスに対して2種類の実験を行った．1つめは，異なり単語数やシソーラスを用いて単語の特徴を観察した．2つめは，有用性を調査するため，格フレームを用いて調査を行った．その結果，異なり単語数，格フレーム数ともに新聞や未処理のWebテキストを上回るコーパスを構築できた．

論文抄録(英)

内容記述タイプ

Other

内容記述

We present a method for construction of a Web corpus. There is a quantity issue in a newspaper corpus as we use it as a text corpus for natural language processing. We use a collection of Web pages so that we can solve lack of resource amount. However, some of the Web texts have a low quality. We then propose some methods to reduce some of these texts out of the Web corpus. The methods include sentence determination using a part of HTML tags, and filtering out-of-range sentences by proportions of each character type. We have confirmed that our Web corpus outperformed a newspaper corpus, in terms of number of words and case frames. We also show that our Web corpus is also superior to unprocessed Web texts.

書誌レコードID

収録物識別子タイプ

NCID

収録物識別子

AN10115061

書誌情報

情報処理学会研究報告自然言語処理（NL）

巻 2003, 号 98(2003-NL-157), p. 123-130, 発行日 2003-09-29

Notice

SIG Technical Reports are nonrefereed and hence may later appear in any journals, conferences, symposia, etc.

出版者

言語

出版者

情報処理学会

戻る

views

See details

	Views

Versions

Ver.1

2025-01-22 08:35:44.472426

Show All versions

Cite as

山本, 和英, 2003: 情報処理学会, 123–130 p.

エクスポート

OAI-PMH

JPCOAR
DublinCore
DDI

Other Formats

JSON
BIBTEX

インデックスリンク

インデックスツリー

アイテム

Webコーパスの提案

× 関口, 洋一山本, 和英

× Youichi, Sekiguchi Kazuhide, Yamamoto

Versions

Share

Cite as

エクスポート

インデックスリンク

インデックスツリー

アイテム

Webコーパスの提案

× 関口, 洋一 山本, 和英

× Youichi, Sekiguchi Kazuhide, Yamamoto

Versions

Share

Cite as

エクスポート

× 関口, 洋一山本, 和英