多クラス文書分類問題におけるZiv-Merhav Crossparsingの適用と評価

相澤, 彰子; Akiko, Aizawa

WEKO3

lat lon distance

[[sub_check.contents]]

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

インデックスツリー

アイテム

多クラス文書分類問題におけるZiv-Merhav Crossparsingの適用と評価

https://ipsj.ixsq.nii.ac.jp/records/78162

名前 / ファイル	ライセンス	アクション
IPSJ-JNL5210007.pdf (1.2 MB)	Copyright (c) 2011 by the Information Processing Society of Japan
オープンアクセス

Item type

Journal(1)

公開日

2011-10-15

タイトル

多クラス文書分類問題におけるZiv-Merhav Crossparsingの適用と評価

タイトル

言語

タイトル

Multiclass Text Classification Using Ziv-Merhav Crossparsing

言語

jpn

キーワード

主題Scheme

Other

主題

一般論文

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_6501

資源タイプ

journal article

著者所属

国立情報学研究所

著者所属(英)

National Institute of Informatics

著者名

相澤, 彰子

著者名(英)

Akiko, Aizawa

論文抄録

内容記述タイプ

Other

内容記述

圧縮プログラムや符号化に基づくデータの類似度尺度について，テキスト文書への適用を中心に近年の研究を概観するとともに，Ziv-Merhav crossparsingと呼ばれる系列分解法と単純ベイズ法を組み合わせたテキスト分類法を新たに提案する．異なるタイプの分類問題を用いた実験により，従来のZiv-Merhav crossparsingや単純ベイズ法に対して，提案手法では分類性能の大幅な改善が得られることを示す．また，サポートベクタマシンやロジスティック回帰に基づく多クラス分類器をベースラインとして用いた比較により，Reuters-21578やTechTC-300のようにカテゴリが文書の話題に基づき設定される問題ではこれらの機械学習手法が優位であるが，論文著者の同定のようにカテゴリが文書の作成者に対応づけられる問題では提案手法が優位となる場合があることを示す．最後に，可変長Nグラムによる類似度尺度という観点から考察を加える．

論文抄録(英)

内容記述タイプ

Other

内容記述

In this paper, we first present an overview of recent studies on compression and encoding-based similarity measures for textual documents. Next, we propose a new method that combines Ziv-Merhav crossparsing and a naive Bayes classifier. Then, we investigate the performance using different types of text classification problems. The experimental results show that the proposed method considerably overperforms the conventional practice of Ziv-Merhav crossparsing and also naive Bayes classifiers. It is also shown that while multiclass versions of two well-known machine learning methods, a support vector machine and logistic regression, perform better than the proposed method with standard test sets such as Reuters-21578 or TechTC-300, the proposed method performs better with some types of author identification problems. Lastly, we provide a perspective of the proposed method as a similarity measure based on variable length n-grams.

書誌レコードID

収録物識別子タイプ

NCID

収録物識別子

AN00116647

書誌情報

情報処理学会論文誌

巻 52, 号 10, p. 2953-2964, 発行日 2011-10-15

ISSN

収録物識別子タイプ

ISSN

収録物識別子

1882-7764

戻る

views

See details

	Views

Versions

Ver.1

2025-01-21 20:37:20.008285

Show All versions

Cite as

エクスポート

OAI-PMH

JPCOAR
DublinCore
DDI

Other Formats

JSON
BIBTEX

インデックスリンク

インデックスツリー

アイテム

多クラス文書分類問題におけるZiv-Merhav Crossparsingの適用と評価

× 相澤, 彰子

× Akiko, Aizawa

Versions

Share

Cite as

エクスポート