WEKO3
アイテム
Experiments on Automatic Web Page Categorization for Information Retrieval System
https://ipsj.ixsq.nii.ac.jp/records/12091
https://ipsj.ixsq.nii.ac.jp/records/120914e418419-b797-4e48-afed-1cfacc1808bf
名前 / ファイル | ライセンス | アクション |
---|---|---|
![]() |
Copyright (c) 2001 by the Information Processing Society of Japan
|
|
オープンアクセス |
Item type | Journal(1) | |||||||
---|---|---|---|---|---|---|---|---|
公開日 | 2001-02-15 | |||||||
タイトル | ||||||||
タイトル | Experiments on Automatic Web Page Categorization for Information Retrieval System | |||||||
タイトル | ||||||||
言語 | en | |||||||
タイトル | Experiments on Automatic Web Page Categorization for Information Retrieval System | |||||||
言語 | ||||||||
言語 | eng | |||||||
キーワード | ||||||||
主題Scheme | Other | |||||||
主題 | 論文 | |||||||
資源タイプ | ||||||||
資源タイプ識別子 | http://purl.org/coar/resource_type/c_6501 | |||||||
資源タイプ | journal article | |||||||
その他タイトル | ||||||||
その他のタイトル | 文書処理 | |||||||
著者所属 | ||||||||
Systems Development Laboratory Hitachi Ltd. | ||||||||
著者所属 | ||||||||
Systems Development Laboratory Hitachi Ltd. | ||||||||
著者所属(英) | ||||||||
en | ||||||||
Systems Development Laboratory, Hitachi, Ltd. | ||||||||
著者所属(英) | ||||||||
en | ||||||||
Systems Development Laboratory, Hitachi, Ltd. | ||||||||
著者名 |
Hisao, Mase
× Hisao, Mase
|
|||||||
著者名(英) |
Hisao, Mase
× Hisao, Mase
|
|||||||
論文抄録 | ||||||||
内容記述タイプ | Other | |||||||
内容記述 | Our goal is to embed keyword-based categorization technique into information retrieval systems for Web pages to facilitate the end-users' search task. Then search results must be categorized faster while keeping accuracy high.Typical keyword-based categorization systems use a knowledge base (KB) to assign categories. The KB contains keywords with weights by category and generate KB automatically from training texts. With this keyword-based approach the algorithms to extract keywords and assign weights to them should be considered because they affect strongly accuracy and processing speed. Furthermore we must take two characteristics of Web pages into account: (1) the text length is variable which makes it harder to use statistics to calculate keyword weights and (2) too many distinct words are used which makes the KB bigger and therefore processing speed lower. We propose five kinds of methods to normalize word frequency distribution for higher accuracy and three kinds of methods to filter out non-important words from the KB for faster processing. We performed experiments to compare these methods from viewpoints of accuracy and KB size. The results show that the accuracy improvement by combining our normalization methods and filtering methods is statistically significant. The results also shows that the KBs with various accuracy values and sizes could be generated and that end-users could select appropriate KB according to their preferences in accuracy and speed. | |||||||
論文抄録(英) | ||||||||
内容記述タイプ | Other | |||||||
内容記述 | Our goal is to embed keyword-based categorization technique into information retrieval systems for Web pages to facilitate the end-users' search task. Then, search results must be categorized faster, while keeping accuracy high.Typical keyword-based categorization systems use a knowledge base (KB) to assign categories. The KB contains keywords with weights by category, and generate KB automatically from training texts. With this keyword-based approach, the algorithms to extract keywords and assign weights to them should be considered, because they affect strongly accuracy and processing speed. Furthermore, we must take two characteristics of Web pages into account: (1) the text length is variable, which makes it harder to use statistics to calculate keyword weights, and (2) too many distinct words are used, which makes the KB bigger and therefore processing speed lower. We propose five kinds of methods to normalize word frequency distribution for higher accuracy, and three kinds of methods to filter out non-important words from the KB for faster processing. We performed experiments to compare these methods from viewpoints of accuracy and KB size. The results show that the accuracy improvement by combining our normalization methods and filtering methods is statistically significant. The results also shows that the KBs with various accuracy values and sizes could be generated and that end-users could select appropriate KB according to their preferences in accuracy and speed. | |||||||
書誌レコードID | ||||||||
収録物識別子タイプ | NCID | |||||||
収録物識別子 | AN00116647 | |||||||
書誌情報 |
情報処理学会論文誌 巻 42, 号 2, p. 334-348, 発行日 2001-02-15 |
|||||||
ISSN | ||||||||
収録物識別子タイプ | ISSN | |||||||
収録物識別子 | 1882-7764 |