転置ファイルとビット配列を用いた高速文字列あいまい照合アルゴリズム

下村, 秀樹; 福島, 俊一; Hideki, Shimomura; Toshikazu, Fukushima

WEKO3

lat lon distance

[[sub_check.contents]]

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

インデックスツリー

アイテム

転置ファイルとビット配列を用いた高速文字列あいまい照合アルゴリズム

https://ipsj.ixsq.nii.ac.jp/records/12746

名前 / ファイル	ライセンス	アクション
IPSJ-JNL4004052.pdf (1.7 MB)	Copyright (c) 1999 by the Information Processing Society of Japan
オープンアクセス

Item type

Journal(1)

公開日

1999-04-15

タイトル

転置ファイルとビット配列を用いた高速文字列あいまい照合アルゴリズム

タイトル

言語

タイトル

A Fast Approximate String Matching Algorithm Using an Inverted File and Bit - arrays

言語

jpn

キーワード

主題Scheme

Other

主題

論文

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_6501

資源タイプ

journal article

その他タイトル

その他のタイトル

テキスト処理

著者所属

日本電気株式会社ヒューマンメディア研究所

著者所属

日本電気株式会社ヒューマンメディア研究所

著者所属(英)

Human Media Research Laboratories, NEC Corporation

著者所属(英)

Human Media Research Laboratories, NEC Corporation

著者名

下村, 秀樹

著者名(英)

Hideki, Shimomura

論文抄録

内容記述タイプ

Other

内容記述

本稿では文字ベースの転置ファイルとビット配列を用いた高速文字列あいまい照合アルゴリズムを提案し実験によってその有効性を示す. 文字列あいまい照合の代表的アルゴリズムに WuとManberのアルゴリズム(以下「Bit-Arrayアルゴリズム」)がある. これは照合状況を示すビット配列に単純なシフト演算と論理演算を逐次的に施すことで長さnのテキストと長さmの検索文字列に対するk回以内の挿入/削除/置換を許した文字列あいまい照合をワード長がmビット以上の計算機においてO(nk)の計算量で実行できるという優れた特長を持つ. しかし照合のためにテキストの全文字を参照する必要がありそのままで大量テキストのあいまい検索に適した方法とはいえない. これに対して本稿では Bit-Arrayアルゴリズムの考え方を大量の固定的テキストの検索に有効な文字ベースの転置ファイルと組み合わせた「スキップ型Bit-Arrayアルゴリズム(SBAアルゴリズム)」を提案する. SBAアルゴリズムは転置ファイルを参照することで検索文字列の構成文字以外のテキスト位置でのビット配列計算をスキップする. 照合処理の計算量はO(kmn/|Σ|)であるが(|Σ|は文字セット規模) |Σ|に比べてmが十分小さければBit-Arrayアルゴリズムよりも高速な検索速度が得られる. 1000万文字の日本語テキストに対する検索実験(2≤m≤10 k≤m)では従来のBit-Arrayアルゴリズムと比べて照合処理時間が約1728?約1/178に短縮され SBAアルゴリズムの優位性を確認できた.

論文抄録(英)

内容記述タイプ

Other

内容記述

This paper proposes a new approximate string matching (ASM) algorithm using a character-base inverted file and bit-arrays, and also describes experimental results to show its effectiveness. One of the well-known ASM algorithms is Wu and Manber's (Bit-Array algorithm), which applies bit-shift and bit-AND/OR operations on bit-arrays representing matching situations. It implemented on a computer with more than m bits for the word size, in time of O(nk), where n is the text length, m is the pattern length, and mismatches are allowed with in k times. However, it is not suitable for some applications searching a large amount of text, because it has to scan all characters in the text at least once. The new algorithm (SBA algorithm), proposed in this paper, solves the problem by combining the idea of the Bit-Array algorithm with a character-based inverted file method. It requires O(kmn/|Σ|) time for ASM, where |Σ| is the size of a character set. When m is smaller enough than |Σ|, the SBA algorithm runs faster than the Bit-Array algorithm, because its cans only the index data corresponding to the characters which the pattern consists of. The experimental results for 10 million character Japanese text with 2≤m≤10 and k≤m show that the SBA algorithm has the advantage of its matching speed 28縲鰀178 times faster than the Bit-Array algorithm.

書誌レコードID

収録物識別子タイプ

NCID

収録物識別子

AN00116647

書誌情報

情報処理学会論文誌

巻 40, 号 4, p. 1816-1830, 発行日 1999-04-15

ISSN

収録物識別子タイプ

ISSN

収録物識別子

1882-7764

戻る

views

See details

	Views

Versions

Ver.1

2025-01-23 01:31:51.653141

Show All versions

Cite as

エクスポート

OAI-PMH

JPCOAR
DublinCore
DDI

Other Formats

JSON
BIBTEX

インデックスリンク

インデックスツリー

アイテム

転置ファイルとビット配列を用いた高速文字列あいまい照合アルゴリズム

× 下村, 秀樹

× Hideki, Shimomura

Versions

Share

Cite as

エクスポート