複製文字列検知に基づいた Splog フィルタリング手法

竹田, 隆治; 高須, 淳宏; Takaharu, Takeda; Atsuhiro, Takasu

WEKO3

lat lon distance

[[sub_check.contents]]

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

インデックスツリー

アイテム

複製文字列検知に基づいた Splog フィルタリング手法

https://ipsj.ixsq.nii.ac.jp/records/60721

名前 / ファイル	ライセンス	アクション
IPSJ-TOD0201009.pdf (445.7 kB)	Copyright (c) 2009 by the Information Processing Society of Japan
オープンアクセス

Item type

Trans(1)

公開日

2009-03-31

タイトル

複製文字列検知に基づいた Splog フィルタリング手法

タイトル

言語

タイトル

Splog Filtering Method Based on Copy String Detection

言語

jpn

キーワード

主題Scheme

Other

主題

研究論文

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_6501

資源タイプ

journal article

著者所属

総合研究大学院大学

著者所属

国立情報学研究所

著者所属(英)

The Graduate University for Advanced Studies

著者所属(英)

National Institute of Informatics

著者名

竹田, 隆治

著者名(英)

Takaharu, Takeda

論文抄録

内容記述タイプ

Other

内容記述

ブログなどの CGM （Consumer Generated Media）のデータは，消費者の実体験や生の声を含んでおり，顧客のニーズを分析したり，プロモーションの効果を検証したりするための情報源として，その重要性が増してきている．しかし，ブログには，商品の販売促進や，特定の web サイトのランクをあげることなどを目的とした splog と呼ばれるスパムコンテンツが含まれており，ブログの検索や分析に悪影響を及ぼしている．本稿では特に日本語における splog の特徴であるコピーコンテンツの検出に注目し，そのフィルタリング手法を提案する．日本語の splog は，さまざまな文書に含まれる文字列をコピーしつなぎ合わせることによって機械的に生成されることが多い．そこで，本稿では，動的計画法と suffix array を用いて，各ブログに含まれる文字列で，他の文書にも現れる文字列を効率良く検出するアルゴリズムを提案し，そのような文字列がブログに占める割合に基づいた splog のフィルタリング法を提案する．また，フィルタリング性能を評価するためのコーパスを構築し，提案手法が高いフィルタリング性能を実現できることを示すとともに，その特性を分析する．

論文抄録(英)

内容記述タイプ

Other

内容記述

CGM (Consumer Generated Media) data such as blog contains valuable information about customers reputation and it becomes important information source for detecting customers' needs and analyzing effects of various product promotion. However, CGM data contains spam content such as so called "splogs" that are generated for promoting products or improving rank of search results. They are harmful for CGM content retrieval and analysis. This paper proposes a splog filtering method based on the feature of Japanese splogs. The Japanese splogs are often generated by combining words and phrases appearing in various documents. This paper proposes an efficient copy string detection algorithm using the dynamic programming technique and suffix array and apply the proposed algorithm to calculate the ratio of copied strings in a blog. We construct an evaluation corpus for splog filters and show that the proposed method achieves high filtering performance using the corpus.

書誌レコードID

収録物識別子タイプ

NCID

収録物識別子

AA11464847

書誌情報

情報処理学会論文誌データベース（TOD）

巻 2, 号 1, p. 93-103, 発行日 2009-03-31

ISSN

収録物識別子タイプ

ISSN

収録物識別子

1882-7799

出版者

言語

出版者

情報処理学会

戻る

views

See details

	Views

Versions

Ver.1

2025-01-22 03:01:40.920992

Show All versions

Cite as

エクスポート

OAI-PMH

JPCOAR
DublinCore
DDI

Other Formats

JSON
BIBTEX

インデックスリンク

インデックスツリー

アイテム

複製文字列検知に基づいた Splog フィルタリング手法

× 竹田, 隆治

× Takaharu, Takeda

Versions

Share

Cite as

エクスポート