Producer-Consumer型モジュールで構成された並列分散Webクローラの開発

上田, 高徳; 佐藤, 亘; 鈴木, 大地; 打田, 研二; 森本, 浩介; 秋岡, 明香; 山名, 早人; Takanori, Ueda; Koh, Satoh; Daichi, Suzuki; Kenji, Uchida; Kousuke, Morimoto; Sayaka, Akioka; Hayato, Yamana

WEKO3

lat lon distance

[[sub_check.contents]]

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

インデックスツリー

アイテム

Producer-Consumer型モジュールで構成された並列分散Webクローラの開発

https://ipsj.ixsq.nii.ac.jp/records/91388

名前 / ファイル	ライセンス	アクション
IPSJ-TOD0602008.pdf (1.6 MB)	Copyright (c) 2013 by the Information Processing Society of Japan
オープンアクセス

Item type

Trans(1)

公開日

2013-03-29

タイトル

Producer-Consumer型モジュールで構成された並列分散Webクローラの開発

タイトル

言語

タイトル

A Parallel Distributed Web Crawler Consisting of Producer-Consumer Modules

言語

jpn

キーワード

主題Scheme

Other

主題

[実例・実践論文] Webクローラ，並列分散処理，Producer-Consumerモデル

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_6501

資源タイプ

journal article

著者所属

早稲田大学

著者所属

早稲田大学

著者所属

早稲田大学

著者所属

早稲田大学

著者所属

早稲田大学

著者所属

早稲田大学

著者所属

早稲田大学／国立情報学研究所

著者所属(英)

Waseda University

著者所属(英)

Waseda University

著者所属(英)

Waseda University

著者所属(英)

Waseda University

著者所属(英)

Waseda University

著者所属(英)

Waseda University

著者所属(英)

Waseda University / National Institute of Informatics

著者名

上田, 高徳

	上田, 高徳佐藤, 亘鈴木, 大地打田, 研二森本, 浩介秋岡, 明香山名, 早人

Search repository

著者名(英)

Takanori, Ueda

en	Takanori, Ueda Koh, Satoh Daichi, Suzuki Kenji, Uchida Kousuke, Morimoto Sayaka, Akioka Hayato, Yamana

Search repository

論文抄録

内容記述タイプ

Other

内容記述

Webクローラは，クローリング済みURLの検出やWebサーバに対する連続アクセス防止といった処理を実行しながらデータ収集を行う必要がある．Web空間に存在する大量のURLに対して高速な収集を実現するために並列分散クローリングが求められるが，省資源でのクローリングを行うためにも，処理の時間計算量と空間計算量の削減に加え，計算機間の負荷分散も必要である．本論文で提案するWebクローラは，クローリング処理をProducer-Consumer型のモジュール群で実行することにより，これまでの被クロールWebサイト単位での負荷分散でなく，Webクローラを構成するモジュール単位での負荷分散を実現する．つまり，Webクローラを構成する各モジュールが必要とする計算機資源に応じた分散処理が可能になり，計算機間での計算負荷やメモリ使用量の偏りを改善することができる．また，ホスト名やURLを管理するモジュールは時間計算量と空間計算量に優れたデータ構造を利用して構成されており，大規模なクローリングが省資源で可能になる．

論文抄録(英)

内容記述タイプ

Other

内容記述

Web crawlers must collect Web data while performing tasks such as detecting crawled URLs and preventing consecutive accesses to a particular Web server. Parallel-distributed crawling is carried out at a high speed for the enormous number of URLs existing on the Web. However, in order to crawl efficiently, a crawler must realize load balancing between computers in addition to reducing time and space complexities in the crawling process. The Web crawler proposed in this paper crawls the Web using producer-consumer modules, which compose the crawler, and it realizes load balancing per module and not per crawled Web site. In other words, it realizes load balancing that is appropriate to certain computer resources necessary for the modules that compose the Web crawler; in this way, it improves biases in computation loads and memory utilization between computers. Moreover, the crawler is able to crawl the Web on a large scale while conserving resources, because the modules that manage host names or URLs are implemented by data structures that are temporally and spatially efficient.

書誌レコードID

収録物識別子タイプ

NCID

収録物識別子

AA11464847

書誌情報

情報処理学会論文誌データベース（TOD）

巻 6, 号 2, p. 85-97, 発行日 2013-03-29

ISSN

収録物識別子タイプ

ISSN

収録物識別子

1882-7799

出版者

言語

出版者

情報処理学会

戻る

views

See details

	Views

Versions

Ver.1

2025-01-21 15:31:41.594499

Show All versions

Cite as

エクスポート

OAI-PMH

JPCOAR
DublinCore
DDI

Other Formats

JSON
BIBTEX

インデックスリンク

インデックスツリー

アイテム

Producer-Consumer型モジュールで構成された並列分散Webクローラの開発

× 上田, 高徳

× Takanori, Ueda

Versions

Share

Cite as

エクスポート