Item type |
SIG Technical Reports(1) |
公開日 |
2017-12-15 |
タイトル |
|
|
タイトル |
Detection of mergeable Wikipedia articles based on multiple embedding results |
タイトル |
|
|
言語 |
en |
|
タイトル |
Detection of mergeable Wikipedia articles based on multiple embedding results |
言語 |
|
|
言語 |
eng |
資源タイプ |
|
|
資源タイプ識別子 |
http://purl.org/coar/resource_type/c_18gh |
|
資源タイプ |
technical report |
著者所属 |
|
|
|
Graduate School of Information, Production and Systems, Waseda University |
著者所属 |
|
|
|
Graduate School of Information, Production and Systems, Waseda University |
著者所属(英) |
|
|
|
en |
|
|
Graduate School of Information, Production and Systems, Waseda University |
著者所属(英) |
|
|
|
en |
|
|
Graduate School of Information, Production and Systems, Waseda University |
著者名 |
Renzhi, Wang
Mizuho, Iwaihara
|
著者名(英) |
Renzhi, Wang
Mizuho, Iwaihara
|
論文抄録 |
|
|
内容記述タイプ |
Other |
|
内容記述 |
Wikipedia is the largest online encyclopedia, in which articles are edited by different volunteers with different thoughts and styles. Sometimes two or more articles' titles are different but the themes of these articles are exactly the same or strongly similar. Administrators and editors are supposed to detect these article pairs and determine whether they should be merged together. In this paper, we propose a method to automatically determine whether an article pair should be merged together. We consider both duplicate case and overlap case. In the duplicate case, the articles pairs are covering exactly the same contents. In the overlap case, the articles pairs are covering related subjects that have a significant overlap. The content of an overlap part is similar but the words in the pair are probably different, so methods that exploit semantic relatedness are necessary. To deal with this problem we propose combination of multiple embedding results and rebuild word vectors for detecting mergeable article pairs. We also deal with various mergeable cases by combining distinct text fragments together. Our experiments show that our method performs better than existing embedding methods. |
論文抄録(英) |
|
|
内容記述タイプ |
Other |
|
内容記述 |
Wikipedia is the largest online encyclopedia, in which articles are edited by different volunteers with different thoughts and styles. Sometimes two or more articles' titles are different but the themes of these articles are exactly the same or strongly similar. Administrators and editors are supposed to detect these article pairs and determine whether they should be merged together. In this paper, we propose a method to automatically determine whether an article pair should be merged together. We consider both duplicate case and overlap case. In the duplicate case, the articles pairs are covering exactly the same contents. In the overlap case, the articles pairs are covering related subjects that have a significant overlap. The content of an overlap part is similar but the words in the pair are probably different, so methods that exploit semantic relatedness are necessary. To deal with this problem we propose combination of multiple embedding results and rebuild word vectors for detecting mergeable article pairs. We also deal with various mergeable cases by combining distinct text fragments together. Our experiments show that our method performs better than existing embedding methods. |
書誌レコードID |
|
|
収録物識別子タイプ |
NCID |
|
収録物識別子 |
AN10112482 |
書誌情報 |
研究報告データベースシステム(DBS)
巻 2017-DBS-166,
号 15,
p. 1-5,
発行日 2017-12-15
|
ISSN |
|
|
収録物識別子タイプ |
ISSN |
|
収録物識別子 |
2188-871X |
Notice |
|
|
|
SIG Technical Reports are nonrefereed and hence may later appear in any journals, conferences, symposia, etc. |
出版者 |
|
|
言語 |
ja |
|
出版者 |
情報処理学会 |