2024-03-28T19:18:48Zhttps://ipsj.ixsq.nii.ac.jp/ej/?action=repository_oaipmhoai:ipsj.ixsq.nii.ac.jp:000486802023-04-27T10:00:04Z01164:04179:04236:04242
文字列が単語になる確率を用いた未知語抽出Extraction of unknown words by the probability to accept the kanji character sequence as one wordjpnhttp://id.nii.ac.jp/1001/00048680/Technical Reporthttps://ipsj.ixsq.nii.ac.jp/ej/?action=repository_action_common_download&item_id=48680&item_no=1&attribute_id=1&file_no=1Copyright (c) 2000 by the Information Processing Society of Japan茨城大学工学部システム工学科茨城大学工学部システム工学科池谷, 昌紀新納, 浩幸茨城大学工学部システム工学科 / 茨城大学工学部システム工学科In this paper, we propose a method to extract unknown words, which are composed of two or three Kanji characters, from Japanese text. Generally the unknown word composed or Kanji characters are segmented into some words by the morophological analysis. Moreover, the appearance probability of each segmented word is small. By this characteristic, we can define the measure to accpet two or three kanji characters sequence as an unknown word. On the other hand, we can find some patterns for word segmantation of unkown words. By applying the above measure to Kanji character sequences with these patterns, we can extract unknow words. In the experiment, the F-measure for extraction of unknown words which are composed of two Kanji characters was 0.684 and the F-measure for extraction of unknown words which are composed of three Kanji characters was 0.182. Our method does not need the frequency of the character sequence α in the training corpus to judge whether α is the unknown word or not. Therefore, Our method has the advantage that the low frequent unknown word are extracted.AN10115061情報処理学会研究報告自然言語処理(NL)200011(1999-NL-135)49542000-01-272009-06-30