2024-03-29T08:46:01Zhttps://ipsj.ixsq.nii.ac.jp/ej/?action=repository_oaipmhoai:ipsj.ixsq.nii.ac.jp:001096152023-11-17T02:17:36Z06504:06739:07814
Detection of Paragraph Boundaries in Complex Page Layouts for Electronic Documentsengデータベースとメディアhttp://id.nii.ac.jp/1001/00109591/Conference Paperhttps://ipsj.ixsq.nii.ac.jp/ej/?action=repository_action_common_download&item_id=109615&item_no=1&attribute_id=1&file_no=1Copyright (c) 2012 by the Information Processing Society of Japan東大国立情報学研国立情報学研YiminChu高須淳宏安達淳The precision of paragraph segmentation is critical for the succeed information retrieval tasks in reverse engineering of paginated electronic documents such as PDF files. Current solutions to the layout analysis for simple layouts are not flexible enough to adapt to various complex layouts. Here we propose one method to determine the boundary of the paragraphs with machine learning techniques. We decide the paragraph boundaries based on the features of other parts of the paragraph which are not so ambiguous. A tree structure is also designed in order to enable the text content being grouped flexibly.AN00349328第74回全国大会講演論文集201215395402012-03-062014-12-18