NDLTableSet ：デジタル化資料中の表領域の構造化を目的としたデータセットの構築及び機械学習手法の検討

青池, 亨; Toru, Aoike

WEKO3

lat lon distance

[[sub_check.contents]]

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

インデックスツリー

アイテム

NDLTableSet ：デジタル化資料中の表領域の構造化を目的としたデータセットの構築及び機械学習手法の検討

https://ipsj.ixsq.nii.ac.jp/records/231368

名前 / ファイル	ライセンス	アクション
IPSJ-CH2023033 (916.3 kB)	Copyright (c) 2023 by the Information Processing Society of Japan
オープンアクセス

Item type

Symposium(1)

公開日

2023-12-02

タイトル

NDLTableSet ：デジタル化資料中の表領域の構造化を目的としたデータセットの構築及び機械学習手法の検討

タイトル

言語

タイトル

NDLTableSet: Construction of a dataset for structuring table areas in digitized materials , and investigation of machine learning methods

言語

jpn

キーワード

主題Scheme

Other

主題

デジタルアーカイブ; データセット ; 機械学習; テーブルデータ ; OCR

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_5794

資源タイプ

conference paper

著者所属

国立国会図書館

著者所属(英)

National Diet Library

著者名

青池, 亨

著者名(英)

Toru, Aoike

論文抄録

内容記述タイプ

Other

内容記述

デジタル化資料の画像中に含まれる表領域に関し，再解析やグラフによるデータの可視化を行うためには，表内部の数値等の情報を OCR（光学文字認識）によってテキスト化することに加えて，行及び列の位置関係やセル間の結合関係を踏まえた構造化処理を行う必要がある．機械学習分野における表構造化処理のための研究は，論文 PDF等，ボーンデジタルなリソースを対象に盛んに行われている一方で，スキャン撮影した非ボーンデジタルな資料においては利用可能な公開データセットがごく少ないことが課題であった．本研究では表領域に関する上記の課題を解決し，国立国会図書館が所蔵する著作権保護期間の満了したデジタル化資料の画像内の表データを利用可能とすることを目的として，表を構造化するためのデータセットをデジタル化資料の画像から構築し，これを利用して表構造推定のための機械学習モデルを開発することで構造化処理の自動化を検討した．

論文抄録(英)

内容記述タイプ

Other

内容記述

To reanalyze or graphically visualize data in a table in an image of a document, it is necessary to convert information in the table, such as numerical values, into text using OCR (Optical Character Recognition), and to perform structuring processing based on the positional relationships of rows and columns and the connection among cells. While research for table structuring processing in the machine learning field has been actively conducted on bo rn-digital resources s uch as PDFs of academic papers, there have been very few available public datasets for scanned, non -born-digital materials, making it difficult to study them. In order to solve the forementioned problem and to make available table data in the images of mat erials held by the National Diet Library whose copyright protection period has expired, this study create a dataset for structuring tables from images and developed a machine learning model to infer table structure using this dataset. This study examined the possibility of automation of the process by developing a machine learning model to infer the table structure.

書誌情報

じんもんこん2023論文集

巻 2023, p. 231-236, 発行日 2023-12-02

出版者

言語

出版者

情報処理学会

戻る

views

See details

	Views

Versions

Ver.1

2025-01-19 10:43:59.981130

Show All versions

Cite as

エクスポート

OAI-PMH

JPCOAR
DublinCore
DDI

Other Formats

JSON
BIBTEX

インデックスリンク

インデックスツリー

アイテム

NDLTableSet ：デジタル化資料中の表領域の構造化を目的としたデータセットの構築及び機械学習手法の検討

× 青池, 亨

× Toru, Aoike

Versions

Share

Cite as

エクスポート