Item type |
Symposium(1) |
公開日 |
2024-11-30 |
タイトル |
|
|
タイトル |
Beyond OCR: Enhancing Classical Japanese Transcription with Large Language Models |
タイトル |
|
|
言語 |
en |
|
タイトル |
Beyond OCR: Enhancing Classical Japanese Transcription with Large Language Models |
言語 |
|
|
言語 |
eng |
キーワード |
|
|
主題Scheme |
Other |
|
主題 |
Large Language Model, Historical Document, Classical Japanese, OCR |
資源タイプ |
|
|
資源タイプ識別子 |
http://purl.org/coar/resource_type/c_5794 |
|
資源タイプ |
conference paper |
著者所属 |
|
|
|
Sakana AI |
著者所属 |
|
|
|
ROIS-DS Center for Open Data in the Humanities |
著者所属 |
|
|
|
National Institute of Informatics |
著者所属(英) |
|
|
|
en |
|
|
Sakana AI |
著者所属(英) |
|
|
|
en |
|
|
ROIS-DS Center for Open Data in the Humanities |
著者所属(英) |
|
|
|
en |
|
|
National Institute of Informatics |
著者名 |
Clanuwat, Tarin
Zhao Tianyu
Imajuku, Yuki
Kitamoto, Asanobu
|
著者名(英) |
Tarin, Clanuwat
Tianyu Zhao
Yuki Imajuku
Asanobu Kitamoto
|
論文抄録 |
|
|
内容記述タイプ |
Other |
|
内容記述 |
This paper presents a methodology for enhancing Optical Character Recognition (OCR) accuracy for historical Japanese documents, using Large Language Models (LLMs). We experimented with six open-source LLMs, ranging in size from 7 to 14 billion parameters, developing two models—a next-token prediction model and an OCR text refiner—both fine-tuned on classical Japanese text from the Minna de Honkoku project. Our approach significantly reduces the Character Error Rate (CER) by correcting misidentified characters and reordering incorrect sequences, particularly improving the recognition of Katakana and Kanji characters often misinterpreted by RURI Kuzushiji OCR model. The findings demonstrate the potential of advanced LLMs to improve the digitization and preservation of Japanese historical documents. |
論文抄録(英) |
|
|
内容記述タイプ |
Other |
|
内容記述 |
This paper presents a methodology for enhancing Optical Character Recognition (OCR) accuracy for historical Japanese documents, using Large Language Models (LLMs). We experimented with six open-source LLMs, ranging in size from 7 to 14 billion parameters, developing two models—a next-token prediction model and an OCR text refiner—both fine-tuned on classical Japanese text from the Minna de Honkoku project. Our approach significantly reduces the Character Error Rate (CER) by correcting misidentified characters and reordering incorrect sequences, particularly improving the recognition of Katakana and Kanji characters often misinterpreted by RURI Kuzushiji OCR model. The findings demonstrate the potential of advanced LLMs to improve the digitization and preservation of Japanese historical documents. |
書誌情報 |
じんもんこん2024論文集
巻 2024,
p. 75-82,
発行日 2024-11-30
|
出版者 |
|
|
言語 |
ja |
|
出版者 |
情報処理学会 |