情報学広場：情報処理学会電子図書館

WEKO3

To

lat lon distance

[[sub_check.contents]]

[[sub_check.contents]]

[[sub_radio.contents]]

To

Field does not validate

[[sub_attr.contents]]　

インデックスツリー

アイテム

言語モデルによるソースコードの「自然さ」を利用した自動生成ファイルの特定

https://ipsj.ixsq.nii.ac.jp/records/194391

名前 / ファイル	ライセンス	アクション
IPSJ-JNL6002044.pdf (897.5 kB)	Copyright (c) 2019 by the Information Processing Society of Japan
オープンアクセス

Item type

Journal(1)

公開日

2019-02-15

タイトル

タイトル

言語モデルによるソースコードの「自然さ」を利用した自動生成ファイルの特定

タイトル

言語

en

タイトル

Identification of Auto-generated Files Using Naturalness of Source Code by Language Model

言語

言語

jpn

キーワード

主題Scheme

Other

主題

[一般論文] 自動生成コード，N-gram言語モデル，ソースコード解析

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_6501

資源タイプ

journal article

著者所属

大阪大学大学院情報科学研究科

著者所属

大阪大学大学院情報科学研究科

著者所属

大阪大学大学院情報科学研究科

著者所属

大阪大学大学院情報科学研究科

著者所属

大阪大学大学院情報科学研究科

著者所属(英)

en

Graduate School of Information Science and Technology, Osaka University

著者所属(英)

en

Graduate School of Information Science and Technology, Osaka University

著者所属(英)

en

Graduate School of Information Science and Technology, Osaka University

著者所属(英)

en

Graduate School of Information Science and Technology, Osaka University

著者所属(英)

en

Graduate School of Information Science and Technology, Osaka University

著者名

土居, 真之
肥後, 芳樹
有馬, 諒
下仲, 健斗
楠本, 真二

著者名(英)

Masayuki, Doi
Yoshiki, Higo
Ryo, Arima
Kento, Shimonaka
Shinji, Kusumoto

論文抄録

内容記述タイプ

Other

内容記述

ソースコードの解析において，解析対象のソースファイルの中には自動生成ファイルが含まれていることがある．自動生成ファイルの存在が解析に悪影響を及ぼす場合があるため，多くの場合自動生成ファイルは除外して解析する必要がある．自動生成ファイルを除外する方法として，ソースコードが自動生成ファイルであるかを目視で判定するという方法がある．しかしこの方法は時間的コストが大きくなってしまうといった問題がある．他にも自動生成ファイル内に存在する特有のコメント文を文字列検索することにより特定するという方法があるが，この方法に関しても，自動生成ファイル特有のコメント文が消された場合に，自動生成ファイルを自動的に特定できないといった問題がある．そこで本研究では，自動生成コードとしての「自然さ」と人が作成したコードとしての「自然さ」を比較することで任意の自動生成ファイルを自動的に特定する手法を提案する．コードの自然さ，すなわち，自動生成あるいは人が生成したコードとしてもっともらしい度合いは，確率的言語モデルであるN-gram言語モデルによって数値化する．この提案手法を評価するために，4つの自動生成プログラムから生成された自動生成ファイル群を対象に実験を行った．その結果，高い精度で自動生成ファイルを特定できた．

論文抄録(英)

内容記述タイプ

Other

内容記述

In source code analysis, target source files include auto-generated files in some cases. However, auto-generated file may adversely affect the source code analysis, and so it is often necessary to exclude the auto-generated files before analyzing. A way of excluding auto-generated files is visually determining whether each source file is an auto-generated file or not, but it takes too much time to see source files manually. Another way is searching special comments which are included in the auto-generated file, An issue of the way is when such special comments have deleted for some reasons, the file cannot be identified automatically. Therefore, in this technique, we propose a way to automatically identify auto-generated files by comparing “naturalness” as an auto-generated file and as a handwritten file. The naturalness of the files, that is, the degree that is likely to be auto-generated or handwritten code, is quantified by a N-gram language model which is a probabilistic language model. In order to evaluate the proposed techinique, experiments were conducted on datasets which are groups of generated files from four auto-generated Java files, handwritten Java files, JavaScript files translated from TypeScript files and JavaScript files. As a result, we were able to identify auto-generated files with very high accuracy.

書誌レコードID

収録物識別子タイプ

NCID

収録物識別子

AN00116647

書誌情報

情報処理学会論文誌

巻 60, 号 2, p. 642-650, 発行日 2019-02-15

ISSN

収録物識別子タイプ

ISSN

収録物識別子

1882-7764

戻る

0

views

	Views

Versions

Ver.1

2025-01-19 23:30:09.576901

Show All versions

Share

Cite as

エクスポート

OAI-PMH

JPCOAR
DublinCore
DDI

Other Formats

JSON
BIBTEX