WEKO3
アイテム
Evaluation of a cGAN Model and Random Seed Oversampling on Imbalanced JavaScript Datasets
https://ipsj.ixsq.nii.ac.jp/records/220191
https://ipsj.ixsq.nii.ac.jp/records/2201916a86bb0e-ff28-4d0a-b35e-1eeca1d60a11
| 名前 / ファイル | ライセンス | アクション |
|---|---|---|
|
|
Copyright (c) 2022 by the Information Processing Society of Japan
|
|
| オープンアクセス | ||
| Item type | Journal(1) | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 公開日 | 2022-09-15 | |||||||||
| タイトル | ||||||||||
| タイトル | Evaluation of a cGAN Model and Random Seed Oversampling on Imbalanced JavaScript Datasets | |||||||||
| タイトル | ||||||||||
| 言語 | en | |||||||||
| タイトル | Evaluation of a cGAN Model and Random Seed Oversampling on Imbalanced JavaScript Datasets | |||||||||
| 言語 | ||||||||||
| 言語 | eng | |||||||||
| キーワード | ||||||||||
| 主題Scheme | Other | |||||||||
| 主題 | [特集:量子時代をみすえたコンピュータセキュリティ技術] malicious JavaScript, LSI model, natural language processing, oversampling, machine learning, cGAN | |||||||||
| 資源タイプ | ||||||||||
| 資源タイプ識別子 | http://purl.org/coar/resource_type/c_6501 | |||||||||
| 資源タイプ | journal article | |||||||||
| 著者所属 | ||||||||||
| National Defense Academy | ||||||||||
| 著者所属 | ||||||||||
| National Defense Academy | ||||||||||
| 著者所属(英) | ||||||||||
| en | ||||||||||
| National Defense Academy | ||||||||||
| 著者所属(英) | ||||||||||
| en | ||||||||||
| National Defense Academy | ||||||||||
| 著者名 |
Ngoc, Minh Phung
× Ngoc, Minh Phung
× Mamoru, Mimura
|
|||||||||
| 著者名(英) |
Ngoc, Minh Phung
× Ngoc, Minh Phung
× Mamoru, Mimura
|
|||||||||
| 論文抄録 | ||||||||||
| 内容記述タイプ | Other | |||||||||
| 内容記述 | Malicious JavaScript detection using machine learning models have shown many great results over the years. The main problem is that the dataset used to train the model tends to be imbalanced, as the size of the malicious dataset is far smaller than the benign one. Many of the previous techniques ignore most of the benign samples and focus on training a machine learning model with a balanced dataset. However, real-world data only has a small fraction of malicious JavaScript, making it an imbalanced dataset. This paper proposes a cGAN-based filter model that can quickly classify JavaScript malware using Natural Language Processing (NLP) and oversampling. The feature of the JavaScript file will be converted into vector form and used to train the SVM classifier. Different NLP models and oversampling methods are tested to achieve a high recall score, such as the Doc2Vec and Latent Semantic Indexing (LSI) models. In this paper, a cGAN model will be used to generate new training malicious data based on the original training dataset. We evaluate our models with a dataset of over 30,000 samples obtained from top popular websites, PhishTank, and GitHub. The experimental result shows that the best recall score achieves 0.78 with the LSI model. ------------------------------ This is a preprint of an article intended for publication Journal of Information Processing(JIP). This preprint should not be cited. This article should be cited as: Journal of Information Processing Vol.30(2022) (online) DOI http://dx.doi.org/10.2197/ipsjjip.30.591 ------------------------------ |
|||||||||
| 論文抄録(英) | ||||||||||
| 内容記述タイプ | Other | |||||||||
| 内容記述 | Malicious JavaScript detection using machine learning models have shown many great results over the years. The main problem is that the dataset used to train the model tends to be imbalanced, as the size of the malicious dataset is far smaller than the benign one. Many of the previous techniques ignore most of the benign samples and focus on training a machine learning model with a balanced dataset. However, real-world data only has a small fraction of malicious JavaScript, making it an imbalanced dataset. This paper proposes a cGAN-based filter model that can quickly classify JavaScript malware using Natural Language Processing (NLP) and oversampling. The feature of the JavaScript file will be converted into vector form and used to train the SVM classifier. Different NLP models and oversampling methods are tested to achieve a high recall score, such as the Doc2Vec and Latent Semantic Indexing (LSI) models. In this paper, a cGAN model will be used to generate new training malicious data based on the original training dataset. We evaluate our models with a dataset of over 30,000 samples obtained from top popular websites, PhishTank, and GitHub. The experimental result shows that the best recall score achieves 0.78 with the LSI model. ------------------------------ This is a preprint of an article intended for publication Journal of Information Processing(JIP). This preprint should not be cited. This article should be cited as: Journal of Information Processing Vol.30(2022) (online) DOI http://dx.doi.org/10.2197/ipsjjip.30.591 ------------------------------ |
|||||||||
| 書誌レコードID | ||||||||||
| 収録物識別子タイプ | NCID | |||||||||
| 収録物識別子 | AN00116647 | |||||||||
| 書誌情報 |
情報処理学会論文誌 巻 63, 号 9, 発行日 2022-09-15 |
|||||||||
| ISSN | ||||||||||
| 収録物識別子タイプ | ISSN | |||||||||
| 収録物識別子 | 1882-7764 | |||||||||
| 公開者 | ||||||||||
| 言語 | ja | |||||||||
| 出版者 | 情報処理学会 | |||||||||