文字列の距離空間上の最大マージン識別器とそのタンパク質科学への応用

小谷野, 仁; 林田, 守広; 阿久津, 達也; Hitoshi, Koyano; Morihiro, Hayashida; Tatsuya, Akutsu

WEKO3

lat lon distance

[[sub_check.contents]]

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

インデックスツリー

アイテム

文字列の距離空間上の最大マージン識別器とそのタンパク質科学への応用

https://ipsj.ixsq.nii.ac.jp/records/101824

名前 / ファイル	ライセンス	アクション
IPSJ-MPS14098013.pdf (908.1 kB)	Copyright (c) 2014 by the Information Processing Society of Japan
オープンアクセス

Item type

SIG Technical Reports(1)

公開日

2014-06-18

タイトル

文字列の距離空間上の最大マージン識別器とそのタンパク質科学への応用

タイトル

言語

タイトル

Maximum Margin Classifier Working in a Metric Space of Strings and Its Application to Protein Science

言語

jpn

キーワード

主題Scheme

Other

主題

合同企画セッション：バイオデータマイニング

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_18gh

資源タイプ

technical report

著者所属

京都大学大学院医学研究科臨床研究総合センター

著者所属

京都大学化学研究所バイオインフォマティクスセンター

著者所属

京都大学化学研究所バイオインフォマティクスセンター

著者所属(英)

Institute for Advancement of Clinical and Translational Science, Graduate School of Medicine, Kyoto University

著者所属(英)

Bioinformatics Center, Institute for Chemical Research, Kyoto University

著者所属(英)

Bioinformatics Center, Institute for Chemical Research, Kyoto University

著者名

小谷野, 仁

著者名(英)

Hitoshi, Koyano

論文抄録

内容記述タイプ

Other

内容記述

これまでデータと言えば，数や数ベクトルが大部分を占めていたが，近年，計算機科学や生物学において，テキストデータや生物配列など，大量の文字列データが生成されるようになり，文字列データの分類問題は，様々な領域に共通の問題となっている．この問題に対して現在最もよく用いられている方法は，文字列カーネルによって文字列を数ベクトルに変換し，それにサポートベクターマシーンを適用することである．しかし，この変換は 1 対 1 ではなく，文字列を構成する文字の並びに関するかなりの量の情報を捨ててしまう．また，この接近法のより重要な問題は，学習機械を訓練し，テストするために与えられたデータはある確立法則に従って生成された文字列であるという重要な側面を考慮し，確立論を用いて学習機械の汎化誤差を理論的に評価することを不可能にしていることである．なぜ，文字列データを分類するために，それを数ベクトルに変換し，数ベクトル空間上で動作する学習機械を用いるのだろうか．文字列を分類するには，文字列の集合上で動作する学習機械を用いるのが自然だろう．我々は，文字列を数ベクトルに変換せずに，文字列自体を入力として受け取る学習機械を構築することにより，この分類問題に接近した．このような学習機械の汎化誤差を理論的に評価するには，文字列に対する確率論が必要である．文字列は，これまで，数学の対象というよりは，計算機科学の対象であり，文字列の集合に位相構造や代数構造を与えて，その上で確立論を展開するということはなされてこなかったが，著者等のうちの 1 人と彼の共同研究者は，以前の研究において，Levenshtein 距離が与えられた文字列の距離空間上で確立論を展開して，ベクトル空間における大数強法則の，この空間におけるアナロジーを証明した．この研究において，我々は，この文字列の集合上の確立論を応用することにより，ある正則条件の下で，我々の学習機械が漸近的に最適な仕方で文字列を分類することを証明した．更に，我々の学習機械を，アミノ酸配列を用いたタンパク質間相互作用の予測問題に応用して，実際のデータ解析におけるその有用牲を示した．

論文抄録(英)

内容記述タイプ

Other

内容記述

Numbers and numerical vectors account for a large portion of data. However, recently, the amount of string data generated has increased dramatically. Consequently, classifying string data is a common problem in many fields. The most widely used approach to this problem is to convert strings into numerical vectors using string kernels and subsequently apply a support vector machine that works in a numerical vector space. However, this non-one-to-one conversion involves information loss and makes it impossible to evaluate, using probability theory, the generalization error of a learning machine, considering that the given data to train and test the machine are strings generated according to probability laws. We approach this classification problem by constructing a classifier that receives the strings themselves as inputs. To evaluate the generalization error of such a classifier theoretically, probability theory for strings is required. A string is an object of computer science rather than mathematics, and probability theory for strings has not been constructed. However, one of the authors and his colleague, in previous studies, first developed a probability theory on a metric space of strings provided with the Levenshtein distance and demonstrated an analogy of the strong law of large numbers in a numerical vector space. In this study, by applying this probability theory on a set of strings, we demonstrate that our developed learning machine classifies strings in an asymptotically optimal manner. Furthermore, we demonstrate the usefulness of our machine in practical data analysis by applying it to predicting protein-protein interactions using amino acid sequences.

書誌レコードID

収録物識別子タイプ

NCID

収録物識別子

AN10505667

書誌情報

研究報告数理モデル化と問題解決（MPS）

巻 2014-MPS-98, 号 13, p. 1-8, 発行日 2014-06-18

Notice

SIG Technical Reports are nonrefereed and hence may later appear in any journals, conferences, symposia, etc.

出版者

言語

出版者

情報処理学会

戻る

views

See details

	Views

Versions

Ver.1

2025-01-21 11:02:34.925604

Show All versions

Cite as

エクスポート

OAI-PMH

JPCOAR
DublinCore
DDI

Other Formats

JSON
BIBTEX

インデックスリンク

インデックスツリー

アイテム

文字列の距離空間上の最大マージン識別器とそのタンパク質科学への応用

× 小谷野, 仁

× Hitoshi, Koyano

Versions

Share

Cite as

エクスポート