ログイン 新規登録
言語:

WEKO3

  • トップ
  • ランキング
To
lat lon distance
To

Field does not validate



インデックスリンク

インデックスツリー

メールアドレスを入力してください。

WEKO

One fine body…

WEKO

One fine body…

アイテム

  1. 論文誌(トランザクション)
  2. コンピューティングシステム(ACS)
  3. Vol.49
  4. No.SIG2(ACS21)

Level-3 BLAS and LU Factorization on a Matrix Processor

https://ipsj.ixsq.nii.ac.jp/records/18204
https://ipsj.ixsq.nii.ac.jp/records/18204
1530dec0-45d3-4de6-bf7b-5725a01632ae
名前 / ファイル ライセンス アクション
IPSJ-TACS4902005.pdf IPSJ-TACS4902005.pdf (1.1 MB)
Copyright (c) 2008 by the Information Processing Society of Japan
オープンアクセス
Item type Trans(1)
公開日 2008-03-15
タイトル
タイトル Level-3 BLAS and LU Factorization on a Matrix Processor
タイトル
言語 en
タイトル Level-3 BLAS and LU Factorization on a Matrix Processor
言語
言語 eng
キーワード
主題Scheme Other
主題 数値計算
資源タイプ
資源タイプ識別子 http://purl.org/coar/resource_type/c_6501
資源タイプ journal article
著者所属
Department of Information Systems The University of Aizu
著者所属
Department of Information Systems The University of Aizu
著者所属(英)
en
Department of Information Systems, The University of Aizu
著者所属(英)
en
Department of Information Systems, The University of Aizu
著者名 AhmedS.Zekri StanislavG.Sedukhin

× AhmedS.Zekri StanislavG.Sedukhin

AhmedS.Zekri
StanislavG.Sedukhin

Search repository
著者名(英) Ahmed, S.Zekri Stanislav, G.Sedukhin

× Ahmed, S.Zekri Stanislav, G.Sedukhin

en Ahmed, S.Zekri
Stanislav, G.Sedukhin

Search repository
論文抄録
内容記述タイプ Other
内容記述 As increasing clock frequency approaches its physical limits a good approach to enhance performance is to increase parallelism by integrating more cores as coprocessors to generalpurpose processors in order to handle the different workloads in scientific engineering and signal processing applications. In this paper we propose a many-core matrix processor model consisting of a scalar unit augmented with b×b simple cores tightly connected in a 2D torus matrix unit to accelerate matrix-based kernels. Data load/store is overlapped with computing using a decoupled data access unit that moves b×b blocks of data between memory and the two scalar and matrix processing units. The operation of the matrix unit is mainly processing fine-grained b×b matrix multiply-add (MMA) operations. We formulate the data alignment operations including matrix transposition and skewing as MMA operations in order to overlap them with data load/store. Two fundamental linear algebra algorithms are designed and analytically evaluated on the proposed matrix processor: the Level-3 BLAS kernel GEMM and the LU factorization with partial pivoting the main step in solving linear systems of equations.For the GEMM kernel the maximum speed of computing measured in FLOPs/cycle is approached for different matrix sizes n and block sizes b. The speed of the LU factorization for relatively large values of n ranges from around 50?90% of the maximum speed depending on the model parameters. Overall the analytical results show the merits of using the matrix unit for accelerating the matrix-based applications.
論文抄録(英)
内容記述タイプ Other
内容記述 As increasing clock frequency approaches its physical limits, a good approach to enhance performance is to increase parallelism by integrating more cores as coprocessors to generalpurpose processors in order to handle the different workloads in scientific, engineering, and signal processing applications. In this paper, we propose a many-core matrix processor model consisting of a scalar unit augmented with b×b simple cores tightly connected in a 2D torus matrix unit to accelerate matrix-based kernels. Data load/store is overlapped with computing using a decoupled data access unit that moves b×b blocks of data between memory and the two scalar and matrix processing units. The operation of the matrix unit is mainly processing fine-grained b×b matrix multiply-add (MMA) operations. We formulate the data alignment operations including matrix transposition and skewing as MMA operations in order to overlap them with data load/store. Two fundamental linear algebra algorithms are designed and analytically evaluated on the proposed matrix processor: the Level-3 BLAS kernel, GEMM, and the LU factorization with partial pivoting, the main step in solving linear systems of equations.For the GEMM kernel, the maximum speed of computing measured in FLOPs/cycle is approached for different matrix sizes, n, and block sizes, b. The speed of the LU factorization for relatively large values of n ranges from around 50窶骭€90% of the maximum speed depending on the model parameters. Overall, the analytical results show the merits of using the matrix unit for accelerating the matrix-based applications.
書誌レコードID
収録物識別子タイプ NCID
収録物識別子 AA11833852
書誌情報 情報処理学会論文誌コンピューティングシステム(ACS)

巻 49, 号 SIG2(ACS21), p. 37-52, 発行日 2008-03-15
ISSN
収録物識別子タイプ ISSN
収録物識別子 1882-7829
出版者
言語 ja
出版者 情報処理学会
戻る
0
views
See details
Views

Versions

Ver.1 2025-01-22 22:54:42.976665
Show All versions

Share

Mendeley Twitter Facebook Print Addthis

Cite as

エクスポート

OAI-PMH
  • OAI-PMH JPCOAR
  • OAI-PMH DublinCore
  • OAI-PMH DDI
Other Formats
  • JSON
  • BIBTEX

Confirm


Powered by WEKO3


Powered by WEKO3