Level-3 BLAS and LU Factorization on a Matrix Processor

AhmedS.Zekri; StanislavG.Sedukhin; Ahmed, S.Zekri; Stanislav, G.Sedukhin

WEKO3

lat lon distance

[[sub_check.contents]]

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

インデックスツリー

アイテム

Level-3 BLAS and LU Factorization on a Matrix Processor

https://ipsj.ixsq.nii.ac.jp/records/18204

名前 / ファイル	ライセンス	アクション
IPSJ-TACS4902005.pdf (1.1 MB)	Copyright (c) 2008 by the Information Processing Society of Japan
オープンアクセス

Item type

Trans(1)

公開日

2008-03-15

タイトル

Level-3 BLAS and LU Factorization on a Matrix Processor

タイトル

言語

タイトル

Level-3 BLAS and LU Factorization on a Matrix Processor

言語

eng

キーワード

主題Scheme

Other

主題

数値計算

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_6501

資源タイプ

journal article

著者所属

Department of Information Systems The University of Aizu

著者所属

Department of Information Systems The University of Aizu

著者所属(英)

Department of Information Systems, The University of Aizu

著者所属(英)

Department of Information Systems, The University of Aizu

著者名

AhmedS.Zekri

著者名(英)

Ahmed, S.Zekri

論文抄録

内容記述タイプ

Other

内容記述

As increasing clock frequency approaches its physical limits a good approach to enhance performance is to increase parallelism by integrating more cores as coprocessors to generalpurpose processors in order to handle the different workloads in scientific engineering and signal processing applications. In this paper we propose a many-core matrix processor model consisting of a scalar unit augmented with b×b simple cores tightly connected in a 2D torus matrix unit to accelerate matrix-based kernels. Data load/store is overlapped with computing using a decoupled data access unit that moves b×b blocks of data between memory and the two scalar and matrix processing units. The operation of the matrix unit is mainly processing fine-grained b×b matrix multiply-add (MMA) operations. We formulate the data alignment operations including matrix transposition and skewing as MMA operations in order to overlap them with data load/store. Two fundamental linear algebra algorithms are designed and analytically evaluated on the proposed matrix processor: the Level-3 BLAS kernel GEMM and the LU factorization with partial pivoting the main step in solving linear systems of equations.For the GEMM kernel the maximum speed of computing measured in FLOPs/cycle is approached for different matrix sizes n and block sizes b. The speed of the LU factorization for relatively large values of n ranges from around 50?90% of the maximum speed depending on the model parameters. Overall the analytical results show the merits of using the matrix unit for accelerating the matrix-based applications.

論文抄録(英)

内容記述タイプ

Other

内容記述

As increasing clock frequency approaches its physical limits, a good approach to enhance performance is to increase parallelism by integrating more cores as coprocessors to generalpurpose processors in order to handle the different workloads in scientific, engineering, and signal processing applications. In this paper, we propose a many-core matrix processor model consisting of a scalar unit augmented with b×b simple cores tightly connected in a 2D torus matrix unit to accelerate matrix-based kernels. Data load/store is overlapped with computing using a decoupled data access unit that moves b×b blocks of data between memory and the two scalar and matrix processing units. The operation of the matrix unit is mainly processing fine-grained b×b matrix multiply-add (MMA) operations. We formulate the data alignment operations including matrix transposition and skewing as MMA operations in order to overlap them with data load/store. Two fundamental linear algebra algorithms are designed and analytically evaluated on the proposed matrix processor: the Level-3 BLAS kernel, GEMM, and the LU factorization with partial pivoting, the main step in solving linear systems of equations.For the GEMM kernel, the maximum speed of computing measured in FLOPs/cycle is approached for different matrix sizes, n, and block sizes, b. The speed of the LU factorization for relatively large values of n ranges from around 50窶骭90% of the maximum speed depending on the model parameters. Overall, the analytical results show the merits of using the matrix unit for accelerating the matrix-based applications.

書誌レコードID

収録物識別子タイプ

NCID

収録物識別子

AA11833852

書誌情報

情報処理学会論文誌コンピューティングシステム（ACS）

巻 49, 号 SIG2(ACS21), p. 37-52, 発行日 2008-03-15

ISSN

収録物識別子タイプ

ISSN

収録物識別子

1882-7829

出版者

言語

出版者

情報処理学会

戻る

views

See details

	Views

Versions

Ver.1

2025-01-22 22:54:42.976665

Show All versions

Cite as

エクスポート

OAI-PMH

JPCOAR
DublinCore
DDI

Other Formats

JSON
BIBTEX

インデックスリンク

インデックスツリー

アイテム

Level-3 BLAS and LU Factorization on a Matrix Processor

× AhmedS.Zekri

× Ahmed, S.Zekri

Versions

Share

Cite as

エクスポート