GPUによる4倍・8倍精度BLASの実装と評価

椋木, 大地; 高橋, 大介; Daichi, Mukunoki; Daisuke, Takahashi

WEKO3

lat lon distance

[[sub_check.contents]]

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

インデックスツリー

アイテム

GPUによる4倍・8倍精度BLASの実装と評価

https://ipsj.ixsq.nii.ac.jp/records/71791

名前 / ファイル	ライセンス	アクション
IPSJ-HPCS2011054.pdf (661.4 kB)	Copyright (c) 2011 by the Information Processing Society of Japan
オープンアクセス

Item type

Symposium(1)

公開日

2011-01-11

タイトル

GPUによる4倍・8倍精度BLASの実装と評価

タイトル

言語

タイトル

Implementation and Evaluation of Quadruple and Octuple Precision BLAS on GPUs

言語

jpn

キーワード

主題Scheme

Other

主題

アプリケーションと高性能実装

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_5794

資源タイプ

conference paper

著者所属

筑波大学大学院システム情報工学研究科

著者所属

筑波大学大学院システム情報工学研究科

著者所属(英)

Graduate School of Systems and Information Engineering, University of Tsukuba

著者所属(英)

Graduate School of Systems and Information Engineering, University of Tsukuba

著者名

椋木, 大地高橋, 大介

著者名(英)

Daichi, Mukunoki Daisuke, Takahashi

論文抄録

内容記述タイプ

Other

内容記述

本研究では 4 倍・8 倍精度演算に対応した BLAS （Basic Linear Algebra Subprograms）関数を GPU （Graphics Processing Unit）向けに実装し評価を行った．4 倍・8 倍精度演算には double 型倍精度数を 2 つ連結して 4 倍精度数を表す double-double （DD）型 4 倍精度演算，および 4 つ連結して 8 倍精度数を表現する quad-double （QD）型 8 倍精度演算を用いた．NVIDIA Tesla C2050 による性能評価では，Intel Core i7 920での同一処理と比べ，4 倍精度 AXPY が約 9.5 倍，8 倍精度 AXPY が約 19 倍高速化された．また 4 倍精度 GEMM は CPU に比べて約 29 倍，8 倍精度 GEMM は約 24 倍の高速化を達成した．さらに Tesla C2050 では 4 倍精度 AXPY が倍精度演算の高々 2.1 倍の演算時間となり，GEMV，GEMM でも倍精度演算に対する計算時間の増大が CPU の場合と比べ大幅に削減された．一方で PCI-Express （PCIe）によるデータ転送時間を考慮した場合，倍精度 GEMM は PCIe データ転送性能に律速される傾向が見られたが，4 倍・8 倍精度 GEMM ではこれがほぼ解消されることが示された．本論文では 4 倍・8 倍精度 BLAS 演算が GPU に適しており，CPU に比べ実用的な性能が得られることを示す．

論文抄録(英)

内容記述タイプ

Other

内容記述

We implemented quadruple and octuple precision Basic Linear Algebra Subprograms (BLAS) functions on graphics processing units (GPUs), and evaluated their performances. We used DD-type quadruple precision operation, which combines two double precision values to represent a quadruple precision value, and QD-type octuple precision operation, which combines four double precision value, to represent a octuple precision value. On NVIDIA Tesla C2050, quadruple precision AXPY is approximately 9.5 times faster, and octuple precision AXPY is approximately 19 times faster than that on Intel Core i7 920. Additionally, quadruple precision GEMM is approximately 29 times faster, and octuple precision GEMM is approximately 24 times faster than that on the CPU. Moreover, the execution time of quadruple precision AXPY takes only approximately 2.1 times longer than that of double precision AXPY on the GPU. Also on quadruple and octuple precision GEMV and GEMM on the GPU, the increase of the execution time relative to double precision operation is decreased compared to the CPU. On the other hand, taking the PCI-Express (PCIe) data transfer time into consideration, the performance of double precision GEMM is limited by PCIe data transfer time, but that of quadruple and octuple precision GEMM is almost not limited by them. In this research, we show that quadruple and octuple precision BLAS operations are suitable for GPUs.

書誌情報

ハイパフォーマンスコンピューティングと計算科学シンポジウム論文集

巻 2011, p. 148-156, 発行日 2011-01-11

出版者

言語

出版者

情報処理学会

戻る

views

See details

	Views

Versions

Ver.1

2025-01-21 22:55:46.737475

Show All versions

Cite as

エクスポート

OAI-PMH

JPCOAR
DublinCore
DDI

Other Formats

JSON
BIBTEX

インデックスリンク

インデックスツリー

アイテム

GPUによる4倍・8倍精度BLASの実装と評価

× 椋木, 大地高橋, 大介

× Daichi, Mukunoki Daisuke, Takahashi

Versions

Share

Cite as

エクスポート

インデックスリンク

インデックスツリー

アイテム

GPUによる4倍・8倍精度BLASの実装と評価

× 椋木, 大地 高橋, 大介

× Daichi, Mukunoki Daisuke, Takahashi

Versions

Share

Cite as

エクスポート

× 椋木, 大地高橋, 大介