Item type |
SIG Technical Reports(1) |
公開日 |
2022-11-24 |
タイトル |
|
|
タイトル |
Breaking the Memory Bottleneck for Iterative Memory-bound Applications Via Persistent Kernels |
タイトル |
|
|
言語 |
en |
|
タイトル |
Breaking the Memory Bottleneck for Iterative Memory-bound Applications Via Persistent Kernels |
言語 |
|
|
言語 |
eng |
キーワード |
|
|
主題Scheme |
Other |
|
主題 |
高性能計算 |
資源タイプ |
|
|
資源タイプ識別子 |
http://purl.org/coar/resource_type/c_18gh |
|
資源タイプ |
technical report |
著者所属 |
|
|
|
Tokyo Institute of Technology/National Institute of Advanced Industrial Science and Technology |
著者所属 |
|
|
|
RIKEN Center for Computational Science |
著者所属 |
|
|
|
National Institute of Advanced Industrial Science and Technology |
著者所属 |
|
|
|
Shenzhen Institutes of Advanced Technology |
著者所属 |
|
|
|
Oak Ridge National Laboratory |
著者所属 |
|
|
|
Tokyo Institute of Technology |
著者所属 |
|
|
|
RIKEN Center for Computational Science/Tokyo Institute of Technology |
著者所属(英) |
|
|
|
en |
|
|
Tokyo Institute of Technology / National Institute of Advanced Industrial Science and Technology |
著者所属(英) |
|
|
|
en |
|
|
RIKEN Center for Computational Science |
著者所属(英) |
|
|
|
en |
|
|
National Institute of Advanced Industrial Science and Technology |
著者所属(英) |
|
|
|
en |
|
|
Shenzhen Institutes of Advanced Technology |
著者所属(英) |
|
|
|
en |
|
|
Oak Ridge National Laboratory |
著者所属(英) |
|
|
|
en |
|
|
Tokyo Institute of Technology |
著者所属(英) |
|
|
|
en |
|
|
RIKEN Center for Computational Science / Tokyo Institute of Technology |
著者名 |
Lingqi, Zhang
Mohamed, Wahib
Peng, Chen
Jintao, Meng
Xiao, Wang
Toshio, Endo
Satoshi, Matsuoka
|
著者名(英) |
Lingqi, Zhang
Mohamed, Wahib
Peng, Chen
Jintao, Meng
Xiao, Wang
Toshio, Endo
Satoshi, Matsuoka
|
論文抄録 |
|
|
内容記述タイプ |
Other |
|
内容記述 |
Iterative memory-bound solvers commonly occur in HPC codes. Spatial blocking optimizations of iterative solvers are directed towards improving the data locality of the code executed within a single time step of the solver. Temporal blocking optimizations combine multiple consecutive iterations in a scheme that requires the resolution of neighborhood dependencies. We propose a novel data-locality optimization scheme for memory-bound iterative kernels: PERsistent KernelS (PERKS). In this scheme, we target the elimination or reduction of data movements occurring in-between time steps. We eliminate or reduce the traffic to the memory by caching a subset of the output in each time step on on-chip resources to be used as input for the following time step. PERKS can be generalized to any iterative solver: they are largely independent of the solver's implementation, and run independently on top of spatial/temporal blocking optimizations. We implement PERKS in CUDA since Nvidia GPUs provide low latency device-wide synchronizations and a large volume of on-chip resources, i.e., scratch-pad memory and register files. We explain the design principle of PERKS and demonstrate the effectiveness of PERKS for a wide range of iterative 2D/3D stencil benchmarks (geomean speedup of 2.35x for 2D stencils and 1.53x for 3D stencils). |
論文抄録(英) |
|
|
内容記述タイプ |
Other |
|
内容記述 |
Iterative memory-bound solvers commonly occur in HPC codes. Spatial blocking optimizations of iterative solvers are directed towards improving the data locality of the code executed within a single time step of the solver. Temporal blocking optimizations combine multiple consecutive iterations in a scheme that requires the resolution of neighborhood dependencies. We propose a novel data-locality optimization scheme for memory-bound iterative kernels: PERsistent KernelS (PERKS). In this scheme, we target the elimination or reduction of data movements occurring in-between time steps. We eliminate or reduce the traffic to the memory by caching a subset of the output in each time step on on-chip resources to be used as input for the following time step. PERKS can be generalized to any iterative solver: they are largely independent of the solver's implementation, and run independently on top of spatial/temporal blocking optimizations. We implement PERKS in CUDA since Nvidia GPUs provide low latency device-wide synchronizations and a large volume of on-chip resources, i.e., scratch-pad memory and register files. We explain the design principle of PERKS and demonstrate the effectiveness of PERKS for a wide range of iterative 2D/3D stencil benchmarks (geomean speedup of 2.35x for 2D stencils and 1.53x for 3D stencils). |
書誌レコードID |
|
|
収録物識別子タイプ |
NCID |
|
収録物識別子 |
AN10463942 |
書誌情報 |
研究報告ハイパフォーマンスコンピューティング(HPC)
巻 2022-HPC-187,
号 18,
p. 1-10,
発行日 2022-11-24
|
ISSN |
|
|
収録物識別子タイプ |
ISSN |
|
収録物識別子 |
2188-8841 |
Notice |
|
|
|
SIG Technical Reports are nonrefereed and hence may later appear in any journals, conferences, symposia, etc. |
出版者 |
|
|
言語 |
ja |
|
出版者 |
情報処理学会 |