2024-03-28T18:46:06Zhttps://ipsj.ixsq.nii.ac.jp/ej/?action=repository_oaipmhoai:ipsj.ixsq.nii.ac.jp:001981472023-04-27T10:00:04Z01164:02240:09748:09861
Breaking the limitation of GPU Memory for Deep Learning WorkloadsBreaking the limitation of GPU Memory for Deep Learning Workloadseng機械学習http://id.nii.ac.jp/1001/00198057/Technical Reporthttps://ipsj.ixsq.nii.ac.jp/ej/?action=repository_action_common_download&item_id=198147&item_no=1&attribute_id=1&file_no=1Copyright (c) 2019 by the Information Processing Society of JapanTokyo Institute of Technology, Dept. of Mathematical and Computing ScienceAIST-Tokyo Tech Real World Big-Data Computation Open Innovation LaboratoryTokyo Institute of Technology, Dept. of Mathematical and Computing ScienceTokyo Institute of Technology, Dept. of Mathematical and Computing ScienceRIKEN Center for Computational Science/Tokyo Institute of Technology, Dept. of Mathematical and Computing ScienceHaoyu, ZhangMohamed, WahibLingqi, ZhangYohei, TsujiSatoshi, MtsuokaGPU memory can be insufficient for Deep Learning workloads with respect to the model and dataset sizes. Although model parallelism could help, significant modification of the code is needed for every case. An alternative general solution to this problem is to use out-of-core methods. Recent work proposed data-swapping and CUDA Unified Memory (UM) methods to break the limitation of GPU memory capacity. However, there is a lack of detailed analysis, via performance modeling, of the behavior and limitations of those methods. In this paper we analyze the behavior in terms of both single layer and the whole model. as well as propose a performance model based on the analysis to study how out-of-core training behaves and hence empower the co-design process for Deep Learning workloads.GPU memory can be insufficient for Deep Learning workloads with respect to the model and dataset sizes. Although model parallelism could help, significant modification of the code is needed for every case. An alternative general solution to this problem is to use out-of-core methods. Recent work proposed data-swapping and CUDA Unified Memory (UM) methods to break the limitation of GPU memory capacity. However, there is a lack of detailed analysis, via performance modeling, of the behavior and limitations of those methods. In this paper we analyze the behavior in terms of both single layer and the whole model. as well as propose a performance model based on the analysis to study how out-of-core training behaves and hence empower the co-design process for Deep Learning workloads.AN10463942研究報告ハイパフォーマンスコンピューティング(HPC)2019-HPC-17010172019-07-172188-88412019-07-09