ABSTRACT
The training dataset used in deep neural networks (DNNs) keeps on increasing. When a training dataset grows larger, the reading performance of such a large training dataset becomes a problem. A high-performance computing (HPC) cluster has high performance I/O storage devices, for example, NVMe SSD, as local storage on each compute node. This high-performance I/O storage can mitigate the I/O bottleneck. However, such storage devices provide only temporary storage, therefore the users have to copy the training dataset from shared storage (such as Lustre) into local storage. Large datasets (over a few hundred GiB) takes a long time to copy the datasets between local storage and shared storage. To solve this problem, we propose a method to conceal the time spent on copying dataset to local storage by overlapping the copying and reading of training data. We implemented the proposed method at the machine learning framework Chainer. The results of our experiments showed that the read I/O bandwidth of our method improved from 1.38 times to 6.19 times compared with reading the dataset from Lustre directly using Chainer standard class. Moreover, evaluation of data parallel training showed that our method improved the performance from 1.26 times to 1.74 times for the same comparison.
References
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), Vol. 115, No. 3, pp. 211--252, 2015.Google Scholar
- Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A Large-Scale Video Classification Benchmark. arXiv preprint arXiv:1609.08675, pp. 1--10, 2016.Google Scholar
- Andreas Geiger, P Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: the KITTI dataset. The International Journal of Robotics Research, Vol. 32, pp. 1231--1237, 2013.Google Scholar
- Thorsten Kurth, Sean Treichler, Joshua Romero, Mayur Mudigonda, Nathan Luehr, Everett Phillips, Ankur Mahesh, Michael Matheson, Jack Deslippe, Massimiliano Fatica, Prabhat, and Michael Houston. Exascale Deep Learning for Climate Analytics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC '18, pp. 51:1--51:12. IEEE Press, 2018.Google Scholar
- Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. Chainer: a Next-Generation Open Source Framework for Deep Learning. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS), pp. 1--6, 2015.Google Scholar
- Center for Computational Sciences, University of Tsukuba. https://www.ccs.tsukuba.ac.jp/.Google Scholar
- Masafumi Yamazaki, Akihiko Kasagi, Akihiro Tabuchi, Takumi Honda, Masahiro Miwa, Naoto Fukumoto, Tsuguchika Tabaru, Atsushi Ike, and Kohta Nakashima. Yet Another Accelerated SGD: ResNet-50 Training on ImageNet in 74.7 seconds. arXiv preprint arXiv:1903.12650, pp. 1--5, 2019.Google Scholar
- Hiroaki Mikami, Hisahiro Suganuma, Yoshiki Tanaka, Yuichi Kageyama, et al. ImageNet/Resnet-50 training in 224 Seconds. arXiv preprint arXiv:1811.05233, pp. 1--8, 2018.Google Scholar
- Chris Ying, Sameer Kumar, Dehao Chen, Tao Wang, and Youlong Cheng. Image Classification at Supercomputer Scale. arXiv preprint arXiv:1811.06992, pp. 1--8, 2018.Google Scholar
- Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, et al. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes. arXiv preprint arXiv:1807.11205, pp. 1--9, 2018.Google Scholar
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770--778, 2016.Google Scholar
- Y. Zhu, F. Chowdhury, H. Fu, A. Moody, K. Mohror, K. Sato, and W. Yu. Entropy-Aware I/O Pipelining for Large-Scale Deep Learning on HPC Systems. In 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 145--156, 2018.Google Scholar
- Mart'in Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv preprint arXiv:1603.04467, pp. 1--19, 2016.Google Scholar
- BeeGFS . https://www.beegfs.io/content/.Google Scholar
- Youyou Lu, Jiwu Shu, Youmin Chen, and Tao Li. Octopus: an RDMA-enabled Distributed Persistent Memory File System. In 2017 USENIX Annual Technical Conference (USENIX ATC 17), pp. 773--785. USENIX Association, 2017.Google Scholar
- Steven WD Chien, Stefano Markidis, Chaitanya Prasad Sishtla, Luis Santos, Pawel Herman, Sai Narasimhamurthy, and Erwin Laure. Characterizing Deep-Learning I/O Workloads in TensorFlow. In 2018 IEEE/ACM 3rd International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS), pp. 54--63. IEEE, 2018.Google Scholar
- Fahim Chowdhury, Yue Zhu, Todd Heer, Saul Paredes, Adam Moody, Robin Goldstone, Kathryn Mohror, and Weikuan Yu. I/O Characterization and Performance Evaluation of BeeGFS for Deep Learning. In Proceedings of the 48th International Conference on Parallel Processing, ICPP 2019, pp. 80:1--80:10. ACM, 2019.Google Scholar
- Brian Van Essen, Hyojin Kim, Roger Pearce, Kofi Boakye, and Barry Chen. LBANN: livermore big artificial neural network HPC toolkit. In Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments, MLHPC '15, pp. 5:1--5:6. ACM, 2015.Google Scholar
Digital Library
Index Terms
Accelerating Machine Learning I/O by Overlapping Data Staging and Mini-batch Generations





Comments