10.1145/3365109.3368768acmconferencesArticle/Chapter ViewAbstractPublication PagesbdcatConference Proceedings
short-paper

Accelerating Machine Learning I/O by Overlapping Data Staging and Mini-batch Generations

ABSTRACT

The training dataset used in deep neural networks (DNNs) keeps on increasing. When a training dataset grows larger, the reading performance of such a large training dataset becomes a problem. A high-performance computing (HPC) cluster has high performance I/O storage devices, for example, NVMe SSD, as local storage on each compute node. This high-performance I/O storage can mitigate the I/O bottleneck. However, such storage devices provide only temporary storage, therefore the users have to copy the training dataset from shared storage (such as Lustre) into local storage. Large datasets (over a few hundred GiB) takes a long time to copy the datasets between local storage and shared storage. To solve this problem, we propose a method to conceal the time spent on copying dataset to local storage by overlapping the copying and reading of training data. We implemented the proposed method at the machine learning framework Chainer. The results of our experiments showed that the read I/O bandwidth of our method improved from 1.38 times to 6.19 times compared with reading the dataset from Lustre directly using Chainer standard class. Moreover, evaluation of data parallel training showed that our method improved the performance from 1.26 times to 1.74 times for the same comparison.

References

  1. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), Vol. 115, No. 3, pp. 211--252, 2015.Google ScholarGoogle Scholar
  2. Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A Large-Scale Video Classification Benchmark. arXiv preprint arXiv:1609.08675, pp. 1--10, 2016.Google ScholarGoogle Scholar
  3. Andreas Geiger, P Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: the KITTI dataset. The International Journal of Robotics Research, Vol. 32, pp. 1231--1237, 2013.Google ScholarGoogle Scholar
  4. Thorsten Kurth, Sean Treichler, Joshua Romero, Mayur Mudigonda, Nathan Luehr, Everett Phillips, Ankur Mahesh, Michael Matheson, Jack Deslippe, Massimiliano Fatica, Prabhat, and Michael Houston. Exascale Deep Learning for Climate Analytics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC '18, pp. 51:1--51:12. IEEE Press, 2018.Google ScholarGoogle Scholar
  5. Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. Chainer: a Next-Generation Open Source Framework for Deep Learning. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS), pp. 1--6, 2015.Google ScholarGoogle Scholar
  6. Center for Computational Sciences, University of Tsukuba. https://www.ccs.tsukuba.ac.jp/.Google ScholarGoogle Scholar
  7. Masafumi Yamazaki, Akihiko Kasagi, Akihiro Tabuchi, Takumi Honda, Masahiro Miwa, Naoto Fukumoto, Tsuguchika Tabaru, Atsushi Ike, and Kohta Nakashima. Yet Another Accelerated SGD: ResNet-50 Training on ImageNet in 74.7 seconds. arXiv preprint arXiv:1903.12650, pp. 1--5, 2019.Google ScholarGoogle Scholar
  8. Hiroaki Mikami, Hisahiro Suganuma, Yoshiki Tanaka, Yuichi Kageyama, et al. ImageNet/Resnet-50 training in 224 Seconds. arXiv preprint arXiv:1811.05233, pp. 1--8, 2018.Google ScholarGoogle Scholar
  9. Chris Ying, Sameer Kumar, Dehao Chen, Tao Wang, and Youlong Cheng. Image Classification at Supercomputer Scale. arXiv preprint arXiv:1811.06992, pp. 1--8, 2018.Google ScholarGoogle Scholar
  10. Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, et al. Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes. arXiv preprint arXiv:1807.11205, pp. 1--9, 2018.Google ScholarGoogle Scholar
  11. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770--778, 2016.Google ScholarGoogle Scholar
  12. Y. Zhu, F. Chowdhury, H. Fu, A. Moody, K. Mohror, K. Sato, and W. Yu. Entropy-Aware I/O Pipelining for Large-Scale Deep Learning on HPC Systems. In 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 145--156, 2018.Google ScholarGoogle Scholar
  13. Mart'in Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv preprint arXiv:1603.04467, pp. 1--19, 2016.Google ScholarGoogle Scholar
  14. BeeGFS . https://www.beegfs.io/content/.Google ScholarGoogle Scholar
  15. Youyou Lu, Jiwu Shu, Youmin Chen, and Tao Li. Octopus: an RDMA-enabled Distributed Persistent Memory File System. In 2017 USENIX Annual Technical Conference (USENIX ATC 17), pp. 773--785. USENIX Association, 2017.Google ScholarGoogle Scholar
  16. Steven WD Chien, Stefano Markidis, Chaitanya Prasad Sishtla, Luis Santos, Pawel Herman, Sai Narasimhamurthy, and Erwin Laure. Characterizing Deep-Learning I/O Workloads in TensorFlow. In 2018 IEEE/ACM 3rd International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS), pp. 54--63. IEEE, 2018.Google ScholarGoogle Scholar
  17. Fahim Chowdhury, Yue Zhu, Todd Heer, Saul Paredes, Adam Moody, Robin Goldstone, Kathryn Mohror, and Weikuan Yu. I/O Characterization and Performance Evaluation of BeeGFS for Deep Learning. In Proceedings of the 48th International Conference on Parallel Processing, ICPP 2019, pp. 80:1--80:10. ACM, 2019.Google ScholarGoogle Scholar
  18. Brian Van Essen, Hyojin Kim, Roger Pearce, Kofi Boakye, and Barry Chen. LBANN: livermore big artificial neural network HPC toolkit. In Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments, MLHPC '15, pp. 5:1--5:6. ACM, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Accelerating Machine Learning I/O by Overlapping Data Staging and Mini-batch Generations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Article Metrics

      • Downloads (Last 12 months)103
      • Downloads (Last 6 weeks)6

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!