Abstract
The current design trend in large scale machine learning is to use distributed clusters of CPUs and GPUs with MapReduce-style programming. Some have been led to believe that this type of horizontal scaling can reduce or even eliminate the need for traditional algorithm development, careful parallelization, and performance engineering. This paper is a case study showing the contrary: that the benefits of algorithms, parallelization, and performance engineering, can sometimes be so vast that it is possible to solve "cluster-scale" problems on a single commodity multicore machine.
Connectomics is an emerging area of neurobiology that uses cutting edge machine learning and image processing to extract brain connectivity graphs from electron microscopy images. It has long been assumed that the processing of connectomics data will require mass storage, farms of CPU/GPUs, and will take months (if not years) of processing time. We present a high-throughput connectomics-on-demand system that runs on a multicore machine with less than 100 cores and extracts connectomes at the terabyte per hour pace of modern electron microscopes.
- 1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(5): 898--916, 2011.Google Scholar
Digital Library
- G. Bertrand and Z. Aktouf. Three-dimensional thinning algorithm using subfields. volume 2356, pages 113--124, 1995. doi: 10.1117/12.198601. Google Scholar
Cross Ref
- R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP '95, pages 207--216, New York, NY, USA, 1995. ACM. ISBN 0--89791--700--6. doi: 10.1145/209936.209958. URL http://doi.acm.org/10.1145/209936.209958. Google Scholar
Digital Library
- D. Budden, A. Matveev, S. Santurkar, S. R. Chaudhuri, and N. Shavit. Deep tensor convolution on multicores. CoRR, abs/1611.06565, 2016. URL http://arxiv.org/abs/1611.06565.Google Scholar
- I. Calciu, D. Dice, Y. Lev, V. Luchangco, V. J. Marathe, and N. Shavit. Numa-aware reader-writer locks. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 157--166, New York, NY, USA, 2013. ACM. ISBN 978--1--4503- 1922--5. doi: 10.1145/2442516.2442532. URL http://doi.acm.org/10.1145/2442516.2442532. Google Scholar
Digital Library
- T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project adam: Building an efficient and scalable deep learning training system. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 571--582, Berkeley, CA, USA, 2014. USENIX Association. ISBN 978--1--931971--16--4. URL http://dl.acm.org/citation.cfm?id=2685048.2685094.Google Scholar
Digital Library
- S. Chintala. Convnet benchmarks. https://github.com/soumith/convnet-benchmarks.Google Scholar
- D. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber. Deep neural networks segment neuronal membranes in electron microscopy images. In Advances in neural information processing systems, pages 2843--2851, 2012.Google Scholar
Digital Library
- T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. The MIT Press, third edition, 2009.Google Scholar
Digital Library
- I. Corporation. Intel math kernel library, 2016. URL https://en.wikipedia.org/wiki/Math Kernel Library.Google Scholar
- H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing. Geeps: Scalable deep learning on distributed gpus with a gpuspecialized parameter server. In Proceedings of the Eleventh European Conference on Computer Systems, EuroSys '16, pages 4:1--4:16, New York, NY, USA, 2016. ACM. ISBN 978--1--4503--4240--7. doi: 10.1145/2901318.2901323. URL http://doi.acm.org/10.1145/2901318.2901323. Google Scholar
Digital Library
- J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1): 107--113, Jan. 2008. Google Scholar
Digital Library
- D. Dice, V. J. Marathe, and N. Shavit. Flat-combining numa locks. In Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA '11, pages 65--74, New York, NY, USA, 2011. ACM. ISBN 978--1--4503-0743--7. doi: 10.1145/1989493.1989502. URL http://doi.acm.org/10.1145/1989493.1989502. Google Scholar
Digital Library
- D. Dice, V. J. Marathe, and N. Shavit. Lock cohorting: A general technique for designing numa locks. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '12, pages 247--256, New York, NY, USA, 2012. ACM. ISBN 978--1--4503--1160--1. doi: 10.1145/2145816.2145848. URL http://doi.acm.org/10.1145/2145816.2145848. Google Scholar
Digital Library
- A. Eberle, S. Mikula, R. Schalek, J. Lichtman, M. K. TATE, and D. Zeidler. High-resolution, high-throughput imaging with a multibeam scanning electron microscope. Journal of microscopy, 259(2):114--120, 2015. Google Scholar
Cross Ref
- F. Ellen, Y. Lev, V. Luchangco, and M. Moir. Snzi: Scalable nonzero indicators. In Proceedings of the Twenty-sixth Annual ACM Symposium on Principles of Distributed Computing, PODC '07, pages 13--22, New York, NY, USA, 2007. ACM. ISBN 978--1--59593--616--5. doi: 10.1145/1281100.1281106. URL http://doi.acm.org/10.1145/1281100.1281106. Google Scholar
Digital Library
- L. Feng, T. Zhao, and J. Kim. neuTube 1.0: a New Design for Efficient Neuron Reconstruction Software Based on the SWC Format. eneuro, Jan. 2015. ISSN 2373--2822.Google Scholar
- M. Frigo, P. Halpern, C. E. Leiserson, and S. Lewin-Berlin. Reducers and other cilk++ hyperobjects. In Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, pages 79--90. ACM, 2009. Google Scholar
Digital Library
- A. Giusti, D. C. Ciresan, J. Masci, L. M. Gambardella, and J. Schmidhuber. Fast image scanning with deep max-pooling convolutional neural networks. In ICIP, page in press, 2013. Google Scholar
Cross Ref
- Google. Google cloud platform blog: Google supercharges machine learning tasks with tpu custom chip, 2016. URL https://cloudplatform.googleblog.com/2016/05/Googlesupercharges-machine-learning-tasks-with-customchip.html.Google Scholar
- J. Hauswald, Y. Kang, M. A. Laurenzano, Q. Chen, C. Li, T. Mudge, R. G. Dreslinski, J. Mars, and L. Tang. Djinn and tonic: Dnn as a service and its implications for future warehouse scale computers. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA '15, pages 27--40, New York, NY, USA, 2015. ACM. ISBN 978--1--4503--3402-0. doi:10.1145/2749469.2749472. URL http://doi.acm.org/10.1145/2749469.2749472. Google Scholar
Digital Library
- IBM. Introducing a brain-inspired computer, 2016. URL http://www.research.ibm.com/articles/brain-chip.shtml.Google Scholar
- itseez. Open source computer vision library, 2016. URL http://opencv.org/.Google Scholar
- Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia, pages 675--678. ACM, 2014. Google Scholar
Digital Library
- T. Kaler, W. Hasenplaugh, T. B. Schardl, and C. E. Leiserson. Executing dynamic data-graph computations deterministically using chromatic scheduling. In Proceedings of the 26th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA '14, pages 154--165, New York, NY, USA, 2014. ACM. ISBN 978--1--4503--2821-0. doi: 10.1145/2612669.2612673. URL http://doi.acm.org/10.1145/2612669.2612673. Google Scholar
Digital Library
- T. Kaler, W. Hasenplaugh, T. B. Schardl, and C. E. Leiserson. Executing dynamic data-graph computations deterministically using chromatic scheduling. ACM Trans. Parallel Comput., 3(1):2:1--2:31, July 2016. ISSN 2329--4949. doi:10.1145/2896850. URL http://doi.acm.org/10.1145/2896850. Google Scholar
Digital Library
- N. Kasthuri, K. Hayworth, J. C. Tapia, R. Schalek, S. Nundy, and J. W. Lichtman. The brain on tape: Imaging an ultra-thin section library (utsl). In Soc. Neurosci. Abstr, 2009.Google Scholar
- N. Kasthuri, K. J. Hayworth, D. R. Berger, R. L. Schalek, J. A. Conchello, S. Knowles-Barley, D. Lee, A. Vazquez-Reina, V. Kaynig, T. R. Jones, et al. Saturated reconstruction of a volume of neocortex. Cell, 162(3):648--661, 2015. Google Scholar
Cross Ref
- V. Kaynig, A. Vazquez-Reina, S. Knowles-Barley, M. Roberts, T. R. Jones, N. Kasthuri, E. Miller, J. Lichtman, and H. Pfister. Large-scale automatic reconstruction of neuronal processes from electron microscopy images. Medical image analysis, 22(1):77--88, 2015. Google Scholar
Cross Ref
- S. Knowles-Barley. Rhoana git. https://github.com/Rhoana/membrane cnn/tree/master/maxout.Google Scholar
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097--1105, 2012.Google Scholar
Digital Library
- K. Lee, A. Zlateski, V. Ashwin, and H. S. Seung. Recursive Training of 2D-3D Convolutional Networks for Neuronal Boundary Prediction. In Advances in Neural Information Processing Systems, pages 3559--3567, 2015.Google Scholar
- V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. Debunking the 100x gpu vs. cpu myth: An evaluation of throughput computing on cpu and gpu. In Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA '10, pages 451--460, New York, NY, USA, 2010. ACM. ISBN 978--1--4503-0053--7. doi: 10.1145/1815961.1816021. URL http://doi.acm.org/10.1145/1815961.1816021. Google Scholar
Digital Library
- W.-C. A. Lee, V. Bonin, M. Reed, B. J. Graham, G. Hood, K. Glattfelder, and R. C. Reid. Anatomy and function of an excitatory network in the visual cortex. Nature, 532(7599): 370--374, 2016. Google Scholar
Cross Ref
- C. E. Leiserson. The cilk++ concurrency platform. The Journal of Supercomputing, 51(3):244--257, 2010. Google Scholar
Digital Library
- H. Li, A. Kadav, E. Kruus, and C. Ungureanu. Malt: Distributed data-parallelism for existing ML applications. In Proceedings of the Tenth European Conference on Computer Systems, EuroSys '15, pages 3:1--3:16, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3238-5. doi:10.1145/2741948.2741965. URL http://doi.acm.org/10.1145/2741948.2741965. Google Scholar
Digital Library
- Y. Li and Z. Lan. Exploit failure prediction for adaptive fault-tolerance in cluster computing. In Cluster Computing and the Grid, 2006. CCGRID 06. Sixth IEEE International Symposium on, volume 1, pages 8--pp. IEEE, 2006.Google Scholar
- J. W. Lichtman and W. Denk. The big and the small: challenges of imaging the brains circuits. Science, 334(6056):618--623, 2011. Google Scholar
Cross Ref
- J. W. Lichtman and J. R. Sanes. Ome sweet ome: what can the genome tell us about the connectome? Current Opinion in Neurobiology, 18(3):346--353, June 2008. ISSN 0959--4388. doi: 10.1016/j.conb.2008.08.010. URL http://www.sciencedirect.com/science/article/pii/S0959438808000834. Google Scholar
Cross Ref
- J. W. Lichtman, H. Pfister, and N. Shavit. The big data challenges of connectomics. Nature neuroscience, 17(11): 1448--1454, 2014. Google Scholar
Cross Ref
- J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. Google Scholar
Cross Ref
- J. Maitin-Shepard, V. Jain, M. Januszewski, P. Li, J. Kornfeld, J. Buhmann, and P. Abbeel. Combinatorial energy learning for image segmentation. arXiv preprint arXiv:1506.04304, 2015.Google Scholar
- J. Masci, A. Giusti, D. Ciresan, G. Fricout, and J. Schmidhuber. A fast learning algorithm for image segmentation with max-pooling convolutional networks. In Image Processing (ICIP), 2013 20th IEEE International Conference on, pages 2713--2717. IEEE, 2013. Google Scholar
Cross Ref
- M. Meila. Comparing clusteringsan information based distance. Journal of multivariate analysis, 98(5):873--895, 2007.Google Scholar
- M. Meila. Comparing clusteringsan information based distance. Journal of multivariate analysis, 98(5):873--895, 2007. Google Scholar
Digital Library
- Y. Meirovitch, A. Matveev, H. Saribekyan, D. Budden, D. Rolnick, G. Odor, S. K.-B. T. R. Jones, H. Pfister, J. W. Lichtman, and N. Shavit. A Multi-Pass Approach to LargeScale Connectomics. ArXiv e-prints, Dec. 2016.Google Scholar
- J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst., 9(1):21--65, Feb. 1991. ISSN 0734-2071. doi: 10.1145/103727.103729. URL http://doi. acm.org/10.1145/103727.103729. Google Scholar
Digital Library
- A. N. Moga, B. Cramariuc, and M. Gabbouj. Parallel watershed transformation algorithms for image segmentation. Parallel Comput., 24(14):1981--2001, Dec. 1998. ISSN 0167- 8191. doi: 10.1016/S0167--8191(98)00085--4. URL http://dx.doi.org/10.1016/S0167--8191(98)00085--4.Google Scholar
Digital Library
- J. L. Morgan, D. R. Berger, A. W. Wetzel, and J. W. Lichtman. The fuzzy logic of network connectivity in mouse visual thalamus. Cell, 165(1):192--206, 2016. Google Scholar
Cross Ref
- Nervana. Neon. https://github.com/NervanaSystems/neon.Google Scholar
- J. Nunez-Iglesias, R. Kennedy, T. Parag, J. Shi, and D. B. Chklovskii. Machine learning of hierarchical clustering to segment 2D and 3D images. PloS one, 8(8):e71715, 2013. Google Scholar
Cross Ref
- J. Nunez-Iglesias, R. Kennedy, S. M. Plaza, A. Chakraborty, and W. T. Katz. Graph-based active learning of agglomeration (gala): a python library to segment 2d and 3d neuroimages. Frontiers in neuroinformatics, 8, 2014. Google Scholar
Cross Ref
- NVIDIA. Nvidia cudnn - gpu accelerated deep learning, 2016. URL https://developer.nvidia.com/cudnn.Google Scholar
- T. Parag, A. Chakrobarty, and S. Plaza. A context-aware delayed agglomeration framework for em segmentation. CoRR, 2014.Google Scholar
- T. Parag, A. Chakraborty, S. Plaza, and L. Scheffer. A context-aware delayed agglomeration framework for electron microscopy segmentation. PloS one, 10(5):e0125825, 2015. Google Scholar
Cross Ref
- S. M. Plaza and S. E. Berg. Large-scale electron microscopy image segmentation in spark. arXiv preprint arXiv:1604.00385, 2016.Google Scholar
- S. Ramon and S. Cajal. Textura del Sistema Nervioso del Hombre y de los Vertebrados, volume 2. Madrid Nicolas Moya, 1904.Google Scholar
- W. R. G. Roncal, D. M. Kleissas, J. T. Vogelstein, P. Manavalan, K. Lillaney, M. Pekala, R. Burns, R. J. Vogelstein, C. E. Priebe, M. A. Chevillet, et al. An automated images-tographs framework for high resolution connectomics. Frontiers in neuroinformatics, 9, 2015.Google Scholar
- O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention-MICCAI 2015, pages 234--241. Springer, 2015.Google Scholar
Cross Ref
- P. K. Saha, G. Borgefors, and G. Sanniti di Baja. A survey on skeletonization algorithms and their applications. Pattern Recognition Letters, 76:3--12, June 2016. ISSN 0167--8655. doi: 10.1016/j.patrec.2015.04.006. Google Scholar
Digital Library
- S. Seung. Connectome: How the brain's wiring makes us who we are. Houghton Mifflin Harcourt, 2012.Google Scholar
- F. She, R. Chen, W. Gao, P. Hodgson, L. Kong, and H. Hong. Improved 3d Thinning Algorithms for Skeleton Extraction. In Digital Image Computing: Techniques and Applications, 2009. DICTA '09., pages 14--18, Dec. 2009. doi: 10.1109/ DICTA.2009.13.Google Scholar
Digital Library
- I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104--3112, 2014.Google Scholar
Digital Library
- F. Tschopp. Efficient convolutional neural networks for pixelwise classification on heterogeneous hardware systems. arXiv preprint arXiv:1509.03371, 2015.Google Scholar
- B. Vision and L. Center. Caffe deep learning framework. http://caffe.berkeleyvision.org/.Google Scholar
- J. Vogelstein. Machine intelligence from cortical networks (microns), 2016. URL https://www.iarpa.gov/index.php/ research-programs/microns.Google Scholar
- J. G. White, E. Southgate, J. N. Thomson, and S. Brenner. The structure of the nervous system of the nematode caenorhabditis elegans. Philosophical Transactions of the Royal Society B: Biological Sciences, 314(1165):1--340, 1986. ISSN 0080--4622. doi: 10.1098/rstb.1986.0056. URL http://rstb.royalsocietypublishing.org/content/314/1165/1. Google Scholar
Cross Ref
- Wiki. Advanced vector extensions, 2016. URL https://en. wikipedia.org/wiki/Advanced Vector Extensions.Google Scholar
- F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.Google Scholar
- M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. HotCloud, 10:10--10, 2010.Google Scholar
Digital Library
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pages 2--2. USENIX Association, 2012.Google Scholar
Digital Library
- A. Zlateski, K. Lee, and H. S. Seung. ZNN-A fast and scalable algorithm for training 3D convolutional networks on multicore and many-core shared memory machines. arXiv preprint arXiv:1510.06706, 2015.Google Scholar
Index Terms
A Multicore Path to Connectomics-on-Demand
Recommendations
A Multicore Path to Connectomics-on-Demand
PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingThe current design trend in large scale machine learning is to use distributed clusters of CPUs and GPUs with MapReduce-style programming. Some have been led to believe that this type of horizontal scaling can reduce or even eliminate the need for ...
Real-time parallel image processing applications on multicore CPUs with OpenMP and GPGPU with CUDA
This paper presents real-time image processing applications using multicore and multiprocessing technologies. To this end, parallel image segmentation was performed on many images covering the entire surface of the same metallic and cylindrical moving ...
A Multicore Path to Connectomics-on-Demand
SPAA '16: Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and ArchitecturesConnectomics is an emerging field of neurobiology that uses cutting edge machine learning and image processing to extract brain connectivity graphs from electron microscopy images. It has long been assumed that the processing of connectomics data will ...







Comments