Abstract
The world has experienced phenomenal growth in data production and storage in recent years, much of which has taken the form of media files. At the same time, computing power has become abundant with multi-core machines, grids, and clouds. Yet it remains a challenge to harness the available power and move toward gracefully searching and retrieving from web-scale media collections. Several researchers have experimented with using automatically distributed computing frameworks, notably Hadoop and Spark, for processing multimedia material, but mostly using small collections on small computing clusters. In this article, we describe a prototype of a (near) web-scale throughput-oriented MM retrieval service using the Spark framework running on the AWS cloud service. We present retrieval results using up to 43 billion SIFT feature vectors from the public YFCC 100M collection, making this the largest high-dimensional feature vector collection reported in the literature. We also present a publicly available demonstration retrieval system, running on our own servers, where the implementation of the Spark pipelines can be observed in practice using standard image benchmarks, and downloaded for research purposes. Finally, we describe a method to evaluate retrieval quality of the ever-growing high-dimensional index of the prototype, without actually indexing a web-scale media collection.
- L. Amsaleg. 2014. A Database Perspective on Large Scale High-Dimensional Indexing. Habilitation à diriger des recherches, Université de Rennes 1.Google Scholar
- R. Arandjelovic and A. Zisserman. 2013. All about VLAD. In Proceedings of the IEEE International Conference on Computer Vision 8 Pattern Recognition. Google Scholar
Digital Library
- A. Babenko and V. S. Lempitsky. 2015. The inverted multi-index. IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 6 (2015).Google Scholar
Digital Library
- M. Batko, F. Falchi, C. Lucchese, D. Novak, R. Perego, F. Rabitti, J. Sedmidubsky, and P. Zezula. 2010. Building a web-scale image similarity search system. Multimedia Tools and Applications 47, 3 (2010). Google Scholar
Digital Library
- E. Y. Chang. 2011. Foundations of Large-Scale Multimedia Information Management and Retrieval: Mathematics of Perception. Springer, Berlin, Germany.Google Scholar
- J. Dean and S. Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Communications of the ACM 51, 1 (2008). Google Scholar
Digital Library
- R. K. Grace, R. Manimegalai, and S. S. Kumar. 2014. Medical image retrieval system in grid using Hadoop framework. In Proceedings of the International Conference on Computer Science and Computational Intelligence. Google Scholar
Digital Library
- C. Gu and Y. Gao. 2012. A content-based image retrieval system based on Hadoop and Lucene. In Proceedings of the International Conference on Cloud and Green Computing. Google Scholar
Digital Library
- G. Þ. Guðmundsson, L. Amsaleg, B. Þ. Jórnsson, and M. J. Franklin. 2017. Towards engineering a web-scale multimedia service: A case study using Spark. In Proceedings of the ACM Multimedia Systems Conference Google Scholar
Digital Library
- J. S. Hare, S. Samangooei, D. P. Dupplaw, and P. H. Lewis. 2012. ImageTerrier: An extensible platform for scalable high-performance image retrieval. In Proceedings of the ACM International Conference on Multimedia Retrieval. Google Scholar
Digital Library
- S. Jai-Andaloussi, A. Elabdouli, A. Chaffai, N. Madrane, and A. Sekkaki. 2013. Medical content based image retrieval by using the Hadoop framework. In Proceedings of the Intermational Conference on Telecommunications.Google Scholar
- H. Jégou, M. Douze, and C. Schmid. 2008a. Hamming embedding and weak geometric consistency for large scale image search. In Proceedings of the European Conference on Computer Vision. Google Scholar
Digital Library
- H. Jégou, M. Douze, and C. Schmid. 2008b. The Copydays image dataset. http://lear.inrialpes.fr/people/jegou/data.php#copydays.Google Scholar
- H. Jégou, M. Douze, and C. Schmid. 2011. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 1 (2011). Google Scholar
Digital Library
- H. Jégou, F. Perronnin, M. Douze, J. Sánchez, P. Pérez, and C. Schmid. 2012. Aggregating local image descriptors into compact codes. IEEE Transactions on Pattern Analysis and Machine Intelligence 34, 9 (2012). Google Scholar
Digital Library
- H. Lejsek, B. Þ. Jórnsson, and L. Amsaleg. 2011. NV-Tree: Nearest neighbours at the billion scale. In Proceedings of the ACM International Conference on Multimedia Retrieval. Google Scholar
Digital Library
- T. Liu, C. Rosenberg, and H. A. Rowley. 2007. Clustering billions of images with large scale nearest neighbor search. In Proceedings of the IEEE Workshop on Applications of Computer Vision. Google Scholar
Digital Library
- D. G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. International Journal on Computer Vision 60, 2 (2004). Google Scholar
Digital Library
- N. Marz and J. Warren. 2015. Big Data: Principles and Best Practices of Scalable Real-Time Data Systems. Manning Publ. Co., Shelter Island, NY. Google Scholar
Digital Library
- D. Moise, D. Shestakov, G. Þ. Guðmundsson, and L. Amsaleg. 2013a. Indexing and searching 100M images with Map-Reduce. In Proceedings of the ACM International Conference on Multimedia Retrieval. Google Scholar
Digital Library
- D. Moise, D. Shestakov, G. Þ. Guðmundsson, and L. Amsaleg. 2013b. Terabyte-scale image similarity search: Experience and best practice. In Proceedings of the IEEE International Conference on Big Data.Google Scholar
- P. Moritz, R. Nishihara, I. Stoica, and M. I. Jordan. 2015. SparkNet: Training Deep Networks in Spark. Arxiv:1511.06051.Google Scholar
- D. Nistér and H. Stewénius. 2006. Scalable recognition with a vocabulary tree. In Proceedings of the IEEE International Conference on Computer Vision 8 Pattern Recognition. Google Scholar
Digital Library
- B. C. Ooi, K.-L. Tan, S. Wang, W. Wang, Q. Cai, G. Chen, J. Gao, Z. Luo, A. K. H. Tung, Y. Wang, Z. Xie, M. Zhang, and K. Zheng. 2015. SINGA: A distributed deep learning platform. In Proceedings of the ACM International Confernce on Multimedia. Google Scholar
Digital Library
- S. Owen, R. Anil, T. Dunning, and E. Friedman. 2011. Mahout in Action. Manning Publ. Co., Shelter Island, NYA. Google Scholar
Digital Library
- J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. 2007. Object retrieval with large vocabularies and fast spatial matching. In Proceedings of the IEEE International Conference on Computer Vision 8 Pattern Recognition.Google Scholar
- J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. 2008. Lost in quantization: Improving particular object retrieval in large scale image databases. In Proceedings of the IEEE International Conference on Computer Vision 8 Pattern Recognition.Google Scholar
- W. Premchaiswadi, A. Tungkatsathan, S. Intarasema, and N. Premchaiswadi. 2013. Improving performance of content-based image retrieval schemes using Hadoop MapReduce. In Proceedings of the International Conference on High Performance Computing and Simulation.Google Scholar
- D. Shestakov, D. Moise, G. Þ. Guðmundsson, and L. Amsaleg. 2013. Scalable high-dimensional indexing with Hadoop. In International Workshop on Content-Based Multimedia Indexing.Google Scholar
- K. Shvachko, H. Kuang, S. Radia, and R. Chansler. 2010. The Hadoop distributed file system. In Proceedings of the IEEE Symposium on Mass Storage Systems and Technologies. Google Scholar
Digital Library
- J. Sivic and A. Zisserman. 2003. Video Google: A text retrieval approach to object matching in videos. In Proceedings of the IEEE International Conference on Computer Vision. Google Scholar
Digital Library
- X. Sun, C. Wang, C. Xu, and L. Zhang. 2013. Indexing billions of images for sketch-based retrieval. In Proeedings of the ACM International Conference on Multimedia. Google Scholar
Digital Library
- R. Tavenard, H. Jégou, and L. Amsaleg. 2011. Balancing clusters to reduce response time variability in large scale image search. In International Workshop on Content-Based Multimedia Indexing.Google Scholar
- B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. 2016. YFCC100M: The new data in multimedia research. Commuications of the ACM 59, 2 (2016). Google Scholar
Digital Library
- A. Vedaldi and B. Fulkerson. 2010. Vlfeat: An open and portable library of computer vision algorithms. In Proceedings of the ACM International Conference on Multimedia. Google Scholar
Digital Library
- H. Wang, B. Xiao, L. Wang, and J. Wu. 2015. Accelerating large-scale image retrieval on heterogeneous architectures with Spark. In Proceedings of the ACM International Conference on Multimedia. Google Scholar
Digital Library
- B. White, T. Yeh, J. Lin, and L. Davis. 2010. Web-scale computer vision using MapReduce for multimedia data mining. In Proceedings of the International Workshop on Multimedia Data Mining. Google Scholar
Digital Library
- Qing-An Yao, Hong Zheng, Zhong-Yu Xu, Qiong Wu, Zi-Wei Li, and Lifen Yun. 2014. Massive medical images retrieval system based on Hadoop. Journal of Multimedia 9, 2 (2014).Google Scholar
Cross Ref
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the Symposium on Networked Systems Design and Implementation. Google Scholar
Digital Library
- M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. 2010. Spark: Cluster computing with working sets. In Proceedings of the USENIX Workshop on Hot Topics in Cloud Computing. Google Scholar
Digital Library
- J. Zhang, X. Liu, J. Luo, and B. Lang. 2010. DIRS: Distributed image retrieval system based on MapReduce. In Proceedings of the International Conference on Pervasive Computing and Applications.Google Scholar
Index Terms
Prototyping a Web-Scale Multimedia Retrieval Service Using Spark
Recommendations
Towards Engineering a Web-Scale Multimedia Service: A Case Study Using Spark
MMSys'17: Proceedings of the 8th ACM on Multimedia Systems ConferenceComputing power has now become abundant with multi-core machines, grids and clouds, but it remains a challenge to harness the available power and move towards gracefully handling web-scale datasets. Several researchers have used automatically ...
Cloud-agnostic architectures for machine learning based on Apache Spark
Highlights- Cloud provider-independent cluster deployment in cloud
- Scalable multi-VM ...
AbstractReference architectures for Big Data, machine learning and stream processing include not only recommended practices and interconnected building blocks but considerations for scalability, availability, manageability, and security as ...
Hybrid IT and Multi Cloud an Emerging Trend and Improved Performance in Cloud Computing
AbstractIn the present day scenario cloud computing is an attractive subject for IT and non IT personnel. It is a service-oriented pay per use computational model. Cloud has working models with service-oriented delivery mechanism as well as deployment-...






Comments