skip to main content
research-article

Minersoft: Software retrieval in grid and cloud computing infrastructures

Published:05 July 2012Publication History
Skip Abstract Section

Abstract

One of the main goals of Cloud and Grid infrastructures is to make their services easily accessible and attractive to end-users. In this article we investigate the problem of supporting keyword-based searching for the discovery of software files that are installed on the nodes of large-scale, federated Grid and Cloud computing infrastructures. We address a number of challenges that arise from the unstructured nature of software and the unavailability of software-related metadata on large-scale networked environments. We present Minersoft, a harvester that visits Grid/Cloud infrastructures, crawls their file systems, identifies and classifies software files, and discovers implicit associations between them. The results of Minersoft harvesting are encoded in a weighted, typed graph, called the Software Graph. A number of information retrieval (IR) algorithms are used to enrich this graph with structural and content associations, to annotate software files with keywords and build inverted indexes to support keyword-based searching for software. Using a real testbed, we present an evaluation study of our approach, using data extracted from production-quality Grid and Cloud computing infrastructures. Experimental results show that Minersoft is a powerful tool for software search and discovery.

References

  1. Agrawal, R. et al. 2008. The Claremont report on database research. SIGMOD Rec. 37, 3, 9--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Al-Maskari, A., Sanderson, M., and Clough, P. 2007. The relationship between IR effectiveness measures and user satisfaction. In Proceedings of SIGIR '07. ACM, New York, 773--774. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. AMAZON. 2009. Amazon Elastic Compute (EC2) Cloud. http://aws.amazon.com/ec2.Google ScholarGoogle Scholar
  4. Ames, A., Maltzahn, C., Bobb, N., Miller, E. L., Brandt, S. A., Neeman, A., Hiatt, A., and Tuteja, D. 2005. Richer file system metadata using links and attributes. In Proceedings of MSST '05. IEEE, Los Alamitos, CA, 49--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Anderson, J. and Rainie, L. 2010. The future of Cloud computing. Tech. rep., Pew Internet and American Life Project, http://www.pewinternet.org/Reports/2010/The-future-of-cloud-computing.aspx.Google ScholarGoogle Scholar
  6. Antoniol, G., Canfora, G., Casazza, G., Lucia, A. D., and Merlo, E. 2002. Recovering traceability links between code and documentation. IEEE Trans. Softw. Eng. 28, 10, 970--983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Armbrust, M., Fox, A., Griffith, R., Joseph, A. D., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Stoica, I., and Zaharia, M. 2010. A view of cloud computing. Comm. ACM 53, 4, 50--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. RACKSPACE. 2009. The Rackspace Cloud. http://www.mosso.com/rackspace.jsp.Google ScholarGoogle Scholar
  9. Bao, S., Xue, G., Wu, X., Yu, Y., Fei, B., and Su, Z. 2007. Optimizing web search using social annotations. In Proceedings of WWW'07. ACM, New York, 501--510. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Bass, L., Clements, P., Kazman, R., and Klein, M. 2008. Evaluating the software architecture competence of organizations. In Proceedings of WICSA'08. 249--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Bird, I., Jones, B., and Kee, K. F. 2009. The organization and management of grid infrastructures. Computer 42, 1, 36--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Brochu, F., Egede, U., Elmsheuser, J., et al. 2009. Ganga: A tool for computational-task management and easy access to Grid resources. Comput. Phys. Comm. http://ganga.web.cern.ch/ganga/documents/index.php.Google ScholarGoogle Scholar
  13. Brogi, A., Corfini, S., and Popescu, R. 2008. Semantics-based composition-oriented discovery of web services. ACM Trans. Internet Technol. 8, 4, 1--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Cho, J. and Garcia-Molina, H. 2002. Parallel crawlers. In Proceedings of the 11th International Conference on the World Wide Web (WWW'02). ACM, New York, 124--135. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Clarke, C. L. et al. 2008. Novelty and diversity in information retrieval evaluation. In Proceedings of SIGIR'08. ACM, New York, 659--666. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Cohen, S., Domshlak, C., and Zwerdling, N. 2008. On ranking techniques for desktop search. ACM Trans. Inf. Syst. 26, 2, 1--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Coyle, M. and Smyth, B. 2007. Supporting intelligent web search. ACM Trans. Internet Technol. 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Dean, J. and Ghemawat, S. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI'04). Usenix Association, 137--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Dikaiakos, M. D., Katsaros, D., Mehra, P., Pallis, G., and Vakali, A. 2009. Cloud computing: Distributed Internet computing for IT and scientific research . IEEE Internet Comput. 13, 5 , 10--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Dikaiakos, M. D., Sakellariou, R., and Ioannidis, Y. 2006. Information Services for Largescale Grids: ACase for a Grid Search Engine. American Scientific Publishers, 571--585.Google ScholarGoogle Scholar
  21. EGEE. 2010. Enabling grids for E-sciencE (EGEE). http://www.eu-egee.org/.Google ScholarGoogle Scholar
  22. Foster, I., Kesselman, C., and Tuecke, S. 2001. The anatomy of the grid: Enabling scalable virtual organizations. Int. J. Supercomput. Appl. 15, 3, 200--222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Gabel, M., Jiang, L., and Su, Z. 2008. Scalable detection of semantic clones. In Proceedings of the 30th International Conference on Software Engineering (ICSE '08). ACM, New York, 321--330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Gifford, D. K., Jouvelot, P., Sheldon, M. A., and O'Toole, J. W. 1991. Semantic file systems. In Proceedings of SOSP'91. ACM, New York, 16--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Grechanik, M., Fu, C., Xie, Q., McMillan, C., Poshyvanyk, D., and Cumby, C. 2010. A search engine for finding highly relevant applications. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering (ICSE '10), ACM, New York, 475--484. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Gyllstrom, K. A., Soules, C., and Veitch, A. 2007. Confluence: Enhancing contextual desktop search. In Proceedings of SIGIR'07. ACM, New York, 717--718. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Hummel, O. and Atkinson, C. 2004. Extreme harvesting: Test driven discovery and reuse of software components. In Proceedings of the IEEE International Conference on Information Reuse and Integration. IEEE, Los Alamitos, CA, 66--72.Google ScholarGoogle Scholar
  28. Järvelin, K. and Kekäläinen, J. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 4, 422--446. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Katsifodimos, A., Pallis, G., and Dikaiakos, D. M. 2009. Harvesting large-scale grids for software resources. In Proceedings of CCGRID'09. IEEE, Los Alamitos, CA, Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Khemakhem, S., Drira, K., and Jmaiel, M. 2007. Sec+: An enhanced search engine for component-based software development. SIGSOFT Softw. Eng. Notes 32, 4, 4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Khemakhem, S., Drira, K., and Jmaiel, M. 2010. An integration ontology for components composition. Int. Jo Web Portals 2, 3, 35--42.Google ScholarGoogle ScholarCross RefCross Ref
  32. Koren, J., Leung, A., Zhang, Y., Maltzahn, C., Ames, S., and Miller, E. 2007. Searching and navigating metabyte-scale file systems based on facets. In Proceedings of PDSW'07. 21--25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Leskovec, J., Kleinberg, J., and Faloutsos, C. 2007. Graph evolution: Densification and shrinking diameters. ACM Trans Knowl. Disav. Data 1, 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Li, G., Ooi, B. C., Feng, J., Wang, J., and Zhou, L. 2008. Ease: An effective 3-in-1 keyword search method for unstructured, semi-structured and structured data. In Proceedings of SIGMOD'08. ACM, New York, 903--914. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Li, Q., Liu, A., Liu, H., Lin, B., Huang, L., and Gu, N. 2009. Web services provision: Solutions, challenges and opportunities. In Proceedings of the 3rd International Conference on Ubiquitous Information Management and Communication (ICUIMC'09). ACM, New York, 80--87. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Linstead, E., Bajracharya, S., Ngo, T., Rigor, P., Lopes, C., and Baldi, P. 2009. Sourcerer: Mining and searching internet-scale software repositories. Data Mining. Knowl. Discov 18, 2, 300--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Liu, L., Xu, L., Wu, Y., Yang, G., and Ganger, G. R. 2010. SmartScan: Efficient metadata crawl for storage management metadata querying in large file systems. Tech. rep.CMU-PDL-10-112, Parallel Data Lab., Carnegie Mellon University.Google ScholarGoogle Scholar
  38. Lucene. 2009. Apache Lucene. http://lucene.apache.org/core/.Google ScholarGoogle Scholar
  39. Lucia, A. D., Fasano, F., Oliveto, R., and Tortora, G. 2007. Recovering traceability links in software artifact management systems using information retrieval methods. ACM Trans. Softw. Eng. Methodol. 16, 4, 13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Lucredio, D., do Prado, A. F., and de Almeida, E. S. 2004. A survey on software components search and retrieval. In Proceedings of the 30th Euromicro Conferenc. 152--159. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Maarek, Y. S., Berry, D. M., and Kaiser, G. E. 1991. An information retrieval approach for automatically constructing software libraries. IEEE Trans. Softw. Eng. 17, 8, 800--813. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Manber, U. 1994. Finding similar files in a large file system. In Proceedings of the USENIX Winter Technical Conference. USENIX Association, Berkeley, CA, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Marcus, A. and Maletic, J. 2003. Recovering documentation-to-source-code traceability links using latent semantic indexing. In Proceedings of ICSE2003. 125--135. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Matsushita, M. 2005. Ranking significance of software components based on use relations. IEEE Trans. Softw. Eng. 31, 3, 213--225. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. McMillan, C., Grechanik, M., Poshyvanyk, D., Xie, Q., and Fu, C. 2011. Portfolio: Finding relevant functions and their usage. In Proceedings of the 33rd International Conference on Software Engineering (ICSE'11). ACM, New York, 111--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Mohagheghi, P. and Conradi, R. 2008. An empirical investigation of software reuse benefits in a large telecom product. ACM Trans. Softw. Eng. Methodol. 17, 3, 1--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Pallis, G., Katsifodimos, A., and Dikaiakos, D. M. 2009. Effective keyword search for software resources installed in large-scale grid infrastructures. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Robillard, M. P. 2008. Robillard, M. P. 2008. Topology analysis of software dependencies. ACM Trans. Softw. Eng. Methodol. 17, 4, 1--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Soules, C. A. N. and Ganger, G. R. 2005. Connections: using context to enhance file search. SIGOPS Oper. Syst. Rev. 39, 5, 119--132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Spärck Jones, K. and Willett, P. 1997. Readings in Information Retrieval. Morgan Kaufman, San Francisco.Google ScholarGoogle Scholar
  51. Susan, S., Medha, U., Sukanya, R., and Christina, L. 2010. How well do search engines support code retrieval on the Web? ACM Trans. Softw. Eng. Methodol. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Toch, E., Gal, A., Reinhartz-Berger, I., and Dori, D. 2007. A semantic approach to approximate service retrieval. ACM Trans. Internet Technol. 8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Vanderlei, T. et al. 2007. A cooperative classification mechanism for search and retrieval software components. In Proceedings of SAC'07. ACM, New York, 866--871. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Xue, X.-B., Zhou, Z.-H., and Zhang, Z. M. 2008. Improving web search using image snippets. ACM Trans. Internet Technol. 8, 21--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Yeung, P. C., Freund, L., and Clarke, C. L. 2007. X-site: A workplace search tool for software engineers. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'07). ACM, New York, 900. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Zaremski, A. M. and Wing, J. M. 1997. Specification matching of software components. ACM Trans. Softw. Eng. Methodol. 6, 4, 333--369. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Minersoft: Software retrieval in grid and cloud computing infrastructures

        Recommendations

        Reviews

        Xiannong Meng

        Commercial cloud providers such as Amazon and research grids such as Enabling Grids for E-Science (EGEE) are playing an increasingly important role in computing, especially in handling issues with big data. However, with hundreds of cloud and grid servers, and files on the order of millions, it is a challenge for users to find an appropriate piece of software in the cloud or grid in an efficient and effective manner. Researchers are trying to find answers to the challenge. Minersoft, a search engine for software packages distributed across computing clouds and grids, is one such attempt in this direction. Minersoft consists of crawlers that collect software-related information from the clouds, indexers that build inverted indices for search, data storage that stores all information and data for search, a query manager that accepts and processes the queries and returns the results to the user, and a job manager that coordinates the work among different components. Users are able to examine, index, and retrieve upon request software and related documents in various forms, including binary, source code, software libraries, and software description documents. One of the interesting concepts used by the authors is that of a software graph, which is similar to a map of a file system starting from a root. Each of the leaves is a file, and each node in the tree (interior or leaf) contains metadata that helps identify or categorize the node (file). The authors conducted two types of experiments to evaluate the performance of Minersoft. One examines system performance, measuring the number of files a system can index and the time needed to index these files. On a grid, the crawling software is written in Python and it can read an average rate of 100,000 files in five-to-30 minutes. The average rates of the indexing software range from 15-to-65 minutes per 100,000 files. The measurements are on the same order but slower on clouds. The other type of measurement concerns query/answer correctness. The authors used multiple types of measurements, including [email protected], mean reciprocal rank (MRR), and normalized discounted cumulative gain (NDCG). All three measures show that the system performs very well. The system is very interesting and will be useful to cloud and grid communities. Tools like these can help users locate and identify pieces of software that match their needs. Online Computing Reviews Service

        Access critical reviews of Computing literature here

        Become a reviewer for Computing Reviews.

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Internet Technology
          ACM Transactions on Internet Technology  Volume 12, Issue 1
          June 2012
          83 pages
          ISSN:1533-5399
          EISSN:1557-6051
          DOI:10.1145/2220352
          Issue’s Table of Contents

          Copyright © 2012 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 5 July 2012
          • Accepted: 1 April 2012
          • Revised: 1 February 2012
          • Received: 1 May 2011
          Published in toit Volume 12, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!