Abstract
One of the main goals of Cloud and Grid infrastructures is to make their services easily accessible and attractive to end-users. In this article we investigate the problem of supporting keyword-based searching for the discovery of software files that are installed on the nodes of large-scale, federated Grid and Cloud computing infrastructures. We address a number of challenges that arise from the unstructured nature of software and the unavailability of software-related metadata on large-scale networked environments. We present Minersoft, a harvester that visits Grid/Cloud infrastructures, crawls their file systems, identifies and classifies software files, and discovers implicit associations between them. The results of Minersoft harvesting are encoded in a weighted, typed graph, called the Software Graph. A number of information retrieval (IR) algorithms are used to enrich this graph with structural and content associations, to annotate software files with keywords and build inverted indexes to support keyword-based searching for software. Using a real testbed, we present an evaluation study of our approach, using data extracted from production-quality Grid and Cloud computing infrastructures. Experimental results show that Minersoft is a powerful tool for software search and discovery.
- Agrawal, R. et al. 2008. The Claremont report on database research. SIGMOD Rec. 37, 3, 9--19. Google Scholar
Digital Library
- Al-Maskari, A., Sanderson, M., and Clough, P. 2007. The relationship between IR effectiveness measures and user satisfaction. In Proceedings of SIGIR '07. ACM, New York, 773--774. Google Scholar
Digital Library
- AMAZON. 2009. Amazon Elastic Compute (EC2) Cloud. http://aws.amazon.com/ec2.Google Scholar
- Ames, A., Maltzahn, C., Bobb, N., Miller, E. L., Brandt, S. A., Neeman, A., Hiatt, A., and Tuteja, D. 2005. Richer file system metadata using links and attributes. In Proceedings of MSST '05. IEEE, Los Alamitos, CA, 49--60. Google Scholar
Digital Library
- Anderson, J. and Rainie, L. 2010. The future of Cloud computing. Tech. rep., Pew Internet and American Life Project, http://www.pewinternet.org/Reports/2010/The-future-of-cloud-computing.aspx.Google Scholar
- Antoniol, G., Canfora, G., Casazza, G., Lucia, A. D., and Merlo, E. 2002. Recovering traceability links between code and documentation. IEEE Trans. Softw. Eng. 28, 10, 970--983. Google Scholar
Digital Library
- Armbrust, M., Fox, A., Griffith, R., Joseph, A. D., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Stoica, I., and Zaharia, M. 2010. A view of cloud computing. Comm. ACM 53, 4, 50--58. Google Scholar
Digital Library
- RACKSPACE. 2009. The Rackspace Cloud. http://www.mosso.com/rackspace.jsp.Google Scholar
- Bao, S., Xue, G., Wu, X., Yu, Y., Fei, B., and Su, Z. 2007. Optimizing web search using social annotations. In Proceedings of WWW'07. ACM, New York, 501--510. Google Scholar
Digital Library
- Bass, L., Clements, P., Kazman, R., and Klein, M. 2008. Evaluating the software architecture competence of organizations. In Proceedings of WICSA'08. 249--252. Google Scholar
Digital Library
- Bird, I., Jones, B., and Kee, K. F. 2009. The organization and management of grid infrastructures. Computer 42, 1, 36--46. Google Scholar
Digital Library
- Brochu, F., Egede, U., Elmsheuser, J., et al. 2009. Ganga: A tool for computational-task management and easy access to Grid resources. Comput. Phys. Comm. http://ganga.web.cern.ch/ganga/documents/index.php.Google Scholar
- Brogi, A., Corfini, S., and Popescu, R. 2008. Semantics-based composition-oriented discovery of web services. ACM Trans. Internet Technol. 8, 4, 1--39. Google Scholar
Digital Library
- Cho, J. and Garcia-Molina, H. 2002. Parallel crawlers. In Proceedings of the 11th International Conference on the World Wide Web (WWW'02). ACM, New York, 124--135. Google Scholar
Digital Library
- Clarke, C. L. et al. 2008. Novelty and diversity in information retrieval evaluation. In Proceedings of SIGIR'08. ACM, New York, 659--666. Google Scholar
Digital Library
- Cohen, S., Domshlak, C., and Zwerdling, N. 2008. On ranking techniques for desktop search. ACM Trans. Inf. Syst. 26, 2, 1--24. Google Scholar
Digital Library
- Coyle, M. and Smyth, B. 2007. Supporting intelligent web search. ACM Trans. Internet Technol. 7. Google Scholar
Digital Library
- Dean, J. and Ghemawat, S. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI'04). Usenix Association, 137--150. Google Scholar
Digital Library
- Dikaiakos, M. D., Katsaros, D., Mehra, P., Pallis, G., and Vakali, A. 2009. Cloud computing: Distributed Internet computing for IT and scientific research . IEEE Internet Comput. 13, 5 , 10--13. Google Scholar
Digital Library
- Dikaiakos, M. D., Sakellariou, R., and Ioannidis, Y. 2006. Information Services for Largescale Grids: ACase for a Grid Search Engine. American Scientific Publishers, 571--585.Google Scholar
- EGEE. 2010. Enabling grids for E-sciencE (EGEE). http://www.eu-egee.org/.Google Scholar
- Foster, I., Kesselman, C., and Tuecke, S. 2001. The anatomy of the grid: Enabling scalable virtual organizations. Int. J. Supercomput. Appl. 15, 3, 200--222. Google Scholar
Digital Library
- Gabel, M., Jiang, L., and Su, Z. 2008. Scalable detection of semantic clones. In Proceedings of the 30th International Conference on Software Engineering (ICSE '08). ACM, New York, 321--330. Google Scholar
Digital Library
- Gifford, D. K., Jouvelot, P., Sheldon, M. A., and O'Toole, J. W. 1991. Semantic file systems. In Proceedings of SOSP'91. ACM, New York, 16--25. Google Scholar
Digital Library
- Grechanik, M., Fu, C., Xie, Q., McMillan, C., Poshyvanyk, D., and Cumby, C. 2010. A search engine for finding highly relevant applications. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering (ICSE '10), ACM, New York, 475--484. Google Scholar
Digital Library
- Gyllstrom, K. A., Soules, C., and Veitch, A. 2007. Confluence: Enhancing contextual desktop search. In Proceedings of SIGIR'07. ACM, New York, 717--718. Google Scholar
Digital Library
- Hummel, O. and Atkinson, C. 2004. Extreme harvesting: Test driven discovery and reuse of software components. In Proceedings of the IEEE International Conference on Information Reuse and Integration. IEEE, Los Alamitos, CA, 66--72.Google Scholar
- Järvelin, K. and Kekäläinen, J. 2002. Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. 20, 4, 422--446. Google Scholar
Digital Library
- Katsifodimos, A., Pallis, G., and Dikaiakos, D. M. 2009. Harvesting large-scale grids for software resources. In Proceedings of CCGRID'09. IEEE, Los Alamitos, CA, Google Scholar
Digital Library
- Khemakhem, S., Drira, K., and Jmaiel, M. 2007. Sec+: An enhanced search engine for component-based software development. SIGSOFT Softw. Eng. Notes 32, 4, 4. Google Scholar
Digital Library
- Khemakhem, S., Drira, K., and Jmaiel, M. 2010. An integration ontology for components composition. Int. Jo Web Portals 2, 3, 35--42.Google Scholar
Cross Ref
- Koren, J., Leung, A., Zhang, Y., Maltzahn, C., Ames, S., and Miller, E. 2007. Searching and navigating metabyte-scale file systems based on facets. In Proceedings of PDSW'07. 21--25. Google Scholar
Digital Library
- Leskovec, J., Kleinberg, J., and Faloutsos, C. 2007. Graph evolution: Densification and shrinking diameters. ACM Trans Knowl. Disav. Data 1, 1. Google Scholar
Digital Library
- Li, G., Ooi, B. C., Feng, J., Wang, J., and Zhou, L. 2008. Ease: An effective 3-in-1 keyword search method for unstructured, semi-structured and structured data. In Proceedings of SIGMOD'08. ACM, New York, 903--914. Google Scholar
Digital Library
- Li, Q., Liu, A., Liu, H., Lin, B., Huang, L., and Gu, N. 2009. Web services provision: Solutions, challenges and opportunities. In Proceedings of the 3rd International Conference on Ubiquitous Information Management and Communication (ICUIMC'09). ACM, New York, 80--87. Google Scholar
Digital Library
- Linstead, E., Bajracharya, S., Ngo, T., Rigor, P., Lopes, C., and Baldi, P. 2009. Sourcerer: Mining and searching internet-scale software repositories. Data Mining. Knowl. Discov 18, 2, 300--336. Google Scholar
Digital Library
- Liu, L., Xu, L., Wu, Y., Yang, G., and Ganger, G. R. 2010. SmartScan: Efficient metadata crawl for storage management metadata querying in large file systems. Tech. rep.CMU-PDL-10-112, Parallel Data Lab., Carnegie Mellon University.Google Scholar
- Lucene. 2009. Apache Lucene. http://lucene.apache.org/core/.Google Scholar
- Lucia, A. D., Fasano, F., Oliveto, R., and Tortora, G. 2007. Recovering traceability links in software artifact management systems using information retrieval methods. ACM Trans. Softw. Eng. Methodol. 16, 4, 13. Google Scholar
Digital Library
- Lucredio, D., do Prado, A. F., and de Almeida, E. S. 2004. A survey on software components search and retrieval. In Proceedings of the 30th Euromicro Conferenc. 152--159. Google Scholar
Digital Library
- Maarek, Y. S., Berry, D. M., and Kaiser, G. E. 1991. An information retrieval approach for automatically constructing software libraries. IEEE Trans. Softw. Eng. 17, 8, 800--813. Google Scholar
Digital Library
- Manber, U. 1994. Finding similar files in a large file system. In Proceedings of the USENIX Winter Technical Conference. USENIX Association, Berkeley, CA, 2. Google Scholar
Digital Library
- Marcus, A. and Maletic, J. 2003. Recovering documentation-to-source-code traceability links using latent semantic indexing. In Proceedings of ICSE2003. 125--135. Google Scholar
Digital Library
- Matsushita, M. 2005. Ranking significance of software components based on use relations. IEEE Trans. Softw. Eng. 31, 3, 213--225. Google Scholar
Digital Library
- McMillan, C., Grechanik, M., Poshyvanyk, D., Xie, Q., and Fu, C. 2011. Portfolio: Finding relevant functions and their usage. In Proceedings of the 33rd International Conference on Software Engineering (ICSE'11). ACM, New York, 111--120. Google Scholar
Digital Library
- Mohagheghi, P. and Conradi, R. 2008. An empirical investigation of software reuse benefits in a large telecom product. ACM Trans. Softw. Eng. Methodol. 17, 3, 1--31. Google Scholar
Digital Library
- Pallis, G., Katsifodimos, A., and Dikaiakos, D. M. 2009. Effective keyword search for software resources installed in large-scale grid infrastructures. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence. ACM, New York. Google Scholar
Digital Library
- Robillard, M. P. 2008. Robillard, M. P. 2008. Topology analysis of software dependencies. ACM Trans. Softw. Eng. Methodol. 17, 4, 1--36. Google Scholar
Digital Library
- Soules, C. A. N. and Ganger, G. R. 2005. Connections: using context to enhance file search. SIGOPS Oper. Syst. Rev. 39, 5, 119--132. Google Scholar
Digital Library
- Spärck Jones, K. and Willett, P. 1997. Readings in Information Retrieval. Morgan Kaufman, San Francisco.Google Scholar
- Susan, S., Medha, U., Sukanya, R., and Christina, L. 2010. How well do search engines support code retrieval on the Web? ACM Trans. Softw. Eng. Methodol. Google Scholar
Digital Library
- Toch, E., Gal, A., Reinhartz-Berger, I., and Dori, D. 2007. A semantic approach to approximate service retrieval. ACM Trans. Internet Technol. 8. Google Scholar
Digital Library
- Vanderlei, T. et al. 2007. A cooperative classification mechanism for search and retrieval software components. In Proceedings of SAC'07. ACM, New York, 866--871. Google Scholar
Digital Library
- Xue, X.-B., Zhou, Z.-H., and Zhang, Z. M. 2008. Improving web search using image snippets. ACM Trans. Internet Technol. 8, 21--28. Google Scholar
Digital Library
- Yeung, P. C., Freund, L., and Clarke, C. L. 2007. X-site: A workplace search tool for software engineers. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'07). ACM, New York, 900. Google Scholar
Digital Library
- Zaremski, A. M. and Wing, J. M. 1997. Specification matching of software components. ACM Trans. Softw. Eng. Methodol. 6, 4, 333--369. Google Scholar
Digital Library
Index Terms
Minersoft: Software retrieval in grid and cloud computing infrastructures
Recommendations
Harnessing Cloud Technologies for a Virtualized Distributed Computing Infrastructure
The InterGrid system aims to provide an execution environment for running applications on top of interconnected infrastructures. The system uses virtual machines as building blocks to construct execution environments that span multiple computing sites. ...
Effective Keyword Search for Software Resources Installed in Large-Scale Grid Infrastructures
WI-IAT '09: Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01In this paper, we investigate the problem of supporting keyword-based searching for the discovery of software resources that are installed on the nodes of large-scale, federated Grid computing infrastructures. We address a number of challenges that ...
Infrastructure Federation Through Virtualized Delegation of Resources and Services
Infrastructure federation is becoming an increasingly important issue for modern Distributed Computing Infrastructures (DCIs): Dynamic elasticity of quasi-static Grid environments, incorporation of special-purpose resources into commoditized Cloud ...








Comments