ABSTRACT
This paper evaluates scalable distributed crawling by means of the geographical partition of the Web. The approach is based on the existence of multiple distributed crawlers each one responsible for the pages belonging to one or more previously identified geographical zones. The work considers a distributed crawler where the assignment of pages to visit is based on page content geographical scope. For the initial assignment of a page to a partition we use a simple heuristic that marks a page within the same scope of the hosting web server geographical location. During download, if the analyze of a page contents recommends a different geographical scope, the page is forwarded to the well-located web server.A sample of the Portuguese Web pages, extracted during the year 2005, was used to evaluate: a) page download communication times and the b) overhead of pages exchange among servers. Evaluation results permit to compare our approach to conventional hash partitioning strategies.
References
- C. C. Aggarwal, F. Al-Garawi, and P. S. Yu. Intelligent crawling on the world wide web with arbitrary predicates. In World Wide Web, pages 96--105, 2001. Google Scholar
Digital Library
- A.-L. Barab&3225;si and R. Albert. Emergence of scaling in random networks. SIAM Journal on Scientific Computing, 286(5439):509 -- 512, 1999.Google Scholar
- CAIDA. NetGeo - The Internet Geographic Database. http://www.caida.org/tools/utilities/netgeo/, 2002.Google Scholar
- S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: A new approach to topic-specific web resource discovery. In Proceedings of the 8th International WWW Conference, 1999. Google Scholar
Digital Library
- J. Cho and H. Garcia-Molina. Parallel crawlers. In Proc. of the 11th International World--Wide Web Conference, 2002. Google Scholar
Digital Library
- C. Chung and C. Clarke. Topic-oriented collaborative crawling. In 11th International Conference on Information and Knowledge Management (CIKM'02), pages 34--42, 2002. Google Scholar
Digital Library
- J. Ding, L. Gravano, and N. Shivakumar. Computing geographical scopes of web resources. In 26th International Conference on Very Large Databases, VLDB 2000, Cairo, Egypt, September 10--14 2000. Google Scholar
Digital Library
- J. Exposto, J. Macedo, A. Pina, A. Alves, and J. Rufino". Multi-objective web graph partitioning for efficient distributed web crawling. Technical report, University of Minho, Department of Computer Science (Work in progress), 2005.Google Scholar
- Jung - the java universal network/graph framework. http://jung.sourceforge.net/, 2005.Google Scholar
- G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):359 -- 392, 1998. Google Scholar
Digital Library
- G. Karypis and V. Kumar. A software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices -- version 4.0, 1998.Google Scholar
- J. Macedo, A. Pina, P. Azevedo, O. Belo, M. Santos, J. J. Almeida, and L. Silva. NetCensus Project. http://marco.uminho.pt/~macedo/netcensus/, 2001.Google Scholar
- R. Periakaruppan and E. Nemeth. GTrace: A graphical traceroute tool. In 13th Conference on Systems Administration (LISA-99), pages 69--78, 1999. Google Scholar
Digital Library
- K. Schloegel, G. Karypis, and V. Kumar. A new algorithm for multi-objective graph partitioning. Technical report, University of Minnesota, Department of Computer Science, 1999.Google Scholar
- M. J. Silva, B. Martins, M. Chaves, N. Cardoso, and A. P. Afonso. Adding geographic scopes to web resources. In ACM SIGIR 2004 Workshop on Geographic Information Retrieval, 2004.Google Scholar
- US National Geospatial Intelligence Agency. Geographic Names Database. http://earth-info.nima.mil/gns/html/index.html.Google Scholar
- XLDB Group. WPT03. Linguateca, http://www.linguateca.pt, 2003.Google Scholar
Index Terms
Geographical partition for distributed web crawling



Comments