10.1145/1096985.1096999acmconferencesArticle/Chapter ViewAbstractPublication PagesgirConference Proceedings
ARTICLE

Geographical partition for distributed web crawling

ABSTRACT

This paper evaluates scalable distributed crawling by means of the geographical partition of the Web. The approach is based on the existence of multiple distributed crawlers each one responsible for the pages belonging to one or more previously identified geographical zones. The work considers a distributed crawler where the assignment of pages to visit is based on page content geographical scope. For the initial assignment of a page to a partition we use a simple heuristic that marks a page within the same scope of the hosting web server geographical location. During download, if the analyze of a page contents recommends a different geographical scope, the page is forwarded to the well-located web server.A sample of the Portuguese Web pages, extracted during the year 2005, was used to evaluate: a) page download communication times and the b) overhead of pages exchange among servers. Evaluation results permit to compare our approach to conventional hash partitioning strategies.

References

  1. C. C. Aggarwal, F. Al-Garawi, and P. S. Yu. Intelligent crawling on the world wide web with arbitrary predicates. In World Wide Web, pages 96--105, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A.-L. Barab&3225;si and R. Albert. Emergence of scaling in random networks. SIAM Journal on Scientific Computing, 286(5439):509 -- 512, 1999.Google ScholarGoogle Scholar
  3. CAIDA. NetGeo - The Internet Geographic Database. http://www.caida.org/tools/utilities/netgeo/, 2002.Google ScholarGoogle Scholar
  4. S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: A new approach to topic-specific web resource discovery. In Proceedings of the 8th International WWW Conference, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Cho and H. Garcia-Molina. Parallel crawlers. In Proc. of the 11th International World--Wide Web Conference, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Chung and C. Clarke. Topic-oriented collaborative crawling. In 11th International Conference on Information and Knowledge Management (CIKM'02), pages 34--42, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Ding, L. Gravano, and N. Shivakumar. Computing geographical scopes of web resources. In 26th International Conference on Very Large Databases, VLDB 2000, Cairo, Egypt, September 10--14 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Exposto, J. Macedo, A. Pina, A. Alves, and J. Rufino". Multi-objective web graph partitioning for efficient distributed web crawling. Technical report, University of Minho, Department of Computer Science (Work in progress), 2005.Google ScholarGoogle Scholar
  9. Jung - the java universal network/graph framework. http://jung.sourceforge.net/, 2005.Google ScholarGoogle Scholar
  10. G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1):359 -- 392, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. Karypis and V. Kumar. A software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices -- version 4.0, 1998.Google ScholarGoogle Scholar
  12. J. Macedo, A. Pina, P. Azevedo, O. Belo, M. Santos, J. J. Almeida, and L. Silva. NetCensus Project. http://marco.uminho.pt/~macedo/netcensus/, 2001.Google ScholarGoogle Scholar
  13. R. Periakaruppan and E. Nemeth. GTrace: A graphical traceroute tool. In 13th Conference on Systems Administration (LISA-99), pages 69--78, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. K. Schloegel, G. Karypis, and V. Kumar. A new algorithm for multi-objective graph partitioning. Technical report, University of Minnesota, Department of Computer Science, 1999.Google ScholarGoogle Scholar
  15. M. J. Silva, B. Martins, M. Chaves, N. Cardoso, and A. P. Afonso. Adding geographic scopes to web resources. In ACM SIGIR 2004 Workshop on Geographic Information Retrieval, 2004.Google ScholarGoogle Scholar
  16. US National Geospatial Intelligence Agency. Geographic Names Database. http://earth-info.nima.mil/gns/html/index.html.Google ScholarGoogle Scholar
  17. XLDB Group. WPT03. Linguateca, http://www.linguateca.pt, 2003.Google ScholarGoogle Scholar

Index Terms

  1. Geographical partition for distributed web crawling

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!