skip to main content
research-article

Do not crawl in the DUST: Different URLs with similar text

Published:17 January 2009Publication History
Skip Abstract Section

Abstract

We consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent in Web sites, as Web server software often uses aliases and redirections, and dynamically generates the same page from various different URL requests. We present a novel algorithm, DustBuster, for uncovering DUST; that is, for discovering rules that transform a given URL to others that are likely to have similar content. DustBuster mines DUST effectively from previous crawl logs or Web server logs, without/examining page contents. Verifying these rules via sampling requires fetching few actual Web pages. Search engines can benefit from information about DUST to increase the effectiveness of crawling, reduce indexing overhead, and improve the quality of popularity statistics such as PageRank.

References

  1. Apache 2008. Apache. http server version 2.2 configuration files. http://httpd.apache.org/docs/2.2/configuring.html.Google ScholarGoogle Scholar
  2. Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB), 487--499. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Analog. 2008. Analog homepage. http://www.analog.cx/.Google ScholarGoogle Scholar
  4. Berners-Lee, T., Fielding, R., and Masinter, L. Uniform resource identifiers (URI): Generic syntax. http://www.ietf.org/rfc/rfc2396.txt. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bharat, K. and Broder, A. Z. 1999. Mirror, mirror on the Web: A study of host pairs with replicated content. Comput. Netw. 31, 11--16, 1579--1590. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bharat, K., Broder, A. Z., Dean, J., and Henzinger, M. R. 2000. A comparison of techniques to find mirrored hosts on the WWW. J. Amer. Soc. Inf. Sci. 51, 12, 1114--1122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bognar, M. 1995. A survey on abstract rewriting. www.di.ubi.pt/~desousa/1998-1999/logica/mb.ps.Google ScholarGoogle Scholar
  8. Brin, S., Davis, J., and Garcia-Molina, H. 1995. Copy detection mechanisms for digital documents. In Proceedings of the 14th Special Interest Group on Management of Data (SIGMOD), 398--409. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Broder, A. Z., Glassman, S. C., and Manasse, M. S. 1997. Syntactic clustering of the Web. In Proceedings of the 6th International World Wide Web Conference (WWW), 1157--1166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Cho, J., Shivakumar, N., and Garcia-Molina, H. 2000. Finding replicated Web collections. In Proceedings of the 19th Special Interest Group on Management of Data (SIGMOD), 355--366. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Di Iorio, E., Diligenti, M., Gori, M., Maggini, M., and Pucci, A. 2003. Detecting near-replicas on the Web by content and hyperlink analysis. In Proceedings of the 11th International World Wide Web Conference (WWW).Google ScholarGoogle Scholar
  12. Douglis, F., Feldman, A., Krishnamurthy, B., and Mogul, J. 1997. Rate of change and other metrics: A live study of the World Wide Web. In Proceedings of the 1st USENIX Symposium on Internet Technologies and Systems (USITS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Finkel, R. A., Zaslavsky, A. B., Monostori, K., and Schmidt, H. W. 2002. Signature extraction for overlap detection in documents. In Proceedings of the 25th Australasian Computer Science Conference (ACSC), 59--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Garcia-Molina, H., Gravano, L., and Shivakumar, N. 1996. Dscam: Finding document copies across multiple databases. In Proceedings of the 4th International Conference on Parallel and Distributed Information Systems (PDIS), 68--79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Garey, M. R. and Johnson, D. S. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Google, Inc. 2008. Google sitemaps. http://sitemaps.google.com.Google ScholarGoogle Scholar
  17. Gusfield, D. 1997. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Hoad, T. C. and Zobel, J. 2003. Methods for identifying versioned and plagiarized documents. J. Amer. Soc. Inf. Sci. Technol. 54, 3, 203--215. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Jaccard, P. 1908. Nouvelles recherches sur la distribution florale. 44, 223--270.Google ScholarGoogle Scholar
  20. Jain, N., Dahlin, M., and Tewari, R. 2005. Using bloom filters to refine Web search results. In Proceedings of the 7th International Workshop on the Web and Databases (WebDB), 25--30.Google ScholarGoogle Scholar
  21. Kelly, T. and Mogul, J. C. 2002. Aliasing on the World Wide Web: Prevalence and performance implications. In Proceedings of the 11th International World Wide Web Conference (WWW). 281--292. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kim, S. J., Jeong, H. S., and Lee, S. H. 2006. Reliable evaluations of URL normalization. In Proceedings of the 4th International Conference on Computational Science and Its Applications (ICCSA), 609--617. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Liang, H. 2001. A URL-string-based algorithm for finding WWW mirror host. M.S. thesis, Auburn University.Google ScholarGoogle Scholar
  24. McCown, F. and Nelson, M. L. 2006. Evaluation of crawling policies for a Web-repository crawler. In Proceedings of the 17th ACM Conference on Hypertext and Hypermedia (HYPERTEXT). 157--168. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Monostori, K., Finkel, R. A., Zaslavsky, A. B., Hodász, G., and Pataki, M. 2002. Comparison of overlap detection techniques. In Proceedings of the 10th International Conference on Complex Systems (ICCS), 51--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Shivakumar, N. and Garcia-Molina, H. 1998. Finding near-replicas of documents and servers on the Web. In Proceedings of the 1st International Workshop on the Web and Databases (WebDB), 204--212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. StatCounter. 1998. Counter homepage. http://www.statcounter.com/.Google ScholarGoogle Scholar
  28. 2008} WEBLOGEXPERT WebLog Expert. 2008. WebLog expert homepage. http://www.weblogexpert.com/.Google ScholarGoogle Scholar
  29. Zobel, J. and Moffat, A. 1998. Exploring the similarity space. SIGIR Forum 32, 1, 18--34. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Do not crawl in the DUST: Different URLs with similar text

        Recommendations

        Reviews

        Fazli Can

        The identification of different uniform resource locators (URLs) with similar text (DUST) is important: the elimination of such URLs would increase caching, crawling, and indexing efficiency, as well as the search effectiveness of Web search engines. This paper focuses on URLs with similar content. Such Web pages are common, and their brute-force elimination by examining page content is costly. First, the authors point out that some general rules can explain DUST. They present an algorithm for uncovering DUST that uses "previous crawl logs or Web server logs" and is "verified (or refuted) by sampling a small number of actual Web pages." In the algorithm, they "use a URL list to discover two types of DUST rules": substring substitutions and parameter substitutions; the second type is commonly used in URLs for dynamically generated pages. Experiments with four different Web sites show that the discovered rules explain about half of the DUST cases and "can reduce a crawl by up to 26 percent." The paper also includes experiments and proofs on some string operations. Since it has both practical and theoretical flavors, both practitioners and researchers may like to read it. Online Computing Reviews Service

        Access critical reviews of Computing literature here

        Become a reviewer for Computing Reviews.

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on the Web
          ACM Transactions on the Web  Volume 3, Issue 1
          January 2009
          123 pages
          ISSN:1559-1131
          EISSN:1559-114X
          DOI:10.1145/1462148
          Issue’s Table of Contents

          Copyright © 2009 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 17 January 2009
          • Accepted: 1 October 2008
          • Revised: 1 July 2008
          • Received: 1 July 2007
          Published in tweb Volume 3, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!