Abstract
We consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent in Web sites, as Web server software often uses aliases and redirections, and dynamically generates the same page from various different URL requests. We present a novel algorithm, DustBuster, for uncovering DUST; that is, for discovering rules that transform a given URL to others that are likely to have similar content. DustBuster mines DUST effectively from previous crawl logs or Web server logs, without/examining page contents. Verifying these rules via sampling requires fetching few actual Web pages. Search engines can benefit from information about DUST to increase the effectiveness of crawling, reduce indexing overhead, and improve the quality of popularity statistics such as PageRank.
- Apache 2008. Apache. http server version 2.2 configuration files. http://httpd.apache.org/docs/2.2/configuring.html.Google Scholar
- Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB), 487--499. Google Scholar
Digital Library
- Analog. 2008. Analog homepage. http://www.analog.cx/.Google Scholar
- Berners-Lee, T., Fielding, R., and Masinter, L. Uniform resource identifiers (URI): Generic syntax. http://www.ietf.org/rfc/rfc2396.txt. Google Scholar
Digital Library
- Bharat, K. and Broder, A. Z. 1999. Mirror, mirror on the Web: A study of host pairs with replicated content. Comput. Netw. 31, 11--16, 1579--1590. Google Scholar
Digital Library
- Bharat, K., Broder, A. Z., Dean, J., and Henzinger, M. R. 2000. A comparison of techniques to find mirrored hosts on the WWW. J. Amer. Soc. Inf. Sci. 51, 12, 1114--1122. Google Scholar
Digital Library
- Bognar, M. 1995. A survey on abstract rewriting. www.di.ubi.pt/~desousa/1998-1999/logica/mb.ps.Google Scholar
- Brin, S., Davis, J., and Garcia-Molina, H. 1995. Copy detection mechanisms for digital documents. In Proceedings of the 14th Special Interest Group on Management of Data (SIGMOD), 398--409. Google Scholar
Digital Library
- Broder, A. Z., Glassman, S. C., and Manasse, M. S. 1997. Syntactic clustering of the Web. In Proceedings of the 6th International World Wide Web Conference (WWW), 1157--1166. Google Scholar
Digital Library
- Cho, J., Shivakumar, N., and Garcia-Molina, H. 2000. Finding replicated Web collections. In Proceedings of the 19th Special Interest Group on Management of Data (SIGMOD), 355--366. Google Scholar
Digital Library
- Di Iorio, E., Diligenti, M., Gori, M., Maggini, M., and Pucci, A. 2003. Detecting near-replicas on the Web by content and hyperlink analysis. In Proceedings of the 11th International World Wide Web Conference (WWW).Google Scholar
- Douglis, F., Feldman, A., Krishnamurthy, B., and Mogul, J. 1997. Rate of change and other metrics: A live study of the World Wide Web. In Proceedings of the 1st USENIX Symposium on Internet Technologies and Systems (USITS). Google Scholar
Digital Library
- Finkel, R. A., Zaslavsky, A. B., Monostori, K., and Schmidt, H. W. 2002. Signature extraction for overlap detection in documents. In Proceedings of the 25th Australasian Computer Science Conference (ACSC), 59--64. Google Scholar
Digital Library
- Garcia-Molina, H., Gravano, L., and Shivakumar, N. 1996. Dscam: Finding document copies across multiple databases. In Proceedings of the 4th International Conference on Parallel and Distributed Information Systems (PDIS), 68--79. Google Scholar
Digital Library
- Garey, M. R. and Johnson, D. S. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman. Google Scholar
Digital Library
- Google, Inc. 2008. Google sitemaps. http://sitemaps.google.com.Google Scholar
- Gusfield, D. 1997. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press. Google Scholar
Digital Library
- Hoad, T. C. and Zobel, J. 2003. Methods for identifying versioned and plagiarized documents. J. Amer. Soc. Inf. Sci. Technol. 54, 3, 203--215. Google Scholar
Digital Library
- Jaccard, P. 1908. Nouvelles recherches sur la distribution florale. 44, 223--270.Google Scholar
- Jain, N., Dahlin, M., and Tewari, R. 2005. Using bloom filters to refine Web search results. In Proceedings of the 7th International Workshop on the Web and Databases (WebDB), 25--30.Google Scholar
- Kelly, T. and Mogul, J. C. 2002. Aliasing on the World Wide Web: Prevalence and performance implications. In Proceedings of the 11th International World Wide Web Conference (WWW). 281--292. Google Scholar
Digital Library
- Kim, S. J., Jeong, H. S., and Lee, S. H. 2006. Reliable evaluations of URL normalization. In Proceedings of the 4th International Conference on Computational Science and Its Applications (ICCSA), 609--617. Google Scholar
Digital Library
- Liang, H. 2001. A URL-string-based algorithm for finding WWW mirror host. M.S. thesis, Auburn University.Google Scholar
- McCown, F. and Nelson, M. L. 2006. Evaluation of crawling policies for a Web-repository crawler. In Proceedings of the 17th ACM Conference on Hypertext and Hypermedia (HYPERTEXT). 157--168. Google Scholar
Digital Library
- Monostori, K., Finkel, R. A., Zaslavsky, A. B., Hodász, G., and Pataki, M. 2002. Comparison of overlap detection techniques. In Proceedings of the 10th International Conference on Complex Systems (ICCS), 51--60. Google Scholar
Digital Library
- Shivakumar, N. and Garcia-Molina, H. 1998. Finding near-replicas of documents and servers on the Web. In Proceedings of the 1st International Workshop on the Web and Databases (WebDB), 204--212. Google Scholar
Digital Library
- StatCounter. 1998. Counter homepage. http://www.statcounter.com/.Google Scholar
- 2008} WEBLOGEXPERT WebLog Expert. 2008. WebLog expert homepage. http://www.weblogexpert.com/.Google Scholar
- Zobel, J. and Moffat, A. 1998. Exploring the similarity space. SIGIR Forum 32, 1, 18--34. Google Scholar
Digital Library
Index Terms
Do not crawl in the DUST: Different URLs with similar text
Recommendations
De-duping URLs via rewrite rules
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data miningA large fraction of the URLs on the web contain duplicate (or near-duplicate) content. De-duping URLs is an extremely important problem for search engines, since all the principal functions of a search engine, including crawling, indexing, ranking, and ...
Do not crawl in the DUST: different URLs with similar text
WWW '06: Proceedings of the 15th international conference on World Wide WebWe consider the problem of dust: Different URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, translates URLs to some canonical form, and dynamically generates the same ...
Do not crawl in the dust: different urls with similar text
WWW '07: Proceedings of the 16th international conference on World Wide WebWe consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent in web sites, as web server software often uses aliases and redirections, and dynamically generates the same page from various different URLrequests. We ...








Comments