skip to main content
10.1145/1099554.1099583acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Focused crawling for both topical relevance and quality of medical information

Published:31 October 2005Publication History

ABSTRACT

Subject-specific search facilities on health sites are usually built using manual inclusion and exclusion rules. These can be expensive to maintain and often provide incomplete coverage of Web resources. On the other hand, health information obtained through whole-of-Web search may not be scientifically based and can be potentially harmful.To address problems of cost, coverage and quality, we built a focused crawler for the mental health topic of depression, which was able to selectively fetch higher quality relevant information. We found that the relevance of unfetched pages can be predicted based on link anchor context, but the quality cannot. We therefore estimated quality of the entire linking page, using a learned IR-style query of weighted single words and word pairs, and used this to predict the quality of its links. The overall crawler priority was determined by the product of link relevance and source quality.We evaluated our crawler against baseline crawls using both relevance judgments and objective site quality scores obtained using an evidence-based rating scale. Both a relevance focused crawler and the quality focused crawler retrieved twice as many relevant pages as a breadth-first control. The quality focused crawler was quite effective in reducing the amount of low quality material fetched while crawling more high quality content, relative to the relevance focused crawler.Analysis suggests that quality of content might be improved by post-filtering a very big breadth-first crawl, at the cost of substantially increased network traffic.

References

  1. C. C. Aggarwal, F. Al-Garawi, and P. S. Yu. On the design of a learning crawler for topical resource discovery. ACM Trans. Inf. Syst., 19(3):286--309, 2001. Google ScholarGoogle Scholar
  2. L. Baker, T. H. Wagner, S. Singer, and M. K. Bundorf. Use of the internet and e-mail for health care information. JAMA, 289(18):2400--2406, 2003.Google ScholarGoogle Scholar
  3. P. D. Bra, G. Houben, Y. Kornatzky, and R. Post. Information retrieval in distributed hypertexts. In Procs. of the 4th RIAO Conference, pages 481--491, New York, 1994.Google ScholarGoogle Scholar
  4. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In WWW7, pages 107--117, Brisbane, Australia, 1998. Google ScholarGoogle Scholar
  5. CEBMH. A systematic guide for the management of depression in primary care: treatment. University of Oxford, UK, 1998. Available at http://cebmh.warne.ox.ac.uk/cebmh/guidelines/ depression/treatment.html, Accessed 30 May 2005.Google ScholarGoogle Scholar
  6. S. Chakrabarti, M. Berg, and B. Dom. Focused crawling: A new approach to topic-specific web resource discovery. In WWW8, 1999. Google ScholarGoogle Scholar
  7. S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. In Procs. of the WWW7, pages 65--74, Brisbane, Australia, 1998. Elsevier Science Publishers B. V. Google ScholarGoogle Scholar
  8. D. Charnock, S. Shepperd, G. Needham, and R. Gann. Discern: an instrument for judging the quality of written consumer health information on treatment choices. J. Epidemiol Community Health, 53:105--111, 1999.Google ScholarGoogle Scholar
  9. J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through url ordering. In WWW7, 1998. Google ScholarGoogle Scholar
  10. B. D. Davison. Topical locality in the web. In Procs. of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 272--279, New York, NY, USA, 2000. ACM Press. Google ScholarGoogle Scholar
  11. M. Diligenti, F. M. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused crawling using context graphs. In Procs. of the 26th VLDB Conference, Cairo, Egypt, 2000. Google ScholarGoogle Scholar
  12. K. Griffiths and H. Christensen. Quality of web based information on treatment of depression: cross sectional survey. British Medical Journal, 321:1511- 1515, 2000. bmj.bmjjournals.com/cgi/content/full/321/7275/1511.Google ScholarGoogle Scholar
  13. K. Griffiths and H. Christensen. The quality and accessibility of australian depression sites on the world wide web. The Medical Journal of Australia, 176:S97--S104, 2002.Google ScholarGoogle Scholar
  14. K. Griffiths, H. Christensen, and S. Blomberg. Website quality indicators for consumers. In Tromso Telemedicine and e-Health Conf., Tromso, Norway, 2004.Google ScholarGoogle Scholar
  15. D. Harman. Towards interactive query expansion. In Procs. of the 11th annual international ACM SIGIR conference on Research and development in information retrieval, pages 321--331, New York, NY, USA, 1988. ACM Press. Google ScholarGoogle Scholar
  16. M. Hersovici, M. Jacovi, Y. S. Maarek, D. Pellegb, M. Shtalhaima, and S. Ura. The shark-search algorithm. an application: tailored web site mapping. In WWW7, 1998. Google ScholarGoogle Scholar
  17. A. R. Jadad and A. Gagliardi. Rating health information on the internet. JAMA, 279:611--614, 1998.Google ScholarGoogle Scholar
  18. R. Kiley. Quality of medical information on the internet. J. Royal Soc. of Med., 91:369--370, 1998.Google ScholarGoogle Scholar
  19. D. D. Margineantu and T. G. Dietterich. Improved class probability estimates from decision tree models. In D. D. Denison, M. H. Hansen, C. C. Holmes, B. Mallick, and B. Yu, editors, Lecture Notes in Statistics. Nonlinear Estimation and Classification, volume 171, pages 169--184, New York, 2002. Springer-Verlag.Google ScholarGoogle Scholar
  20. A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Building domain-specific search engines with machine learning technique. In Procs. of AAAI Spring Symposium on Intelligents Engine in Cyberspace, 1999.Google ScholarGoogle Scholar
  21. S. L. Price and W. R. Hersh. Filtering web pages for quality indicators: An empirical approach to finding high quality consumer health information on the world wide web. In Procs. of the AMIA 1999 Annual Symposium, pages 911--915, Washington DC, 1999.Google ScholarGoogle Scholar
  22. J. R. Quinlan. C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993. Google ScholarGoogle Scholar
  23. A. Risk and J. Dzenowagis. Review of internet health information quality initiatives. JMIR, 3(4):e28, 2001.Google ScholarGoogle Scholar
  24. S. E. Robertson. On term selection for query expansion. J. Doc., 46(4):359--364, 1990. Google ScholarGoogle Scholar
  25. S. E. Robertson and K. S. Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3):129--146, 1976.Google ScholarGoogle Scholar
  26. S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi at trec-3. In Procs. of the Third Text REtrieval Conference, pages 109--126, USA, 1996.Google ScholarGoogle Scholar
  27. W. M. Silberg, G. D. Lundberg, and R. A. Musacchio. Assessing, controlling, and assuring the quality of medical information on the internet. JAMA, 277:1244--1245, 1997.Google ScholarGoogle Scholar
  28. T. T. Tang, N. Craswell, D. Hawking, K. M. Griffiths, and H. Christensen. Quality and relevance of domain-specific search: A case study in mental health. To appear in the Journal of Information Retrieval -Special Issues, 2005. Google ScholarGoogle Scholar
  29. T. T. Tang, D. Hawking, N. Craswell, and R. S. Sankaranarayana. Focused crawling in depression portal search: A feasibility study. In Procs. of the Ninth ADCS, pages 2--9, Australia, 2004.Google ScholarGoogle Scholar
  30. I. H. Witten and E. Frank. Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann, San Francisco, 1999. Google ScholarGoogle Scholar

Index Terms

  1. Focused crawling for both topical relevance and quality of medical information

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!