ABSTRACT
Subject-specific search facilities on health sites are usually built using manual inclusion and exclusion rules. These can be expensive to maintain and often provide incomplete coverage of Web resources. On the other hand, health information obtained through whole-of-Web search may not be scientifically based and can be potentially harmful.To address problems of cost, coverage and quality, we built a focused crawler for the mental health topic of depression, which was able to selectively fetch higher quality relevant information. We found that the relevance of unfetched pages can be predicted based on link anchor context, but the quality cannot. We therefore estimated quality of the entire linking page, using a learned IR-style query of weighted single words and word pairs, and used this to predict the quality of its links. The overall crawler priority was determined by the product of link relevance and source quality.We evaluated our crawler against baseline crawls using both relevance judgments and objective site quality scores obtained using an evidence-based rating scale. Both a relevance focused crawler and the quality focused crawler retrieved twice as many relevant pages as a breadth-first control. The quality focused crawler was quite effective in reducing the amount of low quality material fetched while crawling more high quality content, relative to the relevance focused crawler.Analysis suggests that quality of content might be improved by post-filtering a very big breadth-first crawl, at the cost of substantially increased network traffic.
- C. C. Aggarwal, F. Al-Garawi, and P. S. Yu. On the design of a learning crawler for topical resource discovery. ACM Trans. Inf. Syst., 19(3):286--309, 2001. Google Scholar
- L. Baker, T. H. Wagner, S. Singer, and M. K. Bundorf. Use of the internet and e-mail for health care information. JAMA, 289(18):2400--2406, 2003.Google Scholar
- P. D. Bra, G. Houben, Y. Kornatzky, and R. Post. Information retrieval in distributed hypertexts. In Procs. of the 4th RIAO Conference, pages 481--491, New York, 1994.Google Scholar
- S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In WWW7, pages 107--117, Brisbane, Australia, 1998. Google Scholar
- CEBMH. A systematic guide for the management of depression in primary care: treatment. University of Oxford, UK, 1998. Available at http://cebmh.warne.ox.ac.uk/cebmh/guidelines/ depression/treatment.html, Accessed 30 May 2005.Google Scholar
- S. Chakrabarti, M. Berg, and B. Dom. Focused crawling: A new approach to topic-specific web resource discovery. In WWW8, 1999. Google Scholar
- S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. In Procs. of the WWW7, pages 65--74, Brisbane, Australia, 1998. Elsevier Science Publishers B. V. Google Scholar
- D. Charnock, S. Shepperd, G. Needham, and R. Gann. Discern: an instrument for judging the quality of written consumer health information on treatment choices. J. Epidemiol Community Health, 53:105--111, 1999.Google Scholar
- J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through url ordering. In WWW7, 1998. Google Scholar
- B. D. Davison. Topical locality in the web. In Procs. of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 272--279, New York, NY, USA, 2000. ACM Press. Google Scholar
- M. Diligenti, F. M. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused crawling using context graphs. In Procs. of the 26th VLDB Conference, Cairo, Egypt, 2000. Google Scholar
- K. Griffiths and H. Christensen. Quality of web based information on treatment of depression: cross sectional survey. British Medical Journal, 321:1511- 1515, 2000. bmj.bmjjournals.com/cgi/content/full/321/7275/1511.Google Scholar
- K. Griffiths and H. Christensen. The quality and accessibility of australian depression sites on the world wide web. The Medical Journal of Australia, 176:S97--S104, 2002.Google Scholar
- K. Griffiths, H. Christensen, and S. Blomberg. Website quality indicators for consumers. In Tromso Telemedicine and e-Health Conf., Tromso, Norway, 2004.Google Scholar
- D. Harman. Towards interactive query expansion. In Procs. of the 11th annual international ACM SIGIR conference on Research and development in information retrieval, pages 321--331, New York, NY, USA, 1988. ACM Press. Google Scholar
- M. Hersovici, M. Jacovi, Y. S. Maarek, D. Pellegb, M. Shtalhaima, and S. Ura. The shark-search algorithm. an application: tailored web site mapping. In WWW7, 1998. Google Scholar
- A. R. Jadad and A. Gagliardi. Rating health information on the internet. JAMA, 279:611--614, 1998.Google Scholar
- R. Kiley. Quality of medical information on the internet. J. Royal Soc. of Med., 91:369--370, 1998.Google Scholar
- D. D. Margineantu and T. G. Dietterich. Improved class probability estimates from decision tree models. In D. D. Denison, M. H. Hansen, C. C. Holmes, B. Mallick, and B. Yu, editors, Lecture Notes in Statistics. Nonlinear Estimation and Classification, volume 171, pages 169--184, New York, 2002. Springer-Verlag.Google Scholar
- A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Building domain-specific search engines with machine learning technique. In Procs. of AAAI Spring Symposium on Intelligents Engine in Cyberspace, 1999.Google Scholar
- S. L. Price and W. R. Hersh. Filtering web pages for quality indicators: An empirical approach to finding high quality consumer health information on the world wide web. In Procs. of the AMIA 1999 Annual Symposium, pages 911--915, Washington DC, 1999.Google Scholar
- J. R. Quinlan. C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993. Google Scholar
- A. Risk and J. Dzenowagis. Review of internet health information quality initiatives. JMIR, 3(4):e28, 2001.Google Scholar
- S. E. Robertson. On term selection for query expansion. J. Doc., 46(4):359--364, 1990. Google Scholar
- S. E. Robertson and K. S. Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3):129--146, 1976.Google Scholar
- S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi at trec-3. In Procs. of the Third Text REtrieval Conference, pages 109--126, USA, 1996.Google Scholar
- W. M. Silberg, G. D. Lundberg, and R. A. Musacchio. Assessing, controlling, and assuring the quality of medical information on the internet. JAMA, 277:1244--1245, 1997.Google Scholar
- T. T. Tang, N. Craswell, D. Hawking, K. M. Griffiths, and H. Christensen. Quality and relevance of domain-specific search: A case study in mental health. To appear in the Journal of Information Retrieval -Special Issues, 2005. Google Scholar
- T. T. Tang, D. Hawking, N. Craswell, and R. S. Sankaranarayana. Focused crawling in depression portal search: A feasibility study. In Procs. of the Ninth ADCS, pages 2--9, Australia, 2004.Google Scholar
- I. H. Witten and E. Frank. Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann, San Francisco, 1999. Google Scholar
Index Terms
Focused crawling for both topical relevance and quality of medical information
Recommendations
Efficient Topical Focused Crawling Through Neighborhood Feature
AbstractA focused web crawler is an essential tool for gathering domain-specific data used by national web corpora, vertical search engines, and so on, since it is more efficient than general Breadth-First or Depth-First crawlers. The problem in focused ...
Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web EngineeringWeb crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
A cross-language focused crawling algorithm based on multiple relevance prediction strategies
Focused crawling is increasingly seen as a solution to address the scalability limitations of existing general-purpose search engines, by traversing the Web to only gather pages that are relevant to a specific topic. How to predict the relevance of the ...






Comments