ABSTRACT
Search engine logs have a great potential in tracking and predicting outbreaks of infectious disease. More precisely, one can use the search volume of some search terms to predict the infection rate of an infectious disease in nearly real-time. However, conducting accurate and stable prediction of outbreaks using search engine logs is a challenging task due to the following two-way instability characteristics of the search logs. First, the search volume of a search term may change irregularly in the short-term, for example, due to environmental factors such as the amount of media or news. Second, the search volume may also change in the long-term due to the demographic change of the search engine. That is to say, if a model is trained with such search logs with ignoring such characteristic, the resulting prediction would contain serious mispredictions when these changes occur. In this work, we proposed a novel feature selection method to overcome this instability problem. In particular, we employ a seasonal-adjustment method that decomposes each time series into three components: seasonal, trend and irregular component and build prediction models for each component individually. We also carefully design a feature selection method to select proper search terms to predict each component. We conducted comprehensive experiments on ten different kinds of infectious diseases. The experimental results show that the proposed method outperforms all comparative methods in prediction accuracy for seven of ten diseases, in both now-casting and forecasting setting. Also, the proposed method is more successful in selecting search terms that are semantically related to target diseases.
References
- Monica Anderson and Andrew Perrin. 2017. Tech Adoption Climbs Among Older Adults. (2017).Google Scholar
- John S Brownstein, Clark C Freifeld, and Lawrence C Madoff. 2009. Digital disease detection-harnessing the Web for public health surveillance. New England Journal of Medicine, Vol. 360, 21 (2009), 2153--2157.Google Scholar
Cross Ref
- Declan Butler. 2013. When Google got flu wrong. Nature, Vol. 494, 7436 (2013), 155.Google Scholar
- Marc-Allen Cartright, Ryen W White, and Eric Horvitz. 2011. Intentions and attention in exploratory health search. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. ACM, 65--74. Google Scholar
Digital Library
- Patrick Copeland, Raquel Romano, Tom Zhang, Greg Hecht, Dan Zigmond, and Christian Stefansen. 2013. Google disease trends: an update. Nature , Vol. 457 (2013), 1012--1014.Google Scholar
- Niels Dalum Hansen, Kåre Mølbak, Ingemar J Cox, and Christina Lioma. 2017. Seasonal Web Search Query Selection for Influenza-Like Illness (ILI) Estimation. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1197--1200. Google Scholar
Digital Library
- Jeremy Ginsberg, Matthew H Mohebbi, Rajan S Patel, Lynnette Brammer, Mark S Smolinski, and Larry Brilliant. 2009. Detecting influenza epidemics using search engine query data. Nature , Vol. 457, 7232 (2009), 1012.Google Scholar
- Heidi Godman. 2016. How long does the flu last? (2016).Google Scholar
- Paul Goodwin and Richard Lawton. 1999. On the asymmetry of the symmetric MAPE. International journal of forecasting , Vol. 15, 4 (1999), 405--408.Google Scholar
- Isabelle Guyon and André Elisseeff. 2003. An introduction to variable and feature selection. Journal of machine learning research , Vol. 3, Mar (2003), 1157--1182. Google Scholar
Digital Library
- National Institute Of Infectious Diseases (JAPAN). {n.d.}. Infectious Diseases Weekly Report (IDWR). https://www.niid.go.jp/niid/en/idwr-e.html.Google Scholar
- Ron Kohavi and George H John. 1997. Wrappers for feature subset selection. Artificial intelligence , Vol. 97, 1--2 (1997), 273--324. Google Scholar
Digital Library
- Vasileios Lampos, Andrew C Miller, Steve Crossan, and Christian Stefansen. 2015. Advances in nowcasting influenza-like illness rates using search query logs. Scientific reports , Vol. 5 (2015), 12760.Google Scholar
- Vasileios Lampos, Bin Zou, and Ingemar Johansson Cox. 2017. Enhancing feature selection using word embeddings: The case of flu surveillance. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 695--704. Google Scholar
Digital Library
- Philip M Polgreen, Yiling Chen, David M Pennock, Forrest D Nelson, and Robert A Weinstein. 2008. Using internet searches for influenza surveillance. Clinical infectious diseases , Vol. 47, 11 (2008), 1443--1448.Google Scholar
- Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) (1996), 267--288.Google Scholar
- Qingyu Yuan, Elaine O Nsoesie, Benfu Lv, Geng Peng, Rumi Chunara, and John S Brownstein. 2013. Monitoring influenza epidemics in china with search query from baidu. PloS one , Vol. 8, 5 (2013), e64323.Google Scholar
Cross Ref
- Hui Zou and Trevor Hastie. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , Vol. 67, 2 (2005), 301--320.Google Scholar
Cross Ref
Index Terms
Seasonal-adjustment Based Feature Selection Method for Predicting Epidemic with Large-scale Search Engine Logs





Comments