ABSTRACT
Business-to-consumer (B2C) emails are usually generated by filling structured user data (e.g.purchase, event) into templates. Extracting structured data from B2C emails allows users to track important information on various devices.
However, it also poses several challenges, due to the requirement of short response time for massive data volume, the diversity and complexity of templates, and the privacy and legal constraints. Most notably, email data is legally protected content, which means no one except the receiver can review the messages or derived information.
In this paper we first introduce a system which can extract structured information automatically without requiring human review of any personal content. Then we focus on how to annotate product names from the extracted texts, which is one of the most difficult problems in the system. Neither general learning methods, such as binary classifiers, nor more specific structure learning methods, suchas Conditional Random Field (CRF), can solve this problem well.
To accomplish this task, we propose a hybrid approach, which basically trains a CRF model using the labels predicted by binary classifiers (weak learners). However, the performance of weak learners can be low, therefore we use Expectation Maximization (EM) algorithm on CRF to remove the noise and improve the accuracy, without the need to label and inspect specific emails. In our experiments, the EM-CRF model can significantly improve the product name annotations over the weak learners and plain CRFs.
- N. Ailon, Z. S. Karnin, E. Liberty, and Y. Maarek. Threading machine generated email. In WSDM. ACM, 2013. Google Scholar
Digital Library
- S. M. Aji and R. J. McEliece. The generalized distributive law. IEEE Transactions on Information Theory, 46(2), 2000. Google Scholar
Digital Library
- C. Bird, A. Gourley, P. Devanbu, M. Gertz, and A. Swaminathan. Mining email social networks. In Workshop on Mining software repositories, pages 137--143. ACM, 2006. Google Scholar
Digital Library
- E. Blanzieri and A. Bryl. A survey of learning-based techniques of email spam filtering. Artificial Intelligence Review, 29(1):63--92, 2008. Google Scholar
Digital Library
- L. Breiman. Random forests. Machine learning, 45(1), 2001. Google Scholar
Digital Library
- J. D. Brutlag and C. Meek. Challenges of the email domain for text classification. In ICML, pages 103--110, 2000. Google Scholar
Digital Library
- D. Buttler, L. Liu, and C. Pu. A fully automated object extraction system for the world wide web. In ICDCS, 2001. Google Scholar
Digital Library
- C.-H. Chang and S.-C. Lui. Iepad: information extraction based on pattern discovery. In WWW, pages 681--688. ACM, 2001. Google Scholar
Digital Library
- G. V. Cormack. Email spam filtering: A systematic review. Foundations and Trends in Information Retrieval, 1(4), 2007. Google Scholar
Digital Library
- C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273--297, 1995. Google Scholar
Digital Library
- M. Craven, J. Kumlien, et al. Constructing biological knowledge bases by extracting information from text sources. In ISMB, volume 1999, pages 77--86, 1999. Google Scholar
Digital Library
- V. Crescenzi, G. Mecca, P. Merialdo, et al. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, volume 1, pages 109--118, 2001. Google Scholar
Digital Library
- N. Dalvi, R. Kumar, and M. Soliman. Automatic wrappers for large scale web extraction. Proceedings of the VLDB Endowment, 4(4):219--230, 2011. Google Scholar
Digital Library
- J. Diesner, T. L. Frantz, and K. M. Carley. Communication networks from the enron email corpus "it's always about the people. enron is no different". Computational & Mathematical Organization Theory, 11(3):201--228, 2005. Google Scholar
Digital Library
- D. W. Embley, Y. Jiang, and Y.-K. Ng. Record-boundary discovery in web documents. ACM SIGMOD Record, 28(2):467--478, 1999. Google Scholar
Digital Library
- J. H. Friedman. Multivariate adaptive regression splines. The annals of statistics, pages 1--67, 1991.Google Scholar
- K. Ganchev, J. Graica, J. Gillenwater, and B. Taskar. Posterior regularization for structured latent variable models. JMLR, 99:2001--2049, 2010. Google Scholar
Digital Library
- S. Gupta, G. Kaiser, D. Neistadt, and P. Grimm. Dom-based content extraction of html documents. In WWW, 2003. Google Scholar
Digital Library
- K. Hall, R. McDonald, J. Katz-Brown, and M. Ringgaard. Training dependency parsers by jointly optimizing multiple objectives. In EMNLP, pages 1489--1499, 2011. Google Scholar
Digital Library
- T. Hastie, R. Tibshirani, and J. J. H. Friedman. The elements of statistical learning, volume 1. Springer New York, 2001.Google Scholar
- R. Hoffmann, C. Zhang, X. Ling, L. Zettlemoyer, and D. S. Weld. Knowledge-based weak supervision for information extraction of overlapping relations. In ACL, 2011. Google Scholar
Digital Library
- R. Horst and N. V. Thoai. Dc programming: overview. Journal of Optimization Theory and Applications, 103(1):1--43, 1999. Google Scholar
Digital Library
- R. Jin and Z. Ghahramani. Learning with multiple labels. In NIPS, pages 897--904, 2002.Google Scholar
Digital Library
- S. Kiritchenko and S. Matwin. Email classification with co-training. In CASCON, pages 301--312. IBM Corp., 2011. Google Scholar
Digital Library
- A. Klementiev and D. Roth. Weakly supervised named entity transliteration and discovery from multilingual comparable corpora. In ACL, pages 817--824, 2006. Google Scholar
Digital Library
- B. Klimt and Y. Yang. The enron corpus: A new dataset for email classification research.Google Scholar
- A. Kulkarni and T. Pedersen. Name discrimination and email clustering using unsupervised clustering and labeling of similar contexts. In IICAI, pages 703--722, 2005.Google Scholar
- J. Lafferty, A. McCallum, and F. C. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001. Google Scholar
Digital Library
- H. Li, D. Shen, B. Zhang, Z. Chen, and Q. Yang. Adding semantics to email clustering. In ICDM, 2006. Google Scholar
Digital Library
- B. Liu, R. Grossman, and Y. Zhai. Mining data records in web pages. In KDD, pages 601--606. ACM, 2003. Google Scholar
Digital Library
- G. S. Mann and A. McCallum. Generalized expectation criteria for semi-supervised learning with weakly labeled data. JMLR, 11:955--984, 2010. Google Scholar
Digital Library
- M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled data. In ACL, pages 1003--1011, 2009. Google Scholar
Digital Library
- L. B. J. F. R. Olshen and C. J. Stone. Classification and regression trees. Wadsworth International Group, 1984.Google Scholar
- M. Paisca. Weakly-supervised discovery of named entities using web search queries. In CIKM, pages 683--690. ACM, 2007. Google Scholar
Digital Library
- P. Ravikumar, M. J. Wainwright, and J. D. Lafferty. High-dimensional ising model selection using l1-regularized logistic regression. The Annals of Statistics, 38(3), 2010.Google Scholar
Cross Ref
- G. Ridgeway. Generalized boosted regression models. Documentation on the R Package 'gbm', version, 1(5):7, 2006.Google Scholar
- S. Riedel, L. Yao, and A. McCallum. Modeling relations and their mentions without labeled text. In ECML, 2010. Google Scholar
Digital Library
- R. Rowe, G. Creamer, S. Hershkop, and S. J. Stolfo. Automated social hierarchy detection through email network analysis. In WebKDD and SNA-KDD, pages 109--117. ACM, 2007. Google Scholar
Digital Library
- A. M. Rush. A tutorial on dual decomposition and lagrangian relaxation for inference in natural language processing. 2012.Google Scholar
- A. M. Rush, D. Sontag, M. Collins, and T. Jaakkola. On dual decomposition and linear programming relaxations for natural language processing. In EMNLP, pages 1--11, 2010. Google Scholar
Digital Library
- C.-Y. Tseng, J.-W. Huang, and M.-S. Chen. Promail: using progressive email social network for spam detection. In Advances in Knowledge Discovery and Data Mining. 2007. Google Scholar
Digital Library
- S. Yoo, Y. Yang, F. Lin, and I.-C. Moon. Mining social networks for personalized email prioritization. In KDD, 2009. Google Scholar
Digital Library
- S. Youn and D. McLeod. A comparative study for email classification. In Advances and Innovations in Systems, Computing Sciences and Software Engineering. 2007.Google Scholar
Cross Ref
- Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In WWW, pages 76--85. ACM, 2005. Google Scholar
Digital Library
- J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. 2d conditional random fields for web information extraction. In ICML, pages 1044--1051. ACM, 2005. Google Scholar
Digital Library
- J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. Simultaneous record detection and attribute labeling in web data extraction. In KDD, pages 494--503. ACM, 2006. Google Scholar
Digital Library
- J. Zhu, Z. Nie, B. Zhang, and J.-R. Wen. Dynamic hierarchical markov random fields for integrated web data extraction. JMLR, 9:1583--1614, 2008. Google Scholar
Digital Library
- X. Zhu. Semi-supervised learning literature survey. Computer Science, University of Wisconsin-Madison, 2:3, 2006.Google Scholar
- X. Zhu and A. B. Goldberg. Introduction to Semi-supervised Learning. Number 6. Morgan & Claypool Publishers, 2009. Google Scholar
Digital Library
Index Terms
Annotating Needles in the Haystack without Looking: Product Information Extraction from Emails
Recommendations
Focusing on the Long-term: It's Good for Users and Business
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data MiningOver the past 10+ years, online companies large and small have adopted widespread A/B testing as a robust data-based method for evaluating potential product improvements. In online experimentation, it is straightforward to measure the short-term effect, ...
TimeMachine: Timeline Generation for Knowledge-Base Entities
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data MiningWe present a method called TIMEMACHINE to generate a timeline of events and relations for entities in a knowledge base. For example for an actor, such a timeline should show the most important professional and personal milestones and relationships such ...
Algorithmic Cartography: Placing Points of Interest and Ads on Maps
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data MiningWe study the problem of selecting a set of points of interest (POIs) to show on a map. We begin with a formal model of the setting, noting that the utility of a POI may be discounted by (i) the presence of competing businesses nearby as well as (ii) its ...





Comments