skip to main content
10.1145/2783258.2788580acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Open Access

Annotating Needles in the Haystack without Looking: Product Information Extraction from Emails

Published:10 August 2015Publication History

ABSTRACT

Business-to-consumer (B2C) emails are usually generated by filling structured user data (e.g.purchase, event) into templates. Extracting structured data from B2C emails allows users to track important information on various devices.

However, it also poses several challenges, due to the requirement of short response time for massive data volume, the diversity and complexity of templates, and the privacy and legal constraints. Most notably, email data is legally protected content, which means no one except the receiver can review the messages or derived information.

In this paper we first introduce a system which can extract structured information automatically without requiring human review of any personal content. Then we focus on how to annotate product names from the extracted texts, which is one of the most difficult problems in the system. Neither general learning methods, such as binary classifiers, nor more specific structure learning methods, suchas Conditional Random Field (CRF), can solve this problem well.

To accomplish this task, we propose a hybrid approach, which basically trains a CRF model using the labels predicted by binary classifiers (weak learners). However, the performance of weak learners can be low, therefore we use Expectation Maximization (EM) algorithm on CRF to remove the noise and improve the accuracy, without the need to label and inspect specific emails. In our experiments, the EM-CRF model can significantly improve the product name annotations over the weak learners and plain CRFs.

References

  1. N. Ailon, Z. S. Karnin, E. Liberty, and Y. Maarek. Threading machine generated email. In WSDM. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. M. Aji and R. J. McEliece. The generalized distributive law. IEEE Transactions on Information Theory, 46(2), 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Bird, A. Gourley, P. Devanbu, M. Gertz, and A. Swaminathan. Mining email social networks. In Workshop on Mining software repositories, pages 137--143. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. E. Blanzieri and A. Bryl. A survey of learning-based techniques of email spam filtering. Artificial Intelligence Review, 29(1):63--92, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. L. Breiman. Random forests. Machine learning, 45(1), 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. D. Brutlag and C. Meek. Challenges of the email domain for text classification. In ICML, pages 103--110, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. Buttler, L. Liu, and C. Pu. A fully automated object extraction system for the world wide web. In ICDCS, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C.-H. Chang and S.-C. Lui. Iepad: information extraction based on pattern discovery. In WWW, pages 681--688. ACM, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. V. Cormack. Email spam filtering: A systematic review. Foundations and Trends in Information Retrieval, 1(4), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273--297, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Craven, J. Kumlien, et al. Constructing biological knowledge bases by extracting information from text sources. In ISMB, volume 1999, pages 77--86, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. V. Crescenzi, G. Mecca, P. Merialdo, et al. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, volume 1, pages 109--118, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. N. Dalvi, R. Kumar, and M. Soliman. Automatic wrappers for large scale web extraction. Proceedings of the VLDB Endowment, 4(4):219--230, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Diesner, T. L. Frantz, and K. M. Carley. Communication networks from the enron email corpus "it's always about the people. enron is no different". Computational & Mathematical Organization Theory, 11(3):201--228, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. W. Embley, Y. Jiang, and Y.-K. Ng. Record-boundary discovery in web documents. ACM SIGMOD Record, 28(2):467--478, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. H. Friedman. Multivariate adaptive regression splines. The annals of statistics, pages 1--67, 1991.Google ScholarGoogle Scholar
  17. K. Ganchev, J. Graica, J. Gillenwater, and B. Taskar. Posterior regularization for structured latent variable models. JMLR, 99:2001--2049, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Gupta, G. Kaiser, D. Neistadt, and P. Grimm. Dom-based content extraction of html documents. In WWW, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. K. Hall, R. McDonald, J. Katz-Brown, and M. Ringgaard. Training dependency parsers by jointly optimizing multiple objectives. In EMNLP, pages 1489--1499, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. Hastie, R. Tibshirani, and J. J. H. Friedman. The elements of statistical learning, volume 1. Springer New York, 2001.Google ScholarGoogle Scholar
  21. R. Hoffmann, C. Zhang, X. Ling, L. Zettlemoyer, and D. S. Weld. Knowledge-based weak supervision for information extraction of overlapping relations. In ACL, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Horst and N. V. Thoai. Dc programming: overview. Journal of Optimization Theory and Applications, 103(1):1--43, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. R. Jin and Z. Ghahramani. Learning with multiple labels. In NIPS, pages 897--904, 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Kiritchenko and S. Matwin. Email classification with co-training. In CASCON, pages 301--312. IBM Corp., 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Klementiev and D. Roth. Weakly supervised named entity transliteration and discovery from multilingual comparable corpora. In ACL, pages 817--824, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. B. Klimt and Y. Yang. The enron corpus: A new dataset for email classification research.Google ScholarGoogle Scholar
  27. A. Kulkarni and T. Pedersen. Name discrimination and email clustering using unsupervised clustering and labeling of similar contexts. In IICAI, pages 703--722, 2005.Google ScholarGoogle Scholar
  28. J. Lafferty, A. McCallum, and F. C. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. H. Li, D. Shen, B. Zhang, Z. Chen, and Q. Yang. Adding semantics to email clustering. In ICDM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. B. Liu, R. Grossman, and Y. Zhai. Mining data records in web pages. In KDD, pages 601--606. ACM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. G. S. Mann and A. McCallum. Generalized expectation criteria for semi-supervised learning with weakly labeled data. JMLR, 11:955--984, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled data. In ACL, pages 1003--1011, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. L. B. J. F. R. Olshen and C. J. Stone. Classification and regression trees. Wadsworth International Group, 1984.Google ScholarGoogle Scholar
  34. M. Paisca. Weakly-supervised discovery of named entities using web search queries. In CIKM, pages 683--690. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. P. Ravikumar, M. J. Wainwright, and J. D. Lafferty. High-dimensional ising model selection using l1-regularized logistic regression. The Annals of Statistics, 38(3), 2010.Google ScholarGoogle ScholarCross RefCross Ref
  36. G. Ridgeway. Generalized boosted regression models. Documentation on the R Package 'gbm', version, 1(5):7, 2006.Google ScholarGoogle Scholar
  37. S. Riedel, L. Yao, and A. McCallum. Modeling relations and their mentions without labeled text. In ECML, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. R. Rowe, G. Creamer, S. Hershkop, and S. J. Stolfo. Automated social hierarchy detection through email network analysis. In WebKDD and SNA-KDD, pages 109--117. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. A. M. Rush. A tutorial on dual decomposition and lagrangian relaxation for inference in natural language processing. 2012.Google ScholarGoogle Scholar
  40. A. M. Rush, D. Sontag, M. Collins, and T. Jaakkola. On dual decomposition and linear programming relaxations for natural language processing. In EMNLP, pages 1--11, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. C.-Y. Tseng, J.-W. Huang, and M.-S. Chen. Promail: using progressive email social network for spam detection. In Advances in Knowledge Discovery and Data Mining. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. S. Yoo, Y. Yang, F. Lin, and I.-C. Moon. Mining social networks for personalized email prioritization. In KDD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. S. Youn and D. McLeod. A comparative study for email classification. In Advances and Innovations in Systems, Computing Sciences and Software Engineering. 2007.Google ScholarGoogle ScholarCross RefCross Ref
  44. Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In WWW, pages 76--85. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. 2d conditional random fields for web information extraction. In ICML, pages 1044--1051. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. Simultaneous record detection and attribute labeling in web data extraction. In KDD, pages 494--503. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. J. Zhu, Z. Nie, B. Zhang, and J.-R. Wen. Dynamic hierarchical markov random fields for integrated web data extraction. JMLR, 9:1583--1614, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. X. Zhu. Semi-supervised learning literature survey. Computer Science, University of Wisconsin-Madison, 2:3, 2006.Google ScholarGoogle Scholar
  49. X. Zhu and A. B. Goldberg. Introduction to Semi-supervised Learning. Number 6. Morgan & Claypool Publishers, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Annotating Needles in the Haystack without Looking: Product Information Extraction from Emails

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
        August 2015
        2378 pages
        ISBN:9781450336642
        DOI:10.1145/2783258

        Copyright © 2015 Owner/Author

        Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 10 August 2015

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        KDD '15 Paper Acceptance Rate160of819submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader