Abstract
We develop an unsupervised learning framework for extracting popular product attributes from product description pages originated from different E-commerce Web sites. Unlike existing information extraction methods that do not consider the popularity of product attributes, our proposed framework is able to not only detect popular product features from a collection of customer reviews but also map these popular features to the related product attributes. One novelty of our framework is that it can bridge the vocabulary gap between the text in product description pages and the text in customer reviews. Technically, we develop a discriminative graphical model based on hidden Conditional Random Fields. As an unsupervised model, our framework can be easily applied to a variety of new domains and Web sites without the need of labeling training samples. Extensive experiments have been conducted to demonstrate the effectiveness and robustness of our framework.
- Enrique Alfonseca, Marius Pasca, and Enrique Robledo-Arnuncio. 2010. Acquisition of instance attributes via labeled and related instances. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 58--65. Google Scholar
Digital Library
- Lidong Bing, Wai Lam, and Yuan Gu. 2011. Towards a unified solution: Data record region detection and segmentation. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM’11). ACM, New York, NY, 1265--1274. Google Scholar
Digital Library
- Lidong Bing, Wai Lam, and Tak-Lam Wong. 2013. Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining (WSDM’13). ACM, New York, NY, USA, 567--576. Google Scholar
Digital Library
- Lidong Bing, Tak-Lam Wong, and Wai Lam. 2012. Unsupervised extraction of popular product attributes from web sites. In Proceedings of the 8th Asia Information Retrieval Societies Conference. 437--446.Google Scholar
Cross Ref
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3 (2003), 993--1022. Google Scholar
Digital Library
- Kenneth Bloom, Navendu Garg, and Shlomo Argamon. 2007. Extracting appraisal expressions. In Proceedings of Human Language Technologies/North American Association of Computational Linguists. Association for Computational Linguistics, Rochester, New York, 308--315.Google Scholar
- D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. 2004. Block-based web search. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 456--463. Google Scholar
Digital Library
- Xiaowen Ding, Bing Liu, and Lei Zhang. 2009. Entity discovery and assignment for opinion mining applications. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, 1125--1134. Google Scholar
Digital Library
- J. L. Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychol. Bull. 76, 5 (1971), 378--382.Google Scholar
Cross Ref
- Rayid Ghani, Katharina Probst, Yan Liu, Marko Krema, and Andrew Fano. 2006. Text mining for product attribute extraction. SIGKDD Explor. Newslett. 8, 1 (2006), 41--48. Google Scholar
Digital Library
- H. Guo, H. Zhu, Z. Guo, Z. Zhang, and Z. Su. 2009. Product feature categorization with multilevel latent semantic association. In Proceedings of the 18th ACM International Conference on Information and Knowledge Management. ACM, New York, NY, 1087--1096. Google Scholar
Digital Library
- Minqing Hu and Bing Liu. 2004a. Mining and summarizing customer reviews. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, 168--177. Google Scholar
Digital Library
- Minqing Hu and Bing Liu. 2004b. Mining opinion features in customer reviews. In Proceedings of the 19th National Conference on Artifical Intelligence (AAAI’04). 755--760. Google Scholar
Digital Library
- Nozomi Kobayashi, Kentaro Inui, Yuji Matsumoto, Kenji Tateishi, and Toshikazu Fukushima. 2004. Collecting evaluative expressions for opinion extraction. In Proceedings of the International Joint Conference on Natural Language Processing. 584--589. Google Scholar
Digital Library
- John Lafferty, Andrew McCallum, and Fernando Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of 18th International Conference on Machine Learning. 282--289. Google Scholar
Digital Library
- Xiao Li, Ye-Yi Wang, and Alex Acero. 2009. Extracting structured information from user queries with semi-supervised conditional random fields. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 572--579. Google Scholar
Digital Library
- Bing Liu, Robert Grossman, and Yanhong Zhai. 2003. Mining data records in web pages. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’03). ACM, New York, NY, USA, 601--606. Google Scholar
Digital Library
- Bing Liu, Minqing Hu, and Junsheng Cheng. 2005. Opinion observer: Analyzing and comparing opinions on the web. In Proceedings of the 14th International Conference on World Wide Web. ACM, New York, NY, USA, 342--351. Google Scholar
Digital Library
- Ping Luo, Fen Lin, Yuhong Xiong, Yong Zhao, and Zhongzhi Shi. 2009. Towards combining web classification and web information extraction: A case study. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, 1235--1244. Google Scholar
Digital Library
- Ana-Maria Popescu and Oren Etzioni. 2005. Extracting product features and opinions from reviews. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Stroudsburg, PA, USA, 339--346. Google Scholar
Digital Library
- K. Probst, M. Krema R. Ghai, A. Fano, and Y. Liu. 2007. Semi-supervised learning of attribute-value pairs from product descriptions. In Proceedings of the 20th International Joint Conference on Artificial Intelligence. 2838--2843. Google Scholar
Digital Library
- Changqin Quan and Fuji Ren. 2014. Unsupervised product feature extraction for feature-oriented opinion determination. Inf. Sci. 272 (2014), 16--28. Google Scholar
Digital Library
- A. Quattoni, S. Wang, L.-P. Morency, M. Collins, and T. Darrell. 2007. Hidden conditional random fields. IEEE Trans. Pattern Anal. Mach. Intell. 29(10) (2007), 1848--1853. Google Scholar
Digital Library
- Xinying Song, Jing Liu, Yunbo Cao, Chin-Yew Lin, and Hsiao-Wuen Hon. 2010. Automatic extraction of web data records containing user-generated content. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. ACM, New York, NY, 39--48. Google Scholar
Digital Library
- Y.-H. Sung and D. Jurafsky. 2009. Hidden conditional random fields for phone recognition. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, USA, 107--112.Google Scholar
- Huifeng Tang, Songbo Tan, and Xueqi Cheng. 2009. A survey on sentiment detection of reviews. Expert Syst. Appl. 36 (September 2009), 10760--10773. Issue 7. Google Scholar
Digital Library
- Ivan Titov and Ryan McDonald. 2008. Modeling online reviews with multi-grain topic models. In Proceedings of the 17th International Conference on World Wide Web. ACM, New York, NY, USA, 111--120. Google Scholar
Digital Library
- Peter D. Turney. 2002. Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA, USA, 417--424. Google Scholar
Digital Library
- Hongning Wang, Yue Lu, and Chengxiang Zhai. 2010. Latent aspect rating analysis on review text data: A rating regression approach. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, 783--792. Google Scholar
Digital Library
- Tak-Lam Wong, Lidong Bing, and Wai Lam. 2011. Normalizing web product attributes and discovering domain ontology with minimal effort. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining (WSDM’11). ACM, New York, NY, 805--814. Google Scholar
Digital Library
- Tak-Lam Wong and W. Lam. 2007. Adapting web information extraction knowledge via mining site invariant and site depdent features. ACM Trans. Internet Technol. 7(1) (2007), Article 6. Google Scholar
Digital Library
- Tak-Lam Wong, W. Lam, and T. S. Wong. 2008. An unsupervised framework for extracting and normalizing product attributes from multiple web sites. In Proceedings of the 31st International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 35--42. Google Scholar
Digital Library
- Liheng Xu, Kang Liu, Siwei Lai, and Jun Zhao. 2014. Product feature mining: Semantic clues versus syntactic constituents. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22--27, 2014, Baltimore, MD, USA, Volume 1: Long Papers. 336--346.Google Scholar
Cross Ref
- Chunyu Yang, Yong Cao, Zaiqing Nie, Jie Zhou, and Ji-Rong Wen. 2010. Closing the loop in webpage understanding. IEEE Trans. Knowledge Data Eng. 22 (May 2010), 639--650. Issue 5. Google Scholar
Digital Library
- Yanhong Zhai and Bing Liu. 2006. Structured data extraction from the web based on partial tree alignment. IEEE Trans. Knowledge Data Eng. 18(12) (2006), 1614--1628. Google Scholar
Digital Library
- Lei Zhang, Bing Liu, Suk Hwan Lim, and Eamonn O’Brien-Strain. 2010. Extracting and ranking product features in opinion documents. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters. 1462--1470. Google Scholar
Digital Library
- Xin Wayne Zhao, Yanwei Guo, Yulan He, Han Jiang, Yuexin Wu, and Xiaoming Li. 2014. We know what you want to buy: A demographic-based system for product recommendation on microblogs. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’14). ACM, New York, NY, 1935--1944. Google Scholar
Digital Library
- Shuyi Zheng, Ruihua Song, Ji-Rong Wen, and C. Lee Giles. 2009. Efficient record-level wrapper induction. In Proceeding of the 18th ACM International Conference on Information and Knowledge Management. ACM, New York, NY, 47--56. Google Scholar
Digital Library
- J. Zhu, Z. Nie, B. Zhang, and J.-R. Wen. 2008. Dynamic hierarchical Markov random fields for integrated web data extraction. J. Mach. Learn. Res. (2008), 1583--1614. Google Scholar
Digital Library
Index Terms
Unsupervised Extraction of Popular Product Attributes from E-Commerce Web Sites by Considering Customer Reviews
Recommendations
Product Phrase Extraction from e-Commerce Pages
WWW '19: Companion Proceedings of The 2019 World Wide Web ConferenceAnalyzing commercial pages to infer the products or services being offered by a web-based business is a task central to product search, product recommendation, ad placement and other e-commerce tasks. What makes this task challenging is that there are ...
Normalizing web product attributes and discovering domain ontology with minimal effort
WSDM '11: Proceedings of the fourth ACM international conference on Web search and data miningWe have developed a framework aiming at normalizing product attributes from Web pages collected from different Web sites without the need of labeled training examples. It can deal with pages composed of different layout format and content in an ...
Simultaneous record detection and attribute labeling in web data extraction
KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data miningRecent work has shown the feasibility and promise of template-independent Web data extraction. However, existing approaches use decoupled strategies - attempting to do data record detection and attribute labeling in two separate phases. In this paper, ...






Comments