skip to main content
research-article
Public Access

Creating Hardware Component Knowledge Bases with Training Data Generation and Multi-task Learning

Published:29 September 2020Publication History
Skip Abstract Section

Abstract

Hardware component databases are vital resources in designing embedded systems. Since creating these databases requires hundreds of thousands of hours of manual data entry, they are proprietary, limited in the data they provide, and have random data entry errors.

We present a machine learning based approach for creating hardware component databases directly from datasheets. Extracting data directly from datasheets is challenging because: (1) the data is relational in nature and relies on non-local context, (2) the documents are filled with technical jargon, and (3) the datasheets are PDFs, a format that decouples visual locality from locality in the document. Addressing this complexity has traditionally relied on human input, making it costly to scale. Our approach uses a rich data model, weak supervision, data augmentation, and multi-task learning to create these knowledge bases in a matter of days.

We evaluate the approach on datasheets of three types of components and achieve an average quality of 77 F1 points—quality comparable to existing human-curated knowledge bases. We perform application studies that demonstrate the extraction of multiple data modalities including numerical properties and images. We show how different sources of supervision such as heuristics and human labels have distinct advantages that can be utilized together to improve knowledge base quality. Finally, we present a case study to show how this approach changes the way practitioners create hardware component knowledge bases.

References

  1. Héctor Martínez Alonso and Barbara Plank. 2016. When is multitask learning effective? Semantic sequence prediction under varying data conditions. arXiv preprint arXiv:1612.02251 (2016).Google ScholarGoogle Scholar
  2. Fraser Anderson, Tovi Grossman, and George Fitzmaurice. 2017. Trigger-action-circuits: Leveraging generative design to enable novices to design and build circuitry. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology. ACM, 331--342.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Gabor Angeli, Melvin Jose Johnson Premkumar, and Christopher D. Manning. 2015. Leveraging linguistic structure for open domain information extraction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Vol. 1. 344--354.Google ScholarGoogle Scholar
  4. Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. 2007. Multi-task feature learning. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 41--48.Google ScholarGoogle Scholar
  5. Joachim Bingel and Anders Søgaard. 2017. Identifying beneficial task relations for multi-task learning in deep neural networks. arXiv preprint arXiv:1702.08303 (2017).Google ScholarGoogle Scholar
  6. Hui Chao and Jian Fan. 2004. Layout and content extraction for PDF documents. In Proceedings of the International Workshop on Document Analysis Systems. Springer, 213--224.Google ScholarGoogle ScholarCross RefCross Ref
  7. Christopher Andreas Clark and Santosh Divvala. 2015. Looking beyond text: Extracting figures, tables and captions from computer science papers. In Proceedings of the Workshops at the 29th Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence (AAAI’15).Google ScholarGoogle Scholar
  8. Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V. Le. 2018. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501 (2018).Google ScholarGoogle Scholar
  9. Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V. Le. 2019. RandAugment: Practical data augmentation with no separate search. arXiv preprint arXiv:1909.13719 (2019).Google ScholarGoogle Scholar
  10. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google ScholarGoogle Scholar
  11. Dave Doherty. 2019. About Digikey. Retrieved from https://www.digikey.com/en/resources/about-digikey.Google ScholarGoogle Scholar
  12. Daniel Drew, Julie L. Newcomb, William McGrath, Filip Maksimovic, David Mellis, and Björn Hartmann. 2016. The toastboard: Ubiquitous instrumentation and automated checking of breadboarded circuits. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology. ACM, 677--686.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Oren Etzioni, Anthony Fader, Janara Christensen, Stephen Soderland, et al. 2011. Open information extraction: The second generation. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Benoît Frénay and Michel Verleysen. 2014. Classification in the presence of label noise: A survey. IEEE Trans. Neur. Netw. Learn. Syst. 25, 5 (2014), 845--869.Google ScholarGoogle ScholarCross RefCross Ref
  15. Hector Garcia-Molina, Manas Joglekar, Adam Marcus, Aditya Parameswaran, and Vasilis Verroios. 2016. Challenges in data crowdsourcing. IEEE Trans. Knowl. Data Eng. 28, 4 (2016), 901--911.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Luke Hsiao, Sen Wu, Nicholas Chiang, Christopher Ré, and Philip Levis. 2019. Automating the generation of hardware component knowledge bases. In Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems. ACM, 163--176.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. William Huang, Ye-Sheng Kuo, Pat Pannuto, and Prabal Dutta. 2014. Opo: A wearable sensor for capturing high-fidelity face-to-face interactions. In Proceedings of the 12th ACM Conference on Embedded Network Sensor Systems. ACM, 61--75.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Daniel P. Huttenlocher, Gregory A. Klanderman, and William A. Rucklidge. 1993. Comparing images using the Hausdorff distance. IEEE Trans. Pattern Anal. Mach. Intell. 9 (1993), 850--863.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Antonio Iannopollo, Stavros Tripakis, and Alberto Sangiovanni-Vincentelli. 2019. Constrained synthesis from component libraries. Sci. Comput. Prog. 171 (2019), 21--41.Google ScholarGoogle ScholarCross RefCross Ref
  20. Manas Joglekar, Hector Garcia-Molina, and Aditya Parameswaran. 2015. Comprehensive and reliable crowd assessment algorithms. In Proceedings of the IEEE 31st International Conference on Data Engineering. IEEE, 195--206.Google ScholarGoogle ScholarCross RefCross Ref
  21. Ertugrul Kara, Mark Traquair, Burak Kantarci, and Shahzad Khan. 2019. Deep learning for recognizing the anatomy of tables on datasheets. In Proceedings of the IEEE Symposium on Computers and Communications (ISCC’19). IEEE, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  22. Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504 (2019).Google ScholarGoogle Scholar
  23. Ying Liu, Kun Bai, Prasenjit Mitra, and Clyde Lee Giles. 2007. Tableseer: Automatic table metadata extraction and searching in digital libraries. In Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’07). ACM, 91--100.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2018. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730 (2018).Google ScholarGoogle Scholar
  25. Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Vol. 2. Association for Computational Linguistics, 1003--1011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Ermelinda Oro and Massimo Ruffolo. 2009. Trex: An approach for recognizing and extracting tables from PDF documents. In Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR’09). IEEE, 906--910.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Martha O. Perez-Arriaga, Trilce Estrada, and Soraya Abad-Mota. 2016. TAO: System for table detection and extraction from PDF documents. In Proceedings of the 29th International Florida Artificial Intelligence Research Society Conference (FLAIRS’16).Google ScholarGoogle Scholar
  28. Shanan E. Peters, Ce Zhang, Miron Livny, and Christopher Ré. 2014. A machine reading system for assembling synthetic paleontological databases. PLOS One 9, 12 (2014).Google ScholarGoogle Scholar
  29. Raf Ramakers, Fraser Anderson, Tovi Grossman, and George Fitzmaurice. 2016. Retrofab: A design tool for retrofitting physical interfaces using actuators, sensors, and 3D printing. In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM, 409--419.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Cartic Ramakrishnan, Abhishek Patnia, Eduard Hovy, and Gully A. P. C. Burns. 2012. Layout-aware text extraction from full-text PDF of scientific articles. Source Code Biol. Med. 7, 1 (2012), 7.Google ScholarGoogle ScholarCross RefCross Ref
  31. Rohit Ramesh, Richard Lin, Antonio Iannopollo, Alberto Sangiovanni-Vincentelli, Björn Hartmann, and Prabal Dutta. 2017. Turning coders into makers: The promise of embedded design generation. In Proceedings of the 1st Annual ACM Symposium on Computational Fabrication. ACM, 4.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2020. Snorkel: Rapid training data creation with weak supervision. The Very Large Data Bases (VLDB) J. 29, 2 (2019), 709--730. DOI:https://doi.org/10.1007/s00778-019-00552-1Google ScholarGoogle ScholarCross RefCross Ref
  33. Alexander Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. 2016. Data programming: Creating large training sets, quickly. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 3567--3575.Google ScholarGoogle Scholar
  34. Sagnik Ray Choudhury, Prasenjit Mitra, and Clyde Lee Giles. 2015. Automatic extraction of figures from scholarly documents. In Proceedings of the ACM Symposium on Document Engineering. ACM, 47--50.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Sebastian Ruder. 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098 (2017).Google ScholarGoogle Scholar
  36. StackExchange. 2015. Choosing the right transistor for a switching circuit. Retrieved from https://electronics.stackexchange.com/questions/29029/choosing-the-right-transistor-for-a-switching-circuit.Google ScholarGoogle Scholar
  37. Abdel Aziz Taha and Allan Hanbury. 2015. An efficient algorithm for calculating the exact Hausdorff distance. IEEE Trans. Pattern Anal. Mach. Intell. 37, 11 (2015), 2153--2163.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Jörg Tiedemann. 2014. Improved text extraction from PDF documents for large-scale natural language processing. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 102--112.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018).Google ScholarGoogle Scholar
  40. Sen Wu. 2019. Emmental: A framework for building multi-modal multi-task learning systems. Retrieved from https://github.com/SenWu/emmental.Google ScholarGoogle Scholar
  41. Sen Wu, Luke Hsiao, Xiao Cheng, Braden Hancock, Theodoros Rekatsinas, Philip Levis, and Christopher Ré. 2018. Fonduer: Knowledge base construction from richly formatted data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 1301--1316.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Sen Wu, Hongyang Zhang, and Christopher Ré. 2020. Understanding and improving information transfer in multi-task learning. In Proceedings of the International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=SylzhkBtDB.Google ScholarGoogle Scholar
  43. Sen Wu, Hongyang Zhang, Gregory Valiant, and Christopher Ré. 2020. On the generalization effects of linear transformations in data augmentation. In Proceedings of the International Conference on Machine Learning.Google ScholarGoogle Scholar
  44. Ce Zhang, Vidhya Govindaraju, Jackson Borchardt, Tim Foltz, Christopher Ré, and Shanan Peters. 2013. GeoDeepDive: Statistical inference using familiar data-processing languages. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 993--996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Yuchen Zhang, Xi Chen, Dengyong Zhou, and Michael I. Jordan. 2014. Spectral methods meet EM: A provably optimal algorithm for crowdsourcing. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1260--1268.Google ScholarGoogle Scholar

Index Terms

  1. Creating Hardware Component Knowledge Bases with Training Data Generation and Multi-task Learning

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Embedded Computing Systems
        ACM Transactions on Embedded Computing Systems  Volume 19, Issue 6
        Special Issue on LCETES, Part 2, Learning, Distributed, and Optimizing Compilers
        November 2020
        271 pages
        ISSN:1539-9087
        EISSN:1558-3465
        DOI:10.1145/3427195
        Issue’s Table of Contents

        Copyright © 2020 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 29 September 2020
        • Online AM: 7 May 2020
        • Revised: 1 March 2020
        • Accepted: 1 March 2020
        • Received: 1 October 2019
        Published in tecs Volume 19, Issue 6

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format

      Access Granted

      This article is provided by ACM and the author Luke Hsiao through the ACM Author-Izer service.