Abstract
Hardware component databases are vital resources in designing embedded systems. Since creating these databases requires hundreds of thousands of hours of manual data entry, they are proprietary, limited in the data they provide, and have random data entry errors.
We present a machine learning based approach for creating hardware component databases directly from datasheets. Extracting data directly from datasheets is challenging because: (1) the data is relational in nature and relies on non-local context, (2) the documents are filled with technical jargon, and (3) the datasheets are PDFs, a format that decouples visual locality from locality in the document. Addressing this complexity has traditionally relied on human input, making it costly to scale. Our approach uses a rich data model, weak supervision, data augmentation, and multi-task learning to create these knowledge bases in a matter of days.
We evaluate the approach on datasheets of three types of components and achieve an average quality of 77 F1 points—quality comparable to existing human-curated knowledge bases. We perform application studies that demonstrate the extraction of multiple data modalities including numerical properties and images. We show how different sources of supervision such as heuristics and human labels have distinct advantages that can be utilized together to improve knowledge base quality. Finally, we present a case study to show how this approach changes the way practitioners create hardware component knowledge bases.
- Héctor Martínez Alonso and Barbara Plank. 2016. When is multitask learning effective? Semantic sequence prediction under varying data conditions. arXiv preprint arXiv:1612.02251 (2016).Google Scholar
- Fraser Anderson, Tovi Grossman, and George Fitzmaurice. 2017. Trigger-action-circuits: Leveraging generative design to enable novices to design and build circuitry. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology. ACM, 331--342.Google Scholar
Digital Library
- Gabor Angeli, Melvin Jose Johnson Premkumar, and Christopher D. Manning. 2015. Leveraging linguistic structure for open domain information extraction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Vol. 1. 344--354.Google Scholar
- Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. 2007. Multi-task feature learning. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 41--48.Google Scholar
- Joachim Bingel and Anders Søgaard. 2017. Identifying beneficial task relations for multi-task learning in deep neural networks. arXiv preprint arXiv:1702.08303 (2017).Google Scholar
- Hui Chao and Jian Fan. 2004. Layout and content extraction for PDF documents. In Proceedings of the International Workshop on Document Analysis Systems. Springer, 213--224.Google Scholar
Cross Ref
- Christopher Andreas Clark and Santosh Divvala. 2015. Looking beyond text: Extracting figures, tables and captions from computer science papers. In Proceedings of the Workshops at the 29th Association for the Advancement of Artificial Intelligence Conference on Artificial Intelligence (AAAI’15).Google Scholar
- Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V. Le. 2018. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501 (2018).Google Scholar
- Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V. Le. 2019. RandAugment: Practical data augmentation with no separate search. arXiv preprint arXiv:1909.13719 (2019).Google Scholar
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google Scholar
- Dave Doherty. 2019. About Digikey. Retrieved from https://www.digikey.com/en/resources/about-digikey.Google Scholar
- Daniel Drew, Julie L. Newcomb, William McGrath, Filip Maksimovic, David Mellis, and Björn Hartmann. 2016. The toastboard: Ubiquitous instrumentation and automated checking of breadboarded circuits. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology. ACM, 677--686.Google Scholar
Digital Library
- Oren Etzioni, Anthony Fader, Janara Christensen, Stephen Soderland, et al. 2011. Open information extraction: The second generation. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence.Google Scholar
Digital Library
- Benoît Frénay and Michel Verleysen. 2014. Classification in the presence of label noise: A survey. IEEE Trans. Neur. Netw. Learn. Syst. 25, 5 (2014), 845--869.Google Scholar
Cross Ref
- Hector Garcia-Molina, Manas Joglekar, Adam Marcus, Aditya Parameswaran, and Vasilis Verroios. 2016. Challenges in data crowdsourcing. IEEE Trans. Knowl. Data Eng. 28, 4 (2016), 901--911.Google Scholar
Digital Library
- Luke Hsiao, Sen Wu, Nicholas Chiang, Christopher Ré, and Philip Levis. 2019. Automating the generation of hardware component knowledge bases. In Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems. ACM, 163--176.Google Scholar
Digital Library
- William Huang, Ye-Sheng Kuo, Pat Pannuto, and Prabal Dutta. 2014. Opo: A wearable sensor for capturing high-fidelity face-to-face interactions. In Proceedings of the 12th ACM Conference on Embedded Network Sensor Systems. ACM, 61--75.Google Scholar
Digital Library
- Daniel P. Huttenlocher, Gregory A. Klanderman, and William A. Rucklidge. 1993. Comparing images using the Hausdorff distance. IEEE Trans. Pattern Anal. Mach. Intell. 9 (1993), 850--863.Google Scholar
Digital Library
- Antonio Iannopollo, Stavros Tripakis, and Alberto Sangiovanni-Vincentelli. 2019. Constrained synthesis from component libraries. Sci. Comput. Prog. 171 (2019), 21--41.Google Scholar
Cross Ref
- Manas Joglekar, Hector Garcia-Molina, and Aditya Parameswaran. 2015. Comprehensive and reliable crowd assessment algorithms. In Proceedings of the IEEE 31st International Conference on Data Engineering. IEEE, 195--206.Google Scholar
Cross Ref
- Ertugrul Kara, Mark Traquair, Burak Kantarci, and Shahzad Khan. 2019. Deep learning for recognizing the anatomy of tables on datasheets. In Proceedings of the IEEE Symposium on Computers and Communications (ISCC’19). IEEE, 1--6.Google Scholar
Cross Ref
- Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019. Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504 (2019).Google Scholar
- Ying Liu, Kun Bai, Prasenjit Mitra, and Clyde Lee Giles. 2007. Tableseer: Automatic table metadata extraction and searching in digital libraries. In Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’07). ACM, 91--100.Google Scholar
Digital Library
- Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. 2018. The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730 (2018).Google Scholar
- Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Vol. 2. Association for Computational Linguistics, 1003--1011.Google Scholar
Digital Library
- Ermelinda Oro and Massimo Ruffolo. 2009. Trex: An approach for recognizing and extracting tables from PDF documents. In Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR’09). IEEE, 906--910.Google Scholar
Digital Library
- Martha O. Perez-Arriaga, Trilce Estrada, and Soraya Abad-Mota. 2016. TAO: System for table detection and extraction from PDF documents. In Proceedings of the 29th International Florida Artificial Intelligence Research Society Conference (FLAIRS’16).Google Scholar
- Shanan E. Peters, Ce Zhang, Miron Livny, and Christopher Ré. 2014. A machine reading system for assembling synthetic paleontological databases. PLOS One 9, 12 (2014).Google Scholar
- Raf Ramakers, Fraser Anderson, Tovi Grossman, and George Fitzmaurice. 2016. Retrofab: A design tool for retrofitting physical interfaces using actuators, sensors, and 3D printing. In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM, 409--419.Google Scholar
Digital Library
- Cartic Ramakrishnan, Abhishek Patnia, Eduard Hovy, and Gully A. P. C. Burns. 2012. Layout-aware text extraction from full-text PDF of scientific articles. Source Code Biol. Med. 7, 1 (2012), 7.Google Scholar
Cross Ref
- Rohit Ramesh, Richard Lin, Antonio Iannopollo, Alberto Sangiovanni-Vincentelli, Björn Hartmann, and Prabal Dutta. 2017. Turning coders into makers: The promise of embedded design generation. In Proceedings of the 1st Annual ACM Symposium on Computational Fabrication. ACM, 4.Google Scholar
Digital Library
- Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2020. Snorkel: Rapid training data creation with weak supervision. The Very Large Data Bases (VLDB) J. 29, 2 (2019), 709--730. DOI:https://doi.org/10.1007/s00778-019-00552-1Google Scholar
Cross Ref
- Alexander Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. 2016. Data programming: Creating large training sets, quickly. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 3567--3575.Google Scholar
- Sagnik Ray Choudhury, Prasenjit Mitra, and Clyde Lee Giles. 2015. Automatic extraction of figures from scholarly documents. In Proceedings of the ACM Symposium on Document Engineering. ACM, 47--50.Google Scholar
Digital Library
- Sebastian Ruder. 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098 (2017).Google Scholar
- StackExchange. 2015. Choosing the right transistor for a switching circuit. Retrieved from https://electronics.stackexchange.com/questions/29029/choosing-the-right-transistor-for-a-switching-circuit.Google Scholar
- Abdel Aziz Taha and Allan Hanbury. 2015. An efficient algorithm for calculating the exact Hausdorff distance. IEEE Trans. Pattern Anal. Mach. Intell. 37, 11 (2015), 2153--2163.Google Scholar
Digital Library
- Jörg Tiedemann. 2014. Improved text extraction from PDF documents for large-scale natural language processing. In Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics. Springer, 102--112.Google Scholar
Digital Library
- Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018).Google Scholar
- Sen Wu. 2019. Emmental: A framework for building multi-modal multi-task learning systems. Retrieved from https://github.com/SenWu/emmental.Google Scholar
- Sen Wu, Luke Hsiao, Xiao Cheng, Braden Hancock, Theodoros Rekatsinas, Philip Levis, and Christopher Ré. 2018. Fonduer: Knowledge base construction from richly formatted data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 1301--1316.Google Scholar
Digital Library
- Sen Wu, Hongyang Zhang, and Christopher Ré. 2020. Understanding and improving information transfer in multi-task learning. In Proceedings of the International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=SylzhkBtDB.Google Scholar
- Sen Wu, Hongyang Zhang, Gregory Valiant, and Christopher Ré. 2020. On the generalization effects of linear transformations in data augmentation. In Proceedings of the International Conference on Machine Learning.Google Scholar
- Ce Zhang, Vidhya Govindaraju, Jackson Borchardt, Tim Foltz, Christopher Ré, and Shanan Peters. 2013. GeoDeepDive: Statistical inference using familiar data-processing languages. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, 993--996.Google Scholar
Digital Library
- Yuchen Zhang, Xi Chen, Dengyong Zhou, and Michael I. Jordan. 2014. Spectral methods meet EM: A provably optimal algorithm for crowdsourcing. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1260--1268.Google Scholar
Index Terms
Creating Hardware Component Knowledge Bases with Training Data Generation and Multi-task Learning
Recommendations
Automating the generation of hardware component knowledge bases
LCTES 2019: Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded SystemsHardware component databases are critical resources in designing embedded systems. Since generating these databases requires hundreds of thousands of hours of manual data entry, they are proprietary, limited in the data they provide, and have many ...
Partial Multi-label Learning with a Few Accurately Labeled Data
PRICAI 2023: Trends in Artificial IntelligenceAbstractPartial Multi-label Learning is a multi-label classification problem where only candidate labels are given for training data. These candidate labels consist of relevant labels and false-positive labels. In this paper, we consider the PML when a ...





Comments