ABSTRACT
Hardware component databases are critical resources in designing embedded systems. Since generating these databases requires hundreds of thousands of hours of manual data entry, they are proprietary, limited in the data they provide, and have many random data entry errors.
We present a machine-learning based approach for automating the generation of component databases directly from datasheets. Extracting data directly from datasheets is challenging because: (1) the data is relational in nature and relies on non-local context, (2) the documents are filled with technical jargon, and (3) the datasheets are PDFs, a format that decouples visual locality from locality in the document. The proposed approach uses a rich data model and weak supervision to address these challenges.
We evaluate the approach on datasheets of three classes of hardware components and achieve an average quality of 75 F1 points which is comparable to existing human-curated knowledge bases. We perform two applications studies that demonstrate the extraction of multiple data modalities such as numerical properties and images. We show how different sources of supervision such as heuristics and human labels have distinct advantages which can be utilized together within a single methodology to automatically generate hardware component knowledge bases.
- 2015. Choosing the right transistor for a switching circuit. https://electronics.stackexchange.com/questions/29029/ choosing-the-right-transistor-for-a-switching-circuitGoogle Scholar
- Fraser Anderson, Tovi Grossman, and George Fitzmaurice. 2017. Trigger-Action-Circuits: Leveraging Generative Design to Enable Novices to Design and Build Circuitry. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology . ACM, 331–342. Google Scholar
Digital Library
- Gabor Angeli, Melvin Jose Johnson Premkumar, and Christopher D Manning. 2015. Leveraging linguistic structure for open domain information extraction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Vol. 1. 344–354.Google Scholar
Cross Ref
- Michele Banko, Michael J Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni. 2007. Open information extraction from the web.. In IJCAI, Vol. 7. 2670–2676. Google Scholar
Digital Library
- Hui Chao and Jian Fan. 2004. Layout and content extraction for pdf documents. In International Workshop on Document Analysis Systems. Springer, 213–224.Google Scholar
Cross Ref
- Dave Doherty. 2019. About Digikey. https://www.digikey.com/en/ resources/about-digikeyGoogle Scholar
- Daniel Drew, Julie L Newcomb, William McGrath, Filip Maksimovic, David Mellis, and Björn Hartmann. 2016. The toastboard: Ubiquitous instrumentation and automated checking of breadboarded circuits. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology . ACM, 677–686. Google Scholar
Digital Library
- Oren Etzioni, Anthony Fader, Janara Christensen, Stephen Soderland, et al. 2011. Open information extraction: The second generation. In Twenty-Second International Joint Conference on Artificial Intelligence . Google Scholar
Digital Library
- Benoît Frénay and Michel Verleysen. 2014. Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems 25, 5 (2014), 845–869.Google Scholar
- William Huang, Ye-Sheng Kuo, Pat Pannuto, and Prabal Dutta. 2014. Opo: a wearable sensor for capturing high-fidelity face-to-face interactions. In Proceedings of the 12th ACM Conference on Embedded Network Sensor Systems . ACM, 61–75.Google Scholar
Digital Library
- Antonio Iannopollo, Stavros Tripakis, and Alberto SangiovanniVincentelli. 2019. Constrained synthesis from component libraries. Science of Computer Programming 171 (2019), 21–41.Google Scholar
Cross Ref
- Manas Joglekar, Hector Garcia-Molina, and Aditya Parameswaran. 2015. Comprehensive and reliable crowd assessment algorithms. In 2015 IEEE 31st International Conference on Data Engineering . IEEE, 195–206.Google Scholar
Cross Ref
- Ying Liu, Kun Bai, Prasenjit Mitra, and C Lee Giles. 2007. Tableseer: automatic table metadata extraction and searching in digital libraries. In Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries . ACM, 91–100. Google Scholar
Digital Library
- Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2 . Association for Computational Linguistics, 1003–1011. Google Scholar
Digital Library
- Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, et al. 2018. Ray: A Distributed Framework for Emerging AI Applications. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI) . 561–577. Google Scholar
Digital Library
- Ermelinda Oro and Massimo Ruffolo. 2009. Trex: An approach for recognizing and extracting tables from pdf documents. In 2009 10th International Conference on Document Analysis and Recognition . IEEE, 906–910. Google Scholar
Digital Library
- Shanan E Peters, Ce Zhang, Miron Livny, and Christopher Ré. 2014. A machine reading system for assembling synthetic paleontological databases. PLoS one 9, 12 (2014), e113523.Google Scholar
Cross Ref
- Raf Ramakers, Fraser Anderson, Tovi Grossman, and George Fitzmaurice. 2016. Retrofab: A design tool for retrofitting physical interfaces using actuators, sensors and 3d printing. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems . ACM, 409–419. Google Scholar
Digital Library
- Raf Ramakers, Kashyap Todi, and Kris Luyten. 2015. PaperPulse: an integrated approach for embedding electronics in paper designs. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems . ACM, 2457–2466. Google Scholar
Digital Library
- Rohit Ramesh, Richard Lin, Antonio Iannopollo, Alberto SangiovanniVincentelli, Björn Hartmann, and Prabal Dutta. 2017. Turning coders into makers: the promise of embedded design generation. In Proceedings of the 1st Annual ACM Symposium on Computational Fabrication . ACM, 4. Google Scholar
Digital Library
- Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid training data creation with weak supervision. Proceedings of the VLDB Endowment 11, 3 (2017), 269–282.Google Scholar
Digital Library
- Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. 2016. Data Programming: Creating Large Training Sets, Quickly. In Advances in Neural Information Processing Systems. 3567–3575. Google Scholar
Digital Library
- Sen Wu, Luke Hsiao, Xiao Cheng, Braden Hancock, Theodoros Rekatsinas, Philip Levis, and Christopher Ré. 2018. Fonduer: Knowledge Base Construction from Richly Formatted Data. In Proceedings of the 2018 International Conference on Management of Data . ACM, 1301–1316. Google Scholar
Digital Library
- Ce Zhang, Vidhya Govindaraju, Jackson Borchardt, Tim Foltz, Christopher Ré, and Shanan Peters. 2013. GeoDeepDive: statistical inference using familiar data-processing languages. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data . ACM, 993–996. Google Scholar
Digital Library
- Yuchen Zhang, Xi Chen, Dengyong Zhou, and Michael I Jordan. 2014. Spectral methods meet EM: A provably optimal algorithm for crowdsourcing. In Advances in neural information processing systems. 1260– 1268. Google Scholar
Digital Library
Index Terms
Automating the generation of hardware component knowledge bases
Recommendations
Creating Hardware Component Knowledge Bases with Training Data Generation and Multi-task Learning
Special Issue on LCETES, Part 2, Learning, Distributed, and Optimizing CompilersHardware component databases are vital resources in designing embedded systems. Since creating these databases requires hundreds of thousands of hours of manual data entry, they are proprietary, limited in the data they provide, and have random data ...
A Reconfigurable Hardware Architecture for Principal Component Analysis
Principal component analysis (PCA) is one of the widely used techniques for dimensionality reduction in multivariate statistical analysis. This article presents an efficient architecture design and implementation of the PCA algorithm on a field-...





Comments