Abstract
Error detection is a crucial preliminary phase in any data analytics pipeline. Existing error detection techniques typically target specific types of errors. Moreover, most of these detection models either require user-defined rules or ample hand-labeled training examples. Therefore, in this article, we present TabReformer, a model that learns bidirectional encoder representations for tabular data. The proposed model consists of two main phases. In the first phase, TabReformer follows encoder architecture with multiple self-attention layers to model the dependencies between cells and capture tuple-level representations. Also, the model utilizes a Gaussian Error Linear Unit activation function with the Masked Data Model objective to achieve deeper probabilistic understanding. In the second phase, the model parameters are fine-tuned for the task of erroneous data detection. The model applies a data augmentation module to generate more erroneous examples to represent the minority class. The experimental evaluation considers a wide range of databases with different types of errors and distributions. The empirical results show that our solution can enhance the recall values by 32.95% on average compared with state-of-the-art techniques while reducing the manual effort by up to 48.86%.
- R. Lu, X. Jin, S. Zhang, M. Qiu, and X. Wu. 2019. A study on big knowledge and its engineering issues. IEEE Transactions on Knowledge and Data Engineering 31, 9 (2019), 1630–1644, DOI:10.1109/TKDE.2018.2866863.Google Scholar
Digital Library
- X. Chu, I. F. Ilyas, and P. Papotti. 2013. Holistic data cleaning: Putting violations into context. In 2013 IEEE 29th International Conference on Data Engineering (ICDE), 458–469, DOI:10.1109/ICDE.2013.6544847. Google Scholar
Digital Library
- T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. 2017. HoloClean: Holistic data repairs with probabilistic inference. Proceedings of the VLDB Endowment 10, 1 (2017). Google Scholar
Digital Library
- A. Reddy et al. 2017. Using gaussian mixture models to detect outliers in seasonal univariate network traffic. In 2017 IEEE Security and Privacy Workshops (SPW), 229–234, DOI:10.1109/SPW.2017.9.Google Scholar
- C. Pit–Claudel, Z. Mariet, R. Harding, and S. Madden. 2016. Outlier detection in heterogeneous datasets using automatic tuple expansion. 2016.Google Scholar
- F. Riahi and O. Schulte. 2020. Model-based exception mining for object-relational data. Data Mining and Knowledge Discovery 34, 3 (2020), 681–722, DOI:10.1007/s10618-020-00677-w.Google Scholar
Digital Library
- Y. Liu et al. 2019. Generative adversarial active learning for unsupervised outlier detection. IEEE Transactions on Knowledge and Data Engineering 2019, DOI:10.1109/TKDE.2019.2905606.Google Scholar
- S. Schelter, D. Lange, P. Schmidt, M. Celikel, F. Biessmann, and A. Grafberger. 2018. Automating large-scale data quality verification. Proceedings of the VLDB Endowment 11, 12 (2018), 1781–1794. Google Scholar
Digital Library
- M. Dallachiesa et al. 2013. NADEEF: A commodity data cleaning system. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, NY, 2013, 541–552, DOI:10.1145/2463676.2465327. Google Scholar
Digital Library
- L. Koumarelas, T. Papenbrock, and F. Naumann. 2020. MDedup: Duplicate detection with matching dependencies. Proceedings of the VLDB Endowment 13, 5 (2020), 712–725, DOI:10.14778/3377369.3377379. Google Scholar
Digital Library
- E. H. M. Pena, E. C. de Almeida, and F. Naumann. 2019. Discovery of approximate (and exact) denial constraints. Proceedings of the VLDB Endowment 13, 3 (2019), 266–278, DOI:10.14778/3368289.3368293. Google Scholar
Digital Library
- X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang. 2016. Data cleaning: Overview and emerging challenges. In Proceedings of the 2016 International Conference on Management of Data, New York, NY, 2016, 2201–2206, DOI:10.1145/2882903.2912574. Google Scholar
Digital Library
- Z. Abedjan et al. 2016. Detecting data errors: Where are we and what needs to be done?. Proceedings of the VLDB Endowment 9, 12 (2016), 993–1004, 2016, DOI:10.14778/2994509.2994518. Google Scholar
Digital Library
- J. Yang, S. Rahardja, and P. Fränti. 2019. Outlier Detection: How to Threshold Outlier Scores?, In Proceedings of the International Conference on Artificial Intelligence, Information Processing and Cloud Computing, New York, NY, 2019, DOI:10.1145/3371425.3371427. Google Scholar
Digital Library
- A. Heidari, J. McGrath, I. F. Ilyas, and T. Rekatsinas. 2019. HoloDetect: Few-shot learning for error detection. In Proceedings of the 2019 International Conference on Management of Data, Netherlands, 2019, 829–846, DOI:10.1145/3299869.3319888. Google Scholar
Digital Library
- F. Neutatz, M. Mahdavi, and Z. Abedjan. 2019. ED2: A case for active learning in error detection. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, New York, NY, 2019, 2249–2252, DOI:10.1145/3357384.3358129. Google Scholar
Digital Library
- S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg. 2016. ActiveClean: Interactive data cleaning for statistical modeling. Proceedings of the VLDB Endowment 9, 12 (2016), 948–959, DOI:10.14778/2994509.2994514. Google Scholar
Digital Library
- S. Krishnan, M. J. Franklin, K. Goldberg, and E. Wu. 2017. BoostClean: Automated error detection and repair for machine learning. arXiv:1711.01299 [cs], 2017.Google Scholar
- A. Vaswani et al. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, Curran Associates, Inc., 2017, 5998–6008. Google Scholar
Digital Library
- J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio. 2015. Attention-based models for speech recognition. In Advances in Neural Information Processing Systems, Curran Associates, Inc., 2015, 577–585. Google Scholar
Digital Library
- J. Krantz and J. Kalita. 2018. Abstractive summarization using attentive neural techniques. arXiv:1810.08838 [cs], Oct. 2018.Google Scholar
- A. Sternberg, J. Soares, D. Carvalho, and E. Ogasawara. 2017. A review on flight delay prediction. arXiv:1703.06118 [cs], 2017.Google Scholar
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 [cs], 2019.Google Scholar
- A. Adhikari, A. Ram, R. Tang, and J. Lin. 2019. DocBERT: BERT for document classification. arXiv:1904.08398 [cs], Aug. 2019 [Online]. Available: http://arxiv.org/abs/1904.08398.Google Scholar
- Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. 2020. ALBERT: A lite BERT for self-supervised learning of language representations. arXiv:1909.11942 [cs], 2020 [Online]. Available: http://arxiv.org/abs/1909.11942.Google Scholar
- K. Ahmed, N. S. Keskar, and R. Socher. 2017. Weighted transformer network for machine translation. arXiv:1711.02132 [cs], Nov. 2017 [Online]. Available: http://arxiv.org/abs/1711.02132.Google Scholar
- L. Fu, Z. Yin, Y. Liu, and J. Zhang. 2018. Convolution neural network with active learning for information extraction of enterprise announcements. In Natural Language Processing and Chinese Computing, Cham 2018, 330–339.Google Scholar
- M. R. A. Rashid, G. Rizzo, M. Torchiano, N. Mihindukulasooriya, O. Corcho, and R. García-Castro. 2019. Completeness and consistency analysis for evolving knowledge bases. Journal of Web Semantics 54, 48–71.Google Scholar
Digital Library
- M. Farid, A. Roatis, I. F. Ilyas, H.-F. Hoffmann, and X. Chu. 2016. CLAMS: Bringing quality to data lakes. In Proceedings of the 2016 International Conference on Management of Data, San Francisco, CA, 2016, 2089–2092, DOI:10.1145/2882903.2899391. Google Scholar
Digital Library
- H. Saxena, L. Golab, and I. F. Ilyas. 2019. Distributed discovery of functional dependencies. In 2019 IEEE 35th International Conference on Data Engineering, Macao, 2019, 1590–1593, DOI:10.1109/ICDE.2019.00149.Google Scholar
- E. K. Rezig, M. Ouzzani, W. G. Aref, A. K. Elmagarmid, and A. R. Mahmood. 2017. Pattern-driven data cleaning. ArXiv:1712.09437 [cs], 2017.Google Scholar
- A. Qahtan, N. Tang, M. Ouzzani, Y. Cao, and M. Stonebraker. 2020. Pattern functional dependencies for data cleaning. Proceedings of the VLDB Endowment 13, 5 (2020), 684–697. Google Scholar
Digital Library
- Z. Abedjan, L. Golab, and F. Naumann. 2015. Profiling relational data: A survey. The VLDB Journal 24, 4 (2015), 557–581, DOI:10.1007/s00778-015-0389-y. Google Scholar
Digital Library
- E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le. 2019. RandAugment: Practical automated data augmentation with a reduced search space. arXiv:1909.13719 [cs], Nov. 2019.Google Scholar
- E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le. 2019. AutoAugment: Learning augmentation strategies from data. Long Beach, CA, 2019, 113–123.Google Scholar
- B. Zoph and Q. V. Le. 2017. Neural architecture search with reinforcement learning. arXiv:1611.01578 [cs], 2017 [Online]. Available: http://arxiv.org/abs/1611.01578.Google Scholar
- D. Stoller, S. Ewert, and S. Dixon. 2018. Adversarial semi-supervised audio source separation applied to singing voice extraction. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Alberta, Canada (2018), 2391–2395, DOI:10.1109/ICASSP.2018.8461722.Google Scholar
Digital Library
- S. Lim, I. Kim, T. Kim, C. Kim, and S. Kim. 2019. Fast autoaugment. In Advances in Neural Information Processing Systems, Curran Associates, Inc., 2019, 6665–6675.Google Scholar
- Y. Li, G. Hu, Y. Wang, T. Hospedales, N. M. Robertson, and Y. Yang. 2003. DADA: Differentiable automatic data augmentation. arXiv:2003.03780 [cs], 2020, [Online]. Available: http://arxiv.org/abs/2003.03780.Google Scholar
- D. Hendrycks and K. Gimpel. 2018. Gaussian error linear units (GELUs). arXiv:1606.08415 [cs], 2018 [Online]. Available: http://arxiv.org/abs/1606.08415.Google Scholar
- J. Torres, C. Vaca, L. Terán, and C. L. Abad. 2020. Seq2Seq models for recommending short text conversations. Expert Systems with Applications 150, 2020, DOI:10.1016/j.eswa.2020.113270.Google Scholar
- Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, and R. Salakhutdinov. 2019. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy (2019), 2978–2988, DOI:10.18653/v1/P19-1285.Google Scholar
- Y. Liu et al. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692 [cs], Jul. 2019 [Online]. Available: http://arxiv.org/abs/1907.11692.Google Scholar
- Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. 2020. ALBERT: A Lite BERT for self-supervised learning of language representations. arXiv:1909.11942 [cs], 2020 [Online]. Available: http://arxiv.org/abs/1909.11942.Google Scholar
- J. T. Hancock and T. M. Khoshgoftaar. 2020. Survey on categorical data for neural networks. Journal of Big Data 7, 1 (2020), DOI:10.1186/s40537-020-00305-w.Google Scholar
- H. Nam and H.-E. Kim. 2018. Batch-instance normalization for adaptively style-invariant neural networks. In Advances in Neural Information Processing Systems, Curran Associates, Inc. (2018), 2558–2567. Google Scholar
Digital Library
- D. Ulyanov, A. Vedaldi, and V. Lempitsky. 2017. Instance normalization: The missing ingredient for fast stylization. arXiv:1607.08022 [cs], 2017 [Online]. Available: http://arxiv.org/abs/1607.08022.Google Scholar
- W. L. Taylor. 1953. ‘Cloze Procedure’: A new tool for measuring readability. Journalism Quarterly 30, 4 (1953), 415–433, DOI:10.1177/107769905303000401.Google Scholar
Cross Ref
- B. Alipour, L. Tonetto, R. Ketabi, A. Yi Ding, J. Ott, and A. Helmy. 2019. Where are you going next? A practical multi-dimensional look at mobility prediction. In Proceedings of the 22nd International ACM Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems, New York, NY, 2019, 5–12, DOI:10.1145/3345768.3355923. Google Scholar
Digital Library
- P. Dun, L. Zhu, and D. Zhao. 2019. Extending answer prediction for deep bi-directional transformers. In 32nd Conference on Neural Information Processing Systems (NIPS'19).Google Scholar
- H. Tan and M. Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019.Google Scholar
- F. Neutatz, M. Mahdavi, and Z. Abedjan. 2019. ED2: Two-stage active learning for error detection – technical report. arXiv:1908.06309 [cs, stat], Aug. 2019, Accessed: Apr. 17, 2020 [Online]. Available: http://arxiv.org/abs/1908.06309.Google Scholar
- N. S. Tawfik and M. R. Spruit. 2020. Evaluating sentence representations for biomedical text: Methods and experimental results. Journal of Biomedical Informatics 104, 103396 (2020), DOI:10.1016/j.jbi.2020.103396.Google Scholar
- M. Mahdavi et al. 2019. Raha: A configuration-free error detection system. In Proceedings of the 2019 International Conference on Management of Data, New York, NY (2019), 865–882, DOI:10.1145/3299869.3324956. Google Scholar
Digital Library
- D. Crane. “The Cost of Bad Data,” Integrate, Inc, 201AD [Online]. Available: https://demand.integrate.com/rs/951-JPP-414/images/Integrate_TheCostofBadLeads_Whitepaper.pdf.Google Scholar
- D. W. Cearley. 2020. Top 10 strategic technology trends for 2020, Gartner, 2020 [Online]. Available: https://www.gartner.com/en/publications/top-tech-trends-2020.Google Scholar
- D. Dua and C. Graff. 2017. UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences, 2017.Google Scholar
- P. Wang and Y. He. 2019. Uni-detect: A unified approach to automated error detection in tables. In Proceedings of the 2019 International Conference on Management of Data, New York, NY, 2019, 811–828, DOI:10.1145/3299869.3319855. Google Scholar
Digital Library
- Y. Fu, X. Zhu, and B. Li. 2013. A survey on instance selection for active learning. Knowledge and Information Systems 35, 2 (2013), 249–283, DOI:10.1007/s10115-012-0507-8.Google Scholar
Cross Ref
- D. P. Kingma and J. Ba. 2017. Adam: A method for stochastic optimization. arXiv:1412.6980 [cs], Jan. 2017 [Online]. Available: http://arxiv.org/abs/1412.6980.Google Scholar
- P. Zhang, X. Xu, and D. Xiong. 2018. Active learning for neural machine translation. In 2018 International Conference on Asian Language Processing (IALP), 2018, 153–158.Google Scholar
- A. Estabrooks, T. Jo, and N. Japkowicz. 2004. A multiple resampling method for learning from imbalanced data sets. Computational Intelligence 20, 1 (2004), 18–36, DOI:10.1111/j.0824-7935.2004.t01-1-00228.x.Google Scholar
Cross Ref
- S. Akcay, A. Atapour-Abarghouei, and T. P. Breckon. 2019. GANomaly: Semi-supervised anomaly detection via adversarial training. In Computer Vision – ACCV 2018, 2019, 622–637.Google Scholar
- E. Adeli et al. 2019. Semi-supervised discriminative classification robust to sample-outliers and feature-noises. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 2 (2019), 515–522. Google Scholar
Digital Library
- S. Eduardo and C. Sutton. 2016. Data cleaning using probabilistic models of integrity constraints. Neural Information Processing Systems.Google Scholar
- G. Zhu, Q. Wang, Q. Tang, R. Gu, C. Yuan, and Y. Huang. 2019. Efficient and scalable functional dependency discovery on distributed data-parallel platforms. IEEE Transactions on Parallel and Distributed Systems 30, 12 (2019), 2663–2676.Google Scholar
Cross Ref
- J. N. Yan, O. Schulte, M. Zhang, J. Wang, and R. Cheng. 2020. SCODED: Statistical constraint oriented data error detection. Presented at the SIGMOD’20, Portland, OR, 2020. Google Scholar
Digital Library
- K. Chaitanya, N. Karani, C. F. Baumgartner, A. Becker, O. Donati, and E. Konukoglu. 2019. Semi-supervised and task-driven data augmentation. In Information Processing in Medical Imaging, 2019, 29–41.Google Scholar
- S. Liu, J. Zhang, Y. Chen, Y. Liu, Z. Qin, and T. Wan. 2019. Pixel level data augmentation for semantic image segmentation using generative adversarial networks. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2019), 1902–1906, DOI:10.1109/ICASSP.2019.8683590.Google Scholar
- Q. Xie, Z. Dai, E. Hovy, M.-T. Luong, and Q. V. Le. 2020. Unsupervised Data Augmentation for Consistency Training. arXiv 1904.12848v6 [csLG], 2020 [Online]. Available: https://arxiv.org/abs/1904.12848.Google Scholar
- L. Zhang, G.-J. Qi, L. Wang, and J. Luo. 2019. AET vs. AED: Unsupervised representation learning by auto-encoding transformations rather than data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, USA, 2019, 2547–2555.Google Scholar
- M. Freitag, S. Amiriparian, S. Pugachevskiy, N. Cummins, and B. Schuller. 2017. AuDeep: Unsupervised learning of representations from audio with deep recurrent neural networks. The Journal of Machine Learning Research 18, 1 (2017), 6340–6344. Google Scholar
Digital Library
- S. O. Arik and T. Pfister. 2020. TabNet: Attentive interpretable tabular learning. arXiv:1908.07442 [cs, stat], Feb. 2020 [Online]. Available: http://arxiv.org/abs/1908.07442.Google Scholar
Index Terms
TabReformer: Unsupervised Representation Learning for Erroneous Data Detection
Recommendations
A Synchronous Fragile Watermarking Scheme for Erroneous Q-DCT Coefficients Detection
PCM '01: Proceedings of the Second IEEE Pacific Rim Conference on Multimedia: Advances in Multimedia Information ProcessingIn video communications over error prone channels, error concealment techniques are widely applied in video decoder for good subjective image output. However, a damaged MB could be concealed only after it is detected as erroneous. Because of the poor ...
BCH 2-Bit and 3-Bit Error Correction with Fast Multi-Bit Error Detection
Architecture of Computing SystemsAbstractIn this paper an new approach combining 2-bit and 3-bit BCH error correction with fast and simple error detection for errors of higher order is presented. Under the assumption that a 2-bit error or 3-bit error occurred, the corresponding ...
Few training data for Objection Detection
EITCE '20: Proceedings of the 2020 4th International Conference on Electronic Information Technology and Computer EngineeringDeep learning method of object detection has achieved excellent results, but most of the object detection network training processes are supervised learning. The performance improvement is driven by a large amount of annotation data to drive deeper and ...






Comments