skip to main content
research-article
Free Access

TabReformer: Unsupervised Representation Learning for Erroneous Data Detection

Authors Info & Claims
Published:18 May 2021Publication History
Skip Abstract Section

Abstract

Error detection is a crucial preliminary phase in any data analytics pipeline. Existing error detection techniques typically target specific types of errors. Moreover, most of these detection models either require user-defined rules or ample hand-labeled training examples. Therefore, in this article, we present TabReformer, a model that learns bidirectional encoder representations for tabular data. The proposed model consists of two main phases. In the first phase, TabReformer follows encoder architecture with multiple self-attention layers to model the dependencies between cells and capture tuple-level representations. Also, the model utilizes a Gaussian Error Linear Unit activation function with the Masked Data Model objective to achieve deeper probabilistic understanding. In the second phase, the model parameters are fine-tuned for the task of erroneous data detection. The model applies a data augmentation module to generate more erroneous examples to represent the minority class. The experimental evaluation considers a wide range of databases with different types of errors and distributions. The empirical results show that our solution can enhance the recall values by 32.95% on average compared with state-of-the-art techniques while reducing the manual effort by up to 48.86%.

References

  1. R. Lu, X. Jin, S. Zhang, M. Qiu, and X. Wu. 2019. A study on big knowledge and its engineering issues. IEEE Transactions on Knowledge and Data Engineering 31, 9 (2019), 1630–1644, DOI:10.1109/TKDE.2018.2866863.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. X. Chu, I. F. Ilyas, and P. Papotti. 2013. Holistic data cleaning: Putting violations into context. In 2013 IEEE 29th International Conference on Data Engineering (ICDE), 458–469, DOI:10.1109/ICDE.2013.6544847. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. 2017. HoloClean: Holistic data repairs with probabilistic inference. Proceedings of the VLDB Endowment 10, 1 (2017). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Reddy et al. 2017. Using gaussian mixture models to detect outliers in seasonal univariate network traffic. In 2017 IEEE Security and Privacy Workshops (SPW), 229–234, DOI:10.1109/SPW.2017.9.Google ScholarGoogle Scholar
  5. C. Pit–Claudel, Z. Mariet, R. Harding, and S. Madden. 2016. Outlier detection in heterogeneous datasets using automatic tuple expansion. 2016.Google ScholarGoogle Scholar
  6. F. Riahi and O. Schulte. 2020. Model-based exception mining for object-relational data. Data Mining and Knowledge Discovery 34, 3 (2020), 681–722, DOI:10.1007/s10618-020-00677-w.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Y. Liu et al. 2019. Generative adversarial active learning for unsupervised outlier detection. IEEE Transactions on Knowledge and Data Engineering 2019, DOI:10.1109/TKDE.2019.2905606.Google ScholarGoogle Scholar
  8. S. Schelter, D. Lange, P. Schmidt, M. Celikel, F. Biessmann, and A. Grafberger. 2018. Automating large-scale data quality verification. Proceedings of the VLDB Endowment 11, 12 (2018), 1781–1794. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Dallachiesa et al. 2013. NADEEF: A commodity data cleaning system. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, NY, 2013, 541–552, DOI:10.1145/2463676.2465327. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. L. Koumarelas, T. Papenbrock, and F. Naumann. 2020. MDedup: Duplicate detection with matching dependencies. Proceedings of the VLDB Endowment 13, 5 (2020), 712–725, DOI:10.14778/3377369.3377379. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. E. H. M. Pena, E. C. de Almeida, and F. Naumann. 2019. Discovery of approximate (and exact) denial constraints. Proceedings of the VLDB Endowment 13, 3 (2019), 266–278, DOI:10.14778/3368289.3368293. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang. 2016. Data cleaning: Overview and emerging challenges. In Proceedings of the 2016 International Conference on Management of Data, New York, NY, 2016, 2201–2206, DOI:10.1145/2882903.2912574. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Z. Abedjan et al. 2016. Detecting data errors: Where are we and what needs to be done?. Proceedings of the VLDB Endowment 9, 12 (2016), 993–1004, 2016, DOI:10.14778/2994509.2994518. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Yang, S. Rahardja, and P. Fränti. 2019. Outlier Detection: How to Threshold Outlier Scores?, In Proceedings of the International Conference on Artificial Intelligence, Information Processing and Cloud Computing, New York, NY, 2019, DOI:10.1145/3371425.3371427. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Heidari, J. McGrath, I. F. Ilyas, and T. Rekatsinas. 2019. HoloDetect: Few-shot learning for error detection. In Proceedings of the 2019 International Conference on Management of Data, Netherlands, 2019, 829–846, DOI:10.1145/3299869.3319888. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. F. Neutatz, M. Mahdavi, and Z. Abedjan. 2019. ED2: A case for active learning in error detection. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, New York, NY, 2019, 2249–2252, DOI:10.1145/3357384.3358129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg. 2016. ActiveClean: Interactive data cleaning for statistical modeling. Proceedings of the VLDB Endowment 9, 12 (2016), 948–959, DOI:10.14778/2994509.2994514. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Krishnan, M. J. Franklin, K. Goldberg, and E. Wu. 2017. BoostClean: Automated error detection and repair for machine learning. arXiv:1711.01299 [cs], 2017.Google ScholarGoogle Scholar
  19. A. Vaswani et al. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, Curran Associates, Inc., 2017, 5998–6008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio. 2015. Attention-based models for speech recognition. In Advances in Neural Information Processing Systems, Curran Associates, Inc., 2015, 577–585. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Krantz and J. Kalita. 2018. Abstractive summarization using attentive neural techniques. arXiv:1810.08838 [cs], Oct. 2018.Google ScholarGoogle Scholar
  22. A. Sternberg, J. Soares, D. Carvalho, and E. Ogasawara. 2017. A review on flight delay prediction. arXiv:1703.06118 [cs], 2017.Google ScholarGoogle Scholar
  23. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 [cs], 2019.Google ScholarGoogle Scholar
  24. A. Adhikari, A. Ram, R. Tang, and J. Lin. 2019. DocBERT: BERT for document classification. arXiv:1904.08398 [cs], Aug. 2019 [Online]. Available: http://arxiv.org/abs/1904.08398.Google ScholarGoogle Scholar
  25. Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. 2020. ALBERT: A lite BERT for self-supervised learning of language representations. arXiv:1909.11942 [cs], 2020 [Online]. Available: http://arxiv.org/abs/1909.11942.Google ScholarGoogle Scholar
  26. K. Ahmed, N. S. Keskar, and R. Socher. 2017. Weighted transformer network for machine translation. arXiv:1711.02132 [cs], Nov. 2017 [Online]. Available: http://arxiv.org/abs/1711.02132.Google ScholarGoogle Scholar
  27. L. Fu, Z. Yin, Y. Liu, and J. Zhang. 2018. Convolution neural network with active learning for information extraction of enterprise announcements. In Natural Language Processing and Chinese Computing, Cham 2018, 330–339.Google ScholarGoogle Scholar
  28. M. R. A. Rashid, G. Rizzo, M. Torchiano, N. Mihindukulasooriya, O. Corcho, and R. García-Castro. 2019. Completeness and consistency analysis for evolving knowledge bases. Journal of Web Semantics 54, 48–71.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. Farid, A. Roatis, I. F. Ilyas, H.-F. Hoffmann, and X. Chu. 2016. CLAMS: Bringing quality to data lakes. In Proceedings of the 2016 International Conference on Management of Data, San Francisco, CA, 2016, 2089–2092, DOI:10.1145/2882903.2899391. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. H. Saxena, L. Golab, and I. F. Ilyas. 2019. Distributed discovery of functional dependencies. In 2019 IEEE 35th International Conference on Data Engineering, Macao, 2019, 1590–1593, DOI:10.1109/ICDE.2019.00149.Google ScholarGoogle Scholar
  31. E. K. Rezig, M. Ouzzani, W. G. Aref, A. K. Elmagarmid, and A. R. Mahmood. 2017. Pattern-driven data cleaning. ArXiv:1712.09437 [cs], 2017.Google ScholarGoogle Scholar
  32. A. Qahtan, N. Tang, M. Ouzzani, Y. Cao, and M. Stonebraker. 2020. Pattern functional dependencies for data cleaning. Proceedings of the VLDB Endowment 13, 5 (2020), 684–697. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Z. Abedjan, L. Golab, and F. Naumann. 2015. Profiling relational data: A survey. The VLDB Journal 24, 4 (2015), 557–581, DOI:10.1007/s00778-015-0389-y. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le. 2019. RandAugment: Practical automated data augmentation with a reduced search space. arXiv:1909.13719 [cs], Nov. 2019.Google ScholarGoogle Scholar
  35. E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le. 2019. AutoAugment: Learning augmentation strategies from data. Long Beach, CA, 2019, 113–123.Google ScholarGoogle Scholar
  36. B. Zoph and Q. V. Le. 2017. Neural architecture search with reinforcement learning. arXiv:1611.01578 [cs], 2017 [Online]. Available: http://arxiv.org/abs/1611.01578.Google ScholarGoogle Scholar
  37. D. Stoller, S. Ewert, and S. Dixon. 2018. Adversarial semi-supervised audio source separation applied to singing voice extraction. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Alberta, Canada (2018), 2391–2395, DOI:10.1109/ICASSP.2018.8461722.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. S. Lim, I. Kim, T. Kim, C. Kim, and S. Kim. 2019. Fast autoaugment. In Advances in Neural Information Processing Systems, Curran Associates, Inc., 2019, 6665–6675.Google ScholarGoogle Scholar
  39. Y. Li, G. Hu, Y. Wang, T. Hospedales, N. M. Robertson, and Y. Yang. 2003. DADA: Differentiable automatic data augmentation. arXiv:2003.03780 [cs], 2020, [Online]. Available: http://arxiv.org/abs/2003.03780.Google ScholarGoogle Scholar
  40. D. Hendrycks and K. Gimpel. 2018. Gaussian error linear units (GELUs). arXiv:1606.08415 [cs], 2018 [Online]. Available: http://arxiv.org/abs/1606.08415.Google ScholarGoogle Scholar
  41. J. Torres, C. Vaca, L. Terán, and C. L. Abad. 2020. Seq2Seq models for recommending short text conversations. Expert Systems with Applications 150, 2020, DOI:10.1016/j.eswa.2020.113270.Google ScholarGoogle Scholar
  42. Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. Le, and R. Salakhutdinov. 2019. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy (2019), 2978–2988, DOI:10.18653/v1/P19-1285.Google ScholarGoogle Scholar
  43. Y. Liu et al. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692 [cs], Jul. 2019 [Online]. Available: http://arxiv.org/abs/1907.11692.Google ScholarGoogle Scholar
  44. Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. 2020. ALBERT: A Lite BERT for self-supervised learning of language representations. arXiv:1909.11942 [cs], 2020 [Online]. Available: http://arxiv.org/abs/1909.11942.Google ScholarGoogle Scholar
  45. J. T. Hancock and T. M. Khoshgoftaar. 2020. Survey on categorical data for neural networks. Journal of Big Data 7, 1 (2020), DOI:10.1186/s40537-020-00305-w.Google ScholarGoogle Scholar
  46. H. Nam and H.-E. Kim. 2018. Batch-instance normalization for adaptively style-invariant neural networks. In Advances in Neural Information Processing Systems, Curran Associates, Inc. (2018), 2558–2567. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. D. Ulyanov, A. Vedaldi, and V. Lempitsky. 2017. Instance normalization: The missing ingredient for fast stylization. arXiv:1607.08022 [cs], 2017 [Online]. Available: http://arxiv.org/abs/1607.08022.Google ScholarGoogle Scholar
  48. W. L. Taylor. 1953. ‘Cloze Procedure’: A new tool for measuring readability. Journalism Quarterly 30, 4 (1953), 415–433, DOI:10.1177/107769905303000401.Google ScholarGoogle ScholarCross RefCross Ref
  49. B. Alipour, L. Tonetto, R. Ketabi, A. Yi Ding, J. Ott, and A. Helmy. 2019. Where are you going next? A practical multi-dimensional look at mobility prediction. In Proceedings of the 22nd International ACM Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems, New York, NY, 2019, 5–12, DOI:10.1145/3345768.3355923. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. P. Dun, L. Zhu, and D. Zhao. 2019. Extending answer prediction for deep bi-directional transformers. In 32nd Conference on Neural Information Processing Systems (NIPS'19).Google ScholarGoogle Scholar
  51. H. Tan and M. Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490, 2019.Google ScholarGoogle Scholar
  52. F. Neutatz, M. Mahdavi, and Z. Abedjan. 2019. ED2: Two-stage active learning for error detection – technical report. arXiv:1908.06309 [cs, stat], Aug. 2019, Accessed: Apr. 17, 2020 [Online]. Available: http://arxiv.org/abs/1908.06309.Google ScholarGoogle Scholar
  53. N. S. Tawfik and M. R. Spruit. 2020. Evaluating sentence representations for biomedical text: Methods and experimental results. Journal of Biomedical Informatics 104, 103396 (2020), DOI:10.1016/j.jbi.2020.103396.Google ScholarGoogle Scholar
  54. M. Mahdavi et al. 2019. Raha: A configuration-free error detection system. In Proceedings of the 2019 International Conference on Management of Data, New York, NY (2019), 865–882, DOI:10.1145/3299869.3324956. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. D. Crane. “The Cost of Bad Data,” Integrate, Inc, 201AD [Online]. Available: https://demand.integrate.com/rs/951-JPP-414/images/Integrate_TheCostofBadLeads_Whitepaper.pdf.Google ScholarGoogle Scholar
  56. D. W. Cearley. 2020. Top 10 strategic technology trends for 2020, Gartner, 2020 [Online]. Available: https://www.gartner.com/en/publications/top-tech-trends-2020.Google ScholarGoogle Scholar
  57. D. Dua and C. Graff. 2017. UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences, 2017.Google ScholarGoogle Scholar
  58. P. Wang and Y. He. 2019. Uni-detect: A unified approach to automated error detection in tables. In Proceedings of the 2019 International Conference on Management of Data, New York, NY, 2019, 811–828, DOI:10.1145/3299869.3319855. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Y. Fu, X. Zhu, and B. Li. 2013. A survey on instance selection for active learning. Knowledge and Information Systems 35, 2 (2013), 249–283, DOI:10.1007/s10115-012-0507-8.Google ScholarGoogle ScholarCross RefCross Ref
  60. D. P. Kingma and J. Ba. 2017. Adam: A method for stochastic optimization. arXiv:1412.6980 [cs], Jan. 2017 [Online]. Available: http://arxiv.org/abs/1412.6980.Google ScholarGoogle Scholar
  61. P. Zhang, X. Xu, and D. Xiong. 2018. Active learning for neural machine translation. In 2018 International Conference on Asian Language Processing (IALP), 2018, 153–158.Google ScholarGoogle Scholar
  62. A. Estabrooks, T. Jo, and N. Japkowicz. 2004. A multiple resampling method for learning from imbalanced data sets. Computational Intelligence 20, 1 (2004), 18–36, DOI:10.1111/j.0824-7935.2004.t01-1-00228.x.Google ScholarGoogle ScholarCross RefCross Ref
  63. S. Akcay, A. Atapour-Abarghouei, and T. P. Breckon. 2019. GANomaly: Semi-supervised anomaly detection via adversarial training. In Computer Vision – ACCV 2018, 2019, 622–637.Google ScholarGoogle Scholar
  64. E. Adeli et al. 2019. Semi-supervised discriminative classification robust to sample-outliers and feature-noises. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 2 (2019), 515–522. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. S. Eduardo and C. Sutton. 2016. Data cleaning using probabilistic models of integrity constraints. Neural Information Processing Systems.Google ScholarGoogle Scholar
  66. G. Zhu, Q. Wang, Q. Tang, R. Gu, C. Yuan, and Y. Huang. 2019. Efficient and scalable functional dependency discovery on distributed data-parallel platforms. IEEE Transactions on Parallel and Distributed Systems 30, 12 (2019), 2663–2676.Google ScholarGoogle ScholarCross RefCross Ref
  67. J. N. Yan, O. Schulte, M. Zhang, J. Wang, and R. Cheng. 2020. SCODED: Statistical constraint oriented data error detection. Presented at the SIGMOD’20, Portland, OR, 2020. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. K. Chaitanya, N. Karani, C. F. Baumgartner, A. Becker, O. Donati, and E. Konukoglu. 2019. Semi-supervised and task-driven data augmentation. In Information Processing in Medical Imaging, 2019, 29–41.Google ScholarGoogle Scholar
  69. S. Liu, J. Zhang, Y. Chen, Y. Liu, Z. Qin, and T. Wan. 2019. Pixel level data augmentation for semantic image segmentation using generative adversarial networks. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2019), 1902–1906, DOI:10.1109/ICASSP.2019.8683590.Google ScholarGoogle Scholar
  70. Q. Xie, Z. Dai, E. Hovy, M.-T. Luong, and Q. V. Le. 2020. Unsupervised Data Augmentation for Consistency Training. arXiv 1904.12848v6 [csLG], 2020 [Online]. Available: https://arxiv.org/abs/1904.12848.Google ScholarGoogle Scholar
  71. L. Zhang, G.-J. Qi, L. Wang, and J. Luo. 2019. AET vs. AED: Unsupervised representation learning by auto-encoding transformations rather than data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, USA, 2019, 2547–2555.Google ScholarGoogle Scholar
  72. M. Freitag, S. Amiriparian, S. Pugachevskiy, N. Cummins, and B. Schuller. 2017. AuDeep: Unsupervised learning of representations from audio with deep recurrent neural networks. The Journal of Machine Learning Research 18, 1 (2017), 6340–6344. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. S. O. Arik and T. Pfister. 2020. TabNet: Attentive interpretable tabular learning. arXiv:1908.07442 [cs, stat], Feb. 2020 [Online]. Available: http://arxiv.org/abs/1908.07442.Google ScholarGoogle Scholar

Index Terms

  1. TabReformer: Unsupervised Representation Learning for Erroneous Data Detection

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM/IMS Transactions on Data Science
          ACM/IMS Transactions on Data Science  Volume 2, Issue 3
          August 2021
          302 pages
          ISSN:2691-1922
          DOI:10.1145/3465442
          Issue’s Table of Contents

          Copyright © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 18 May 2021
          • Revised: 1 January 2021
          • Accepted: 1 January 2021
          • Received: 1 July 2020
          Published in tds Volume 2, Issue 3

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed
        • Article Metrics

          • Downloads (Last 12 months)153
          • Downloads (Last 6 weeks)10

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!