skip to main content
research-article
Free Access

Deep Hash-based Relevance-aware Data Quality Assessment for Image Dark Data

Authors Info & Claims
Published:08 April 2021Publication History
Skip Abstract Section

Abstract

Data mining can hardly solve but always faces a problem that there is little meaningful information within the dataset serving a given requirement. Faced with multiple unknown datasets, to allocate data mining resources to acquire more desired data, it is necessary to establish a data quality assessment framework based on the relevance between the dataset and requirements. This framework can help the user to judge the potential benefits in advance, so as to optimize the resource allocation to those candidates. However, the unstructured data (e.g., image data) often presents dark data states, which makes it tricky for the user to understand the relevance based on content of the dataset in real time. Even if all data have label descriptions, how to measure the relevance between data efficiently under semantic propagation remains an urgent problem. Based on this, we propose a Deep Hash-based Relevance-aware Data Quality Assessment framework, which contains off-line learning and relevance mining parts as well as an on-line assessing part. In the off-line part, we first design a Graph Convolution Network (GCN)-AutoEncoder hash (GAH) algorithm to recognize the data (i.e., lighten the dark data), then construct a graph with restricted Hamming distance, and finally design a Cluster PageRank (CPR) algorithm to calculate the importance score for each node (image) so as to obtain the relevance representation based on semantic propagation. In the on-line part, we first retrieve the importance score by hash codes and then quickly get the assessment conclusion in the importance list. On the one hand, the introduction of GCN and co-occurrence probability in the GAH promotes the perception ability for dark data. On the other hand, the design of CPR utilizes hash collision to reduce the scale of graph and iteration matrix, which greatly decreases the consumption of space and computing resources. We conduct extensive experiments on both single-label and multi-label datasets to assess the relevance between data and requirements as well as test the resources allocation. Experimental results show our framework can gain the most desired data with the same mining resources. Besides, the test results on Tencent1M dataset demonstrate the framework can complete the assessment with a stability for given different requirements.

References

  1. Danilo Ardagna, Cinzia Cappiello, Walter Samá, and Monica Vitali. 2018. Context-aware data quality assessment for big data. Fut. Gen. Comput. Syst. 89 (2018), 548–562. Google ScholarGoogle ScholarCross RefCross Ref
  2. Michael J. Cafarella, Ihab F. Ilyas, Marcel Kornacker, Tim Kraska, and Christopher Ré. 2016. Dark data: Are we solving the right problems? In ICDE. 1444–1445.Google ScholarGoogle Scholar
  3. Li Cai and Yangyong Zhu. 2015. The challenges of data quality and data quality assessment in the big data era. Data Sci. J. 14 (2015), 2.Google ScholarGoogle Scholar
  4. Yue Cao, Mingsheng Long, Bin Liu, and Jianmin Wang. 2018. Deep cauchy hashing for hamming space retrieval. In CVPR. 1229–1237.Google ScholarGoogle Scholar
  5. Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, and Yanwen Guo. 2019. Multi-label image recognition with graph convolutional networks. In CVPR. 5177–5186.Google ScholarGoogle Scholar
  6. Yueqi Duan, Ziwei Wang, Jiwen Lu, Xudong Lin, and Jie Zhou. 2018. GraphBit: Bitwise interaction mining via deep reinforcement learning. In CVPR. 8270–8279.Google ScholarGoogle Scholar
  7. M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. 2010. The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88, 2 (Jun. 2010), 303–338. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Lianli Gao, Xiaosu Zhu, Jingkuan Song, Zhou Zhao, and Heng Tao Shen. 2019. Beyond product quantization: Deep progressive quantization for image retrieval. In IJCAI. 723–729. Google ScholarGoogle ScholarCross RefCross Ref
  9. Tao He, Yuan-Fang Li, Lianli Gao, Dongxiang Zhang, and Jingkuan Song. 2019. One network for multi-domains: Domain adaptive hashing with intersectant generative adversarial networks. In IJCAI. 2477–2483. Google ScholarGoogle ScholarCross RefCross Ref
  10. P. Bryan Heidorn. 2008. Shedding light on the dark data in the long tail of science. Libr. Trends 57, 2 (2008), 280–299.Google ScholarGoogle ScholarCross RefCross Ref
  11. Chang-Qin Huang, Shang-Ming Yang, Yan Pan, and Hanjiang Lai. 2018. Object-location-aware hashing for multi-label image retrieval via automatic mask learning. IEEE Trans. Image Process. 27, 9 (2018), 4490–4502.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Shirlee-ann Knight and Janice Burn. 2005. Developing a framework for assessing information quality on the world wide web. Inf. Sci. 8 (2005), 159--172.Google ScholarGoogle Scholar
  13. Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report. Citeseer.Google ScholarGoogle Scholar
  14. Hanjiang Lai, Pan Yan, Xiangbo Shu, Yunchao Wei, and Shuicheng Yan. 2016. Instance-aware hashing for multi-label image retrieval. IEEE Trans. Image Process. 25, 6 (2016), 2469–2479. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Yu Lei, Wenjie Li, Ziyu Lu, and Miao Zhao. 2017. Alternating pointwise-pairwise learning for personalized item ranking. In CIKM. 2155–2158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jun Li, Xianglong Liu, Wenxuan Zhang, Mingyuan Zhang, Jingkuan Song, and Nicu Sebe. 2020. Spatio-temporal attention networks for action recognition and detection. IEEE Trans. Multimedia 22, 11 (2020), 2990--3001.Google ScholarGoogle ScholarCross RefCross Ref
  17. Yanghao Li, Naiyan Wang, Jiaying Liu, and Xiaodi Hou. 2017. Factorized bilinear models for image recognition. In ICCV. 2098–2106.Google ScholarGoogle Scholar
  18. Kevin Lin, Jiwen Lu, Chu-Song Chen, and Jie Zhou. 2016. Learning compact binary descriptors with unsupervised deep neural networks. In CVPR. 1183–1192.Google ScholarGoogle Scholar
  19. Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In ECCV. 740–755.Google ScholarGoogle Scholar
  20. Xianglong Liu, Qiang Fu, Deqing Wang, Xiao Bai, Xinyu Wu, and Dacheng Tao. 2020. Distributed complementary binary quantization for joint hash table learning. IEEE Trans. Neur. Netw. Learn. Syst. 31, 12 (2020), 5312--5323.Google ScholarGoogle ScholarCross RefCross Ref
  21. Yu Liu, Jingkuan Song, Ke Zhou, Lingyu Yan, Li Liu, Fuhao Zou, and Ling Shao. 2019. Deep self-taught hashing for image retrieval. IEEE Trans. Cybernet. 49, 6 (2019), 2229–2241.Google ScholarGoogle ScholarCross RefCross Ref
  22. Yu Liu, Yangtao Wang, Ke Zhou, Yujuan Yang, and Yifei Liu. 2020. Semantic-aware data quality assessment for image big data. Fut. Gen. Comput. Sci. 102 (2020), 53–65.Google ScholarGoogle ScholarCross RefCross Ref
  23. Yu Liu, Yangtao Wang, Ke Zhou, Yujuan Yang, Yifei Liu, Jingkuan Song, and Zhili Xiao. 2019. A framework for image dark data assessment. In APWeb-WAIM. 3–18.Google ScholarGoogle Scholar
  24. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.Google ScholarGoogle Scholar
  25. Yuxin Peng, Jian Zhang, and Zhaoda Ye. 2020. Deep reinforcement learning for image hashing. IEEE Trans. Multimedia 22, 8 (2020), 2061--2073.Google ScholarGoogle ScholarCross RefCross Ref
  26. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In EMNLP. 1532–1543.Google ScholarGoogle Scholar
  27. Fabian Richter, Stefan Romberg, Eva Hörster, and Rainer Lienhart. 2010. Multimodal ranking for image search on community databases. In MIR. 63–72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Manish Shukla, Sumesh Manjunath, Rohit Saxena, Sutapa Mondal, and Sachin Lodha. 2015. POSTER: WinOver enterprise dark data. In SIGSAC. 1674–1676. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Jingkuan Song, Lianli Gao, Li Liu, Xiaofeng Zhu, and Nicu Sebe. 2018. Quantization-based hashing: A general framework for scalable image and video retrieval. Pattern Recogn. 75 (2018), 175–187. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jingkuan Song, Xiaosu Zhu, Lianli Gao, Xin-Shun Xu, Wu Liu, and Heng Tao Shen. 2019. Deep recurrent quantization for generating sequential binary codes. In IJCAI. 912–918. Google ScholarGoogle ScholarCross RefCross Ref
  31. Jingdong Wang, Ting Zhang, Jingkuan Song, Nicu Sebe, and Heng Tao Shen. 2018. A survey on learning to hash. IEEE Trans. Pattern Anal. Mach. Intell. 40, 4 (2018), 769–790.Google ScholarGoogle ScholarCross RefCross Ref
  32. Richard Y. Wang and Diane M. Strong. 1996. Beyond accuracy: What data quality means to data consumers. J. Manage. Inf. Syst. 12, 4 (1996), 5–33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Yangtao Wang, Yu Liu, Yifei Liu, Ke Zhou, Yujuan Yang, Jiangfeng Zeng, Xiaodong Xu, and Zhili Xiao. 2019. Analysis and management to hash-based graph and rank. In APWeb-WAIM. 289–296.Google ScholarGoogle Scholar
  34. Yuebin Wang, Liqiang Zhang, Feiping Nie, Xingang Li, Zhijun Chen, and Faqiang Wang. 2020. WeGAN: Deep image hashing with weighted generative adversarial networks. IEEE Trans. Multimedia 22, 6 (2020), 1458–1469.Google ScholarGoogle ScholarCross RefCross Ref
  35. Yan Wu, Xianglong Liu, Haotong Qin, Ke Xia, Sheng Hu, Yuqing Ma, and Meng Wang. 2021. Boosting temporal binary coding for large-scale video search. IEEE Trans. Multimedia 23 (2021), 353--364.Google ScholarGoogle ScholarCross RefCross Ref
  36. De Xie, Cheng Deng, Chao Li, Xianglong Liu, and Dacheng Tao. 2020. Multi-task consistency-preserving adversarial hashing for cross-modal retrieval. IEEE Trans. Image Process. 29 (2020), 3626–3637.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Ruiqing Xu, Chao Li, Junchi Yan, Cheng Deng, and Xianglong Liu. 2019. Graph convolutional network hashing for cross-modal retrieval. In IJCAI. 982–988. Google ScholarGoogle ScholarCross RefCross Ref
  38. Yi Xu, Xianglong Liu, Binshuai Wang, Renshuai Tao, Ke Xia, and Xianbin Cao. 2021. Fast nearest subspace search via random angular hashing. IEEE Trans. Multimedia 23 (2021), 342--352.Google ScholarGoogle ScholarCross RefCross Ref
  39. Erkun Yang, Tongliang Liu, Cheng Deng, Wei Liu, and Dacheng Tao. 2019. DistillHash: Unsupervised deep hashing by distilling data pairs. In CVPR. 2946–2955.Google ScholarGoogle Scholar
  40. Huei-Fang Yang, Kevin Lin, and Chu-Song Chen. 2018. Supervised learning of semantics-preserving hash via deep convolutional neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 40, 2 (2018), 437–451. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Yang Yang, Yadan Luo, Weilun Chen, Fumin Shen, Jie Shao, and Heng Tao Shen. 2016. Zero-shot hashing via transferring supervised knowledge. In MM. 1286–1295. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Zhaoda Ye and Yuxin Peng. 2020. Sequential cross-modal hashing learning via multi-scale correlation mining. ACM Trans. Multim. Comput. Commun. Appl. 15, 4 (2020), 105:1–105:20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In ICCV. 1839–1848.Google ScholarGoogle Scholar
  44. Xiaofeng Yuan, Biao Huang, Yalin Wang, Chunhua Yang, and Weihua Gui. 2018. Deep learning-based feature representation and its application for soft sensor modeling with variable-wise weighted SAE. IEEE Trans. Industr. Inf. 14, 7 (2018), 3235–3243.Google ScholarGoogle Scholar
  45. Ce Zhang, Vidhya Govindaraju, Jackson Borchardt, Tim Foltz, Christopher Ré, and Shanan Peters. 2013. GeoDeepDive: Statistical inference using familiar data-processing languages. In SIGMOD. 993–996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Ce Zhang, Jaeho Shin, Christopher Ré, Michael J. Cafarella, and Feng Niu. 2016. Extracting databases from dark data with deepdive. In SIGMOD. 847–859. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Haofeng Zhang, Li Liu, Yang Long, and Ling Shao. 2018. Unsupervised deep hashing with pseudo labels for scalable image retrieval. IEEE Trans. Image Process. 27, 4 (2018), 1626–1638. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Ke Zhou, Yu Liu, Jingkuan Song, Linyu Yan, Fuhao Zou, and Fumin Shen. 2015. Deep self-taught hashing for image retrieval. In MM. 1215–1218. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Ke Zhou, Yangtao Wang, Yu Liu, Yujuan Yang, Yifei Liu, Guoliang Li, Lianli Gao, and Zhili Xiao. 2020. A framework for image dark data assessment. World Wide Web 23, 3 (2020), 2079–2105.Google ScholarGoogle ScholarCross RefCross Ref
  50. Xiang Zhou, Fumin Shen, Li Liu, Wei Liu, Liqiang Nie, Yang Yang, and Heng Tao Shen. 2020. Graph convolutional network hashing. IEEE Trans. Cybern. 50, 4 (2020), 1460–1472.Google ScholarGoogle ScholarCross RefCross Ref
  51. Erkang Zhu, Fatemeh Nargesian, Ken Q. Pu, and Renée J. Miller. 2016. LSH ensemble: Internet-scale domain search. Proc. VLDB 9, 12 (2016), 1185–1196. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Deep Hash-based Relevance-aware Data Quality Assessment for Image Dark Data

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM/IMS Transactions on Data Science
      ACM/IMS Transactions on Data Science  Volume 2, Issue 2
      May 2021
      149 pages
      ISSN:2691-1922
      DOI:10.1145/3454114
      Issue’s Table of Contents

      Copyright © 2021 Association for Computing Machinery.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 8 April 2021
      • Revised: 1 August 2020
      • Accepted: 1 August 2020
      • Received: 1 March 2020
      Published in tds Volume 2, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed
    • Article Metrics

      • Downloads (Last 12 months)95
      • Downloads (Last 6 weeks)13

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!