Abstract
Data mining can hardly solve but always faces a problem that there is little meaningful information within the dataset serving a given requirement. Faced with multiple unknown datasets, to allocate data mining resources to acquire more desired data, it is necessary to establish a data quality assessment framework based on the relevance between the dataset and requirements. This framework can help the user to judge the potential benefits in advance, so as to optimize the resource allocation to those candidates. However, the unstructured data (e.g., image data) often presents dark data states, which makes it tricky for the user to understand the relevance based on content of the dataset in real time. Even if all data have label descriptions, how to measure the relevance between data efficiently under semantic propagation remains an urgent problem. Based on this, we propose a Deep Hash-based Relevance-aware Data Quality Assessment framework, which contains off-line learning and relevance mining parts as well as an on-line assessing part. In the off-line part, we first design a Graph Convolution Network (GCN)-AutoEncoder hash (GAH) algorithm to recognize the data (i.e., lighten the dark data), then construct a graph with restricted Hamming distance, and finally design a Cluster PageRank (CPR) algorithm to calculate the importance score for each node (image) so as to obtain the relevance representation based on semantic propagation. In the on-line part, we first retrieve the importance score by hash codes and then quickly get the assessment conclusion in the importance list. On the one hand, the introduction of GCN and co-occurrence probability in the GAH promotes the perception ability for dark data. On the other hand, the design of CPR utilizes hash collision to reduce the scale of graph and iteration matrix, which greatly decreases the consumption of space and computing resources. We conduct extensive experiments on both single-label and multi-label datasets to assess the relevance between data and requirements as well as test the resources allocation. Experimental results show our framework can gain the most desired data with the same mining resources. Besides, the test results on Tencent1M dataset demonstrate the framework can complete the assessment with a stability for given different requirements.
- Danilo Ardagna, Cinzia Cappiello, Walter Samá, and Monica Vitali. 2018. Context-aware data quality assessment for big data. Fut. Gen. Comput. Syst. 89 (2018), 548–562. Google Scholar
Cross Ref
- Michael J. Cafarella, Ihab F. Ilyas, Marcel Kornacker, Tim Kraska, and Christopher Ré. 2016. Dark data: Are we solving the right problems? In ICDE. 1444–1445.Google Scholar
- Li Cai and Yangyong Zhu. 2015. The challenges of data quality and data quality assessment in the big data era. Data Sci. J. 14 (2015), 2.Google Scholar
- Yue Cao, Mingsheng Long, Bin Liu, and Jianmin Wang. 2018. Deep cauchy hashing for hamming space retrieval. In CVPR. 1229–1237.Google Scholar
- Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, and Yanwen Guo. 2019. Multi-label image recognition with graph convolutional networks. In CVPR. 5177–5186.Google Scholar
- Yueqi Duan, Ziwei Wang, Jiwen Lu, Xudong Lin, and Jie Zhou. 2018. GraphBit: Bitwise interaction mining via deep reinforcement learning. In CVPR. 8270–8279.Google Scholar
- M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. 2010. The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88, 2 (Jun. 2010), 303–338. Google Scholar
Digital Library
- Lianli Gao, Xiaosu Zhu, Jingkuan Song, Zhou Zhao, and Heng Tao Shen. 2019. Beyond product quantization: Deep progressive quantization for image retrieval. In IJCAI. 723–729. Google Scholar
Cross Ref
- Tao He, Yuan-Fang Li, Lianli Gao, Dongxiang Zhang, and Jingkuan Song. 2019. One network for multi-domains: Domain adaptive hashing with intersectant generative adversarial networks. In IJCAI. 2477–2483. Google Scholar
Cross Ref
- P. Bryan Heidorn. 2008. Shedding light on the dark data in the long tail of science. Libr. Trends 57, 2 (2008), 280–299.Google Scholar
Cross Ref
- Chang-Qin Huang, Shang-Ming Yang, Yan Pan, and Hanjiang Lai. 2018. Object-location-aware hashing for multi-label image retrieval via automatic mask learning. IEEE Trans. Image Process. 27, 9 (2018), 4490–4502.Google Scholar
Digital Library
- Shirlee-ann Knight and Janice Burn. 2005. Developing a framework for assessing information quality on the world wide web. Inf. Sci. 8 (2005), 159--172.Google Scholar
- Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning Multiple Layers of Features from Tiny Images. Technical Report. Citeseer.Google Scholar
- Hanjiang Lai, Pan Yan, Xiangbo Shu, Yunchao Wei, and Shuicheng Yan. 2016. Instance-aware hashing for multi-label image retrieval. IEEE Trans. Image Process. 25, 6 (2016), 2469–2479. Google Scholar
Digital Library
- Yu Lei, Wenjie Li, Ziyu Lu, and Miao Zhao. 2017. Alternating pointwise-pairwise learning for personalized item ranking. In CIKM. 2155–2158. Google Scholar
Digital Library
- Jun Li, Xianglong Liu, Wenxuan Zhang, Mingyuan Zhang, Jingkuan Song, and Nicu Sebe. 2020. Spatio-temporal attention networks for action recognition and detection. IEEE Trans. Multimedia 22, 11 (2020), 2990--3001.Google Scholar
Cross Ref
- Yanghao Li, Naiyan Wang, Jiaying Liu, and Xiaodi Hou. 2017. Factorized bilinear models for image recognition. In ICCV. 2098–2106.Google Scholar
- Kevin Lin, Jiwen Lu, Chu-Song Chen, and Jie Zhou. 2016. Learning compact binary descriptors with unsupervised deep neural networks. In CVPR. 1183–1192.Google Scholar
- Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In ECCV. 740–755.Google Scholar
- Xianglong Liu, Qiang Fu, Deqing Wang, Xiao Bai, Xinyu Wu, and Dacheng Tao. 2020. Distributed complementary binary quantization for joint hash table learning. IEEE Trans. Neur. Netw. Learn. Syst. 31, 12 (2020), 5312--5323.Google Scholar
Cross Ref
- Yu Liu, Jingkuan Song, Ke Zhou, Lingyu Yan, Li Liu, Fuhao Zou, and Ling Shao. 2019. Deep self-taught hashing for image retrieval. IEEE Trans. Cybernet. 49, 6 (2019), 2229–2241.Google Scholar
Cross Ref
- Yu Liu, Yangtao Wang, Ke Zhou, Yujuan Yang, and Yifei Liu. 2020. Semantic-aware data quality assessment for image big data. Fut. Gen. Comput. Sci. 102 (2020), 53–65.Google Scholar
Cross Ref
- Yu Liu, Yangtao Wang, Ke Zhou, Yujuan Yang, Yifei Liu, Jingkuan Song, and Zhili Xiao. 2019. A framework for image dark data assessment. In APWeb-WAIM. 3–18.Google Scholar
- Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.Google Scholar
- Yuxin Peng, Jian Zhang, and Zhaoda Ye. 2020. Deep reinforcement learning for image hashing. IEEE Trans. Multimedia 22, 8 (2020), 2061--2073.Google Scholar
Cross Ref
- Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In EMNLP. 1532–1543.Google Scholar
- Fabian Richter, Stefan Romberg, Eva Hörster, and Rainer Lienhart. 2010. Multimodal ranking for image search on community databases. In MIR. 63–72. Google Scholar
Digital Library
- Manish Shukla, Sumesh Manjunath, Rohit Saxena, Sutapa Mondal, and Sachin Lodha. 2015. POSTER: WinOver enterprise dark data. In SIGSAC. 1674–1676. Google Scholar
Digital Library
- Jingkuan Song, Lianli Gao, Li Liu, Xiaofeng Zhu, and Nicu Sebe. 2018. Quantization-based hashing: A general framework for scalable image and video retrieval. Pattern Recogn. 75 (2018), 175–187. Google Scholar
Digital Library
- Jingkuan Song, Xiaosu Zhu, Lianli Gao, Xin-Shun Xu, Wu Liu, and Heng Tao Shen. 2019. Deep recurrent quantization for generating sequential binary codes. In IJCAI. 912–918. Google Scholar
Cross Ref
- Jingdong Wang, Ting Zhang, Jingkuan Song, Nicu Sebe, and Heng Tao Shen. 2018. A survey on learning to hash. IEEE Trans. Pattern Anal. Mach. Intell. 40, 4 (2018), 769–790.Google Scholar
Cross Ref
- Richard Y. Wang and Diane M. Strong. 1996. Beyond accuracy: What data quality means to data consumers. J. Manage. Inf. Syst. 12, 4 (1996), 5–33. Google Scholar
Digital Library
- Yangtao Wang, Yu Liu, Yifei Liu, Ke Zhou, Yujuan Yang, Jiangfeng Zeng, Xiaodong Xu, and Zhili Xiao. 2019. Analysis and management to hash-based graph and rank. In APWeb-WAIM. 289–296.Google Scholar
- Yuebin Wang, Liqiang Zhang, Feiping Nie, Xingang Li, Zhijun Chen, and Faqiang Wang. 2020. WeGAN: Deep image hashing with weighted generative adversarial networks. IEEE Trans. Multimedia 22, 6 (2020), 1458–1469.Google Scholar
Cross Ref
- Yan Wu, Xianglong Liu, Haotong Qin, Ke Xia, Sheng Hu, Yuqing Ma, and Meng Wang. 2021. Boosting temporal binary coding for large-scale video search. IEEE Trans. Multimedia 23 (2021), 353--364.Google Scholar
Cross Ref
- De Xie, Cheng Deng, Chao Li, Xianglong Liu, and Dacheng Tao. 2020. Multi-task consistency-preserving adversarial hashing for cross-modal retrieval. IEEE Trans. Image Process. 29 (2020), 3626–3637.Google Scholar
Digital Library
- Ruiqing Xu, Chao Li, Junchi Yan, Cheng Deng, and Xianglong Liu. 2019. Graph convolutional network hashing for cross-modal retrieval. In IJCAI. 982–988. Google Scholar
Cross Ref
- Yi Xu, Xianglong Liu, Binshuai Wang, Renshuai Tao, Ke Xia, and Xianbin Cao. 2021. Fast nearest subspace search via random angular hashing. IEEE Trans. Multimedia 23 (2021), 342--352.Google Scholar
Cross Ref
- Erkun Yang, Tongliang Liu, Cheng Deng, Wei Liu, and Dacheng Tao. 2019. DistillHash: Unsupervised deep hashing by distilling data pairs. In CVPR. 2946–2955.Google Scholar
- Huei-Fang Yang, Kevin Lin, and Chu-Song Chen. 2018. Supervised learning of semantics-preserving hash via deep convolutional neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 40, 2 (2018), 437–451. Google Scholar
Digital Library
- Yang Yang, Yadan Luo, Weilun Chen, Fumin Shen, Jie Shao, and Heng Tao Shen. 2016. Zero-shot hashing via transferring supervised knowledge. In MM. 1286–1295. Google Scholar
Digital Library
- Zhaoda Ye and Yuxin Peng. 2020. Sequential cross-modal hashing learning via multi-scale correlation mining. ACM Trans. Multim. Comput. Commun. Appl. 15, 4 (2020), 105:1–105:20. Google Scholar
Digital Library
- Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In ICCV. 1839–1848.Google Scholar
- Xiaofeng Yuan, Biao Huang, Yalin Wang, Chunhua Yang, and Weihua Gui. 2018. Deep learning-based feature representation and its application for soft sensor modeling with variable-wise weighted SAE. IEEE Trans. Industr. Inf. 14, 7 (2018), 3235–3243.Google Scholar
- Ce Zhang, Vidhya Govindaraju, Jackson Borchardt, Tim Foltz, Christopher Ré, and Shanan Peters. 2013. GeoDeepDive: Statistical inference using familiar data-processing languages. In SIGMOD. 993–996. Google Scholar
Digital Library
- Ce Zhang, Jaeho Shin, Christopher Ré, Michael J. Cafarella, and Feng Niu. 2016. Extracting databases from dark data with deepdive. In SIGMOD. 847–859. Google Scholar
Digital Library
- Haofeng Zhang, Li Liu, Yang Long, and Ling Shao. 2018. Unsupervised deep hashing with pseudo labels for scalable image retrieval. IEEE Trans. Image Process. 27, 4 (2018), 1626–1638. Google Scholar
Digital Library
- Ke Zhou, Yu Liu, Jingkuan Song, Linyu Yan, Fuhao Zou, and Fumin Shen. 2015. Deep self-taught hashing for image retrieval. In MM. 1215–1218. Google Scholar
Digital Library
- Ke Zhou, Yangtao Wang, Yu Liu, Yujuan Yang, Yifei Liu, Guoliang Li, Lianli Gao, and Zhili Xiao. 2020. A framework for image dark data assessment. World Wide Web 23, 3 (2020), 2079–2105.Google Scholar
Cross Ref
- Xiang Zhou, Fumin Shen, Li Liu, Wei Liu, Liqiang Nie, Yang Yang, and Heng Tao Shen. 2020. Graph convolutional network hashing. IEEE Trans. Cybern. 50, 4 (2020), 1460–1472.Google Scholar
Cross Ref
- Erkang Zhu, Fatemeh Nargesian, Ken Q. Pu, and Renée J. Miller. 2016. LSH ensemble: Internet-scale domain search. Proc. VLDB 9, 12 (2016), 1185–1196. Google Scholar
Digital Library
Index Terms
Deep Hash-based Relevance-aware Data Quality Assessment for Image Dark Data
Recommendations
Local-feature-based image retrieval with weighted relevance feedback
Accurate and fast retrieval of relevant images is a challenging task mainly due to the limitation in understanding hidden knowledge in images, known as semantic gap. In this work, we propose a novel approach which incorporates local feature ...
A Framework for Image Dark Data Assessment
Web and Big DataAbstractBlindly applying data mining techniques on image dark data whose content and value are not clear, is highly likely to bring undesired result. Therefore, we propose an assessment framework which includes offline and online stages for image dark ...
Quality Data for Data Mining and Data Mining for Quality Data: A Constraint Based Approach in XML
FGCNS '08: Proceedings of the 2008 Second International Conference on Future Generation Communication and Networking Symposia - Volume 02As quality data is important for data mining, reversely data mining is necessary to measure the quality of data. Specifically, in XML, the issue of quality data for mining purposes and also using data mining techniques for quality measures is becoming ...






Comments