|
|
Welcome to ACM MULTIMEDIA 2010 |
| |
Alberto del Bimbo,
Shih-Fu Chang,
Arnold Smeulders
|
|
Pages: i |
|
doi>10.1145/1873951.1913787 |
|
Full text: Mp4
|
|
|
|
|
SIGMM Award Presentation: Life - Experiences (Events) + Vision |
| |
Ramesh Jain
|
|
Pages: ii |
|
doi>10.1145/1873951.1913788 |
|
Full text: Mp4
|
|
|
|
|
SIGMM Award Presentation: Geometry-aware analysis of high-dimentional visual information sets |
| |
Effrosyni Kokiopoulou
|
|
Pages: iii |
|
doi>10.1145/1873951.1913789 |
|
Full text: Mp4
|
|
|
|
|
SESSION: Plenary -- P1 |
| |
Shih-Fu Chang,
Alberto del Bimbo
|
|
|
|
|
Using the web to do social science |
| |
Duncan Watts
|
|
Pages: 1-2 |
|
doi>10.1145/1873951.1873953 |
|
Full text: PDF
|
|
Other formats:
Mp4
|
|
Social science is often concerned with the emergence of collective behavior out of the interactions of large numbers of individuals, but in this regard it has long suffered from a severe measurement problem - namely that individual-level behavior and ...
Social science is often concerned with the emergence of collective behavior out of the interactions of large numbers of individuals, but in this regard it has long suffered from a severe measurement problem - namely that individual-level behavior and interactions are hard to observe, especially at scale and over time. In this talk, I will argue that the technological revolution of the Internet is beginning to lift this constraint. To illustrate, I will describe several examples of internet-based research that would have been impractical to perform until recently, and that shed light on some longstanding sociological questions. Although internet-based research still faces serious methodological and procedural obstacles, I propose that the ability to study truly "social" dynamics at individual-level resolution will have dramatic consequences for social science. expand
|
|
|
Visual crowd surveillance is like hydrodynamics |
| |
Mubarak Shah
|
|
Pages: 3-4 |
|
doi>10.1145/1873951.1873954 |
|
Full text: PDF
|
|
Other formats:
Mp4
|
|
Video Surveillance and Monitoring is very active area of research in Computer Vision. However, most of the current approaches assume that the observed scene is not crowded, and that reliable tracks of objects are available over longer durations. Therefore, ...
Video Surveillance and Monitoring is very active area of research in Computer Vision. However, most of the current approaches assume that the observed scene is not crowded, and that reliable tracks of objects are available over longer durations. Therefore, these approaches are not extendable to more challenging surveillance videos of crowded environments like markets, subways, religious festivals, parades, concerts, football matches etc, where tracking of individual objects is very hard, if not impossible. We have proposed a framework, which views the flow of a high density crowd like the flow of a liquid, prompting the use of ideas and techniques often found in the study of hydrodynamics. Therefore, we treat interactions of people in the scene like moving particles in a liquid on three different length scales (macroscopic, mesoscopic, and microscopic); each scale corresponding to one of the three problems: tracking individuals, detection of abnormal behaviors, and segmentation of crowd motion. expand
|
|
|
SESSION: Full - F1/content track/automatic image tagging |
| |
Jiebo Luo
|
|
|
|
|
Leveraging loosely-tagged images and inter-object correlations for tag recommendation |
| |
Yi Shen,
Jianping Fan
|
|
Pages: 5-14 |
|
doi>10.1145/1873951.1873956 |
|
Full text: PDF
|
|
Large-scale loosely-tagged images (i.e., multiple object tags are given loosely at the image level) are available on Internet, and it is very attractive to leverage such loosely-tagged images for automatic image annotation applications. In this paper, ...
Large-scale loosely-tagged images (i.e., multiple object tags are given loosely at the image level) are available on Internet, and it is very attractive to leverage such loosely-tagged images for automatic image annotation applications. In this paper, a multi-task structured SVM algorithm is developed to leverage both the inter-object correlations and the loosely-tagged images for achieving more effective training of a large number of inter-related object classifiers. To leverage the loosely-tagged images for object classifier training, each loosely-tagged image is partitioned into a set of image instances (image regions) and a multiple instance learning algorithm is developed for instance label identification by automatically identifying the correspondences between multiple tags (given at the image level) and the image instances. An object correlation network is constructed for characterizing the inter-object correlations explicitly and identifying the inter-related learning tasks automatically. To enhance the discrimination power of a large number of inter-related object classifiers, a multi-task structured SVM algorithm is developed to model the inter-task relatedness more precisely and leverage the inter-object correlations for classifier training. Our experiments on a large number of inter-related object classes have provided very positive results. expand
|
|
|
Multi-label boosting for image annotation by structural grouping sparsity |
| |
Fei Wu,
Yahong Han,
Qi Tian,
Yueting Zhuang
|
|
Pages: 15-24 |
|
doi>10.1145/1873951.1873957 |
|
Full text: PDF
|
|
We can obtain high-dimensional heterogenous features from real-world images to describe their various aspects of visual characteristics, such as color, texture and shape etc.Different kinds of heterogenous features have different intrinsic discriminative ...
We can obtain high-dimensional heterogenous features from real-world images to describe their various aspects of visual characteristics, such as color, texture and shape etc.Different kinds of heterogenous features have different intrinsic discriminative power for image understanding. The selection of groups of discriminative features for certain semantics is hence crucial to make the image understanding more interpretable. This paper formulates the multi-label image annotation as a regression model with a regularized penalty. We call it Multi-label Boosting by the selection of heterogeneous features with structural Grouping Sparsity (MtBGS). MtBGS induces a (structural ) sparse selection model to identify subgroups of homogenous features for predicting a certain label. Moreover, the correlations among multiple tags are utilized in MtBGS to boost the performance of multi-label annotation. Extensive experiments on public image datasets show that the proposed approach has better multi-label image annotation performance and leads to a quite interpretable model for image understanding. expand
|
|
|
Unified tag analysis with multi-edge graph |
| |
Dong Liu,
Shuicheng Yan,
Yong Rui,
Hong-Jiang Zhang
|
|
Pages: 25-34 |
|
doi>10.1145/1873951.1873958 |
|
Full text: PDF
|
|
Image tags have become a key intermediate vehicle to organize, index and search the massive online image repositories. Extensive research has been conducted on different yet related tag analysis tasks, e.g., tag refinement, tag-to-region assignment, ...
Image tags have become a key intermediate vehicle to organize, index and search the massive online image repositories. Extensive research has been conducted on different yet related tag analysis tasks, e.g., tag refinement, tag-to-region assignment, and automatic tagging. In this paper, we propose a new concept of multi-edge graph, through which a unified solution is derived for the different tag analysis tasks. Specifically, each vertex of the graph is first characterized by a unique image. Then each image is encoded as a region bag with multiple image segmentations, and the thresholding of the pairwise similarities between regions naturally constructs the multiple edges between each vertex pair. The unified tag analysis is then generally described as the tag propagation between a vertex and its edges, as well as between all edges cross the entire image repository. We develop a core vertex-vs-edge tag equation unique for multi-edge graph to unify the image/vertex tag(s) and region-pair/edge tag(s). Finally, unified tag analysis is formulated as a constrained optimization problem, where the objective function characterizing the cross-patch tag consistency is constrained by the core equations for all vertex pairs, and the cutting plane method is used for efficient optimization. Extensive experiments on various tag analysis tasks over three widely used benchmark datasets validate the effectiveness of our proposed unified solution. expand
|
|
|
Efficient large-scale image annotation by probabilistic collaborative multi-label propagation |
| |
Xiangyu Chen,
Yadong Mu,
Shuicheng Yan,
Tat-Seng Chua
|
|
Pages: 35-44 |
|
doi>10.1145/1873951.1873959 |
|
Full text: PDF
|
|
Annotating large-scale image corpus requires huge amount of human efforts and is thus generally unaffordable, which directly motivates recent development of semi-supervised or active annotation methods. In this paper we revisit this notoriously challenging ...
Annotating large-scale image corpus requires huge amount of human efforts and is thus generally unaffordable, which directly motivates recent development of semi-supervised or active annotation methods. In this paper we revisit this notoriously challenging problem and develop a novel multi-label propagation scheme, whereby both the efficacy and accuracy of large-scale image annotation are further enhanced. Our investigation starts from a survey of previous graph propagation based annotation approaches, wherein we analyze their main drawbacks when scaling up to large-scale datasets and handling multi-label setting. Our proposed scheme outperforms the state-of-the-art algorithms by making the following contributions. 1) Unlike previous approaches that propagate over individual label independently, our proposed large-scale multi-label propagation (LSMP) scheme encodes the tag information of an image as a unit label confidence vector, which naturally imposes inter-label constraints and manipulates labels interactively. It then utilizes the probabilistic Kullback-Leibler divergence for problem formulation on multi-label propagation. 2) We perform the multi-label propagation on the so-called hashing-based L1-graph, which is efficiently derived with Locality Sensitive Hashing approach followed by sparse L1-graph construction within the individual hashing buckets. 3) An efficient and convergency provable iterative procedure is presented for problem optimization. Extensive experiments on NUS-WIDE dataset (both lite version with 56k images and full version with 270k images) well validate the effectiveness and scalability of the proposed approach. expand
|
|
|
SESSION: Full - F6/applications/human-centered multimedia track/user-adapted media access |
| |
Ralf Steinmetz
|
|
|
|
|
Sketch-based 3D model retrieval using diffusion tensor fields of suggestive contours |
| |
Sang Min Yoon,
Maximilian Scherer,
Tobias Schreck,
Arjan Kuijper
|
|
Pages: 193-200 |
|
doi>10.1145/1873951.1873961 |
|
Full text: PDF
|
|
The number of available 3D models in various areas increase steadily. Effective methods to search for those 3D models by content, rather than textual annotations, are crucial. For this purpose, we propose a new approach for content based 3D model retrieval ...
The number of available 3D models in various areas increase steadily. Effective methods to search for those 3D models by content, rather than textual annotations, are crucial. For this purpose, we propose a new approach for content based 3D model retrieval by hand-drawn sketch images. This approach to retrieve visually similar mesh models from a large database consists of three major steps: (1) suggestive contour renderings from different viewpoints to compare against the user drawn sketches; (2) descriptor computation by analyzing diffusion tensor fields of suggestive contour images or the query sketch respectively; (3) similarity measurement to retrieve the models and the most probable view-point from which a model was sketched. Our proposed sketch based 3D model retrieval system is very robust against variations of shape, pose or partial occlusion of the user draw sketches. Experimental results are presented and indicate the effectiveness of our approach for sketch-based 3D mode retrieval. expand
|
|
|
Crowdsourced automatic zoom and scroll for video retargeting |
| |
Axel Carlier,
Vincent Charvillat,
Wei Tsang Ooi,
Romulus Grigoras,
Geraldine Morin
|
|
Pages: 201-210 |
|
doi>10.1145/1873951.1873962 |
|
Full text: PDF
|
|
Screen size and display resolution limit the experience of watching videos on mobile devices. The viewing experience can be improved by determining important or interesting regions within the video (called regions of interest, or ROIs) and displaying ...
Screen size and display resolution limit the experience of watching videos on mobile devices. The viewing experience can be improved by determining important or interesting regions within the video (called regions of interest, or ROIs) and displaying only the ROIs to the viewer. Previous work focuses on analyzing the video content using visual attention model to infer the ROIs. Such content-based technique, however, has limitations. In this paper, we propose an alternative paradigm to infer ROIs from a video. We crowdsource from a large number of users through their implicit viewing behavior using a zoom and pan interface, and infer the ROIs from their collective wisdom. A retargeted video, consisting of relevant shots determined from historical users behavior, can be automatically generated and replayed to subsequent users who would prefer a less interactive viewing experience. This paper presents how we collect the user traces, infer the ROIs and their dynamics, group the ROIs into shots, and automatically reframe those shots to improve the aesthetics of the video. A user study with 48 participants shows that our automatically retargeted video is of comparable quality to one handcrafted by an expert user expand
|
|
|
Personalized photograph ranking and selection system |
| |
Che-Hua Yeh,
Yuan-Chen Ho,
Brian A. Barsky,
Ming Ouhyoung
|
|
Pages: 211-220 |
|
doi>10.1145/1873951.1873963 |
|
Full text: PDF
|
|
In this paper, we propose a novel personalized ranking system for amateur photographs. Although some of the features used in our system are similar to previous work, new features, such as texture, RGB color, portrait (through face detection), and black-and-white, ...
In this paper, we propose a novel personalized ranking system for amateur photographs. Although some of the features used in our system are similar to previous work, new features, such as texture, RGB color, portrait (through face detection), and black-and-white, are included for individual preferences. Our goal of automatically ranking photographs is not intended for award-wining professional photographs but for photographs taken by amateurs, especially when individual preference is taken into account. The performance of our system in terms of precision-recall diagram and binary classification accuracy (93%) is close to the best results to date for both overall system and individual features. Two personalized ranking user interfaces are provided: one is feature-based and the other is example-based. Although both interfaces are effective in providing personalized preferences, our user study showed that example-based was preferred by twice as many people as feature-based. expand
|
|
|
SESSION: Full - F3/content track/classification of content elements |
| |
Alan Smeaton
|
|
|
|
|
Affective image classification using features inspired by psychology and art theory |
| |
Jana Machajdik,
Allan Hanbury
|
|
Pages: 83-92 |
|
doi>10.1145/1873951.1873965 |
|
Full text: PDF
|
|
Images can affect people on an emotional level. Since the emotions that arise in the viewer of an image are highly subjective, they are rarely indexed. However there are situations when it would be helpful if images could be retrieved based on their ...
Images can affect people on an emotional level. Since the emotions that arise in the viewer of an image are highly subjective, they are rarely indexed. However there are situations when it would be helpful if images could be retrieved based on their emotional content. We investigate and develop methods to extract and combine low-level features that represent the emotional content of an image, and use these for image emotion classification. Specifically, we exploit theoretical and empirical concepts from psychology and art theory to extract image features that are specific to the domain of artworks with emotional expression. For testing and training, we use three data sets: the International Affective Picture System (IAPS); a set of artistic photography from a photo sharing site (to investigate whether the conscious use of colors and textures displayed by the artists improves the classification); and a set of peer rated abstract paintings to investigate the influence of the features and ratings on pictures without contextual content. Improved classification results are obtained on the International Affective Picture System (IAPS), compared to state of the art work. expand
|
|
|
CO3 for ultra-fast and accurate interactive segmentation |
| |
Yibiao Zhao,
Song-Chun Zhu,
Siwei Luo
|
|
Pages: 93-102 |
|
doi>10.1145/1873951.1873966 |
|
Full text: PDF
|
|
This paper presents an interactive image segmentation framework which is ultra-fast and accurate. Our framework, termed "CO3", consists of three components: COupled representation, COnditional model and COnvex inference. (i) In representation, we pose ...
This paper presents an interactive image segmentation framework which is ultra-fast and accurate. Our framework, termed "CO3", consists of three components: COupled representation, COnditional model and COnvex inference. (i) In representation, we pose the segmentation problem as partitioning an image domain into regions (foreground vs. background) or boundaries (on vs. off) which are dual but simultaneously compete with each other. Then, we formulate segmentation process as a combinatorial posterior ratio test in both the region and boundary partition space. (ii) In modeling, we use discriminative learning methods to train conditional models for both region and boundary based on interactive scribbles. We exploit rich image features at multi-scales, and simultaneously incorporate user's intention behind the interactive scribbles. (iii) In computing, we relax the energy function into an equivalent continuous form which is convex. Then, we adopt the Bregman iteration method to enforce the "coupling" of region and boundary terms with fast global convergence. In addition, a multigrid technique is further introduced, which is a coarse-to-fine mechanism and guarantees both feature discriminativeness and boundary preciseness by adjusting the size of image features gradually. The proposed interactive system is evaluated on three public datasets: Berkeley segmentation dataset, MSRC dataset and LHI dataset. Compared to five state-of-the-art approaches including Boycov et al., Bai et al., Grady, Unger et al. and Couprie et al., our system outperforms those established approaches in both accuracy and efficiency by a large margin and achieves state-of-the-art results. expand
|
|
|
A generic framework for event detection in various video domains |
| |
Tianzhu Zhang,
Changsheng Xu,
Guangyu Zhu,
Si Liu,
Hanqing Lu
|
|
Pages: 103-112 |
|
doi>10.1145/1873951.1873967 |
|
Full text: PDF
|
|
Event detection is essential for the extensively studied video analysis and understanding area. Although various approaches have been proposed for event detection, there is a lack of a generic event detection framework that can be applied to various ...
Event detection is essential for the extensively studied video analysis and understanding area. Although various approaches have been proposed for event detection, there is a lack of a generic event detection framework that can be applied to various video domains (e.g. sports, news, movies, surveillance). In this paper, we present a generic event detection approach based on semi-supervised learning and Internet vision. Concretely, a Graph-based Semi-Supervised Multiple Instance Learning (GSSMIL) algorithm is proposed to jointly explore small-scale expert labeled videos and large-scale unlabeled videos to train the event models to detect video event boundaries. The expert labeled videos are obtained from the analysis and alignment of well-structured video related text (e.g. movie scripts, web-casting text, close caption). The unlabeled data are obtained by querying related events from the video search engine (e.g. YouTube) in order to give more distributive information for event modeling. A critical issue of GSSMIL in constructing a graph is the weight assignment, where the weight of an edge specifies the similarity between two data points. To tackle this problem, we propose a novel Multiple Instance Learning Induced Similarity (MILIS) measure by learning instance sensitive classifiers. We perform the thorough experiments in three popular video domains: movies, sports and news. The results compared with the state-of-the-arts are promising and demonstrate our proposed approach is performance-effective. expand
|
|
|
Image segmentation with patch-pair density priors |
| |
Xiaobai Liu,
Jiashi Feng,
Shuicheng Yan,
Hai Jin
|
|
Pages: 113-122 |
|
doi>10.1145/1873951.1873968 |
|
Full text: PDF
|
|
In this paper, we investigate how an unlabeled image corpus can facilitate the segmentation of any given image. A simple yet efficient multi-task joint sparse representation model is presented to augment the patch-pair similarities by harnessing the ...
In this paper, we investigate how an unlabeled image corpus can facilitate the segmentation of any given image. A simple yet efficient multi-task joint sparse representation model is presented to augment the patch-pair similarities by harnessing the newly discovered patch-pair density priors. First, each image in over-segmented as a set of patches, and the adjacent patch-pair density priors, statistically calculated from the unlabeled image corpus, bring an intuitively explainable and informative observation that kindred patch-pairs generally have higher densities that inhomogeneous patch-pairs. Then for each adjacent patch-pair within the given image, high-density biased multi-task joint sparse reconstruction is pursued such that 1) both individual patches and patch-pair can be reconstructed with few patch-pairs from the unlabeled image corpus, and 2) the patch-pairs selected for reconstruction are high-density biased, namely, preferring patch-pairs belonging to the same semantic region. In this way, the overall reconstruction residue well conveys the discriminative information on whether these two patches belong to the same semantic region, and consequently the patch affinity matrix is augmented by reconstruction residues for all adjacent patch-pairs within the given image. The ultimate image segmentation is derived by employing the popular normalized cut approach over the augmented patch affinity matrix. Extensive image segmentation experiments over two public databases clearly demonstrate the superiority of the proposed solution over several state-of-the-art algorithms. Furthermore, the algorithmic practicality is well validated with comparison experiments on content-based image retrieval and multi-label image annotation performed over image segmentation outputs. expand
|
|
|
SESSION: Full - F4/applications track/applications of geo-tagging |
| |
Touradj Ebrahimi
|
|
|
|
|
W2Go: a travel guidance system by automatic landmark ranking |
| |
Yue Gao,
Jinhui Tang,
Richang Hong,
Qionghai Dai,
Tat-Seng Chua,
Ramesh Jain
|
|
Pages: 123-132 |
|
doi>10.1145/1873951.1873970 |
|
Full text: PDF
|
|
In this paper, we present a travel guidance system W2Go (Where to Go), which can automatically recognize and rank the landmarks for travellers. In this system, a novel Automatic Landmark Ranking (ALR) method is proposed by utilizing the tag and geo-tag ...
In this paper, we present a travel guidance system W2Go (Where to Go), which can automatically recognize and rank the landmarks for travellers. In this system, a novel Automatic Landmark Ranking (ALR) method is proposed by utilizing the tag and geo-tag information of photos in Flickr and user knowledge from Yahoo Travel Guide. ALR selects the popular tourist attractions (landmarks) based on not only the subjective opinion of the travel editors as is currently done on sites like WikiTravel and Yahoo Travel Guide, but also the ranking derived from popularity among tourists. Our approach utilizes geo-tag information to locate the positions of the tag-indicated places, and computes the probability of a tag being a landmark/site name. For potential landmarks, impact factors are calculated from the frequency of tags, user numbers in Flickr, and user knowledge in Yahoo Travel Guide. These tags are then ranked based on the impact factors. Several representative views for popular landmarks are generated from the crawled images with geo-tags to describe and present them in context of information derived from several relevant reference sources. The experimental comparisons to the other systems are conducted on eight famous cities over the world. User-based evaluation demonstrates the effectiveness of the proposed ALR method and the W2Go system. expand
|
|
|
Mining people's trips from large scale geo-tagged photos |
| |
Yuki Arase,
Xing Xie,
Takahiro Hara,
Shojiro Nishio
|
|
Pages: 133-142 |
|
doi>10.1145/1873951.1873971 |
|
Full text: PDF
|
|
Photo sharing is one of the most popular Web services. Photo sharing sites provide functions to add tags and geo-tags to photos to make photo organization easy. Considering that people take photos to record something that attracts them, geo-tagged photos ...
Photo sharing is one of the most popular Web services. Photo sharing sites provide functions to add tags and geo-tags to photos to make photo organization easy. Considering that people take photos to record something that attracts them, geo-tagged photos are a rich data source that reflects people's memorable events associated with locations. In this paper, we focus on geo-tagged photos and propose a method to detect people's frequent trip patterns, i.e., typical sequences of visited cities and durations of stay as well as descriptive tags that characterize the trip patterns. Our method first segments photo collections into trips and categorizes them based on their trip themes, such as visiting landmarks or communing with nature. Our method mines frequent trip patterns for each trip theme category. We crawled 5.7 million geo-tagged photos and performed photo trip pattern mining. The experimental result shows that our method outperforms other baseline methods and can correctly segment photo collections into photo trips with an accuracy of 78%. For trip categorization, our method can categorize about 80% of trips using tags and titles of photos and visited cities as features. Finally, we illustrate interesting examples of trip patterns detected from our dataset and show an application with which users can search frequent trip patterns by querying a destination, visit duration, and trip theme on the trip. expand
|
|
|
Photo2Trip: generating travel routes from geo-tagged photos for trip planning |
| |
Xin Lu,
Changhu Wang,
Jiang-Ming Yang,
Yanwei Pang,
Lei Zhang
|
|
Pages: 143-152 |
|
doi>10.1145/1873951.1873972 |
|
Full text: PDF
|
|
Travel route planning is an important step for a tourist to prepare his/her trip. As a common scenario, a tourist usually asks the following questions when he/she is planning his/her trip in an unfamiliar place: 1) Are there any travel route suggestions ...
Travel route planning is an important step for a tourist to prepare his/her trip. As a common scenario, a tourist usually asks the following questions when he/she is planning his/her trip in an unfamiliar place: 1) Are there any travel route suggestions for a one-day or three-day trip in Beijing? 2) What is the most popular travel path within the Forbidden City? To facilitate a tourist's trip planning, in this paper, we target at solving the problem of automatic travel route planning. We propose to leverage existing travel clues recovered from 20 million geo-tagged photos collected from www.panoramio.com to suggest customized travel route plans according to users' preferences. As the footprints of tourists at memorable destinations, the geo-tagged photos could be naturally used to discover the travel paths within a destination (attractions/landmarks) and travel routes between destinations. Based on the information discovered from geo-tagged photos, we can provide a customized trip plan for a tourist, i.e., the popular destinations to visit, the visiting order of destinations, the time arrangement in each destination, and the typical travel path within each destination. Users are also enabled to specify personal preference such as visiting location, visiting time/season, travel duration, and destination style in an interactive manner to guide the system. Owning to 20 million geo-tagged photos and 200,000 travelogues, an online system has been developed to help users plan travel routes for over 30,000 attractions/landmarks in more than 100 countries and territories. Experimental results show the intelligence and effectiveness of the proposed framework. expand
|
|
|
Retrieving landmark and non-landmark images from community photo collections |
| |
Yannis Avrithis,
Yannis Kalantidis,
Giorgos Tolias,
Evaggelos Spyrou
|
|
Pages: 153-162 |
|
doi>10.1145/1873951.1873973 |
|
Full text: PDF
|
|
State of the art data mining and image retrieval in community photo collections typically focus on popular subsets, e.g. images containing landmarks or associated to Wikipedia articles. We propose an image clustering scheme that, seen as vector quantization ...
State of the art data mining and image retrieval in community photo collections typically focus on popular subsets, e.g. images containing landmarks or associated to Wikipedia articles. We propose an image clustering scheme that, seen as vector quantization compresses a large corpus of images by grouping visually consistent ones while providing a guaranteed distortion bound. This allows us, for instance, to represent the visual content of all thousands of images depicting the Parthenon in just a few dozens of scene maps and still be able to retrieve any single, isolated, non-landmark image like a house or graffiti on a wall. Starting from a geo-tagged dataset, we first group images geographically and then visually, where each visual cluster is assumed to depict different views of the the same scene. We align all views to one reference image and construct a 2D scene map by preserving details from all images while discarding repeating visual features. Our indexing, retrieval and spatial matching scheme then operates directly on scene maps. We evaluate the precision of the proposed method on a challenging one-million urban image dataset. expand
|
|
|
SESSION: Full - F5/content track/learning concepts in images |
| |
Rita Cucchiara
|
|
|
|
|
S3MKL: scalable semi-supervised multiple kernel learning for image data mining |
| |
Shuhui Wang,
Shuqiang Jiang,
Qingming Huang,
Qi Tian
|
|
Pages: 163-172 |
|
doi>10.1145/1873951.1873975 |
|
Full text: PDF
|
|
For large scale image data mining, a challenging problem is to design a method that could work efficiently under the situation of little ground-truth annotation and a mass of unlabeled or noisy data. As one of the major solutions, semi-supervised learning ...
For large scale image data mining, a challenging problem is to design a method that could work efficiently under the situation of little ground-truth annotation and a mass of unlabeled or noisy data. As one of the major solutions, semi-supervised learning (SSL) has been deeply investigated and widely used in image classification, ranking and retrieval. However, most SSL approaches are not able to incorporate multiple information sources. Furthermore, no sample selection is done on unlabeled data, leading to the unpredictable risk brought by uncontrolled unlabeled data and heavy computational burden that is not suitable for learning on real world dataset. In this paper, we propose a scalable semi-supervised multiple kernel learning method (S3MKL) to deal with the first problem. Our method imposes group LASSO regularization on the kernel coefficients to avoid over-fitting and conditional expectation consensus for regularizing the behaviors of different kernel on the unlabeled data. To reduce the risk of using unlabeled data, we also design a hashing system where multiple kernel locality sensitive hashing (MKLSH) are constructed with respect to different kernels to identify a set of "informative" and "compact" unlabeled training subset from a large unlabeled data corpus. Combining S3MKL with MKLSH, the method is suitable for real world image classification and personalized web image re-ranking with very little user interaction. Comprehensive experiments are conducted to test the performance of our method, and the results show that our method provides promising powers for large scale real world image classification and retrieval. expand
|
|
|
Discriminative codeword selection for image representation |
| |
Lijun Zhang,
Chun Chen,
Jiajun Bu,
Zhengguang Chen,
Shulong Tan,
Xiaofei He
|
|
Pages: 173-182 |
|
doi>10.1145/1873951.1873976 |
|
Full text: PDF
|
|
Bag of features (BoF) representation has attracted an increasing amount of attention in large scale image processing systems. BoF representation treats images as loose collections of local invariant descriptors extracted from them. The visual codebook ...
Bag of features (BoF) representation has attracted an increasing amount of attention in large scale image processing systems. BoF representation treats images as loose collections of local invariant descriptors extracted from them. The visual codebook is generally constructed by using an unsupervised algorithm such as K-means to quantize the local descriptors into clusters. Images are then represented by the frequency histograms of the codewords contained in them. To build a compact and discriminative codebook, codeword selection has become an indispensable tool. However, most of the existing codeword selection algorithms are supervised and the human labeling may be very expensive. In this paper, we consider the problem of unsupervised codeword selection, and propose a novel algorithm called Discriminative Codeword Selection (DCS). Motivated from recent studies on discriminative clustering, the central idea of our proposed algorithm is to select those codewords so that the cluster structure of the image database can be best respected. Specifically, a multi-output linear function is fitted to model the relationship between the data matrix after codeword selection and the indicator matrix. The most discriminative codewords are thus defined as those leading to minimal fitting error. Experiments on image retrieval and clustering have demonstrated the effectiveness of the proposed method. expand
|
|
|
Supervised reranking for web image search |
| |
Linjun Yang,
Alan Hanjalic
|
|
Pages: 183-192 |
|
doi>10.1145/1873951.1873977 |
|
Full text: PDF
|
|
Visual search reranking that aims to improve the text-based image search with the help from visual content analysis has rapidly grown into a hot research topic. The interestingness of the topic stems mainly from the fact that the search reranking is ...
Visual search reranking that aims to improve the text-based image search with the help from visual content analysis has rapidly grown into a hot research topic. The interestingness of the topic stems mainly from the fact that the search reranking is an unsupervised process and therefore has the potential to scale better than its main alternative, namely the search based on offline-learned semantic concepts. However, the unsupervised nature of the reranking paradigm also makes it suffer from problems, the main of which can be identified as the difficulty to optimally determine the role of visual modality over different application scenarios. Inspired by the success of the "learning-to-rank" idea proposed in the field of information retrieval, we propose in this paper the "learning-to-rerank" paradigm, which derives the reranking function in a supervised fashion from the human-labeled training data. Although supervised learning is introduced, our approach does not suffer from scalability issues since a unified reranking model is learned that can be applied to all queries. In other words, a query-independent reranking model will be learned for all queries using query-dependent reranking features. The query-dependent reranking feature extraction is challenging since the textual query and the visual documents have different representation. In this paper, 11 lightweight reranking features are proposed by representing the textual query using visual context and pseudo relevant images from the initial search result. The experiments performed on two representative Web image datasets demonstrate that the proposed learning-to-rerank algorithm outperforms the state-of-the-art unsupervised reranking methods, which makes the learning-to-rerank paradigm a promising alternative for robust and reliable Web-scale image search. expand
|
|
|
SESSION: Full - F2/systems track/improving media delivery |
| |
Shervin Shirmohammadi
|
|
|
|
|
Tenor: making coding practical from servers to smartphones |
| |
Hassan Shojania,
Baochun Li
|
|
Pages: 45-54 |
|
doi>10.1145/1873951.1873979 |
|
Full text: PDF
|
|
It has been theoretically shown that performing coding in networked systems, including Reed-Solomon codes, fountain codes, and random network coding, has a clear advantage with respect to simplifying the design of protocols. These coding techniques can ...
It has been theoretically shown that performing coding in networked systems, including Reed-Solomon codes, fountain codes, and random network coding, has a clear advantage with respect to simplifying the design of protocols. These coding techniques can be deployed on a wide range of networked nodes, from servers in the "cloud" to smartphone devices. However, large-scale real-world deployment of systems using coding is still rare, mainly due to the computational complexity of coding algorithms. This is especially a concern on both extremes: in high-bandwidth servers where coding may not be able to saturate the uplink bandwidth, and in smartphone devices where hardware limitations prevail. In this paper, we present Tenor, a comprehensive toolkit to make coding practical across awide range of networked nodes, from servers to smartphones. We strive to push the performance of our crossplatform coding toolkit to the limits allowed by o?-the-shelf hardware. To show the practicality of the Tenor toolkit in real-world network applications, it has been used to build coded on-demand media streaming systems from a GPU-based server to up to 3000 emulated nodes, and to iPhone devices with actual playback. expand
|
|
|
Improving online gaming quality using detour paths |
| |
Cong Ly,
Cheng-Hsin Hsu,
Mohamed Hefeeda
|
|
Pages: 55-64 |
|
doi>10.1145/1873951.1873980 |
|
Full text: PDF
|
|
We study the problem of improving the user perceived quality of online games in which multiple players form a game session and exchange game-state updates over an overlay network. We propose an Indirect Relay System (IRS) to forward game-state updates ...
We study the problem of improving the user perceived quality of online games in which multiple players form a game session and exchange game-state updates over an overlay network. We propose an Indirect Relay System (IRS) to forward game-state updates over detour paths in order to reduce the round-trip time (RTT) among players. The IRS system efficiently identifies and ranks potential detour paths between any two players, and dynamically selects the most suitable one based on network and client conditions. To the best of our knowledge, this is the first system that directly reduced RTTs among players in online games, while previous works in the literature mitigate the network latency issue by either hiding it from players or preventing players with high RTTs from being in the same game session. We implement the proposed IRS system and deploy it on 500 PlanetLab nodes. The results from real experiments show that the IRS system improves the online gaming quality from several aspects, while incurring negligible network and processing overheads. We also deploy the IRS system on a number of residential computers with DSL and cable modem access links and we successfully found several detour paths among them. To evaluate the IRS system with wider ranges of system parameters we conduct extensive trace-driven simulations using a large number of real game client IPs. The experimental and simulation results show that the proposed IRS system: (i) significantly reduces RTTs among players, (ii) increases number of peers a player can connect to and maintain good gaming quality, (iii) imposes negligible network and processing overheads, and (iv) improves gaming quality and player performance. expand
|
|
|
Subjective evaluation of scalable video coding for content distribution |
| |
Jong-Seok Lee,
Francesca De Simone,
Naeem Ramzan,
Zhijie Zhao,
Engin Kurutepe,
Thomas Sikora,
Jörn Ostermann,
Ebroul Izquierdo,
Touradj Ebrahimi
|
|
Pages: 65-72 |
|
doi>10.1145/1873951.1873981 |
|
Full text: PDF
|
|
This paper investigates the influence of the combination of the scalability parameters in scalable video coding (SVC) schemes on the subjective visual quality. We aim at providing guidelines for an adaptation strategy of SVC that can select the optimal ...
This paper investigates the influence of the combination of the scalability parameters in scalable video coding (SVC) schemes on the subjective visual quality. We aim at providing guidelines for an adaptation strategy of SVC that can select the optimal scalability options for resource-constrained networks. Extensive subjective tests are conducted by using two different scalable video codecs and high definition contents. The results are analyzed with respect to five dimensions, namely, codec, content, spatial resolution, temporal resolution, and frame quality. expand
|
|
|
Self-diagnostic peer-assisted video streaming through a learning framework |
| |
Di Niu,
Baochun Li,
Shuqiao Zhao
|
|
Pages: 73-82 |
|
doi>10.1145/1873951.1873982 |
|
Full text: PDF
|
|
Quality control and resource optimization are challenging problems in peer-assisted video streaming systems, due to their large scales and unreliable peer behavior. Such systems are also prone to per- formance degradation in the event of drastic demand ...
Quality control and resource optimization are challenging problems in peer-assisted video streaming systems, due to their large scales and unreliable peer behavior. Such systems are also prone to per- formance degradation in the event of drastic demand changes, such as flash crowds and large-scale simultaneous peer departures. In this paper, we demonstrate the deficiency of state-of-the-art video streaming systems by analyzing real-world traces from UUSee, a popular commercial P2P media streaming system based in China, during the 2008 Beijing Olympics. We show how simple machine learning techniques combined with periodic collection of statistics can be used for automated monitoring and diagnosis of peer-assisted video streaming systems. With such a framework, it is possible to es- timate performance given certain resource usage patterns, making resource utilization more efficient. It also enables the prediction of large-scale performance degradation due to irregular demand pat- terns. #e effectiveness of our proposed framework is validated with extensive trace-driven evaluations. expand
|
|
|
SESSION: Full - F7/applications/content track/multimodal image and video search |
| |
Bernard Merialdo
|
|
|
|
|
iLike: integrating visual and textual features for vertical search |
| |
Yuxin Chen,
Nenghai Yu,
Bo Luo,
Xue-wen Chen
|
|
Pages: 221-230 |
|
doi>10.1145/1873951.1873984 |
|
Full text: PDF
|
|
Content-based image search on the Internet is a challenging problem, mostly due to the semantic gap between low-level visual features and high-level content, as well as the excessive computation brought by huge amount of images and high dimensional features. ...
Content-based image search on the Internet is a challenging problem, mostly due to the semantic gap between low-level visual features and high-level content, as well as the excessive computation brought by huge amount of images and high dimensional features. In this paper, we present iLike, a new approach to truly combine textual features from web pages, and visual features from image content for better image search in a vertical search engine. We tackle the first problem by trying to capture the meaning of each text term in the visual feature space, and re-weight visual features according to their significance to the query content. Our experimental results in product search for apparels and accessories demonstrate the effectiveness of iLike and its capability of bridging semantic gaps between visual features and abstract concepts. expand
|
|
|
Feature map hashing: sub-linear indexing of appearance and global geometry |
| |
Yannis Avrithis,
Giorgos Tolias,
Yannis Kalantidis
|
|
Pages: 231-240 |
|
doi>10.1145/1873951.1873985 |
|
Full text: PDF
|
|
We present a new approach to image indexing and retrieval, which integrates appearance with global image geometry in the indexing process, while enjoying robustness against viewpoint change, photometric variations, occlusion, and background clutter. ...
We present a new approach to image indexing and retrieval, which integrates appearance with global image geometry in the indexing process, while enjoying robustness against viewpoint change, photometric variations, occlusion, and background clutter. We exploit shape parameters of local features to estimate image alignment via a single correspondence. Then, for each feature, we construct a sparse spatial map of all remaining features, encoding their normalized position and appearance, typically vector quantized to visual word. An image is represented by a collection of such feature maps and RANSAC-like matching is reduced to a number of set intersections. Because the induced dissimilarity is still not a metric, we extend min-wise independent permutations to collections of sets and derive a similarity measure for feature map collections. We then exploit sparseness to build an inverted file whereby the retrieval process is sub-linear in the total number of images, ideally linear in the number of relevant ones. We achieve excellent performance on 10^4 images, with a query time in the order of milliseconds. expand
|
|
|
TalkMiner: a lecture webcast search engine |
| |
John Adcock,
Matthew Cooper,
Laurent Denoue,
Hamed Pirsiavash,
Lawrence A. Rowe
|
|
Pages: 241-250 |
|
doi>10.1145/1873951.1873986 |
|
Full text: PDF
|
|
The design and implementation of a search engine for lecture webcasts is described. A searchable text index is created allowing users to locate material within lecture videos found on a variety of websites such as YouTube and Berkeley webcasts. The index ...
The design and implementation of a search engine for lecture webcasts is described. A searchable text index is created allowing users to locate material within lecture videos found on a variety of websites such as YouTube and Berkeley webcasts. The index is created from words on the presentation slides appearing in the video along with any associated metadata such as the title and abstract when available. The video is analyzed to identify a set of distinct slide images, to which OCR and lexical processes are applied which in turn generate a list of indexable terms. Several problems were discovered when trying to identify distinct slides in the video stream. For example, picture-in-picture compositing of a speaker and a presentation slide, switching cameras, and slide builds confuse basic frame-differencing algorithms for extracting keyframe slide images. Algorithms are described that improve slide identification. A prototype system was built to test the algorithms and the utility of the search engine. Users can browse lists of lectures, slides in a specific lecture, or play the lecture video. Over 10,000 lecture videos have been indexed from a variety of sources. A public website will be published in mid 2010 that allows users to experiment with the search engine. expand
|
|
|
A new approach to cross-modal multimedia retrieval |
| |
Nikhil Rasiwasia,
Jose Costa Pereira,
Emanuele Coviello,
Gabriel Doyle,
Gert R.G. Lanckriet,
Roger Levy,
Nuno Vasconcelos
|
|
Pages: 251-260 |
|
doi>10.1145/1873951.1873987 |
|
Full text: PDF
|
|
The problem of joint modeling the text and image components of multimedia documents is studied. The text component is represented as a sample from a hidden topic model, learned with latent Dirichlet allocation, and images are represented as bags of visual ...
The problem of joint modeling the text and image components of multimedia documents is studied. The text component is represented as a sample from a hidden topic model, learned with latent Dirichlet allocation, and images are represented as bags of visual (SIFT) features. Two hypotheses are investigated: that 1) there is a benefit to explicitly modeling correlations between the two components, and 2) this modeling is more effective in feature spaces with higher levels of abstraction. Correlations between the two components are learned with canonical correlation analysis. Abstraction is achieved by representing text and images at a more general, semantic level. The two hypotheses are studied in the context of the task of cross-modal document retrieval. This includes retrieving the text that most closely matches a query image, or retrieving the images that most closely match a query text. It is shown that accounting for cross-modal correlations and semantic abstraction both improve retrieval accuracy. The cross-modal model is also shown to outperform state-of-the-art image retrieval systems on a unimodal retrieval task. expand
|
|
|
SESSION: Full - F8/applications track/assisted authoring of media content |
| |
Mohan Kankanhalli
|
|
|
|
|
Color and luminance compensation for mobile panorama construction |
| |
Yingen Xiong,
Kari Pulli
|
|
Pages: 261-270 |
|
doi>10.1145/1873951.1873989 |
|
Full text: PDF
|
|
This paper addresses the problem of color and luminance compensation for sequences of overlapping images where the source images have very different colors and luminance. We apply the method for panoramic image construction on mobile phones. A simple ...
This paper addresses the problem of color and luminance compensation for sequences of overlapping images where the source images have very different colors and luminance. We apply the method for panoramic image construction on mobile phones. A simple approach is proposed that minimizes both the color differences of neighboring images and the overall color correction over the whole sequence. We compare several combinations of gamma correction and linear adjustment over different color representations, and select the method with best results: use YCbCr and apply gamma correction for the luminance component and linear correction for the chrominance components expand
|
|
|
A framework for photo-quality assessment and enhancement based on visual aesthetics |
| |
Subhabrata Bhattacharya,
Rahul Sukthankar,
Mubarak Shah
|
|
Pages: 271-280 |
|
doi>10.1145/1873951.1873990 |
|
Full text: PDF
|
|
Other formats:
Mp4
|
|
We present an interactive application that enables users to improve the visual aesthetics of their digital photographs using spatial recomposition. Unlike earlier work that focuses either on photo quality assessment or interactive tools for photo editing, ...
We present an interactive application that enables users to improve the visual aesthetics of their digital photographs using spatial recomposition. Unlike earlier work that focuses either on photo quality assessment or interactive tools for photo editing, we enable the user to make informed decisions about improving the composition of a photograph and to implement them in a single framework. Specifically, the user interactively selects a foreground object and the system presents recommendations for where it can be moved in a manner that optimizes a learned aesthetic metric while obeying semantic constraints. For photographic compositions that lack a distinct foreground object, our tool provides the user with cropping or expanding recommendations that improve its aesthetic quality. We learn a support vector regression model for capturing image aesthetics from user data and seek to optimize this metric during recomposition. Rather than prescribing a fully-automated solution, we allow user-guided object segmentation and inpainting to ensure that the final photograph matches the user's criteria. Our approach achieves 86% accuracy in predicting the attractiveness of unrated images, when compared to their respective human rankings. Additionally, 73% of the images recomposited using our tool are ranked more attractive than their original counterparts by human raters. expand
|
|
|
Automated aesthetic enhancement of videos |
| |
Yang Yang Xiang,
Mohan S. Kankanhalli
|
|
Pages: 281-290 |
|
doi>10.1145/1873951.1873991 |
|
Full text: PDF
|
|
In this paper, we present a content based single-shot video editing scheme. We follow the classic long take directing and editing schemes. This system automatically adjusts the projection velocity of raw video clips to enhance the aesthetic interest. ...
In this paper, we present a content based single-shot video editing scheme. We follow the classic long take directing and editing schemes. This system automatically adjusts the projection velocity of raw video clips to enhance the aesthetic interest. We build up the mathematical model for projection rhythm manipulation based on film theories. The system segments interesting sub-shots and ordinary sub-shots within the single video clip. Different sub-shots are projected to different duration to maximize the video interest. The output video is rendered according to adjusted projection duration. Within this framework, we transform the screen rhythm and camera motion of a given single video shot. Motion interests of frames are re-distributed in projection duration modification and certain special projection patterns are introduced to enhance the aesthetic interest of original video. The user study shows that our scheme is very effective. expand
|
|
|
Learning to photograph |
| |
Bin Cheng,
Bingbing Ni,
Shuicheng Yan,
Qi Tian
|
|
Pages: 291-300 |
|
doi>10.1145/1873951.1873992 |
|
Full text: PDF
|
|
In this paper, we propose an intelligent photography system, which automatically and professionally generates/recommends user-favorite photo(s) from a wide view or a continuous view sequence. This task is quite challenging given that the evaluation of ...
In this paper, we propose an intelligent photography system, which automatically and professionally generates/recommends user-favorite photo(s) from a wide view or a continuous view sequence. This task is quite challenging given that the evaluation of photo quality is under-determined and usually subjective. Motivated by the recent prevalence of online media, we present a solution y mining the underlying knowledge and experience of the photographers from massively crawled professional photos (about 100,000 images, which are highly ranked by users) of those popular photo sharing websites, e.g. Flickr.com. Generally far contexts are critical in characterizing the composition rules for professional photos, and thus we present a method called omni-range context modeling to learn the patch/object spatial correlation distribution for the concurrent patch/object pair of arbitrary distance. The learned photo omni-range context priors then serve as rules to guide the composition of professional photos. When a wide view is fed into the system, these priors are utilized together with other cues (e.g., placements of faces at different poses, patch number, etc) to form a posterior probability formulation for professional sub-view finding. Moreover, this system can function as intelligent professionalview guider based on real-time view quality assessment and the embedded compass (for recording capture direction). Beyond the salient areas targeted by most existing view recommendation algorithms, the proposed system targets at professional photo composition. Qualitative experiments as well as comprehensive user studies well demonstrate the validity and efficiency of the proposed omnirange context learning method as well as the automatic view finding framework. expand
|
|
|
SESSION: Full - F9/human-centered multimedia/systems track/enriched and extended media presentation |
| |
Maja Pantic
|
|
|
|
|
Ink jet olfactory display enabling instantaneous switches of scents |
| |
Sayumi Sugimoto,
Daisuke Noguchi,
Yuichi Bannnai,
Kenichi Okada
|
|
Pages: 301-310 |
|
doi>10.1145/1873951.1873994 |
|
Full text: PDF
|
|
Trials on transmitting olfactory information together with audio/visual information are currently being conducted in the field of multimedia. However, continuous emission of scents creates problems of olfactory adaptations and scents lingering in the ...
Trials on transmitting olfactory information together with audio/visual information are currently being conducted in the field of multimedia. However, continuous emission of scents creates problems of olfactory adaptations and scents lingering in the air. To overcome these problems, we developed an ink-jet olfactory display. This display has high emission control so that it can provide stable pulse emission of scents. Humans detect the scents when they breathe in and inhale scents molecules in the air. Therefore, it is important to synchronize the pulse ejection of scent presentation with the inspiration. With that, by using the pulse ejection of scents, we constructed the pulse ejection pattern what enables instantaneous switch of scents. We first measured the responding time to sense the shift of scents in order to achieve the pulse ejection pattern suited for switching the scents. Then, by using the pattern found in first experiment, we measured the shortest period of scent switching. At last, we measured the limit of switching scents. As a result, we constructed the presenting pattern of scent switching and the limit of scent switching when the scent switching was presented with its shortest period. It is expected that using this developed pattern with movies would raise the realistic sensations. expand
|
|
|
e-Fovea: a multi-resolution approach with steerable focus to large-scale and high-resolution monitoring |
| |
Kuan-Wen Chen,
Chih-Wei Lin,
Mike Y. Chen,
Yi-Ping Hung
|
|
Pages: 311-320 |
|
doi>10.1145/1873951.1873995 |
|
Full text: PDF
|
|
This paper presents e-Fovea, a system that combines both multi-resolution camera input and multi-resolution steerable projector output to support large-scale and high-resolution visual monitoring. e-Fovea utilizes a design similar to the human eyes, ...
This paper presents e-Fovea, a system that combines both multi-resolution camera input and multi-resolution steerable projector output to support large-scale and high-resolution visual monitoring. e-Fovea utilizes a design similar to the human eyes, which provides peripheral vision with a steerable fovea that is in higher resolution. e-Fovea is implemented using a steerable telephoto camera and a wide-angle camera. The telephoto image is displayed using a projector with a steerable mirror, and overlaid on the wide-angle image that is displayed using a second projector. We have deployed e-Fovea in two installations to demonstrate its feasibility. We have also conducted two user studies, with a total of 36 participants, to compare e-Fovea to two existing multi-resolution visual monitoring designs. The user study results show that for visual monitoring tasks, our e-Fovea design with steerable focus is significantly faster than existing approaches and preferred by users expand
|
|
|
Impact of zooming and enhancing region of interests for optimizing user experience on mobile sports video |
| |
Wei Song,
Dian W. Tjondronegoro,
Shu-Hsien Wang,
Michael J. Docherty
|
|
Pages: 321-330 |
|
doi>10.1145/1873951.1873996 |
|
Full text: PDF
|
|
In mobile videos, small viewing size and bitrate limitation often cause unpleasant viewing experiences, which is particularly important for fast-moving sports videos. For optimizing the overall user experience of viewing sports videos on mobile phones, ...
In mobile videos, small viewing size and bitrate limitation often cause unpleasant viewing experiences, which is particularly important for fast-moving sports videos. For optimizing the overall user experience of viewing sports videos on mobile phones, this paper explores the benefits of emphasizing Region of Interest (ROI) by 1) zooming in and 2) enhancing the quality. The main goal is to measure the effectiveness of these two approaches and determine which one is more effective. To obtain a more comprehensive understanding of the overall user experience, the study considers user's interest in video content and user's acceptance of the perceived video quality, and compares the user experience in sports videos with other content types such as talk shows. The results from a user study with 40 subjects demonstrate that zooming and ROI-enhancement are both effective in improving the overall user experience with talk show and mid-shot soccer videos. However, for the full-shot scenes in soccer videos, only zooming is effective while ROI-enhancement has a negative effect. Moreover, user's interest in video content directly affects not only the user experience and the acceptance of video quality, but also the effect of content type on the user experience. Finally, the overall user experience is closely related to the degree of the acceptance of video quality and the degree of the interest in video content. This study is valuable in exploiting effective approaches to improve user experience, especially in mobile sports video streaming contexts, whereby the available bandwidth is usually low or limited. It also provides further understanding of the influencing factors of user experience. expand
|
|
|
Webcams in context: web interfaces to create live 3D environments |
| |
Austin D. Abrams,
Robert B. Pless
|
|
Pages: 331-340 |
|
doi>10.1145/1873951.1873997 |
|
Full text: PDF
|
|
Web services supporting deep integration between video data and geographic information systems (GIS) empower a large user base to build on popular tools such as Google Earth and Google Maps. Here we extend web interfaces designed explicitly for novice ...
Web services supporting deep integration between video data and geographic information systems (GIS) empower a large user base to build on popular tools such as Google Earth and Google Maps. Here we extend web interfaces designed explicitly for novice users to integrate streaming video with 3D GIS, and work to dramatically simplify the task of retexturing 3D scenes from live imagery. We also derive and implement constraints to use corresponding points to calibrate popular pan-tilt-zoom webcams with respect to GIS applications, so that the calibration is automatically updated as web users adjust the camera zoom and view direction. These contributions are demonstrated in a live web application implemented on the Google Earth Plug-in, within which hundreds of users have already geo-registered streaming cameras in hundreds of scenes to create live, updating textures in 3D scenes. expand
|
|
|
SESSION: Full - F10/human-centered multimedia track/improved interactivity |
| |
Jie Yang
|
|
|
|
|
Toward more efficient user interfaces for mobile video browsing: an in-depth exploration of the design space |
| |
Jochen Huber,
Jürgen Steimle,
Max Mühlhäuser
|
|
Pages: 341-350 |
|
doi>10.1145/1873951.1873999 |
|
Full text: PDF
|
|
Increasingly powerful mobile devices enable users to access and watch videos in mobile settings. While some concepts for mobile video browsing have been presented, the field still lacks a general understanding of the design space and of the characteristics ...
Increasingly powerful mobile devices enable users to access and watch videos in mobile settings. While some concepts for mobile video browsing have been presented, the field still lacks a general understanding of the design space and of the characteristics of interaction concepts. In order to improve user interfaces for mobile video browsing, this paper includes three contributions. First, we setup a design space for mobile video browsing and contribute seven novel interface concepts. They rely on GUI-based, on touch-gesture-based, and on physical interaction. Second, we present the results of an in-depth evaluation and comparison of these concepts. They are based on an ascriptive analysis of 18 hours of video observations from a controlled experiment with 44 participants. The results provide insights into common usability errors and misconceptions. Third, we derive implications for the design of mobile video browsers to minimize errors and to increase usability. expand
|
|
|
Error-resilient perceptual coding for networked haptic interaction |
| |
Fernanda Brandi,
Julius Kammerl,
Eckehard Steinbach
|
|
Pages: 351-360 |
|
doi>10.1145/1873951.1874000 |
|
Full text: PDF
|
|
The performance of haptic interaction across communication networks critically depends on the successful reconstruction of the bidirectionally transmitted haptic signals, and hence on the quality of the communication channel. We propose a novel error-resilient ...
The performance of haptic interaction across communication networks critically depends on the successful reconstruction of the bidirectionally transmitted haptic signals, and hence on the quality of the communication channel. We propose a novel error-resilient data reduction scheme for haptic communication which exploits known limits of human haptic perception. Particularly, we show that missing haptic information due to packet loss may strongly impair the user's experience during haptic interaction. We present and compare methods that eliminate the disturbing artifacts resulting out of packet loss. Our approach keeps the estimated impact of packet losses below human perception thresholds. A tree of possible cases (packets received or not received) and their respective occurrence probabilities is maintained at the sender side, and the system predicts unacceptable error cases to decide whether extra packets should be sent. We introduce different criteria that can be employed to trigger additional packets. In our experiments, we evaluate both the objective data reduction performance and the subjective system transparency by performing extensive tests using packet loss probability and round trip time as parameters. The proposed scheme shows excellent performances in terms of data reduction while sustaining good subjective ratings for a wide range of packet loss values and round trip times. expand
|
|
|
FACT: fine-grained cross-media interaction with documents via a portable hybrid paper-laptop interface |
| |
Chunyuan Liao,
Hao Tang,
Qiong Liu,
Patrick Chiu,
Francine Chen
|
|
Pages: 361-370 |
|
doi>10.1145/1873951.1874001 |
|
Full text: PDF
|
|
FACT is an interactive paper system for fine-grained interaction with documents across the boundary between paper and computers. It consists of a small camera-projector unit, a laptop, and ordinary paper documents. With the camera-projector unit pointing ...
FACT is an interactive paper system for fine-grained interaction with documents across the boundary between paper and computers. It consists of a small camera-projector unit, a laptop, and ordinary paper documents. With the camera-projector unit pointing to a paper document, the system allows a user to issue pen gestures on the paper document for selecting fine-grained content and applying various digital functions. For example, the user can choose individual words, symbols, figures, and arbitrary regions for keyword search, copy and paste, web search, and remote sharing. FACT thus enables a computer-like user experience on paper. This paper interaction can be integrated with laptop interaction for cross-media manipulations on multiple documents and views. We present the infrastructure, supporting techniques and interaction design, and demonstrate the feasibility via a quantitative experiment. We also propose applications such as document manipulation, map navigation and remote collaboration. expand
|
|
|
An immersive system for browsing and visualizing surveillance video |
| |
Philip DeCamp,
George Shaw,
Rony Kubat,
Deb Roy
|
|
Pages: 371-380 |
|
doi>10.1145/1873951.1874002 |
|
Full text: PDF
|
|
HouseFly is an interactive data browsing and visualization system that synthesizes audio-visual recordings from multiple sensors, as well as the meta-data derived from those recordings, into a unified viewing experience. The system is being applied to ...
HouseFly is an interactive data browsing and visualization system that synthesizes audio-visual recordings from multiple sensors, as well as the meta-data derived from those recordings, into a unified viewing experience. The system is being applied to study human behavior in both domestic and retail situations grounded in longitudinal video recordings. HouseFly uses an immersive video technique to display multiple streams of high resolution video using a realtime warping procedure that projects the video onto a 3D model of the recorded space. The system interface provides the user with simultaneous control over both playback rate and vantage point, enabling the user to navigate the data spatially and temporally. Beyond applications in video browsing, this system serves as an intuitive platform for visualizing patterns over time in a variety of multi-modal data, including person tracks and speech transcripts. expand
|
|
|
SESSION: Full - F11/applications/content track/novel aids for music retrieval |
| |
Gerald Friedland
|
|
|
|
|
Combining multi-probe histogram and order-statistics based LSH for scalable audio content retrieval |
| |
Yi Yu,
Michel Crucianu,
Vincent Oria,
Ernesto Damiani
|
|
Pages: 381-390 |
|
doi>10.1145/1873951.1874004 |
|
Full text: PDF
|
|
In order to improve the reliability and the scalability of content-based retrieval of variant audio tracks from large music databases, we suggest a new multi-stage LSH scheme that consists in (i) extracting compact but accurate representations from audio ...
In order to improve the reliability and the scalability of content-based retrieval of variant audio tracks from large music databases, we suggest a new multi-stage LSH scheme that consists in (i) extracting compact but accurate representations from audio tracks by exploiting the LSH idea to summarize audio tracks, and (ii) adequately organizing the resulting representations in LSH tables, retaining almost the same accuracy as an exact kNN retrieval. In the first stage, we use major bins of successive chroma features to calculate a multi-probe histogram (MPH) that is concise but retains the information about local temporal correlations. In the second stage, based on the order statistics (OS) of the MPH, we propose a new LSH scheme, OS-LSH, to organize and probe the histograms. The representation and organization of the audio tracks are storage efficient and support robust and scalable retrieval. Extensive experiments over a large dataset with 30,000 real audio tracks confirm the effectiveness and efficiency of the proposed scheme. expand
|
|
|
Music recommendation by unified hypergraph: combining social media information and music content |
| |
Jiajun Bu,
Shulong Tan,
Chun Chen,
Can Wang,
Hao Wu,
Lijun Zhang,
Xiaofei He
|
|
Pages: 391-400 |
|
doi>10.1145/1873951.1874005 |
|
Full text: PDF
|
|
Other formats:
Mp4
|
|
Acoustic-based music recommender systems have received increasing interest in recent years. Due to the semantic gap between low level acoustic features and high level music concepts, many researchers have explored collaborative filtering techniques in ...
Acoustic-based music recommender systems have received increasing interest in recent years. Due to the semantic gap between low level acoustic features and high level music concepts, many researchers have explored collaborative filtering techniques in music recommender systems. Traditional collaborative filtering music recommendation methods only focus on user rating information. However, there are various kinds of social media information, including different types of objects and relations among these objects, in music social communities such as Last.fm and Pandora. This information is valuable for music recommendation. However, there are two challenges to exploit this rich social media information: (a) There are many different types of objects and relations in music social communities, which makes it difficult to develop a unified framework taking into account all objects and relations. (b) In these communities, some relations are much more sophisticated than pairwise relation, and thus cannot be simply modeled by a graph. In this paper, we propose a novel music recommendation algorithm by using both multiple kinds of social media information and music acoustic-based content. Instead of graph, we use hypergraph to model the various objects and relations, and consider music recommendation as a ranking problem on this hypergraph. While an edge of an ordinary graph connects only two objects, a hyperedge represents a set of objects. In this way, hypergraph can be naturally used to model high-order relations. Experiments on a data set collected from the music social community Last.fm have demonstrated the effectiveness of our proposed algorithm. expand
|
|
|
Large-scale music tag recommendation with explicit multiple attributes |
| |
Zhendong Zhao,
Xinxi Wang,
Qiaoliang Xiang,
Andy M. Sarroff,
Zhonghua Li,
Ye Wang
|
|
Pages: 401-410 |
|
doi>10.1145/1873951.1874006 |
|
Full text: PDF
|
|
Social tagging can provide rich semantic information for large-scale retrieval in music discovery. Such collaborative intelligence, however, also generates a high degree of tags unhelpful to discovery, some of which obfuscate critical information. Towards ...
Social tagging can provide rich semantic information for large-scale retrieval in music discovery. Such collaborative intelligence, however, also generates a high degree of tags unhelpful to discovery, some of which obfuscate critical information. Towards addressing these shortcomings, tag recommendation for more robust music discovery is an emerging topic of significance for researchers. However, current methods do not consider diversity of music attributes, often using simple heuristics such as tag frequency for filtering out irrelevant tags. Music attributes encompass any number of perceived dimensions, for instance vocalness, genre, and instrumentation. Many of these are underrepresented by current tag recommenders. We propose a scheme for tag recommendation using Explicit Multiple Attributes based on tag semantic similarity and music content. In our approach, the attribute space is explicitly constrained at the outset to a set that minimizes semantic loss and tag noise, while ensuring attribute diversity. Once the user uploads or browses a song, the system recommends a list of relevant tags in each attribute independently. To the best of our knowledge, this is the first method to consider Explicit Multiple Attributes for tag recommendation. Our system is designed for large-scale deployment, on the order of millions of objects. For processing large-scale music data sets, we design parallel algorithms based on the MapReduce framework to perform large-scale music content and social tag analysis, train a model, and compute tag similarity. We evaluate our tag recommendation system on CAL-500 and a large-scale data set ($N = 77,448$ songs) generated by crawling Youtube and Last.fm. Our results indicate that our proposed method is both effective for recommending attribute-diverse relevant tags and efficient at scalable processing. expand
|
|
|
Social audio features for advanced music retrieval interfaces |
| |
Michael Kuhn,
Roger Wattenhofer,
Samuel Welten
|
|
Pages: 411-420 |
|
doi>10.1145/1873951.1874007 |
|
Full text: PDF
|
|
The size of personal music collections has constantly increased over the past years. As a result, the traditional metadata based lists to browse these collections have reached their limits. Interfaces that are based on music similarity offer an alternative ...
The size of personal music collections has constantly increased over the past years. As a result, the traditional metadata based lists to browse these collections have reached their limits. Interfaces that are based on music similarity offer an alternative and thus are increasingly gaining attention. Music similarity is typically either derived from audio-features (objective approach) or from user driven information sources, such as collaborative filtering or social tags (subjective approach). Studies show that the latter techniques outperform audio-based approaches when it comes to describe the perceived music similarity. However, subjective approaches typically only define pairwise relations as opposed to the global notion of similarity given by audio-feature spaces. Many of the proposed interfaces for similarity based music access inherently depend on this global notion and are thus not applicable to user driven music similarity measures. The first contribution of this paper is a high dimensional music space that is based on user driven similarity measures. It combines the advantages of audio-feature spaces (global view) with the advantages of subjective sources that better reflect the users' perception. The proposed space compactly represents similarity and therefore is well suited for offline use, such as in mobile applications. To demonstrate the practical applicability, the second contribution is a comprehensive mobile music player that incorporates several smart interfaces to access the user's music collection. Based on this application, we finally present a large-scale user study that underlines the benefits of the introduced interfaces and shows their great user acceptance. expand
|
|
|
SESSION: Full - F16/systems track/3D video |
| |
Wolfgang Effelsberg
|
|
|
|
|
A cognitive approach for effective coding and transmission of 3D video |
| |
Simone Milani,
Giancarlo Calvagno
|
|
Pages: 581-590 |
|
doi>10.1145/1873951.1874009 |
|
Full text: PDF
|
|
Other formats:
Mp4
|
|
Reliable delivery of 3D video contents to a wide set of users is expected to be the next big revolution in multimedia applications provided that it is possible to grant a certain level of Quality-of-Experience (QoE) to the end user. During the last years, ...
Reliable delivery of 3D video contents to a wide set of users is expected to be the next big revolution in multimedia applications provided that it is possible to grant a certain level of Quality-of-Experience (QoE) to the end user. During the last years, several cross-layer solutions have proved to be extremely effective in tuning the transmission parameters at the different layers of the protocol stack and in maximizing the perceptual quality of the reconstructed 3D scene. Among these, Cognitive Source Coding (CSC) schemes (defined in analogy with Cognitive Radio systems) make possible to improve the quality of the 3D QoE at the receiver by adapting the source coding strategy according to the state of the transmission channel and to the characteristics of the coded signal. This knowledge also permits an optimization of the computational complexity required at the encoder. The paper presents a CSC architecture that analyzes the 3D scene, identifies the different elements, and chooses the most appropriate coding strategy via a classification of the features of each element based on Support Vector Machine theory. Experimental results show that the proposed approach permits improving the quality of the received 3D signal with respect to traditional cross-layer techniques and reducing the computational complexity of coding operation. expand
|
|
|
Modeling 3D facial expressions using geometry videos |
| |
Jiazhi Xia,
Ying He,
Dao P.T. Quynh,
Xiaoming Chen,
Steven C.H. Hoi
|
|
Pages: 591-600 |
|
doi>10.1145/1873951.1874010 |
|
Full text: PDF
|
|
The significant advances in developing high-speed shape acquisition devices make it possible to capture the moving and deforming objects at video speeds. However, due to its complicated nature, it is technically challenging to effectively model and store ...
The significant advances in developing high-speed shape acquisition devices make it possible to capture the moving and deforming objects at video speeds. However, due to its complicated nature, it is technically challenging to effectively model and store the captured motion data. In this paper, we present a set of algorithms to construct geometry videos for 3D facial expressions, including hole filling, geodesic-based face segmentation, and expression-invariant parametrization. Our algorithms are efficient and robust, and can guarantee the exact correspondence of the salient features (eyes, mouth and nose). Geometry video naturally bridges the 3D motion data and 2D video, and provides a way to borrow the well-studied video processing techniques to motion data processing. With our proposed intra-frame prediction scheme based on H.264/AVC, we are able to compress the geometry videos into a very compact size while maintaining the video quality. Our experimental results on real-world datasets demonstrate that geometry video is effective for modeling the high-resolution 3D expression data. expand
|
|
|
A high-quality low-delay remote rendering system for 3D video |
| |
Shu Shi,
Mahsa Kamali,
Klara Nahrstedt,
John C. Hart,
Roy H. Campbell
|
|
Pages: 601-610 |
|
doi>10.1145/1873951.1874011 |
|
Full text: PDF
|
|
As an emerging technology, 3D video has shown a great potential to become the next generation media for tele-immersion. However, streaming and rendering this dynamic 3D data in real-time requires tremendous network bandwidth and computing resources. ...
As an emerging technology, 3D video has shown a great potential to become the next generation media for tele-immersion. However, streaming and rendering this dynamic 3D data in real-time requires tremendous network bandwidth and computing resources. In this paper, we build a remote rendering model to better study different remote rendering designs and define 3D video rendering as an optimization problem. Moreover, we design a 3D video remote rendering system that significantly reduces the delay while maintaining high rendering quality. We also propose a reference viewpoint prediction algorithm with super sampling support that requires much less computation resources but provides better performance than the search-based algorithms proposed in the related work. expand
|
|
|
SESSION: Full - F12/applications/human-centered multimedia track/narrowing the experience gap |
| |
Abed El Saddik
|
|
|
|
|
Dynamic captioning: video accessibility enhancement for hearing impairment |
| |
Richang Hong,
Meng Wang,
Mengdi Xu,
Shuicheng Yan,
Tat-Seng Chua
|
|
Pages: 421-430 |
|
doi>10.1145/1873951.1874013 |
|
Full text: PDF
|
|
Other formats:
Mp4
|
|
There are more than 66 million people su®ering from hearing impairment and this disability brings them di±culty in the video content understanding due to the loss of audio information. If scripts are available, captioning technology can help ...
There are more than 66 million people su®ering from hearing impairment and this disability brings them di±culty in the video content understanding due to the loss of audio information. If scripts are available, captioning technology can help them in a certain degree by synchronously illustrating the scripts during the playing of videos. However, we show that the existing captioning techniques are far from satisfactory in assisting hearing impaired audience to enjoy videos. In this paper, we introduce a video accessibility enhancement scheme with a Dynamic Captioning approach, which explores a rich set of technologies including face detection and recognition, visual saliency analysis, text-speech alignment, etc. Different from the existing methods that are categorized as static captioning here, dynamic captioning puts scripts at suitable positions to help hearing impaired audience better recognize the speaking characters. In addition, it progressively highlights the scripts word-by-word via aligning them with the speech signal and illustrates the variation of voice volume. In this way, the special audience can better track the scripts and perceive the moods that are conveyed by the variation of volume. We implement the technology on 20 video clips and conduct an in-depth study with 60 real hearing impaired users, and results have demonstrated the effectiveness and usefulness of the video accessibility enhancement scheme. expand
|
|
|
The third eye: mining the visual cognition across multi-language communities |
| |
Chunxi Liu,
Qingming Huang,
Shuqiang Jiang,
Changsheng Xu
|
|
Pages: 431-440 |
|
doi>10.1145/1873951.1874014 |
|
Full text: PDF
|
|
Existing research work in the multimedia domain mainly focuses on image/video indexing, retrieval, annotation, tagging, re-ranking, etc. However, little work has been contributed to people's visual cognition. In this paper, we propose a novel framework ...
Existing research work in the multimedia domain mainly focuses on image/video indexing, retrieval, annotation, tagging, re-ranking, etc. However, little work has been contributed to people's visual cognition. In this paper, we propose a novel framework to mine people's visual cognition across multi-language communities. Two challenges are addressed: the visual cognition representation for a specific language community, and the visual cognition comparison between different language communities. We call it "the third eye", which means that through this way people with different backgrounds can better understand the cognition of each other, and can view the concept more objectively to avoid culture conflict. In this study, we utilize the image search engine to mine the visual cognition of the different communities. The assumption is that the image semantic distribution over the search results can reflect the visual cognition of the community. When a user submits a text query, it is first translated into different languages, and fed into the corresponding image search engine ports to retrieve images from these communities. After retrieval, the obtained images are categorized into different semantic clusters automatically. Finally, inter semantic cluster ranking is employed to rank the semantic clusters according to their relationship to the query, and intra cluster ranking is used to rank the images according to their representativeness. The visual cognition difference among these language communities is achieved by comparing the different community image distributions over these semantic clusters. The experimental results are promising and show that the proposed visual cognition mining approach is effective. expand
|
|
|
Green multimedia: informing people of their carbon footprint through two simple sensors |
| |
Aiden R. Doherty,
Zhengwei Qiu,
Colum Foley,
Hyowon Lee,
Cathal Gurrin,
Alan F. Smeaton
|
|
Pages: 441-450 |
|
doi>10.1145/1873951.1874015 |
|
Full text: PDF
|
|
In this work we discuss a new, but highly relevant, topic to the multimedia community; systems to inform individuals of their carbon footprint, which could ultimately effect change in community carbon footprint-related activities. The reduction of carbon ...
In this work we discuss a new, but highly relevant, topic to the multimedia community; systems to inform individuals of their carbon footprint, which could ultimately effect change in community carbon footprint-related activities. The reduction of carbon emissions is now an important policy driver of many governments, and one of the major areas of focus is in reducing the energy demand from the consumers i.e. all of us individually. In terms of CO2 generated from energy consumption, there are three predominant factors, namely electricity usage, thermal related costs, and transport usage. Standard home electricity and heating sensors can be used to measure the former two aspects, and in this paper we evaluate a novel technique to estimate an individual's transport-related carbon emissions through the use of a simple wearable accelerometer. We investigate how providing this novel estimation of transport-related carbon emissions through an interactive web site and mobile phone app engages a set of users in becoming more aware of their carbon emissions. Our evaluations involve a group of 6 users collecting 25 million accelerometer readings and 12.5 million power readings vs. a control group of 16 users collecting 29.7 million power readings. expand
|
|
|
Bridging low-level features and high-level semantics via fMRI brain imaging for video classification |
| |
Xintao Hu,
Fan Deng,
Kaiming Li,
Tuo Zhang,
Hanbo Chen,
Xi Jiang,
Jinglei Lv,
Dajiang Zhu,
Carlos Faraco,
Degang Zhang,
Arsham Mesbah,
Junwei Han,
Xiansheng Hua,
Li Xie,
Stephen Miller,
Lei Guo,
Tianming Liu
|
|
Pages: 451-460 |
|
doi>10.1145/1873951.1874016 |
|
Full text: PDF
|
|
The multimedia content analysis community has made significant effort to bridge the gap between low-level features and high-level semantics perceived by human cognitive systems such as real-world objects and concepts. In the two fields of multimedia ...
The multimedia content analysis community has made significant effort to bridge the gap between low-level features and high-level semantics perceived by human cognitive systems such as real-world objects and concepts. In the two fields of multimedia analysis and brain imaging, both topics of low-level features and high level semantics are extensively studied. For instance, in the multimedia analysis field, many algorithms are available for multimedia feature extraction, and benchmark datasets are available such as the TRECVID. In the brain imaging field, brain regions that are responsible for vision, auditory perception, language, and working memory are well studied via functional magnetic resonance imaging (fMRI). This paper presents our initial effort in marrying these two fields in order to bridge the gaps between low-level features and high-level semantics via fMRI brain imaging. Our experimental paradigm is that we performed fMRI brain imaging when university student subjects watched the video clips selected from the TRECVID datasets. At current stage, we focus on the three concepts of sports, weather, and commercial-/advertisement specified in the TRECVID 2005. Meanwhile, the brain regions in vision, auditory, language, and working memory networks are quantitatively localized and mapped via task-based paradigm fMRI, and the fMRI responses in these regions are used to extract features as the representation of the brain's comprehension of semantics. Our computational framework aims to learn the most relevant low-level feature sets that best correlate the fMRI-derived semantics based on the training videos with fMRI scans, and then the learned models are applied to larger scale test datasets without fMRI scans for category classifications. Our result shows that: 1) there are meaningful couplings between brain's fMRI responses and video stimuli, suggesting the validity of linking semantics and low-level features via fMRI; 2) The computationally learned low-level feature sets from fMRI-derived semantic features can significantly improve the classification of video categories in comparison with that based on original low-level features. expand
|
|
|
SESSION: Full - F14/applications/content track/detection of near-duplicate content |
| |
Alan Hanjalic
|
|
|
|
|
Building contextual visual vocabulary for large-scale image applications |
| |
Shiliang Zhang,
Qingming Huang,
Gang Hua,
Shuqiang Jiang,
Wen Gao,
Qi Tian
|
|
Pages: 501-510 |
|
doi>10.1145/1873951.1874018 |
|
Full text: PDF
|
|
Not withstanding its great success and wide adoption in Bag-of-visual Words representation, visual vocabulary created from single image local features is often shown to be ineffective largely due to three reasons. First, many detected local features ...
Not withstanding its great success and wide adoption in Bag-of-visual Words representation, visual vocabulary created from single image local features is often shown to be ineffective largely due to three reasons. First, many detected local features are not stable enough, resulting in many noisy and non-descriptive visual words in images. Second, single visual word discards the rich spatial contextual information among the local features, which has been proven to be valuable for visual matching. Third, the distance metric commonly used for generating visual vocabulary does not take the semantic context into consideration, which renders them to be prone to noise. To address these three confrontations, we propose an effective visual vocabulary generation framework containing three novel contributions: 1) we propose an effective unsupervised local feature refinement strategy; 2) we consider local features in groups to model their spatial contexts; 3) we further learn a discriminant distance metric between local feature groups, which we call discriminant group distance. This group distance is further leveraged to induce visual vocabulary from groups of local features. We name it contextual visual vocabulary, which captures both the spatial and semantic contexts. We evaluate the proposed local feature refinement strategy and the contextual visual vocabulary in two large-scale image applications: large-scale near-duplicate image retrieval on a dataset containing 1.5 million images and image search re-ranking tasks. Our experimental results show that the contextual visual vocabulary shows significant improvement over the classic visual vocabulary. Moreover, it outperforms the state-of-the-art Bundled Feature in the terms of retrieval precision, memory consumption and efficiency. expand
|
|
|
Spatial coding for large scale partial-duplicate web image search |
| |
Wengang Zhou,
Yijuan Lu,
Houqiang Li,
Yibing Song,
Qi Tian
|
|
Pages: 511-520 |
|
doi>10.1145/1873951.1874019 |
|
Full text: PDF
|
|
The state-of-the-art image retrieval approaches represent images with a high dimensional vector of visual words by quantizing local features, such as SIFT, in the descriptor space. The geometric clues among visual words in an image is usually ignored ...
The state-of-the-art image retrieval approaches represent images with a high dimensional vector of visual words by quantizing local features, such as SIFT, in the descriptor space. The geometric clues among visual words in an image is usually ignored or exploited for full geometric verification, which is computationally expensive. In this paper, we focus on partial-duplicate web image retrieval, and propose a novel scheme, spatial coding, to encode the spatial relationships among local features in an image. Our spatial coding is both efficient and effective to discover false matches of local features between images, and can greatly improve retrieval performance. Experiments in partial-duplicate web image search, using a database of one million images, reveal that our approach achieves a 53% improvement in mean average precision and 46% reduction in time cost over the baseline bag-of-words approach. expand
|
|
|
Monitoring near duplicates over video streams |
| |
Xiangmin Zhou,
Lei Chen
|
|
Pages: 521-530 |
|
doi>10.1145/1873951.1874020 |
|
Full text: PDF
|
|
Since near duplicates are ubiquitous over different data sources, increasing research efforts have been put to near duplicate detection recently. Among all the near duplicate detection tasks, an important one is continuous near duplicate monitoring over ...
Since near duplicates are ubiquitous over different data sources, increasing research efforts have been put to near duplicate detection recently. Among all the near duplicate detection tasks, an important one is continuous near duplicate monitoring over video streams. Existing video monitoring techniques are not effective for handling the variations that commonly exist among near duplicates. Moreover, approaches proposed for the near duplicate detection in archived video databases are inefficient when applied to high speed video streams. In this work, we propose a framework for effectively online monitoring near duplicates over video streams. Specifically, we first propose a novel representation, a video cuboid signature, to describe a video segment. To capture the local spatio-temporal information of video subclips, we employ the Earth Mover's Distance (EMD) to measure the similarity between two signatures. Both the signature construction and the sequence similarity measure are incrementally processed by exploiting the inherent property of signature series. Then, we propose a novel scheme called locality sensitive multi-leveled approximation (LSMA) that optimizes the near duplicate video similarity matching over streams based on the locality sensitive hashing under EMD metric. The extensive experiments demonstrate the high performance of our approach in terms of the detection accuracy and time cost. expand
|
|
|
Real-time large scale near-duplicate web video retrieval |
| |
Lifeng Shang,
Linjun Yang,
Fei Wang,
Kwok-Ping Chan,
Xian-Sheng Hua
|
|
Pages: 531-540 |
|
doi>10.1145/1873951.1874021 |
|
Full text: PDF
|
|
Near-duplicate video retrieval is becoming more and more important with the exponential growth of the Web. Though various approaches have been proposed to address this problem, they are mainly focusing on the retrieval accuracy while infeasible to query ...
Near-duplicate video retrieval is becoming more and more important with the exponential growth of the Web. Though various approaches have been proposed to address this problem, they are mainly focusing on the retrieval accuracy while infeasible to query on Web scale video database in real time. This paper proposes a novel method to address the efficiency and scalability issues for near-duplicate We video retrieval. We introduce a compact spatiotemporal feature to represent videos and construct an efficient data structure to index the feature to achieve real-time retrieving performance. This novel feature leverages relative gray-level intensity distribution within a frame and temporal structure of videos along frame sequence. The new index structure is proposed based on inverted file to allow for fast histogram intersection computation between videos. To demonstrate the effectiveness and efficiency of the proposed methods we evaluate its performance on an open Web video data set containing about 10K videos and compare it with four existing methods in terms of precision and time complexity. We also test our method on a data set containing about 50K videos and 11M key-frames. It takes on average 17ms to execute a query against the whole 50K Web video data set. expand
|
|
|
SESSION: Full - F15/applications/human-centered multimedia track/automatic generation of media content |
| |
Mohamed M. Hefeeda
|
|
|
|
|
Automatic mashup generation from multiple-camera concert recordings |
| |
Prarthana Shrestha,
Peter H.N. de With,
Hans Weda,
Mauro Barbieri,
Emile H.L. Aarts
|
|
Pages: 541-550 |
|
doi>10.1145/1873951.1874023 |
|
Full text: PDF
|
|
A large number of videos are captured and shared by the audience from musical concerts. However, such recordings are typically perceived as boring mainly because of their limited view, poor visual quality and incomplete coverage. It is our objective ...
A large number of videos are captured and shared by the audience from musical concerts. However, such recordings are typically perceived as boring mainly because of their limited view, poor visual quality and incomplete coverage. It is our objective to enrich the viewing experience of these recordings by exploiting the abundance of content from multiple sources. In this paper, we propose a novel \Virtual Director system that automatically combines the most desirable segments from different recordings resulting in a single video stream, called mashup. We start by eliciting requirements from focus groups, interviewing professional video editors and consulting film grammar literature. We design a formal model for automatic mashup generation based on maximizing the degree of fulfillment of the requirements. Various audio-visual content analysis techniques are used to determine how well the requirements are satisfied by a recording. To validate the system, we compare our mashups with two other mashups: manually created by a professional video editor and machine generated by random segment selection. The mashups are evaluated in terms of visual quality, content diversity and pleasantness by 40 subjects. The results show that our mashups and the manual mashups are perceived as comparable, while both of them are significantly higher than the random mashups in all three terms. expand
|
|
|
Toward an automatically generated soundtrack from low-level cross-modal correlations for automotive scenarios |
| |
Marco Cristani,
Anna Pesarin,
Carlo Drioli,
Vittorio Murino,
Antonio Rodà,
Michele Grapulin,
Nicu Sebe
|
|
Pages: 551-560 |
|
doi>10.1145/1873951.1874024 |
|
Full text: PDF
|
|
In this paper, we propose a novel recommendation policy for driving scenarios. While driving a car, listening to an audio track may enrich the atmosphere, conveying emotions that let the driver sense a more arousing experience. Here, we are introducing ...
In this paper, we propose a novel recommendation policy for driving scenarios. While driving a car, listening to an audio track may enrich the atmosphere, conveying emotions that let the driver sense a more arousing experience. Here, we are introducing a recommendation policy that, given a video sequence taken by a camera mounted onboard a car, chooses the most suitable audio piece from a predetermined set of melodies. The mixing mechanism takes inspiration from a set of generic qualitative aesthetical rules for cross-modal linking, realized by associating audio and video features. The contribution of this paper is to translate such qualitative rules into quantitative terms, learning from an extensive training dataset cross-modal statistical correlations, and validating them in a thoroughly way. In this way, we are able to define what are the audio and video features that correlate at best (i.e., promoting or rejecting some aesthetical rules), and what are their correlation intensities. This knowledge is then employed for the realization of the recommendation policy. A set of user studies illustrate and validate the policy, thus encouraging further developments toward a real implementation in an automotive application. expand
|
|
|
Supporting personal photo storytelling for social albums |
| |
Pere Obrador,
Rodrigo de Oliveira,
Nuria Oliver
|
|
Pages: 561-570 |
|
doi>10.1145/1873951.1874025 |
|
Full text: PDF
|
|
Information overload is one of today's major concerns. As high-resolution digital cameras become increasingly pervasive, unprecedented amounts of social media are being uploaded to online social networks on a daily basis. In order to support users on ...
Information overload is one of today's major concerns. As high-resolution digital cameras become increasingly pervasive, unprecedented amounts of social media are being uploaded to online social networks on a daily basis. In order to support users on selecting the best photos to create an online photo album, attention has been devoted to the development of automatic approaches for photo storytelling. In this paper, we present a novel photo collection summarization system that learns some of the users' social context by analyzing their online photo albums, and includes storytelling principles and face and image aesthetic ranking in order to assist users in creating new photo albums to be shared online. In an in-depth user study conducted with 12 subjects, the proposed system was validated as a first step in the photo album creation process, helping users reduce workload to accomplish such a task. Our findings suggest that a human audio/video professional with cinematographic skills does not perform better than our proposed system. expand
|
|
|
Multimedia content creation using societal-scale ubiquitous camera networks and human-centric wearable sensing |
| |
Mathew Laibowitz,
Nan-wei Gong,
Joseph A. Paradiso
|
|
Pages: 571-580 |
|
doi>10.1145/1873951.1874026 |
|
Full text: PDF
|
|
We present a novel approach to the creation of user-generated, documentary video using a distributed network of sensor-enabled video cameras and wearable on-body sensor devices. The wearable sensors are used to identify the subjects in view of the camera ...
We present a novel approach to the creation of user-generated, documentary video using a distributed network of sensor-enabled video cameras and wearable on-body sensor devices. The wearable sensors are used to identify the subjects in view of the camera system and label the captured video with real-time human-centric social and physical behavioral information. With these labels, massive amounts of continually recorded video can be browsed, searched, and automatically stitched into cohesive multimedia content. This system enables naturally occurring human behavior to drive and control a multimedia content creation system in order to create video output that is understandable, informative, and/or enjoyable to its human audience. The collected sensor data is further utilized to enhance the created multimedia content such as by using the data to edit and/or generate audio score, determine appropriate pacing of edits, and control the length and type of audio and video transitions directly from the content of the captured media. We present the design of the platform, the design of the multimedia content creation application, and the evaluated results from several live runs of the complete system. expand
|
|
|
SESSION: Full - F13/applications/content/human-centered multimedia track/processing of social media |
| |
Yong Rui
|
|
|
|
|
Image tag refinement towards low-rank, content-tag prior and error sparsity |
| |
Guangyu Zhu,
Shuicheng Yan,
Yi Ma
|
|
Pages: 461-470 |
|
doi>10.1145/1873951.1874028 |
|
Full text: PDF
|
|
The vast user-provided image tags on the popular photo sharing websites may greatly facilitate image retrieval and management. However, these tags are often imprecise and/or incomplete, resulting in unsatisfactory performances in tag related applications. ...
The vast user-provided image tags on the popular photo sharing websites may greatly facilitate image retrieval and management. However, these tags are often imprecise and/or incomplete, resulting in unsatisfactory performances in tag related applications. In this work, the tag refinement problem is formulated as a decomposition of the user-provided tag matrix D into a low-rank refined matrix A and a sparse error matrix E, namely D = A + E, targeting the optimality measured by four aspects: 1) low-rank: A is of low-rank owing to the semantic correlations among the tags; 2) content consistency: if two images are visually similar, their tag vectors (i.e., column vectors of A) should also be similar; 3) tag correlation: if two tags co-occur with high frequency in general images, their co-occurrence frequency (described by two row vectors of A) should also be high; and 4) error sparsity: the matrix E is sparse since the tag matrix D is sparse and also humans can provide reasonably accurate tags. All these components finally constitute a constrained yet convex optimization problem, and an efficient convergence provable iterative procedure is proposed for the optimization based on accelerated proximal gradient method. Extensive experiments on two benchmark Flickr datasets, with 25K and 270K images respectively, well demonstrate the effectiveness of the proposed tag refinement approach. expand
|
|
|
Quantifying tag representativeness of visual content of social images |
| |
Aixin Sun,
Sourav S. Bhowmick
|
|
Pages: 471-480 |
|
doi>10.1145/1873951.1874029 |
|
Full text: PDF
|
|
Social tags describe images from many aspects including the visual content observable from the images, the context and usage of images, user opinions and others. Not all tags are therefore useful for image search and are appropriate for tag recommendation ...
Social tags describe images from many aspects including the visual content observable from the images, the context and usage of images, user opinions and others. Not all tags are therefore useful for image search and are appropriate for tag recommendation with respect to visual content of images. However, the relationship between a given tag and the visual content of its tagged images are largely ignored in existing studies on tags and in tagging applications. In this paper, we bridge the two orthogonal areas of social image tagging and query performance prediction in Web search, to quantify tag representativeness of the visual content presented in the annotated images, which is also known as tag visual-representativeness. In simple words, tag visual-representativeness characterizes the effectiveness of a tag in describing the visual content of the set of images annotated by the tag. A tag is visually representative if its annotated images are visually similar to each other, containing a common visual concept such as an object or a scene. We propose two distance metrics, namely cohesion and separation, to quantify tag visual-representativeness from the set of images annotated by a tag and the entire image collection. Through extensive experiments on a subset of Flickr images, we demonstrate the characteristics of seven variants of the distance metrics derived from different low-level image representations and show that the visually representative tags can be identified with high precision. Importantly, these proposed distance measures are parameter free with linear or constant computational complexity, thus are effective for practical applications. expand
|
|
|
Social pixels: genesis and evaluation |
| |
Vivek K. Singh,
Mingyan Gao,
Ramesh Jain
|
|
Pages: 481-490 |
|
doi>10.1145/1873951.1874030 |
|
Full text: PDF
|
|
Huge amounts of social multimedia is being created daily by a combination of globally distributed disparate sensors, including human-sensors (e.g. tweets) and video cameras. Taken together, this represents information about multiple aspects of the evolving ...
Huge amounts of social multimedia is being created daily by a combination of globally distributed disparate sensors, including human-sensors (e.g. tweets) and video cameras. Taken together, this represents information about multiple aspects of the evolving world. Understanding the various events, patterns and situations emerging in such data has applications in multiple domains. We develop abstractions and tools to decipher various spatio-temporal phenomena which manifest themselves across such social media data. We describe an approach for aggregating social interest of users about any particular theme from any particular location into 'social pixels'. Aggregating such pixels spatio-temporally allows creation of social versions of images and videos, which then become amenable to various media processing techniques (like segmentation, convolution) to derive semantic situation information. We define a declarative set of operators upon such data to allow for users to formulate queries to visualize, characterize, and analyze such data. Results of applying these operations over an evolving corpus of millions of Twitter and Flickr posts, to answer situation-based queries in multiple application domains are promising. expand
|
|
|
Image retagging |
| |
Dong Liu,
Xian-Sheng Hua,
Meng Wang,
Hong-Jiang Zhang
|
|
Pages: 491-500 |
|
doi>10.1145/1873951.1874031 |
|
Full text: PDF
|
|
Online social media repositories such as Flickr and Zooomr allow users to manually annotate their images with freely-chosen tags, which are then used as indexing keywords to facilitate image search and other applications. However, these tags are frequently ...
Online social media repositories such as Flickr and Zooomr allow users to manually annotate their images with freely-chosen tags, which are then used as indexing keywords to facilitate image search and other applications. However, these tags are frequently imprecise and incomplete, though they are provided by human beings, and many of them are almost only meaningful for the image owners (such as the name of a dog). Thus there is still a gap between these tags and the actual content of the images, and this significantly limits tag-based applications, such as search and browsing. To tackle this issue, this paper proposes a social image "retagging" scheme that aims at assigning images with better content descriptors. The refining process, including denoising and enriching, is formulated as an optimization framework based on the consistency between "visual similarity" and "semantic similarity" in social images, that is, the visually similar images tend to have similar semantic descriptors, and vice versa. An effective iterative bound optimization algorithm is applied to learn the improved tag assignment. In addition, as many tags are intrinsically not closely-related to the visual content of the images, we employ knowledge based method to differentiate visual content related tags from unrelated ones and then constrain the tagging vocabulary of our automatic algorithm within the content related tags. Finally, to improve the coverage of the tags, we further enrich the tag set with appropriate synonyms and hypernyms based on an external knowledge base. Experimental results on a Flickr image collection demonstrate the effectiveness of this approach. We will also show the remarkable performance improvements brought by retagging via two applications, i.e., tag-based search and automatic annotation. expand
|
|
|
SESSION: Short - S1/applications/human-centered multimedia track |
| |
Marcel Worring
|
|
|
|
|
Movie2Comics: a feast of multimedia artwork |
| |
Richang Hong,
Xiao-Tong Yuan,
Mengdi Xu,
Meng Wang,
Shuicheng Yan,
Tat-Seng Chua
|
|
Pages: 611-614 |
|
doi>10.1145/1873951.1874033 |
|
Full text: PDF
|
|
As a type of artwork, comics are prevalent and popular around the world. However, although there are several assistive software and tools available, the creation of comics is still a tedious and labor intensive process. This paper proposes a scheme that ...
As a type of artwork, comics are prevalent and popular around the world. However, although there are several assistive software and tools available, the creation of comics is still a tedious and labor intensive process. This paper proposes a scheme that is able to automatically turn a movie to comics with two principles: (1) optimizing the information reservation of movie; and (2) generating outputs following the rules and styles of comics. The scheme mainly contains three components: script-face mapping, key-scene extraction, and cartoonization. Script-face mapping utilizes face recognition and tracking techniques to accomplish the mapping between character's faces and their scripts. Key-scene extraction then combines the frames derived from subshots and the extracted index frames based on subtitle to select a sequence of frames for cartoonization. Finally, the cartoonization is accomplished via four steps: panel scale, stylization, word balloon placement and comics layout. Experiments conducted on a set of movie clips have demonstrates the usefulness and e®ectiveness of the scheme. expand
|
|
|
NudgeCam: toward targeted, higher quality media capture |
| |
Scott Carter,
John Adcock,
John Doherty,
Stacy Branham
|
|
Pages: 615-618 |
|
doi>10.1145/1873951.1874034 |
|
Full text: PDF
|
|
NudgeCam is a mobile application that can help users capture more relevant, higher quality media. To guide users to capture media more relevant to a particular project, third-party template creators can show users media that demonstrates relevant content ...
NudgeCam is a mobile application that can help users capture more relevant, higher quality media. To guide users to capture media more relevant to a particular project, third-party template creators can show users media that demonstrates relevant content and can tell users what content should be present in each captured media using tags and other meta-data such as location and camera orientation. To encourage higher quality media capture, NudgeCam provides real time feedback based on standard media capture heuristics, including face positioning, pan speed, audio quality, and many others. We describe an implementation of NudgeCam on the Android platform as well as field deployments of the application. expand
|
|
|
Tagging tags |
| |
Kuiyuan Yang,
Xian-Sheng Hua,
Meng Wang,
Hong-Jiang Zhang
|
|
Pages: 619-622 |
|
doi>10.1145/1873951.1874035 |
|
Full text: PDF
|
|
Social image sharing websites like Flickr have successfully motivated users around the world to annotate images with tags, which greatly facilitate search and organization of social image content. However, these manually-input tags are far from a comprehensive ...
Social image sharing websites like Flickr have successfully motivated users around the world to annotate images with tags, which greatly facilitate search and organization of social image content. However, these manually-input tags are far from a comprehensive description of the image content, which limits effectiveness of the tags in content-based image search. In this paper, we propose an automatic scheme called tagging tags to supplement semantic image descriptions by associating a group of property tags with each existing tag. For example, an initial tag "tiger" will be further tagged with "white", "stripes" and "bottom-right" along three tag properties: color, texture and location, respectively. In the proposed scheme, a lazy learning approach is first applied to estimate the corresponding image regions of each initial tag, and then a set of property tags, which involve six exemplary property aspects including location, color, texture, shape, size and dominance, are derived for each tag according to the content of the regions and the entire image. These tag properties enable much more precise image search especially when certain tag properties are included in the query. The results of the empirical evaluation show that tag properties remarkably boost the performance of social image retrieval. expand
|
|
|
i-m-Space: interactive multimedia-enhanced space for rehabilitation of breast cancer patients |
| |
Ju-Chun Ko,
Wei-Han Chen,
Meng-Chieh Yu,
Han-Hung Lin,
Jin-Yao Lin,
Szu-Wei Wu,
Yi-Yu Chung,
I-Ling Hu,
Wei-Ting Peng,
Shih-Yao Lin,
Chia Han Chang,
Pei-Hsuan Chou,
King-Jen Chang,
Mei-Lan Chang,
Sue-huei Chen,
Jin-Shing Chen,
Ming-Sui Lee,
Mike Y. Chen,
Yi-Ping Hung
|
|
Pages: 623-626 |
|
doi>10.1145/1873951.1874036 |
|
Full text: PDF
|
|
This paper presents i-m-Space, an interactive multimedia rehabilitation space that helps the post-surgery recovery of breast cancer patients. Our goal is to improve patients' physical therapy and psychological relaxation experience through careful applications ...
This paper presents i-m-Space, an interactive multimedia rehabilitation space that helps the post-surgery recovery of breast cancer patients. Our goal is to improve patients' physical therapy and psychological relaxation experience through careful applications of multimedia technology. i-m-Space consists of three types of breathing-based relaxation and three types for interactive exercise-based rehabilitation. Our inter-disciplinary team includes medical professionals, multimedia engineers, designers, and artists. We have implemented i-m-Space in an experimental space in collaboration a local breast cancer foundation. To evaluation i-m-Space, we have recruited several patients who recently recovered from breast cancer to use i-m-Space and to share their first-hand experiences. Our contributions includes the following: 1) injecting a sense of fun and playfulness into traditional therapy to attract patients; 2) providing therapists with sufficient flexibility so they can personalize therapy sessions for each patient; 3) maintaining safety of patients. expand
|
|
|
A music search engine for therapeutic gait training |
| |
Zhonghua Li,
Qiaoliang Xiang,
Jason Hockman,
Jianqing Yang,
Yu Yi,
Ichiro Fujinaga,
Ye Wang
|
|
Pages: 627-630 |
|
doi>10.1145/1873951.1874037 |
|
Full text: PDF
|
|
A music retrieval system is introduced that incorporate tempo, cultural, and beat strength features to help music therapists provide appropriate music for gait training for Parkinson's patients. Unlike current methods available to music therapists (e.g., ...
A music retrieval system is introduced that incorporate tempo, cultural, and beat strength features to help music therapists provide appropriate music for gait training for Parkinson's patients. Unlike current methods available to music therapists (e.g., personal CD/MP3 library search) we propose a domain-specific search engine that utilizes database of music found on YouTube. We independently evaluate the efficacy of our tempo, cultural, and beat strength features on a music database extracted from YouTube. Results from our user study demonstrate the effectiveness and usefulness of our search engine for this application. expand
|
|
|
Beyond GPS: determining the camera viewing direction of a geotagged image |
| |
Minwoo Park,
Jiebo Luo,
Robert T. Collins,
Yanxi Liu
|
|
Pages: 631-634 |
|
doi>10.1145/1873951.1874038 |
|
Full text: PDF
|
|
Increasingly, geographic information is being associated with personal photos. Recent research results have shown that the additional global positioning system (GPS) information helps visual recognition for geotagged photos by providing location context. ...
Increasingly, geographic information is being associated with personal photos. Recent research results have shown that the additional global positioning system (GPS) information helps visual recognition for geotagged photos by providing location context. However, the current GPS data only identifies the camera location, leaving the viewing direction uncertain. To produce more precise location information, i.e. the viewing direction for geotagged photos, we utilize both Google Street View and Google Earth satellite images. Our proposed system is two-pronged: 1) visual matching between a user photo and any available street views in the vicinity determine the viewing direction, and 2) when only an overhead satellite view is available, near-orthogonal view matching between the user photo and satellite imagery computes the viewing direction. Experimental results have shown the promise of the proposed framework. expand
|
|
|
Real-world trajectory extraction for attack pattern analysis in soccer video |
| |
Zhenxing Niu,
Qi Tian,
Xinbo Gao
|
|
Pages: 635-638 |
|
doi>10.1145/1873951.1874039 |
|
Full text: PDF
|
|
Most existing approaches on tactic analysis of soccer video are based on mosaic trajectory analysis, which loses much semantic information comparing to the real-world trajectory. Without effective extraction of real-world trajectory, the tactic of soccer ...
Most existing approaches on tactic analysis of soccer video are based on mosaic trajectory analysis, which loses much semantic information comparing to the real-world trajectory. Without effective extraction of real-world trajectory, the tactic of soccer cannot be properly represented and analyzed from the perspective of soccer professionals. In this paper, a real-world trajectory extraction method is proposed. Moreover, six attack patterns are defined to represent tactic of soccer, and a novel attack pattern recognition algorithm is developed based on the analysis of the ball's state and real-world trajectory. To our best knowledge, this is the first work of its kind that systematically analyzes tactic of soccer based on real-world trajectory for broadcast soccer video. Our experiments demonstrate that the defined attack pattern can be effectively recognized. expand
|
|
|
Tag transformer |
| |
Yicheng Song,
Juan Cao,
Zhineng Chen,
Yongdong Zhang,
Jintao Li
|
|
Pages: 639-642 |
|
doi>10.1145/1873951.1874040 |
|
Full text: PDF
|
|
Human annotations (titles and tags) of web videos facilitate most web video applications. However, the raw tags are noisy, sparse and structureless, which limit the effectiveness of tags. In this paper, we propose a tag transformer schema to solve these ...
Human annotations (titles and tags) of web videos facilitate most web video applications. However, the raw tags are noisy, sparse and structureless, which limit the effectiveness of tags. In this paper, we propose a tag transformer schema to solve these problems. We first eliminate those imprecise and meaningless tags with Wikipedia, and then transform the remaining tags to the Wikipedia category set to gather a precise, complete and structural description of the tags. Our experimental results on web video categorization demonstrate the superiority of the transformed space. We also apply tag transformer into the first study of using Wikipedia category system to structurally recommend the related videos. The online user study of the demo system suggests that our method could bring fantastic experience to the web users. expand
|
|
|
Gaze awareness and interaction support in presentations |
| |
Kar-Han Tan,
Dan Gelb,
Ramin Samadani,
Ian Robinson,
Bruce Culbertson,
John Apostolopoulos
|
|
Pages: 643-646 |
|
doi>10.1145/1873951.1874041 |
|
Full text: PDF
|
|
Modern digital presentation systems use rich media to bring highly sophisticated information visualization and highly effective storytelling capabilities to classrooms and corporate boardrooms. In this paper we address a number of issues that arise when ...
Modern digital presentation systems use rich media to bring highly sophisticated information visualization and highly effective storytelling capabilities to classrooms and corporate boardrooms. In this paper we address a number of issues that arise when the ubiquitous computer-projector setup is used in large venues like the cavernous auditoriums and hotel ballrooms often used in large scale academic meetings and industrial conferences. First, when the presenter is addressing a large audience the slide display needs to be very large and placed high enough so that it is clearly visible from all corners of the room. This makes it impossible for a presenter to walk up to the display and interact with the display with gestures, gaze, and other forms of paralanguage. Second, it is hard for the audience to know which part of the slide the presenter is looking at when he/she has to look the opposite way from the audience while interacting with the slide material. It is also hard for the presenter to see the audience in these cases. Even though there may be video captures of the presenter, slides, and even the audience, the above factors add up to make it very difficult for a user viewing either a live feed or a recording to grasp the interaction between all the components and participants of a presentation. We address these problems with a novel presentation system which creates a live video view that seamlessly combines the presenter and the presented material, capturing all graphical, verbal, and nonverbal channels of communication. The system also allows the local and remote audiences to have highly interactive exchanges with the presenter while creating a comprehensive view for recording or remote streaming. expand
|
|
|
Digesting omni-video along routes for navigation |
| |
Hongyuan Cai,
Jiang Yu Zheng
|
|
Pages: 647-650 |
|
doi>10.1145/1873951.1874042 |
|
Full text: PDF
|
|
Omni-directional video records complete visual information along a route. Though replaying an omni-video presents reality, it requires significant amount of memory and communication bandwidth. This work extracts distinct views from an omni-video to form ...
Omni-directional video records complete visual information along a route. Though replaying an omni-video presents reality, it requires significant amount of memory and communication bandwidth. This work extracts distinct views from an omni-video to form a visual digest named route sheet for navigation. We sort scenes at the motion and visibility level and investigate the similarity/redundancy of scenes in the context of a route. We use source data from 3D elevation map or omni-videos for the view selection. By condensing the flow in the video, our algorithm can generate distinct omni-view sequences with visual information as rich as the omni-video for further scene indexing and navigation with GIS data. expand
|
|
|
Building book inventories using smartphones |
| |
David M. Chen,
Sam S. Tsai,
Bernd Girod,
Cheng-Hsin Hsu,
Kyu-Han Kim,
Jatinder Pal Singh
|
|
Pages: 651-654 |
|
doi>10.1145/1873951.1874043 |
|
Full text: PDF
|
|
Manual generation of a book inventory is time-consuming and tedious, while deployment of barcode and radio-frequency identification (RFID) management systems is costly and affordable only to large institutions. In this paper, we design and implement ...
Manual generation of a book inventory is time-consuming and tedious, while deployment of barcode and radio-frequency identification (RFID) management systems is costly and affordable only to large institutions. In this paper, we design and implement a mobile book recognition system for conveniently generating an inventory of books by snapping photos of a bookshelf with a smartphone. Since smartphones are becoming ubiquitous and affordable, our inventory management solution is cost-effective and very easy to deploy. Automatic and robust book recognition is achieved in our system using a combination of spine segmentation and bag-of-features image matching. At the same time, the location of each book is inferred from the smartphone's sensor readings, including accelerometer traces, digital compass measurements, and WiFi signatures. This location information is combined with the image recognition results to construct a location-aware book inventory. We demonstrate the effectiveness of our book spine recognition and location estimation techniques in recognition experiments and in an actual mobile book recognition system. expand
|
|
|
Templated recursive image composition |
| |
C. Brian Atkins,
Nicholas P. Lyons,
Xuemei Zhang,
Daniel R. Tretter
|
|
Pages: 655-658 |
|
doi>10.1145/1873951.1874044 |
|
Full text: PDF
|
|
With the proliferation of image acquisition and consumption, there is an increasing need for solutions that help ordinary people create high quality image composites. In most solutions today, image layouts are provided as fixed templates, which offer ...
With the proliferation of image acquisition and consumption, there is an increasing need for solutions that help ordinary people create high quality image composites. In most solutions today, image layouts are provided as fixed templates, which offer the potential of visually diverse layout sets. However, the layout choices are limited to those selected in advance by the template designer; and the library may not support a particular image count, aspect ratio set or spatial distribution. To ameliorate these shortcomings, we propose an image layout framework called Templated Recursive Image Composition. TRIC is template-based in that every layout is based on a template specification. However, TRIC is also generative in that virtually any image set can be accommodated as long as there is at least one image for every region in the template specification. Constraints ensure respect for image aspect ratios; for spacing in the layout interior; and for proportions and placement of sublayouts corresponding to regions in the template specification. We present a description of TRIC, results that demonstrate its versatility, and a user study that supports its acceptability. expand
|
|
|
Putting the pieces together: multimodal analysis of social attention in meetings |
| |
Ramanathan Subramanian,
Jacopo Staiano,
Kyriaki Kalimeri,
Nicu Sebe,
Fabio Pianesi
|
|
Pages: 659-662 |
|
doi>10.1145/1873951.1874045 |
|
Full text: PDF
|
|
This paper presents a multimodal framework employing eye-gaze, head-pose and speech cues to explain observed social attention patterns in meeting scenes. We first investigate a few hypotheses concerning social attention and characterize meetings and ...
This paper presents a multimodal framework employing eye-gaze, head-pose and speech cues to explain observed social attention patterns in meeting scenes. We first investigate a few hypotheses concerning social attention and characterize meetings and individuals based on ground-truth data. This is followed by replication of ground-truth results through automated estimation of eye-gaze, head-pose and speech activity for each participant. Experimental results show that combining eye-gaze and head-pose estimates decreases error in social attention estimation by over 26%. expand
|
|
|
AIR conferencing: accelerated instant replay for in-meeting multimodal review |
| |
Kori Inkpen,
Rajesh Hegde,
Sasa Junuzovic,
Christopher Brooks,
John C. Tang,
Zhengyou Zhang
|
|
Pages: 663-666 |
|
doi>10.1145/1873951.1874046 |
|
Full text: PDF
|
|
When people attend meetings they may miss parts of the discussion if they, for example, step out to take a phone call, go to the bathroom, or have a momentary lapse in concentration. As a result, they may need to catch up on what they missed upon returning ...
When people attend meetings they may miss parts of the discussion if they, for example, step out to take a phone call, go to the bathroom, or have a momentary lapse in concentration. As a result, they may need to catch up on what they missed upon returning to the meeting. Asking other attendees for a recap is often disruptive. To avoid such disruptions, we have developed an Accelerated Instant Replay (AIR) Conferencing system for videoconferencing that enables participants to privately catch up to an ongoing meeting. We explored several mechanisms where the meeting content is replayed at an accelerated rate so that the participants can catch up to the live discussion reasonably quickly. expand
|
|
|
Making computers look the way we look: exploiting visual attention for image understanding |
| |
Harish Katti,
Ramanathan Subramanian,
Mohan Kankanhalli,
Nicu Sebe,
Tat-Seng Chua,
Kalpathi R. Ramakrishnan
|
|
Pages: 667-670 |
|
doi>10.1145/1873951.1874047 |
|
Full text: PDF
|
|
Human Visual attention (HVA) is an important strategy to focus on specific information while observing and understanding visual stimuli. HVA involves making a series of fixations on select locations while performing tasks such as object recognition, ...
Human Visual attention (HVA) is an important strategy to focus on specific information while observing and understanding visual stimuli. HVA involves making a series of fixations on select locations while performing tasks such as object recognition, scene understanding, etc. We present one of the first works that combines fixation information with automated concept detectors to (i) infer abstract image semantics, and (ii) enhance performance of object detectors. We develop visual attention-based models that sample fixation distributions and fixation transition distributions in regions-of-interest (ROI) to infer abstract semantics such as expressive faces and interactions (such as look, read, etc.). We also exploit eye-gaze information to deduce possible locations and scale of salient concepts and aid state-of-art detectors. A 18% performance increase with over 80% reduction in computational time for a state-of-art object detector [4]. expand
|
|
|
MOGCLASS: a collaborative system of mobile devices forclassroom music education |
| |
Yinsheng Zhou,
Graham Percival,
Xinxi Wang,
Ye Wang,
Shengdong Zhao
|
|
Pages: 671-674 |
|
doi>10.1145/1873951.1874048 |
|
Full text: PDF
|
|
We introduce MOGCLASS: a system of networked mobile devices to amplify and extend children's capabilities to perceive, perform and produce music collaboratively in classroom context. MOGCLASS includes various features for students to enhance their motivation, ...
We introduce MOGCLASS: a system of networked mobile devices to amplify and extend children's capabilities to perceive, perform and produce music collaboratively in classroom context. MOGCLASS includes various features for students to enhance their motivation, interest, and collaboration in music class. It provides a wide-ranging palette of easy-to-use musical instruments for students to choose from, and supports both collaborative silent practice with headphones, and collaborative performance with loudspeakers. To facilitate classroom management, the teacher's interface is used to control students' activities. Our evaluation results indicate that MOGCLASS is effective in increasing students' motivation in learning music and in supporting teachers' classroom management expand
|
|
|
Adaptive combination of tag and link-based user similarity in flickr |
| |
Nhat Hai Phan,
Van Duc Thong Hoang,
Hyoseop Shin
|
|
Pages: 675-678 |
|
doi>10.1145/1873951.1874049 |
|
Full text: PDF
|
|
Finding similar users is one of the probable applications in social media. The similarity between users can be measured in two different approaches: the semantic similarity and the similarity in terms of social relations. These two approaches can be ...
Finding similar users is one of the probable applications in social media. The similarity between users can be measured in two different approaches: the semantic similarity and the similarity in terms of social relations. These two approaches can be combined with different weight factors. However, the conventional combination scheme has a critical drawback that the weight factors are fixed for every user and thus it is not optimized at those users that are using rare terms or do not have sufficient relations with other users. To address this problem, in this paper, we propose an adaptive combination scheme of tag-based similarity and link-based similarity in which the weight factors are dynamically determined for each user by evaluating each user's characteristics such as tag commonness and link strength. The experimental results with a Flickr data set show that the proposed scheme consistently outperforms the previous work by about 20%. expand
|
|
|
Multi-display map touring with tangible widget |
| |
Marco Piovesana,
Ying-Jui Chen,
Neng-Hao Yu,
Hsiang-Tao Wu,
Li-Wei Chan,
Yi-Ping Hung
|
|
Pages: 679-682 |
|
doi>10.1145/1873951.1874050 |
|
Full text: PDF
|
|
Many map systems are created to help the user finding a place or define a route to follow. Google Map extends the concept of "surfing the map" by adding a street view that allows the user to explore a place from real pictures, creating the same feeling ...
Many map systems are created to help the user finding a place or define a route to follow. Google Map extends the concept of "surfing the map" by adding a street view that allows the user to explore a place from real pictures, creating the same feeling of walking through the streets. The horizontal 2D map and vertical panoramic street view, however, cause usability problems, while operating with traditional computer mouse and keyboards and presenting by single vertical or horizontal display. This paper presents a new table system composed of a horizontal tabletop screen and a vertical screen. The map view and the street view are displayed on the horizontal and vertical displays of our system respectively. Users can place the tangible pawn on the 2D map to have direct access of the street view from the pawn's point of view. In the user study, we compare our system with a standard computer system in the navigation task. The results reported that our system improves the intuitiveness of use, efficiency of city exploring and ease of remembrance on spaces that are not familiar beforehand. We also discuss limitations of using tangible objects for map navigation. expand
|
|
|
"Stray": a new multimedia music composition using the andantephone |
| |
Ryan Janzen,
Steve Mann
|
|
Pages: 683-686 |
|
doi>10.1145/1873951.1874051 |
|
Full text: PDF
|
|
The andantephone is an instrument that allows a performed to physically step through a piece of music by walking. Each note or chord of the piece is assigned to one footstep, so expressively varying velocity varies the tempo in turn. A new, more flexible ...
The andantephone is an instrument that allows a performed to physically step through a piece of music by walking. Each note or chord of the piece is assigned to one footstep, so expressively varying velocity varies the tempo in turn. A new, more flexible design of andantephone was created for use in a new composition, using an array of tiles sensitive to geophonic seismic waves from footsteps. This user-interface was combined with a real-time frequency shifterbank which shifted the geophonic vibrations of human feet to the frequencies of the musical notes and chords of the composition. Matrix multiplication in the shifterbank allowed chords from various tiles to be expressively layered together depending on the geophonic sounds coming from any number of tiles being stepped on at a time. Moreover, the new shifterbank adjusted its tuning dynamically (according to the composition) as the performer cycled around the track, creating a responsive multitouch landscape that unraveled around the track ahead of and behind the performer. The shifterbank output was also interfaced to a pipe organ, via FLUIDI, using the organ as an additional sounding device. A new andantephone tile configuration led to advantages over previous configurations, including less off-track radial acceleration required to change tempo, and the ability to multiplex between different types of vertex turns, which was found to improve spatial orientation when performing. expand
|
|
|
A user study of visual versus sonically-enhanced interfaces for use while walking |
| |
Yaohua Yu,
Zhengjie Liu
|
|
Pages: 687-690 |
|
doi>10.1145/1873951.1874052 |
|
Full text: PDF
|
|
This paper presents a user study on interaction with a mobile device. We investigate the use of non-speech sound in mobile interfaces and design a sonically-enhanced interface. The sonically-enhanced interface is compared to a visual interface when users ...
This paper presents a user study on interaction with a mobile device. We investigate the use of non-speech sound in mobile interfaces and design a sonically-enhanced interface. The sonically-enhanced interface is compared to a visual interface when users are walking. The results show that the sonically-enhanced interface can improve the mobile interaction as compared to the visual interface. expand
|
|
|
Fast image rearrangement via multi-scale patch copying |
| |
Jiayao Hu,
Shifeng Chen,
Jianzhuang Liu,
Xiaoou Tang
|
|
Pages: 691-694 |
|
doi>10.1145/1873951.1874053 |
|
Full text: PDF
|
|
In this paper, we propose a simple interactive way for a novel type of image synthesis called image rearrangement whose goal is to construct a new image based on some objects cropped from source images. The synthesis results are obtained by copying patches ...
In this paper, we propose a simple interactive way for a novel type of image synthesis called image rearrangement whose goal is to construct a new image based on some objects cropped from source images. The synthesis results are obtained by copying patches from the source images in a globally consistent way. The patch copying problem is formulated with the Markov random field model, and belief propagation is used as the optimization tool. To speed up our algorithm, a two-step belief propagation and a multi-scale patch copying scheme are taken. Experimental results indicate that our algorithm obtains satisfactory results in both performance and efficiency. expand
|
|
|
Learning parts-based representation for face transition |
| |
Xiong Li,
Liwei Wang,
Huanxi Liu,
Yuncai Liu
|
|
Pages: 695-698 |
|
doi>10.1145/1873951.1874054 |
|
Full text: PDF
|
|
This paper proposes to learn parts-based face representation from real face samples and then applies it to face transition. It differs from previous works in two aspects. First, we learn flexible face decomposition from real faces unsupervisedly instead ...
This paper proposes to learn parts-based face representation from real face samples and then applies it to face transition. It differs from previous works in two aspects. First, we learn flexible face decomposition from real faces unsupervisedly instead of designing face template manually, for which two simple priors are embedded into learning procedure through constrained EM formulation. Second, both face representation and transition are derived from an unified probabilistic framework. Based on the learned face representation, the face distance measurement could be defined, which enables us to synthesize face via specifying distance with respect to reference faces and depict the full transition trace of two or more given faces with distinct age, gender and race. expand
|
|
|
Gesture and touch controlled video player interface for mobile devices |
| |
Shelley Buchinger,
Ewald Hotop,
Helmut Hlavacs,
Francesca De Simone,
Touradj Ebrahimi
|
|
Pages: 699-702 |
|
doi>10.1145/1873951.1874055 |
|
Full text: PDF
|
|
Today, mobile communication devices allow users to access a wide variety of multimedia contents and services. In order to improve user experience and device usability, the design of interfaces and interaction techniques for mobile devices have focused ...
Today, mobile communication devices allow users to access a wide variety of multimedia contents and services. In order to improve user experience and device usability, the design of interfaces and interaction techniques for mobile devices have focused on new modalities, other than those used for desktop computers. In this paper, we describe a novel gesture controlled video player interface for mobile devices. The results of a usability study confirm that users would definitely like to adopt the major part of the proposed features. Furthermore, the responsiveness and reliability of the interface has been studied. Measured response times have been found to be within acceptable boundaries and the number of unrecognised haptic controls is limited. expand
|
|
|
Eyes do not lie: spontaneous versus posed smiles |
| |
Hamdi Dibeklioglu,
Roberto Valenti,
Albert Ali Salah,
Theo Gevers
|
|
Pages: 703-706 |
|
doi>10.1145/1873951.1874056 |
|
Full text: PDF
|
|
Automatic detection of spontaneous versus posed facial expressions received a lot of attention in recent years. However, almost all published work in this area use complex facial features or multiple modalities, such as head pose and body movements with ...
Automatic detection of spontaneous versus posed facial expressions received a lot of attention in recent years. However, almost all published work in this area use complex facial features or multiple modalities, such as head pose and body movements with facial features. Besides, the results of these studies are not given on public databases. In this paper, we focus on eyelid movements to classify spontaneous versus posed smiles and propose distance-based and angular features for eyelid movements. We assess the reliability of these features with continuous HMM, k-NN and naive Bayes classifiers on two different public datasets. Experimentation shows that our system provides classification rates up to 91 per cent for posed smiles and up to 80 per cent for spontaneous smiles by using only eyelid movements. We additionally compare the discrimination power of movement features from different facial regions for the same task. expand
|
|
|
SESSION: Short - S2/content/systems track |
| |
Zhengyou Zhang
|
|
|
|
|
Integrating web 2.0 resources by wikipedia |
| |
Chen Liu,
Bing Cui,
Anthony K.H. Tung
|
|
Pages: 707-710 |
|
doi>10.1145/1873951.1874058 |
|
Full text: PDF
|
|
The concept of Web 2.0 becomes prevalent and popular in the past few years. People are able to share and manage their own resources in Web 2.0 Systems. The abundance of Web 2.0 resources in various media formats calls for better resource integration, ...
The concept of Web 2.0 becomes prevalent and popular in the past few years. People are able to share and manage their own resources in Web 2.0 Systems. The abundance of Web 2.0 resources in various media formats calls for better resource integration, intending to enrich user experience in both browsing and searching. Though the Web 2.0 resources are shown in various modalities, their tags act as an intuitive medium to connect resources together. However, tagging is by nature an ad hoc activity. They do often contain noises and are affected by the subjective inclination of taggers. Consequently, linking resources simply by tags will not be reliable. In this paper, we propose an effective approach for linking tagged resources to concepts extracted from Wikipedia, which has become a fairly reliable reference over the last few years. Compared to the tags, the concepts are therefore of higher quality. Empirical experiments were conducted, and the results validate the effectiveness of our framework. expand
|
|
|
Vicept: link visual features to concepts for large-scale image understanding |
| |
Zhipeng Wu,
Shuqiang Jiang,
Liang Li,
Peng Cui,
Qingming Huang,
Wen Gao
|
|
Pages: 711-714 |
|
doi>10.1145/1873951.1874059 |
|
Full text: PDF
|
|
On noticing the paradox of visual polysemia and concept poly-morphism, this paper proposes a new perspective called "Vicept" to associate elementary visual features and cognitive concepts. Firstly, a carefully prepared large image dataset and associate ...
On noticing the paradox of visual polysemia and concept poly-morphism, this paper proposes a new perspective called "Vicept" to associate elementary visual features and cognitive concepts. Firstly, a carefully prepared large image dataset and associate concepts are established. Secondly, we extract local interest points as the ele-mentary visual features, cluster them into visual words, and use Fuzzy Concept Membership Updating (FCMU) to build the link between codebook and concept membership distributions. This bottommost feature is called "Vicept word". Then, the global level Vicept features are established to correlate concepts with (partial) images. Finally, we validate our Vicept approach and show its effectiveness in concept detection task. Our approach is independent of case-specific training data and thus can be extended to web-scale scenarios. expand
|
|
|
Analyzing and predicting sentiment of images on the social web |
| |
Stefan Siersdorfer,
Enrico Minack,
Fan Deng,
Jonathon Hare
|
|
Pages: 715-718 |
|
doi>10.1145/1873951.1874060 |
|
Full text: PDF
|
|
In this paper we study the connection between sentiment of images expressed in metadata and their visual content in the social photo sharing environment Flickr. To this end, we consider the bag-of-visual words representation as well as the color distribution ...
In this paper we study the connection between sentiment of images expressed in metadata and their visual content in the social photo sharing environment Flickr. To this end, we consider the bag-of-visual words representation as well as the color distribution of images, and make use of the SentiWordNet thesaurus to extract numerical values for their sentiment from accompanying textual metadata. We then perform a discriminative feature analysis based on information theoretic methods, and apply machine learning techniques to predict the sentiment of images. Our large-scale empirical study on a set of over half a million Flickr images shows a considerable correlation between sentiment and visual features, and promising results towards estimating the polarity of sentiment in images. expand
|
|
|
Landmark image classification using 3D point clouds |
| |
Xian Xiao,
Changsheng Xu,
Jinqiao Wang
|
|
Pages: 719-722 |
|
doi>10.1145/1873951.1874061 |
|
Full text: PDF
|
|
Most of the existing approaches for landmark image classification utilize either holistic features or interest of points in the whole image to train the classification model, which may lead to unsatisfactory result due to involvement of much information ...
Most of the existing approaches for landmark image classification utilize either holistic features or interest of points in the whole image to train the classification model, which may lead to unsatisfactory result due to involvement of much information non-located on the landmark in the training process. In this paper, we propose a novel approach to improve landmark image classification result via a process of 2D to 3D reconstruction and 3D to 2D projection of iconic landmark images. Particularly, we first select iconic images from labeled landmark image collections to reconstruct a 3D landmark represented in point clouds. Then, 3D point clouds are projected back onto the same iconic images to obtain the landmark-region of each iconic image and subsequently extract SIFT features from the landmark-region to construct a k-dimensional tree (kd-tree) for each landmark. This process is able to filter out noise points corresponding to clutter background and non-landmark objects in the iconic images. Finally, the unlabeled images can be classified into predefined landmark categories based on the amount of matched feature points between the image features and the kd-trees. The experimental result and comparison with the state-of-the-art demonstrate the effectiveness of our approach. expand
|
|
|
Portfolio theory of multimedia fusion |
| |
Xiangyu Wang,
Mohan Kankanhalli
|
|
Pages: 723-726 |
|
doi>10.1145/1873951.1874062 |
|
Full text: PDF
|
|
The number of multimedia applications has been increasing over the past two decades. Multimedia information fusion has therefore attracted significant attention with many techniques having been proposed. However, the uncertainty and correlation among ...
The number of multimedia applications has been increasing over the past two decades. Multimedia information fusion has therefore attracted significant attention with many techniques having been proposed. However, the uncertainty and correlation among different modalities have not been fully considered in the existing fusion methods. In general, the predictions of individual modality have uncertainty, furthermore, many modalities are correlated with each other. In this paper, we propose a novel multimedia fusion method based on the Portfolio theory. Portfolio theory is a widely used financial investment theory dealing with how to allocate funds across assets. The key idea is to maximize the performance of the allocated portfolio while minimize the risk in returns. We adapt this approach to multimodal fusion to derive optimal weights that can achieve good fusion results. The optimization is formulated as a quadratic programming problem. Experimental results with both simulated data and real data confirm the theoretical insights and show promising results. expand
|
|
|
Exploiting noisy visual concept detection to improve spoken content based video retrieval |
| |
Stevan Rudinac,
Martha Larson,
Alan Hanjalic
|
|
Pages: 727-730 |
|
doi>10.1145/1873951.1874063 |
|
Full text: PDF
|
|
In this paper, we present a technique for unsupervised construction of concept vectors, concept-based representations of complete video units, from the noisy shot-level output of a set of visual concept detectors. We deploy these vectors to improve spoken-content-based ...
In this paper, we present a technique for unsupervised construction of concept vectors, concept-based representations of complete video units, from the noisy shot-level output of a set of visual concept detectors. We deploy these vectors to improve spoken-content-based video retrieval using Query Expansion Selection (QES). Our QES approach analyzes results lists returned in response to several alternative query expansions, applying a coherence indicator calculated on top-ranked items to choose the appropriate expansion. The approach is data driven, does not require prior training and relies solely on the analysis of the collection being queried and the results lists produced for the given query text. The experiments, performed on two datasets, TRECVID 2007/2008 and TRECVID 2009, demonstrate the effectiveness of our approach and show that a small set of well-selected visual concept detectors is sufficient to improve retrieval performance. expand
|
|
|
End-to-end stochastic scheduling of scalable video overtime-varying channels |
| |
Nesrine Changuel,
Nicholas Mastronarde,
Mihaela Van der Schaar,
Bessem Sayadi,
Michel Kieffer
|
|
Pages: 731-734 |
|
doi>10.1145/1873951.1874064 |
|
Full text: PDF
|
|
This paper addresses the problem of video on demand delivery over a time-varying wireless channel. Packet scheduling and buffer management are jointly considered for scalable video transmission to adapt to the changing channel conditions. A proxy-based ...
This paper addresses the problem of video on demand delivery over a time-varying wireless channel. Packet scheduling and buffer management are jointly considered for scalable video transmission to adapt to the changing channel conditions. A proxy-based filtering algorithm among scalable layers is considered to maximize the decoded video quality at the receiver side while keeping a minimum playback margin. This problem is cast in the context of Markov Decision Processes which allows the design of foresighted policies maximizing some long-term reward. Experimental results illustrate the benefit of this approach compared to a shortterm policy in term of average PSNR improvement. expand
|
|
|
Context dependent SVMs for interconnected image network annotation |
| |
Hichem Sahbi,
Xi Li
|
|
Pages: 735-738 |
|
doi>10.1145/1873951.1874065 |
|
Full text: PDF
|
|
The exponential growth of interconnected networks, such as Flickr, currently makes them the standard way to share and explore data where users put contents and refer to others. These interconnections create valuable information in order to enhance the ...
The exponential growth of interconnected networks, such as Flickr, currently makes them the standard way to share and explore data where users put contents and refer to others. These interconnections create valuable information in order to enhance the performance of many tasks in information retrieval including ranking and annotation. We introduce in this paper a novel image annotation framework based on support vector machines (SVMs) and a new class of kernels referred to as context-dependent. The method goes beyond the naive use of the intrinsic low level features (such as color, texture, shape, etc.) and context-free kernels, in order to design a kernel function applicable to interconnected databases such as social networks. The main contribution of our method includes a variational framework which helps designing this function using both intrinsic features and the underlying contextual information. This function also converges to a positive definite fixed-point, usable for SVM training and other kernel methods. When plugged in SVMs, our context-dependent kernel consistently improves the performance of image annotation, compared to context-free kernels, on hundreds of thousands of Flickr images. expand
|
|
|
A novel video hash algorithm |
| |
Li Weng,
Bart Preneel
|
|
Pages: 739-742 |
|
doi>10.1145/1873951.1874066 |
|
Full text: PDF
|
|
Perceptual hashing is an emerging solution for identification and authentication of multimedia content. In this work, a video hash algorithm is proposed. This algorithm computes a 180-bit hash value for videos of arbitrary lengths. The hash value can ...
Perceptual hashing is an emerging solution for identification and authentication of multimedia content. In this work, a video hash algorithm is proposed. This algorithm computes a 180-bit hash value for videos of arbitrary lengths. The hash value can resist common signal processing and slight geometric distortion. The basic mechanism of the algorithm is to compute and accumulate frame hash values. A frame hash algorithm is designed by combining semi-global and local features. Semi-global features are extracted by computing several statistics from image blocks. Local features are extracted by computing a compact edge density map around stable feature points. The good performance of the new algorithm has been demonstrated by experiments. expand
|
|
|
Age classification for pose variant and occluded faces |
| |
Wei-Ta Chu,
Wen-Long Liu,
Jen-Yu Yu
|
|
Pages: 743-746 |
|
doi>10.1145/1873951.1874067 |
|
Full text: PDF
|
|
We extend the object class invariant (OCI) model to age classification, for pose variant and occluded faces. With the OCI model, we first localize faces from images captured in arbitrary views, and then determine the most distinctive features. Relationships ...
We extend the object class invariant (OCI) model to age classification, for pose variant and occluded faces. With the OCI model, we first localize faces from images captured in arbitrary views, and then determine the most distinctive features. Relationships between feature points and the invariant vector are described in terms of geometry and appearance information, in the form of a probabilistic model. In contrast to previous works on age classification/estimation, we emphasize that this method is especially useful for faces captured in real-world situations. expand
|
|
|
Movie genre classification via scene categorization |
| |
Howard Zhou,
Tucker Hermans,
Asmita V. Karandikar,
James M. Rehg
|
|
Pages: 747-750 |
|
doi>10.1145/1873951.1874068 |
|
Full text: PDF
|
|
This paper presents a method for movie genre categorization of movie trailers, based on scene categorization. We view our approach as a step forward from using only low-level visual feature cues, towards the eventual goal of high-level seman- tic understanding ...
This paper presents a method for movie genre categorization of movie trailers, based on scene categorization. We view our approach as a step forward from using only low-level visual feature cues, towards the eventual goal of high-level seman- tic understanding of feature films. Our approach decom- poses each trailer into a collection of keyframes through shot boundary analysis. From these keyframes, we use state-of- the-art scene detectors and descriptors to extract features, which are then used for shot categorization via unsuper- vised learning. This allows us to represent trailers using a bag-of-visual-words (bovw) model with shot classes as vo- cabularies. We approach the genre classification task by mapping bovw temporally structured trailer features to four high-level movie genres: action, comedy, drama or horror films. We have conducted experiments on 1239 annotated trailers. Our experimental results demonstrate that exploit- ing scene structures improves film genre classification com- pared to using only low-level visual features. expand
|
|
|
Unsupervised summarization of rushes videos |
| |
Yang Liu,
Feng Zhou,
Wei Liu,
Fernando De la Torre,
Yan Liu
|
|
Pages: 751-754 |
|
doi>10.1145/1873951.1874069 |
|
Full text: PDF
|
|
This paper proposes a new framework to formulate summarization of rushes video as an unsupervised learning problem. We pose the problem of video summarization as one of time-series clustering, and proposed Constrained Aligned Cluster Analysis (CACA). ...
This paper proposes a new framework to formulate summarization of rushes video as an unsupervised learning problem. We pose the problem of video summarization as one of time-series clustering, and proposed Constrained Aligned Cluster Analysis (CACA). CACA combines kernel k-means, Dynamic Time Alignment Kernel (DTAK), and unlike previous work, CACA jointly optimizes video segmentation and shot clustering. CACA is effciently solved via dynamic programming. Experimental results on the TRECVID 2007 and 2008 BBC rushes video summarization databases validate the accuracy and effectiveness of CACA. expand
|
|
|
Negotiating multimedia advertising with attention owners |
| |
Yue Zhang,
Nadeem Jamali
|
|
Pages: 755-758 |
|
doi>10.1145/1873951.1874070 |
|
Full text: PDF
|
|
Advertising is increasingly an integral part of multimedia delivery over the Internet. Traditionally, brokers -- intermediaries between content providers, advertisers, and viewers -- have determined the fine balance between the content de- sired by viewers ...
Advertising is increasingly an integral part of multimedia delivery over the Internet. Traditionally, brokers -- intermediaries between content providers, advertisers, and viewers -- have determined the fine balance between the content de- sired by viewers and the advertising embedded in the content. Parameters of this balance are informed by fields of psychology and marketing, which help target viewer segments identified by their viewing habits. Oddly, mechanisms available to individual viewers to inform this balance are coarse grained: one can change the channel. We take an owned attention view of the problem with an explicit treatment of both attention and its ownership. This approach specializes the CyberOrgs model for encapsulating computations with owned resources available for their execution. Particularly, we treat a multimedia consumer's attention space as a precious resource owned by the viewer. Viewers pay for the content they wish to view in dollars, as well as in terms of their attention. Advertisers pay for viewers' attention by subsidizing the cost of their content. This paper presents the rationale, design, implementation, and evaluation of our solution, FlexAdSense. Our approach affords finer grained control capability to viewers than what is offered by existing approaches. Pluggable customizable policies specify negotiation positions of different parties, scalably automating typical negotiations. Experimental work demonstrates that the approach scales well, and informs decisions about allocating resources to servers. expand
|
|
|
ReDi: an interactive virtual display system for ubiquitous devices |
| |
Wen Sun,
Yan Lu,
Shipeng Li
|
|
Pages: 759-762 |
|
doi>10.1145/1873951.1874071 |
|
Full text: PDF
|
|
In this paper, we present an interactive virtual display system to facilitate the ubiquitous user interaction with heterogeneous devices. By using small-size programmable hardware and wearable sensors, any display device (referred to as display surface) ...
In this paper, we present an interactive virtual display system to facilitate the ubiquitous user interaction with heterogeneous devices. By using small-size programmable hardware and wearable sensors, any display device (referred to as display surface) can act as a thin client for users to interact with the different remote devices. Under a flexible system architecture for local and remote devices' communication and collaboration, several techniques, such as adaptive screen compression, interactive ROI control, and accelerometer-based pointing input, are developed to improve the system performance and user experience. Evaluations show that the proposed system can efficiently utilize the remote computing resources and local display capabilities of ubiquitous devices, which will greatly benefit interactive multimedia applications. expand
|
|
|
A proxy-based mobile web browser |
| |
Huifeng Shen,
Zhaotai Pan,
Haicheng Sun,
Yan Lu,
Shipeng Li
|
|
Pages: 763-766 |
|
doi>10.1145/1873951.1874072 |
|
Full text: PDF
|
|
In this paper, we present a proxy-based mobile web browser with rich experiences. We use the server-side web parsing and rendering to leverage the browser computing logic. We use a composite screen format to represent the display of the web content, ...
In this paper, we present a proxy-based mobile web browser with rich experiences. We use the server-side web parsing and rendering to leverage the browser computing logic. We use a composite screen format to represent the display of the web content, incorporating the web background screen and the dynamic web objects. And then we employ a slice-based screen encoding scheme to efficiently compress the web background screen. Besides the display screen of the web content, we also send the side information of the web objects to enable the designed object-level interaction mechanisms. The experimental results show that our browser can achieve the superior browsing speed, compared with the native browser and yield much better visual quality than the existing proxy-based browser expand
|
|
|
Optimal collusion attack for digital fingerprinting |
| |
Hui Feng,
Hefei Ling,
Fuhao Zou,
Weiqi Yan,
Zhengding Lu
|
|
Pages: 767-770 |
|
doi>10.1145/1873951.1874073 |
|
Full text: PDF
|
|
The collusion attack is a cost-efficient attack against digital finger-printing where classes of users combine their fingerprinted content for the purpose of attenuating or removing the fingerprints. A recently introduced gradient attack which appeared ...
The collusion attack is a cost-efficient attack against digital finger-printing where classes of users combine their fingerprinted content for the purpose of attenuating or removing the fingerprints. A recently introduced gradient attack which appeared in ACM MM 2004, demonstrated its efficacy in defeating most spread-spectrum based fingerprints. In this paper, we propose a novel collusion attack strategy, Iterative Optimization Collusion Attack (IOCA), which is based upon the gradient attack and the geometric principal of a Voronoi diagram. The simulation results, under the assumption that orthogonal fingerprints are used, show that the proposed collusion attack performs more effectively than the gradient attack. Less than five fingerprinted pieces of content can sufficiently interrupt orthogonal fingerprints accommodating many thousands of users, meanwhile, high perceptual quality of the attacked content is obtained after the proposed collusion attack. expand
|
|
|
Novel framework for single/multi-frame super-resolution using sequential Monte Carlo method |
| |
Toshie Misu,
Yasutaka Matsuo,
Shinichi Sakaida,
Yoshiaki Shishikui
|
|
Pages: 771-774 |
|
doi>10.1145/1873951.1874074 |
|
Full text: PDF
|
|
We propose a novel super-resolution (SR) framework based on a sequential Monte Carlo (SMC) method, which is capable of robust optimization, for solving the inverse problem of degradation processes of imagery and sampling. The SR image is estimated from ...
We propose a novel super-resolution (SR) framework based on a sequential Monte Carlo (SMC) method, which is capable of robust optimization, for solving the inverse problem of degradation processes of imagery and sampling. The SR image is estimated from a set of multiple hypotheses, which are sequentially reorganized by evaluating their consistency with the input image. The concepts of norm regularization and motion registration in single/multi-frame SR are mapped into stochastic processes of an SMC's proposal distribution. The experiments showed that our framework is capable of seamlessly restoring both static and moving regions of degraded pictures. expand
|
|
|
Similarity content search in content centric networks |
| |
Petros Daras,
Theodoros Semertzidis,
Lambros Makris,
Michael G. Strintzis
|
|
Pages: 775-778 |
|
doi>10.1145/1873951.1874075 |
|
Full text: PDF
|
|
Content searching and downloading are the two dominant actions of the Internet users today, despite the fact that the Internet was not originally architected to serve such actions. Content Centric Networking is the new trend in the research community ...
Content searching and downloading are the two dominant actions of the Internet users today, despite the fact that the Internet was not originally architected to serve such actions. Content Centric Networking is the new trend in the research community to build network architectures that route content by name and not by the host's network address so as to efficiently deal with content persistence, availability and authenticity issues. In this work we propose an extension to the Content Centric Network protocol in order to support content search as a native process of the network. By using object descriptors to represent the actual content objects and integrate these descriptors in the network protocol we manage to search and retrieve content from the network not only by name but also by the content itself. In this approach, searching for information is a process distributed to the reachable network. Moreover, content aggregation is handled by the end user and not by content aggregation portals, thematic content search engines or information curators. By doing so, search is not any more an application but part of the network and can pave the way for many novel applications which have never been thought until today. expand
|
|
|
Accelerated IPTV channel change with transcoded unicast bursting |
| |
Zhi Li,
Ali C. Begen,
Xiaoqing Zhu,
Bernd Girod
|
|
Pages: 779-782 |
|
doi>10.1145/1873951.1874076 |
|
Full text: PDF
|
|
We study video transcoding for accelerated channel changes in IPTV systems. Video transcoding at the Retransmission Server not only reduces the channel change latency, but also reduces the duration and data size of the unicast burst stream used for rapid ...
We study video transcoding for accelerated channel changes in IPTV systems. Video transcoding at the Retransmission Server not only reduces the channel change latency, but also reduces the duration and data size of the unicast burst stream used for rapid acquisition. We develop an analytical model to capture the fundamental trade-offs in this system. This model is then used to characterize the potential savings from transcoding the unicast stream. Analysis and simulation results show that the stream compression factor affects linearly the saving in the channel change latency, and superlinearly the saving in unicast burst duration (or data size). expand
|
|
|
Qoe-based rate adaptation scheme selection for resource-constrained wireless video transmission |
| |
Srisakul Thakolsri,
Wolfgang Kellerer,
Eckehard Steinbach
|
|
Pages: 783-786 |
|
doi>10.1145/1873951.1874077 |
|
Full text: PDF
|
|
This paper proposes a Quality of Experience (QoE) based rate adaptation scheme selection approach for multi-user wireless video delivery. Transcoding and packet dropping are used as examples of rate adaptation schemes, and we investigate their impact ...
This paper proposes a Quality of Experience (QoE) based rate adaptation scheme selection approach for multi-user wireless video delivery. Transcoding and packet dropping are used as examples of rate adaptation schemes, and we investigate their impact on user perceived video quality. In the presence of constrained computation resources, the most suitable rate adaptation scheme is determined for each video stream such that the overall quality degradation is minimized. The proposed scheme selection approach is integrated with QoE-based resource allocation in presence of constrained transmission resources. Simulation results obtained from an emulated High Speed Downlink Packet Access (HSDPA) show that the QoE-based approach leads to significant improvements of user perceived quality compared to other approaches including a non-optimized HSDPA systems expand
|
|
|
Precise indoor localization using smart phones |
| |
Eladio Martin,
Oriol Vinyals,
Gerald Friedland,
Ruzena Bajcsy
|
|
Pages: 787-790 |
|
doi>10.1145/1873951.1874078 |
|
Full text: PDF
|
|
We present an indoor localization application leveraging the sensing capabilities of current state of the art smart phones. To the best of our knowledge, our application is the first one to be implemented in smart phones and integrating both offline ...
We present an indoor localization application leveraging the sensing capabilities of current state of the art smart phones. To the best of our knowledge, our application is the first one to be implemented in smart phones and integrating both offline and online phases of fingerprinting, delivering an accuracy of up to 1.5 meters. In particular, we have studied the possibilities offered by WiFi radio, cellular communications radio, accelerometer and magnetometer, already embedded in smart phones, with the intention to build a multimodal solution for localization. We have also implemented a new approach for the statistical processing of radio signal strengths, showing that it can outperform existing deterministic techniques. expand
|
|
|
Trading bandwidth for playback lag: can active peers help? |
| |
Dongbo Huang,
Jin Zhao,
Xin Wang
|
|
Pages: 791-794 |
|
doi>10.1145/1873951.1874079 |
|
Full text: PDF
|
|
P2P live streaming systems suffer a lot from long playback lag in lag-sensitive scenarios. In this paper, we propose a new approach to reducing the playback lag in P2P live streaming systems. According to measurement studies, there exist a certain amount ...
P2P live streaming systems suffer a lot from long playback lag in lag-sensitive scenarios. In this paper, we propose a new approach to reducing the playback lag in P2P live streaming systems. According to measurement studies, there exist a certain amount of active peers, who stay longer and contribute more bandwidth than other peers. Inspired by this, we propose a tiered overlay design, in which peers are organized into three tiers based on their degrees of activity. We develop a set of algorithms to evaluate the peers' degrees of activity. Specifically, the backbone of the overlay consists of the peers with high activity in tier-1. These active peers are responsible for diffusing the newly generated fresh chunks to peers located in all the involved Autonomous Systems (ASes). They contribute more bandwidth and thus enjoy shorter playback lag. Further more, adaptive biased neighbor selection algorithm is employed among non-backbone peers to keep traffic locality. Evaluated by extensive simulations, the proposed algorithms can reduce the average playback lag and cross-ISP traffic greatly. expand
|
|
|
3D video transcoding for virtual views |
| |
Shujie Liu,
Chang Wen Chen
|
|
Pages: 795-798 |
|
doi>10.1145/1873951.1874080 |
|
Full text: PDF
|
|
Recent emerging development of three dimensional video (3DV) has been vigorously driving the Multiview Video Coding (MVC) standard developed by Joint Video Team as an amendment to H.264/AVC and the new 3DV standard developed by MPEG. It is expected that ...
Recent emerging development of three dimensional video (3DV) has been vigorously driving the Multiview Video Coding (MVC) standard developed by Joint Video Team as an amendment to H.264/AVC and the new 3DV standard developed by MPEG. It is expected that 3DV contents will soon be available to various media consumers. Because of the heterogeneous networks and terminal devices, a variety of end users shall not have (1) the 3DV decoder and view synthesize software installed on their devices, and (2) adequate network bandwidth for 3DV content delivery. 3DV transcoder is therefore necessary for users without 3DV decoder to enjoy 3DV services, as well as for service providers to better control virtual view quality of all 3DV customers. In this paper, we report the first 3DV transcoding scheme for virtual view, which is able to generate bitstream of one single view that can be decoded by H.264/AVC. The key idea of the proposed transcoding is to appropriately generate candidate motion and modes based on motion information from other views in the original bitstream making best use of inter-view correlation. These candidate motion and modes are then properly used to encode a virtual view with significantly reduced complexity for decoding by H.264/AVC. Simulation results show that, compared to the straightforward cascade algorithm, the proposed transcoding experiences minor performance loss with a significantly reduced computation cost. Such low complexity transcoding can be deployed at various media gateways to deliver 3DV content to non-3DV devices. expand
|
|
|
Pull-patching: a combination of multicast and adaptive segmented HTTP streaming |
| |
Espen Jacobsen,
Carsten Griwodz,
Pål Halvorsen
|
|
Pages: 799-802 |
|
doi>10.1145/1873951.1874081 |
|
Full text: PDF
|
|
Multicast delivery for video streaming gains credibility with the introduction of commercial IPTV. We therefore revisit patching, a video-on-demand idea from the 1990s. We have built Pull-Patching, an approach that combines the patching ideas ...
Multicast delivery for video streaming gains credibility with the introduction of commercial IPTV. We therefore revisit patching, a video-on-demand idea from the 1990s. We have built Pull-Patching, an approach that combines the patching ideas with adaptive segmented HTTP streaming, a unicast technique that is used by most commercial providers of large-scale, true video-on-demand in the Internet today. The prototype is tested in a combined Internet and lab en- vironment where we show the influence of practical factors like packet loss, delay and limited resource availability, and identify several details that require further study. expand
|
|
|
SESSION: Short - S3/applications/content track |
| |
Max Muehlhaeuser
|
|
|
|
|
K-way min-max cut for image clustering and junk images filtering from Google images |
| |
Feng Xie,
Yi Shen,
Xiaofei He
|
|
Pages: 803-806 |
|
doi>10.1145/1873951.1874083 |
|
Full text: PDF
|
|
Currently most existing image search engines such as Google Images index web images majorly using text keywords extracted from the context, which may return large amount of junk information. We propose a novel clustering based filtering method to filter ...
Currently most existing image search engines such as Google Images index web images majorly using text keywords extracted from the context, which may return large amount of junk information. We propose a novel clustering based filtering method to filter those junk images. Firstly we apply K-way min-max cut to cluster images returned by Google into multiple clusters based on the mixture of feature kernels, with kernel weights being determined automatically instead of hard fix. Secondly we select the best cluster in a robust way, and rank all the rest clusters according to their similarity with the best one. Finally those low-rank clusters can be filtered out as junk clusters. In experiments we obtain very comparative filtering performance to the current state-of-the-art, and improve Google Images search results significantly. expand
|
|
|
Smart video systems in police cars |
| |
Amirali Jazayeri,
Hongyuan Cai,
Mihran Tuceryan,
Jiang Yu Zheng
|
|
Pages: 807-810 |
|
doi>10.1145/1873951.1874084 |
|
Full text: PDF
|
|
The use of video cameras in police cars has been found to have significant value and the number of such installed systems has been increasing. In addition to recording the events in routine traffic stops for later use in legal settings, in-car video ...
The use of video cameras in police cars has been found to have significant value and the number of such installed systems has been increasing. In addition to recording the events in routine traffic stops for later use in legal settings, in-car video cameras can be used to analyze in real-time or near real-time to detect critical events and notify police headquarters for help. This paper presents methods for detecting critical events in such police car videos. The specific critical events are person running out of a stopped car and officer falling down while approaching a stopped car. In the above situations, the aim is to alert the control center immediately for backup forces, especially in the last example when the officer is incapacitated. In order to implement real-time video processing so that a quick response can be generated without employing complex, slow, and brittle video processing algorithms, we use the reduced spatiotemporal representation (1D projection profile) and Hidden Markov Model to detect these events. The methods are tested on many video shots under various environmental and illumination conditions. expand
|
|
|
Interactive inquiry for object of interest in video playback by motion-augmented graph cut |
| |
Po-Nung Tseng,
Yen-Liang Lin,
Winston H. Hsu
|
|
Pages: 811-814 |
|
doi>10.1145/1873951.1874085 |
|
Full text: PDF
|
|
The touch-based displays (devices) have entailed rich interactions between the videos and users. The objects appearing in videos usually interest users in wanting to know relative knowledge about them. In this paper, we proposed a video playback system ...
The touch-based displays (devices) have entailed rich interactions between the videos and users. The objects appearing in videos usually interest users in wanting to know relative knowledge about them. In this paper, we proposed a video playback system for users to interactively query objects of interest in videos. Since the text information accompanied with videos might not be strongly related to the object of interest, we adopt visual appearances as features to retrieve similar objects from large image collections. The tags associated with the retrieved images are used to reveal related information of the object of interest for further exploiting related knowledge. Solely relying on single viewpoint of the object to query may suffer from different poses, occlusions and is not robust. So we present a novel video object segmentation approach to improve retrieval precision. The approach is based on a 3D graph cut framework. To ensure prompt response and effectiveness, we augment the algorithm with compressed-domain motion vectors; compared with the prior method, the processing speed of our approach is significantly faster. The experiments on community-contributed videos demonstrate the effectiveness of our approach based on multi-frame object region query and the improvement of retrieval precision. expand
|
|
|
GPS, compass, or camera?: investigating effective mobile sensors for automatic search-based image annotation |
| |
An-Jung Cheng,
Fang-Erh Lin,
Yin-Hsi Kuo,
Winston H. Hsu
|
|
Pages: 815-818 |
|
doi>10.1145/1873951.1874086 |
|
Full text: PDF
|
|
Recently, more and more types of sensors are being equipped on the smart phones, which provide different aspects into conside-ration. When a user takes a photo, the information it provides like the image content, the location and even the direction the ...
Recently, more and more types of sensors are being equipped on the smart phones, which provide different aspects into conside-ration. When a user takes a photo, the information it provides like the image content, the location and even the direction the user faces can help us to understand the photo itself. Each factor mentioned above can be treated as an input to the image search system. However, most existing algorithms for image retrieval (or annotation) only focus on the content and location information of the images yet completely ignore the important direction-facing factor and lack of the insights of the capabilities for the sensors. In this paper, we propose a novel ranking algorithm that can leverage different sensors with traditional content-based image retrieval system, and further apply to annotate images. We evaluate different combinations of sensors and investigate how the geolocation, image content and compass direction influence on image retrieval. expand
|
|
|
TwitterSigns: microblogging on the walls |
| |
Markus Buzeck,
Jörg Müller
|
|
Pages: 819-822 |
|
doi>10.1145/1873951.1874087 |
|
Full text: PDF
|
|
In this paper we present TwitterSigns, an approach to display microblogs on public displays. Two different kinds of microblog entries (tweets) are selected for display: Tweets that were posted in the immediate environment of the display, and tweets that ...
In this paper we present TwitterSigns, an approach to display microblogs on public displays. Two different kinds of microblog entries (tweets) are selected for display: Tweets that were posted in the immediate environment of the display, and tweets that were posted by people associated with the location where the displays are installed (locals). The prototype was tested in a university setting on 4 displays for 4 weeks and compared to the information system that is usually running on the displays (iDisplays). Using face detection we show that people look significantly longer at TwitterSigns than at iDisplays. Interviews show that the relationship of viewer and poster as well as the tweet content are much more important than time and location of the tweet. Viewers recall and recognize mostly tweets from people they know, and of apparent importance for themselves (like a apparent bomb found in the city center). Furthermore, TwitterSigns change the way people use twitter (e.g. they feel more responsible for what they tweet). Passers-by seem only to look for keywords and only stop and read the whole tweet if they found some interesting keyword. expand
|
|
|
Multi-exposure imaging on mobile devices |
| |
Natasha Gelfand,
Andrew Adams,
Sung Hee Park,
Kari Pulli
|
|
Pages: 823-826 |
|
doi>10.1145/1873951.1874088 |
|
Full text: PDF
|
|
Many natural scenes have a dynamic range that is larger than the dynamic range of a camera's image sensor. A popular approach to producing an image without under- and over-exposed areas is to capture several input images with varying exposure settings, ...
Many natural scenes have a dynamic range that is larger than the dynamic range of a camera's image sensor. A popular approach to producing an image without under- and over-exposed areas is to capture several input images with varying exposure settings, and later merge them into a single high-quality result using offline image processing software. In this paper, we describe a system for creating images of high-dynamic-range (HDR) scenes that operates entirely on a mobile camera. Our system consists of an automatic HDR metering algorithm that determines which exposures to capture, a video-rate viewfinder preview algorithm that allows the user to verify the dynamic range that will be recorded, and a light-weight image merging algorithm that computes a high-quality result directly on the camera. By using our system, a photographer can capture, view, and share images of HDR scenes directly on the camera, without using offline image processing software. expand
|
|
|
Towards aesthetics: a photo quality assessment and photo selection system |
| |
Congcong Li,
Alexander C. Loui,
Tsuhan Chen
|
|
Pages: 827-830 |
|
doi>10.1145/1873951.1874089 |
|
Full text: PDF
|
|
Automatic photo quality assessment and selection systems are helpful for managing the large mount of consumer photos. In this paper, we present such a system based on evaluating the aesthetic quality of consumer photos. The proposed system focuses on ...
Automatic photo quality assessment and selection systems are helpful for managing the large mount of consumer photos. In this paper, we present such a system based on evaluating the aesthetic quality of consumer photos. The proposed system focuses on photos with faces, which constitute an important part of consumer photo albums. The system has three contributions: 1) We propose an aesthetics-based photo assessment algorithm, by considering different aesthetics-related factors, including the technical characteristics of the photo and the specific features related to faces; 2) Based on the aesthetic measurement, we propose a cropping-based photo editing algorithm, which differs from prior works by eliminating unimportant faces before optimizing photo composition; 3) We also incorporate the aesthetic evaluation with other metrics to select quintessential photos for a large collection of photos. The entire system is delivered by a web interface, which allows users to submit images or albums, and returns promising results for photo evaluation, editing recommendation, and photo selection. expand
|
|
|
Cast2Face: character identification in movie with actor-character correspondence |
| |
Mengdi Xu,
Xiaotong Yuan,
Jialie Shen,
Shuicheng Yan
|
|
Pages: 831-834 |
|
doi>10.1145/1873951.1874090 |
|
Full text: PDF
|
|
We investigate the problem of automatically identifying characters in a movie with the supervision of actor-character name correspondence provided by the movie cast. Our proposed framework, namely Cast2Face, is featured by: (i) we restrict the names ...
We investigate the problem of automatically identifying characters in a movie with the supervision of actor-character name correspondence provided by the movie cast. Our proposed framework, namely Cast2Face, is featured by: (i) we restrict the names to assign within the set of character names in the cast; (ii) for each character, by using the corresponding actor's name as a key word, we retrieve from Google image search a group of face images to form the gallery set; and (iii) the probe face tracks in the movie are then identified as one of the actors by robust multi-task joint sparse representation and classification method. The assigned actor name on a face track is then mapped to the character name based on the cast again. In addition to face naming, we further apply the proposed method to spotlights summarization of a particular actor in his/her movies. Empirical evaluations on several feature-length movies demonstrate the satisfying performance of our method. expand
|
|
|
Visual security evaluation for video encryption |
| |
Lingling Tong,
Feng Dai,
Yongdong Zhang,
Jintao Li
|
|
Pages: 835-838 |
|
doi>10.1145/1873951.1874091 |
|
Full text: PDF
|
|
Video encryption plays an important role in data security guarantee, which is increasingly important with the development of multimedia technology. A great deal of effort has been made in recent years to develop video encryption methods. However, few ...
Video encryption plays an important role in data security guarantee, which is increasingly important with the development of multimedia technology. A great deal of effort has been made in recent years to develop video encryption methods. However, few studies focus on visual security evaluation, which has significant impact in measuring the effectiveness of these methods. In this paper, a new metric for video encryption is proposed, which evaluates visual security based on color and edge features of original and cipher-videos. The metric is easy to be incorporated into video encryption system for visual security based encryption decision. In addition, subjective tests for visual security assessment have been fully carried out. Experiments show that the proposed metric had better correlation with subjective results than others. expand
|
|
|
Automatic trailer generation |
| |
Go Irie,
Takashi Satou,
Akira Kojima,
Toshihiko Yamasaki,
Kiyoharu Aizawa
|
|
Pages: 839-842 |
|
doi>10.1145/1873951.1874092 |
|
Full text: PDF
|
|
This paper presents a content-based movie trailer generation method, named Vid2Trailer (V2T). Since trailers are intended to advertise movies, they must show specific symbols such as the title logo and the main theme music. Moreover, it is expected to ...
This paper presents a content-based movie trailer generation method, named Vid2Trailer (V2T). Since trailers are intended to advertise movies, they must show specific symbols such as the title logo and the main theme music. Moreover, it is expected to attract viewers by its visual and audio content. V2T satisfies these two requirements when creating a trailer from the original movie content. First, the title logo and the main theme music are extracted. Second, impressive speech and video segments are extracted by using an affective content analysis technique. Third, all of the extracted components are concatenated into the form of a trailer; to realize this, we propose a method that estimates the affective impact of shot sequences, and introduce an algorithm that arranges a set of shots so as to maximize the affective impact of the sequence. Experiments show that our V2T is more appropriate to trailer generation than conventional techniques. expand
|
|
|
Extracting captions from videos using temporal feature |
| |
Xiaoqian Liu,
Weiqiang Wang
|
|
Pages: 843-846 |
|
doi>10.1145/1873951.1874093 |
|
Full text: PDF
|
|
Captions in videos provide much useful semantic information for indexing and retrieving video contents. In this paper, we present an effective approach to extracting captions from videos. Its novelty comes from exploiting the temporal information in ...
Captions in videos provide much useful semantic information for indexing and retrieving video contents. In this paper, we present an effective approach to extracting captions from videos. Its novelty comes from exploiting the temporal information in both localization and segmentation of captions. Since some simple features such as edges, corners and color are utilized, our approach is efficient. It involves four steps. First, we exploit the distribution of corners to spatially detect and locate the caption in a frame. Then the temporal localization for different captions in a video is performed by identifying the change of stroke directions. After that, we segment the caption pixels in a clip with a same caption based on the consistency and dominant distribution of caption color. Finally, the segmentation results are further refined. The experimental results on two representative movies have preliminarily verified the validity of our approach. expand
|
|
|
Automatic role recognition based on conversational and prosodic behaviour |
| |
Hugues Salamin,
Alessandro Vinciarelli,
Khiet Truong,
Gelareh Mohammadi
|
|
Pages: 847-850 |
|
doi>10.1145/1873951.1874094 |
|
Full text: PDF
|
|
This paper proposes an approach for the automatic recognition of roles in settings like news and talk-shows, where roles correspond to specific functions like Anchorman, Guest or Interview Participant. The approach is based on purely nonverbal vocal ...
This paper proposes an approach for the automatic recognition of roles in settings like news and talk-shows, where roles correspond to specific functions like Anchorman, Guest or Interview Participant. The approach is based on purely nonverbal vocal behavioral cues, including who talks when and how much (turn-taking behavior), and statistical properties of pitch, formants, energy and speaking rate (prosodic behavior). The experiments have been performed over a corpus of around 50 hours of broadcast material and the accuracy, percentage of time correctly labeled in terms of role, is up to 89%. Both turn-taking and prosodic behavior lead to satisfactory results. Furthermore, on one database, their combination leads to a statistically significant improvement. expand
|
|
|
VERT: automatic evaluation of video summaries |
| |
Yingbo Li,
Bernard Merialdo
|
|
Pages: 851-854 |
|
doi>10.1145/1873951.1874095 |
|
Full text: PDF
|
|
Video Summarization has become an important tool for multimedia information processing, but the automatic evaluation of a video summarization system remains a challenge. A major issue is that an ideal "best" summary does not exist, although people can ...
Video Summarization has become an important tool for multimedia information processing, but the automatic evaluation of a video summarization system remains a challenge. A major issue is that an ideal "best" summary does not exist, although people can easily distinguish "good" from "bad" summaries. A similar situation arise in machine translation and text summarization, where specific automatic procedures, respectively BLEU and ROUGE, evaluate the quality of a candidate by comparing its local similarities with several human-generated references. These procedures are now routinely used in various benchmarks. In this paper, we extend this idea to the video domain and propose the VERT (Video Evaluation by Relevant Threshold) algorithm to automatically evaluate the quality of video summaries. VERT mimics the theories of BLEU and ROUGE, and counts the weighted number of overlapping selected units between the computer-generated video summary and several human-made references. Several variants of VERT are suggested and compared. expand
|
|
|
Character-based movie summarization |
| |
Jitao Sang,
Changsheng Xu
|
|
Pages: 855-858 |
|
doi>10.1145/1873951.1874096 |
|
Full text: PDF
|
|
A decent movie summary is helpful for movie producer to promote the movie as well as audience to capture the theme of the movie before watching the whole movie. Most exiting automatic movie summarization approaches heavily rely on video content only, ...
A decent movie summary is helpful for movie producer to promote the movie as well as audience to capture the theme of the movie before watching the whole movie. Most exiting automatic movie summarization approaches heavily rely on video content only, which may not deliver ideal result due to the semantic gap between computer calculated low-level features and human used high-level understanding. In this paper, we incorporate script into movie analysis and propose a novel character-based movie summarization approach, which is validated by modern film theory that what actually catches audiences' attention is the character. We first segment scenes in the movie by analysis and alignment of script and movie. Then we conduct substory discovery and content attention analysis based on the scent analysis and character interaction features. Given obtained movie structure and content attention value, we calculate movie attraction scores at both shot and scene levels and adopt this as criterion to generate movie summary. The promising experimental results demonstrate that character analysis is effective for movie summarization and movie content understanding. expand
|
|
|
Supervised manifold learning for image and video classification |
| |
Yang Liu,
Yan Liu,
Keith C.C. Chan
|
|
Pages: 859-862 |
|
doi>10.1145/1873951.1874097 |
|
Full text: PDF
|
|
This paper presents a supervised manifold learning model for dimensionality reduction in image and video classification tasks. Unlike most manifold learning models that emphasize the distance preserving, we propose a novel algorithm called maximum distance ...
This paper presents a supervised manifold learning model for dimensionality reduction in image and video classification tasks. Unlike most manifold learning models that emphasize the distance preserving, we propose a novel algorithm called maximum distance embedding (MDE), which aims to maximize the distances between some particular pairs of data points, with the intention of flattening the local nonlinearity and keeping the discriminant information simultaneously in the embedded feature space. Moreover, MDE measures the dissimilarity between data points using L1-norm distance, which is more robust to outliers than widely used Frobenius norm distance. To adapt the nature tensor structure of image and video data, we further propose the multilinear MDE (M2DE). Experiments on various datasets demonstrate that both MDE and M2DE achieve impressive embedding results of image and video data for classification tasks. expand
|
|
|
Unsupervised object category discovery via information bottleneck method |
| |
Zhengzheng Lou,
Yangdong Ye,
Dong Liu
|
|
Pages: 863-866 |
|
doi>10.1145/1873951.1874098 |
|
Full text: PDF
|
|
We present a novel approach to automatically discover object categories from a collection of unlabeled images. This is achieved by the Information Bottleneck method, which finds the optimal partitioning of the image collection by maximally preserving ...
We present a novel approach to automatically discover object categories from a collection of unlabeled images. This is achieved by the Information Bottleneck method, which finds the optimal partitioning of the image collection by maximally preserving the relevant information with respect to the latent semantic residing in the image contents. In this method, the images are modeled by the Bag-of-Words representation, which naturally transforms each image into a visual document composed of visual words. Then the sIB algorithm is adopted to learn the object patterns by maximizing the semantic correlations between the images and their constructive visual words. Extensive experimental results on 15 benchmark image datasets show that the Information Bottleneck method is a promising technique for discovering the hidden semantic of images, and is superior to the state-of-the-art unsupervised object category discovery methods. expand
|
|
|
Probabilistic visual concept trees |
| |
Lexing Xie,
Rong Yan,
Jelena Tešić,
Apostol Natsev,
John R. Smith
|
|
Pages: 867-870 |
|
doi>10.1145/1873951.1874099 |
|
Full text: PDF
|
|
This paper presents probabilistic visual concept trees, a model for large visual semantic taxonomy structures and its use in visual concept detection. Organizing visual semantic knowledge systematically is one of the key challenges towards large-scale ...
This paper presents probabilistic visual concept trees, a model for large visual semantic taxonomy structures and its use in visual concept detection. Organizing visual semantic knowledge systematically is one of the key challenges towards large-scale concept detection, and one that is complementary to optimizing visual classification for individual concepts. Semantic concepts have traditionally been treated as isolated nodes, a densely-connected web, or a tree. Our analysis shows that none of these models are sufficient in modeling the typical relationships on a real-world visual taxonomy, and these relationships belong to three broad categories -- semantic, appearance and statistics. We propose probabilistic visual concept trees for modeling a taxonomy forest with observation uncertainty. As a Bayesian network with parameter constraints, this model is flexible enough to account for the key assumptions in all three types of taxonomy relations, yet it is robust enough to accommodate expansion or deletion in a taxonomy. Our evaluation results on a large web image dataset show that the classification accuracy has considerably improved upon baselines without, or with only a subset of concept relationships expand
|
|
|
A conditional random field viewpoint of symbolic audio-to-score matching |
| |
Cyril Joder,
Slim Essid,
Gaël Richard
|
|
Pages: 871-874 |
|
doi>10.1145/1873951.1874100 |
|
Full text: PDF
|
|
We present a new approach of symbolic audio-to-score alignment, with the use of Conditional Random Fields (CRFs). Unlike Hidden Markov Models, these graphical models allow the calculation of state conditional probabilities to be made on the basis of ...
We present a new approach of symbolic audio-to-score alignment, with the use of Conditional Random Fields (CRFs). Unlike Hidden Markov Models, these graphical models allow the calculation of state conditional probabilities to be made on the basis of several audio frames. The CRF models that we propose exploit this property to take into account the rhythmic information of the musical score. Assuming that the tempo is locally constant, they confront the neighborhood of each frame with several tempo hypotheses. Experiments on a pop-music database show that this use of contextual information leads to a significant improvement of the alignment accuracy. In particular, the proportion of detected onsets inside a 100-ms tolerance window increases by more than 10% when a 1-s neighborhood is considered. expand
|
|
|
Bilingual query translation and expansion for supporting more effective cross-language image retrieval |
| |
Yuejie Zhang,
Lei Cen,
Cheng Jin,
Xiangyang Xue,
Ning Zhou
|
|
Pages: 875-878 |
|
doi>10.1145/1873951.1874101 |
|
Full text: PDF
|
|
To support more effective Cross-Language Image Retrieval (ImageCLIR), a novel algorithm is developed by integrating a bilingual semantic network to achieve more precise bilingual query translation and expansion. An English-Chinese bilingual parallel ...
To support more effective Cross-Language Image Retrieval (ImageCLIR), a novel algorithm is developed by integrating a bilingual semantic network to achieve more precise bilingual query translation and expansion. An English-Chinese bilingual parallel corpus is used to construct the bilingual semantic network for determining more meaningful text terms and characterizing the inter-term correlations and similarity contexts between multiple inter-related text terms more precisely. Our experiments on CWMT2009 and CLEF have provided very promising results. expand
|
|
|
The idiap wolf corpus: exploring group behaviour in a competitive role-playing game |
| |
Hayley Hung,
Gokul Chittaranjan
|
|
Pages: 879-882 |
|
doi>10.1145/1873951.1874102 |
|
Full text: PDF
|
|
In this paper we present the Idiap Wolf Database. This is a audio-visual corpus containing natural conversational data of volunteers who took part in a competitive role-playing game. Four groups of 8-12 people were recorded. In total, just over 7 hours ...
In this paper we present the Idiap Wolf Database. This is a audio-visual corpus containing natural conversational data of volunteers who took part in a competitive role-playing game. Four groups of 8-12 people were recorded. In total, just over 7 hours of interactive conversational data was collected. The data has been annotated in terms of the roles and outcomes of the game. There are 371 examples of different roles played over 50 games. Recordings were made with headset microphones, an 8-microphone array, and 3 video cameras and are fully synchronised. The novelty of this data is that some players have deceptive roles and the participants do not know what roles other people play. expand
|
|
|
Face hallucination with shape parameters projection constraint |
| |
Chengdong Lan,
Ruimin Hu,
Kebin Huang,
Zhen Han
|
|
Pages: 883-886 |
|
doi>10.1145/1873951.1874103 |
|
Full text: PDF
|
|
In real surveillance scenarios, a variety of factors have an impact on the quality of images, which leads to pixel distortion and aliasing. Traditional face super-resolution algorithms only use the difference of image pixel values as similarity criterion, ...
In real surveillance scenarios, a variety of factors have an impact on the quality of images, which leads to pixel distortion and aliasing. Traditional face super-resolution algorithms only use the difference of image pixel values as similarity criterion, which degrades similarity and identification of reconstructed facial images. Image semantic information with human understanding, especially structural data of shapes, is robust to the degraded images. In this paper, we propose a face hallucination with shape parameters projection constraint. This method uses a parameter model to represent face shapes, and shape information of input image is introduced to improving the quality of reconstructed image. The shape model regularization is first added to original objective function. Then shape parameters are projected into the domain of image parameters by a linear regression model. Finally, the gradient descent method is used to obtain the unified parameters. Experimental results demonstrate the proposed method outperforms the traditional schemes significantly both in subjective and objective quality. expand
|
|
|
Automatic detection of malicious sound using segmental two-dimensional mel-frequency cepstral coefficients and histograms of oriented gradients |
| |
Myung Jong Kim,
Younggwan Kim,
JaeDeok Lim,
Hoirin Kim
|
|
Pages: 887-890 |
|
doi>10.1145/1873951.1874104 |
|
Full text: PDF
|
|
This paper addresses the problem of recognizing malicious sounds, such as sexual scream or moan, to detect and block the objectionable multimedia contents. The malicious sounds show the distinct characteristics that have large temporal variations and ...
This paper addresses the problem of recognizing malicious sounds, such as sexual scream or moan, to detect and block the objectionable multimedia contents. The malicious sounds show the distinct characteristics that have large temporal variations and fast spectral transitions. Therefore, extracting appropriate features to properly represent these characteristics is important in achieving a better performance. In this paper, we employ segment-based two-dimensional Mel-frequency cepstral coefficients and histograms of gradient directions as a feature set to characterize both the temporal variations and spectral transitions within a long-range segment of the target signal. Gaussian mixture model (GMM) is adopted to statistically represent the malicious and non-malicious sounds, and the test sounds are classified by a maximum a posterior probability (MAP) method. Evaluation of the proposed feature extraction method on a database of several hundred malicious and non-malicious sound clips yielded precision of 91.31% and recall of 94.27%. This result suggests that this approach could be used as an alternative to the image-based methods. expand
|
|
|
Automatic interesting object extraction from images using complementary saliency maps |
| |
Haonan Yu,
Jia Li,
Yonghong Tian,
Tiejun Huang
|
|
Pages: 891-894 |
|
doi>10.1145/1873951.1874105 |
|
Full text: PDF
|
|
Automatic interesting object extraction is widely used in many image applications. Among various extraction approaches, saliency-based ones usually have a better performance since they well accord with human visual perception. However, nearly all existing ...
Automatic interesting object extraction is widely used in many image applications. Among various extraction approaches, saliency-based ones usually have a better performance since they well accord with human visual perception. However, nearly all existing saliency-based approaches suffer the integrity problem, namely, the extracted result is either a small part of the object (referred to as sketch-like) or a large region that contains some redundant part of the background (referred to as envelope-like). In this paper, we propose a novel object extraction approach by integrating two kinds of "complementary" saliency maps (i.e., sketch-like and envelope-like maps). In our approach, the extraction process is decomposed into two sub-processes, one used to extract a high-precision result based on the sketch-like map, and the other used to extract a high-recall result based on the envelope-like map. Then a classification step is used to extract an exact object based on the two results. By transferring the complex extraction task to an easier classification problem, our approach can effectively break down the integrity problem. Experimental results show that the proposed approach outperforms six state-of-art saliency-based methods remarkably in automatic object extraction, and is even comparable to some interactive approaches. expand
|
|
|
Interactive retrieval of targets for wide area surveillance |
| |
Saad Ali,
Omar Javed,
Neils Haering,
Takeo Kanade
|
|
Pages: 895-898 |
|
doi>10.1145/1873951.1874106 |
|
Full text: PDF
|
|
We address the problem of interactive search for a target of interest in surveillance imagery. Our solution consists of iteratively learning a distance metric for retrieval, based on user feedback. The approach employs (retrieval) rank based constraints ...
We address the problem of interactive search for a target of interest in surveillance imagery. Our solution consists of iteratively learning a distance metric for retrieval, based on user feedback. The approach employs (retrieval) rank based constraints and convex optimization to efficiently learn the distance metric. The algorithm uses both user labeled and unlabeled examples in the learning process. The method is fast enough for a new metric to be learned interactively for each target query. In order to reduce the burden on the user, a model-independent active learning method is used to select key examples, for response solicitation. This leads to a significant reduction in the number of user-interactions required for retrieving the target of interest. The proposed method is evaluated on challenging pedestrian and vehicle data sets, and compares favorably to the state of the art in target re-acquisition algorithms. expand
|
|
|
SESSION: Short - S4/applications/content track |
| |
Lexing Xie
|
|
|
|
|
Restoration of out-of-focus lecture video by automatic slide matching |
| |
Ngai-Man Cheung,
David Chen,
Vijay Chandrasekhar,
Sam S. Tsai,
Gabriel Takacs,
Sherif A. Halawa,
Bernd Girod
|
|
Pages: 899-902 |
|
doi>10.1145/1873951.1874108 |
|
Full text: PDF
|
|
Restoring the fine detail in the slide area of a defocused lecture video is a challenging task. In this work, we propose to use clean images of slides available along with the defocused lecture video to help the restoration. Our proposed method uses ...
Restoring the fine detail in the slide area of a defocused lecture video is a challenging task. In this work, we propose to use clean images of slides available along with the defocused lecture video to help the restoration. Our proposed method uses local feature descriptors and multiple defocused slide decks to automatically identify the slide that is displayed in the defocused frame. We then use the matching slide as side information to estimate the parameters for deconvolution and bilateral filtering. Experimental results show that the proposed algorithm compares favorably to a computationally-intensive iterative deconvolution algorithm that does not employ any side information. In particular, it can recover small drawings and text that are severely blurred in a poorly focused lecture video expand
|
|
|
Increasing interactivity in street view web navigation systems |
| |
Alexandre Devaux,
Nicolas Paparoditis
|
|
Pages: 903-906 |
|
doi>10.1145/1873951.1874109 |
|
Full text: PDF
|
|
This paper presents some interactive features we have added on our street-view web navigation application. Our system allows to navigate through a huge amount of data (panoramas and laser clouds) and also to interact with it. We will detail 4 aspects ...
This paper presents some interactive features we have added on our street-view web navigation application. Our system allows to navigate through a huge amount of data (panoramas and laser clouds) and also to interact with it. We will detail 4 aspects of this interactivity. First, the labelling, displaying of features directly into the images in the 3D space, useful for general public but also researchers in image processing and computer vision. Secondly we propose a crowd sourcing mode for blurring people not automatically detected. Thirdly we offer the possibility for the web user to localize and measure in 3D all objects visible in the images by plotting only in one image. Finally we developed a multimedia editor that allows public administrations (like town halls, museums, operas, theaters, etc.) to add interactive content like video or images at the exact 3D position/orientation/size they chose with an easy manipulating editor to augment with realism the static scenes with dynamic or fresher elements. expand
|
|
|
Improving face clustering using social context |
| |
Peng Wu,
Feng Tang
|
|
Pages: 907-910 |
|
doi>10.1145/1873951.1874110 |
|
Full text: PDF
|
|
In this paper we describe an algorithm to improve the performance of face clustering using the social relationship of people. One common challenge in face clustering techniques is that very often the faces of the same person are clustered into different ...
In this paper we describe an algorithm to improve the performance of face clustering using the social relationship of people. One common challenge in face clustering techniques is that very often the faces of the same person are clustered into different face clusters, due to the imperfection of the face features. The remedy to this problem, the user needs to scan all the clusters and manually merge the face clusters of the same person to the same cluster. We propose to use the social context information inherent among the people in a collection to build a social network and combine this knowledge with face similarity measure to generate a small number of ranked face clusters as the candidate for a cluster to be merged to. Thus, a user can gain the benefit of often avoiding browsing the face clusters back and forth to find the right cluster to merge. Experimental results show that the proposed approach can improve the recall of face clustering because more correct faces are merged into their significant cluster while still maintaining a high precision. expand
|
|
|
Region categorization with mobile applications |
| |
Jiang Gao
|
|
Pages: 911-914 |
|
doi>10.1145/1873951.1874111 |
|
Full text: PDF
|
|
We explore how to optimally categorize regions for faster and more reliable image matching and registration. We propose using the entropy of histogram of oriented gradients(HOG) features to characterize image regions, and propose a region-sensitive feature ...
We explore how to optimally categorize regions for faster and more reliable image matching and registration. We propose using the entropy of histogram of oriented gradients(HOG) features to characterize image regions, and propose a region-sensitive feature selection algorithm for image registration. We apply the region categorization algorithms to several mobile applications, including mobile visual search and image registration for panorama. We demonstrate the effectiveness of our approach by experimental results on a large dataset. expand
|
|
|
Topic discovery of web video using star-structured K-partite graph |
| |
Jian Shao,
Wentao Yin,
Shuai Ma,
Yueting Zhuang
|
|
Pages: 915-918 |
|
doi>10.1145/1873951.1874112 |
|
Full text: PDF
|
|
As the explosive growth of web videos on video-shared sites like YouTube, the discovery of video topics has become a hot research area. In order to utilize all kinds of characteristics in web video such as visual features (SIFT, shape or color) and contextual ...
As the explosive growth of web videos on video-shared sites like YouTube, the discovery of video topics has become a hot research area. In order to utilize all kinds of characteristics in web video such as visual features (SIFT, shape or color) and contextual cues (such as title or tags) effectively, this paper proposes an approach to represent the explicit and implicit correlations hidden in web videos by a star-structured K-partite graph model, and then a co-clustering process is conducted to discover video topics. The experimental results demonstrate the feasibility and effectiveness of the proposed approach. expand
|
|
|
Video retargeting for aesthetic enhancement |
| |
Yang-Yang Xiang,
Mohan S. Kankanhalli
|
|
Pages: 919-922 |
|
doi>10.1145/1873951.1874113 |
|
Full text: PDF
|
|
In this paper, we present a post-editing scheme for camera-work. It is based on video retargeting, but aims to enhance the aesthetic interest of home produced video sequences. The essential part of video clips are emphasized by automatically zooming ...
In this paper, we present a post-editing scheme for camera-work. It is based on video retargeting, but aims to enhance the aesthetic interest of home produced video sequences. The essential part of video clips are emphasized by automatically zooming in. The camera pans to preserve the important features within the frame while zooming in. Different from traditional video retargeting schemes, we use a variable zooming factor which is based on the motion saliency of frames. expand
|
|
|
FireVolleyball: multi-player interactive game providing a sense of touching fire |
| |
Sei Ikeda,
Yuki Uranishi,
Yoshitsugu Manabe,
Kunihiro Chihara
|
|
Pages: 923-926 |
|
doi>10.1145/1873951.1874114 |
|
Full text: PDF
|
|
This paper describes a novel game system which provides multiple players with a sense of touching fire with their own hands. Players in this game are divided into two teams in front of a wall-type flat display and try to score points by grounding a fireball ...
This paper describes a novel game system which provides multiple players with a sense of touching fire with their own hands. Players in this game are divided into two teams in front of a wall-type flat display and try to score points by grounding a fireball on the other team's court like volleyball. The players can recognize their contacts with the fireball from mirrored image of their own appearance and the superimposed fireball in the display. Computer detects those contacts by using a real-time time-of-flight camera and renders flame of the fireball based on fluid simulation. The reason why we choose fire as a ball is that several characteristics of fire are advantages in interactions between users and virtual objects. This game can provide players with enough enjoyment and reality due to these advantages even if the implemented human detection algorithm is quite simple. expand
|
|
|
Memory matrix: a novel user experience for home video |
| |
Qianqian Xu,
Zhipeng Wu,
Guorong Li,
Lei Qin,
Shuqiang Jiang,
Qingming Huang
|
|
Pages: 927-930 |
|
doi>10.1145/1873951.1874115 |
|
Full text: PDF
|
|
Nowadays, various efforts have sprung up aiming to automatically analyze home videos and provide users satisfactory experiences. In this paper, we present a novel user experience for home video called Memory Matrix, which could facilitate users to re-experience ...
Nowadays, various efforts have sprung up aiming to automatically analyze home videos and provide users satisfactory experiences. In this paper, we present a novel user experience for home video called Memory Matrix, which could facilitate users to re-experience the joy of their memories, travelling along not only the time axis but also the space axis. In other words, the video clips (sub-shots) are organized both by taken times and taken locations, which further allows the user to browse home videos taken at similar locations. Moreover, given a specific query in Memory Matrix (row, column), it can also provide the user optional summaries along the time axis or space axis. The summarization scheme in this paper is based on a top-down interest score generation algorithm which automatically propagates the pre-labeled video level interest scores to sub-shot level interest scores. Firstly, the user is asked to provide interest scores to all the video sequences in the home video collection. Then, the video sequences are decomposed into sub-shots which are represented by keyframes. Consequently, we employ multi-scale spatial saliency analysis to remove the foregrounds and model the background scenes based on histogram of visual words. Finally, the interest scores are propagated from video level to sub-shot level by using gradient descent algorithm. Experimental results demonstrate the effectiveness, efficiency, and robustness of our framework. expand
|
|
|
Artistic paper-cut of human portraits |
| |
Meng Meng,
Mingtian Zhao,
Song-Chun Zhu
|
|
Pages: 931-934 |
|
doi>10.1145/1873951.1874116 |
|
Full text: PDF
|
|
This paper presents a method to render artistic paper-cut of human portraits. Rendering paper-cut images from photographs can be considered as an inhomogeneous image binarization problem, to which ideal solutions should reproduce vivid image details ...
This paper presents a method to render artistic paper-cut of human portraits. Rendering paper-cut images from photographs can be considered as an inhomogeneous image binarization problem, to which ideal solutions should reproduce vivid image details with sparse cuts. Especially for portrait paper-cut, good artworks should capture impressive facial features. To achieve this goal, our approach integrates bottom-up and top-down cues to better determine the binary values. In the bottom-up phase, facial components are localized on the input photograph, and their draft binary versions are proposed. In the top-down phase, we use pre-collected representative paper-cut templates, with which we synthesize the final paper-cut image by matching them with the bottom-up proposals. Experimental results show that our approach can produce visually satisfactory results. expand
|
|
|
Robust hashing for music copyright protection by combining beat segmentation and chroma |
| |
Wei Li,
Zhurong Wang,
Bilei Zhu,
Xiangyang Xue
|
|
Pages: 935-938 |
|
doi>10.1145/1873951.1874117 |
|
Full text: PDF
|
|
Time-scale modification and pitching shifting are two recognized challenging attacks to music copyright protection. To resist them simultaneously, a novel robust hashing method is proposed by combining the strength of music beat segmentation and chroma-based ...
Time-scale modification and pitching shifting are two recognized challenging attacks to music copyright protection. To resist them simultaneously, a novel robust hashing method is proposed by combining the strength of music beat segmentation and chroma-based music feature. These two measures are aimed at solving the problem of desynchronization and frequency shifting respectively. Moreover, two layers of scrambling are performed to ensure the security. Experiments exhibit remarkable robustness against various attacks including pitch scaling@20%, time-scale modification@40%, and jittering@1/10 etc. expand
|
|
|
Explicit and implicit concept-based video retrieval with bipartite graph propagation model |
| |
Lei Bao,
Juan Cao,
Yongdong Zhang,
Jintao Li,
Ming-yu Chen,
Alexander G. Hauptmann
|
|
Pages: 939-942 |
|
doi>10.1145/1873951.1874118 |
|
Full text: PDF
|
|
The major scientific problem for content-based video retrieval is the semantic gap. Generally speaking, there are two appropriate ways to bridge the semantic gap: the first one is from human perspective (top-down) and the other one is from computer perspective ...
The major scientific problem for content-based video retrieval is the semantic gap. Generally speaking, there are two appropriate ways to bridge the semantic gap: the first one is from human perspective (top-down) and the other one is from computer perspective (bottom-up). The top-down method defines a concept lexicon from human perspective, trains the detector for each concept based on supervised learning, and then indexes the corpus with concept detectors. Since each concept has an explicit semantic meaning, we call this concept as an explicit concept. The bottom-up approach directly discovers the underlying latent topics from video corpus by machine perspective using an unsupervised learning. The video corpus is indexed subsequently by these latent topics. As opposite to explicit concepts, we name latent topics as implicit concepts. Given the explicit concept set is pre-defined and independent of the corpus, it is impossible to completely describe corpus and users' queries. On the other hand, the implicit concepts are dynamic and dependent on the corpus, which is able to fully describe corpus and users' queries. Therefore, combining explicit and implicit concepts could be a promising way to bridge the semantic gap effectively. In this paper, a Bipartite Graph Propagation Model (BGPM) is applied to automatically balance influences from explicit and implicit concepts. Concept nodes with strong connections to queries are reinforced no matter explicit or implicit. Demonstrated by the experiments on TREVID 2008 video dataset, BGPM successfully fuses explicit and implicit concepts to achieve a significant improvement on 48 search tasks. expand
|
|
|
Lightweight TV logo recognition based on image moment |
| |
Masaru Sugano,
Shigeyuki Sakazawa
|
|
Pages: 943-946 |
|
doi>10.1145/1873951.1874119 |
|
Full text: PDF
|
|
TV logo recognition is one of the suggested solutions for preventing unauthorized duplication and redistribution. The major problem of the previous logo recognition is that the matching process against the reference logos requires much time. Since millions ...
TV logo recognition is one of the suggested solutions for preventing unauthorized duplication and redistribution. The major problem of the previous logo recognition is that the matching process against the reference logos requires much time. Since millions of minutes of TV programs are produced and broadcasted every day, an efficient algorithm is necessary. In this paper, we propose lightweight logo recognition using an image moment. An image moment represents a geometrical feature by a numerical low dimensional vector and thus matching can be achieved by a simple distance metric. Since an image moment itself is not so robust to noise, we introduce conspicuousness of a transparent logo according to visual properties of its background and surrounding regions. A validation experiment using the actual TV programs achieves much faster processing than the existing method with only a small degradation in accuracy. In the proposed method, computational complexity is independent of the number of the reference logos and the size of the logo region. expand
|
|
|
Representative views re-ranking for 3D model retrieval with multi-bipartite graph reinforcement model |
| |
Yue Gao,
You Yang,
Qionghai Dai,
Naiyao Zhang
|
|
Pages: 947-950 |
|
doi>10.1145/1873951.1874120 |
|
Full text: PDF
|
|
In this paper, we propose a multi-bipartite graph reinforcement model for representative views re-ranking in 3D model retrieval. Given the views of one query 3D model, all query views are grouped into clusters to generate representative views and corresponding ...
In this paper, we propose a multi-bipartite graph reinforcement model for representative views re-ranking in 3D model retrieval. Given the views of one query 3D model, all query views are grouped into clusters to generate representative views and corresponding original weights. In the retrieval procedure, labeled positive retrieval results are employed to refine the query information. Each group of views from positive retrieval results and the group of representative query views are employed to construct a bipartite graph, and a multi-bipartite graph reinforcement algorithm is performed on these bipartite graphs to re-rank all views. Then the weights of all representative query views are updated. Experimental results on two 3D model databases are provided to justify the effectiveness of the proposed method. expand
|
|
|
Sorted label classifier chains for learning images with multi-label |
| |
Xi Liu,
Zhiping Shi,
Zhixin Li,
Xishun Wang,
Zhongzhi Shi
|
|
Pages: 951-954 |
|
doi>10.1145/1873951.1874121 |
|
Full text: PDF
|
|
In the real world, images always have several visual objects instead of only one, which makes it difficult for conventional object recognition methods to deal with them. In this paper, we present a topologically sorted classifier chain method for learning ...
In the real world, images always have several visual objects instead of only one, which makes it difficult for conventional object recognition methods to deal with them. In this paper, we present a topologically sorted classifier chain method for learning images with multi-label. We first provide a means of generating a topo-logically sorted label chain ordering by employing a topological sort algorithm and then apply the chain ordering to the classifier chain model proposed by [1] to classify multi-label images. Our method can capture the correlations between labels very effectively due to the sorted label chain ordering and the advantages brought by classifier chain method. We evaluate the proposed method on Corel dataset and demonstrate the micro and macro F1 measures superior to the state-of-the-art methods. expand
|
|
|
3D object retrieval with bag-of-region-words |
| |
Yue Gao,
You Yang,
Qionghai Dai,
Naiyao Zhang
|
|
Pages: 955-958 |
|
doi>10.1145/1873951.1874122 |
|
Full text: PDF
|
|
View-based method becomes an essential approach to 3D object retrieval in recent years. In the view-based 3D object retrieval framework, each object is described by a set of views and representative features are extracted from these views to match the ...
View-based method becomes an essential approach to 3D object retrieval in recent years. In the view-based 3D object retrieval framework, each object is described by a set of views and representative features are extracted from these views to match the objects in database. In this paper, we propose a novel 3D multi-view representation method, Bag-of-Region-Words (BoRW). It first gridly selects points in each view and extracts local SIFT features. Each local feature is encoded into a visual word with a trained visual vocabulary. Then each view is split into several regions, and each region is represented by a bag-of-visual-words feature vector. All the obtained regions are further grouped into clusters based on the bag-of-visual-words feature, and one feature is selected from each cluster with corresponding weight. In this way, each object is described by a set of BoRW. The Earth Movers Distance is employed to estimate the distance between two BoRW feature vectors. Experimental results show that the proposed method can achieve better retrieval performance than existing methods. expand
|
|
|
3D object search through semantic component |
| |
Chunjing Xu,
Zhengwu Zhang,
Jianzhuang Liu,
Xiaoou Tang
|
|
Pages: 959-962 |
|
doi>10.1145/1873951.1874123 |
|
Full text: PDF
|
|
In this paper, we present a novel concept named semantic component for 3D object search which describes a key component that semantically defines a 3D object. In most cases, the semantic component is intra-category stable and therefore can be used to ...
In this paper, we present a novel concept named semantic component for 3D object search which describes a key component that semantically defines a 3D object. In most cases, the semantic component is intra-category stable and therefore can be used to construct an efficient 3D object retrieval scheme. By segmenting an object into segments and learning the similar segments shared by all the objects in the same category, we can summarise what human uses for object recognition, from the analysis of which we develop a method to find the semantic component of an object. In our experiments, the proposed method is justified and the effectiveness of our algorithm is also demonstrated. expand
|
|
|
Keep moving!: revisiting thumbnails for mobile video retrieval |
| |
Wolfgang Hürst,
Cees G.M. Snoek,
Willem-Jan Spoel,
Mate Tomin
|
|
Pages: 963-966 |
|
doi>10.1145/1873951.1874124 |
|
Full text: PDF
|
|
Motivated by the increasing popularity of video on handheld devices and the resulting importance for effective video retrieval, this paper revisits the relevance of thumbnails in a mobile video retrieval setting. Our study indicates that users are quite ...
Motivated by the increasing popularity of video on handheld devices and the resulting importance for effective video retrieval, this paper revisits the relevance of thumbnails in a mobile video retrieval setting. Our study indicates that users are quite able to handle and assess small thumbnails on a mobile's screen - especially with moving images - suggesting promising avenues for future research in design of mobile video retrieval interfaces. expand
|
|
|
Semantic video indexing by fusing explicit and implicit context spaces |
| |
Yingbin Zheng,
Renzhong Wei,
Hong Lu,
Xiangyang Xue
|
|
Pages: 967-970 |
|
doi>10.1145/1873951.1874125 |
|
Full text: PDF
|
|
This paper addresses the problem of context-based concept fusion (CBCF) for concept detection and semantic video indexing. We introduce a novel framework based on constructing context spaces of concepts, such that the contextual correlations are used ...
This paper addresses the problem of context-based concept fusion (CBCF) for concept detection and semantic video indexing. We introduce a novel framework based on constructing context spaces of concepts, such that the contextual correlations are used to improve the performance of concept detectors. Different from traditional CBCF approach, we present two kinds of such context spaces: explicit context space for modeling the correlation of pairwise concepts, and implicit context space for representing latent themes trained from a set of concepts. The final concept detection scores are then directly fused from explicit and implicit context spaces. Experiments are presented on TRECVid 2006 benchmark and the comparisons with several state-of-the-art approaches demonstrate the effectiveness of proposed framework. expand
|
|
|
Effective logo retrieval with adaptive local feature selection |
| |
Jianlong Fu,
Jinqiao Wang,
Hanqing Lu
|
|
Pages: 971-974 |
|
doi>10.1145/1873951.1874126 |
|
Full text: PDF
|
|
Towards building a practical large-scale logo retrieval system, we propose a novel approach to extract and combine local features for effective logo retrieval. Instead of global feature extraction by modeling the web logo as a whole, we extract the local ...
Towards building a practical large-scale logo retrieval system, we propose a novel approach to extract and combine local features for effective logo retrieval. Instead of global feature extraction by modeling the web logo as a whole, we extract the local feature phrases to form a visual codebook and build an inverted file storing the features to accelerate the indexing process. Then we divide logos into several groups according to local feature type based on which feature can model the logo best and naming as "Point-type", "Shape-type" and "Patch-type". We develop a strategy of adaptive feature selection by a weight updating mechanism. To evaluate the performance, we have built a new challenging dataset which consists of 60 international corporations' logos. Experiments and comparisons demonstrate the superior performance to previous retrieval algorithms. expand
|
|
|
Mining and cropping common objects from images |
| |
Gangqiang Zhao,
Junsong Yuan
|
|
Pages: 975-978 |
|
doi>10.1145/1873951.1874127 |
|
Full text: PDF
|
|
Discovering common objects that appear frequently in a number of images is a challenging problem, due to (1) the appearance variations of the same common object and (2) the enormous computational cost involved in exploring the huge solution space, including ...
Discovering common objects that appear frequently in a number of images is a challenging problem, due to (1) the appearance variations of the same common object and (2) the enormous computational cost involved in exploring the huge solution space, including the location, scale, and the number of common objects. We characterize each image as a collection of visual primitives and propose a novel bottom-up approach to gradually prune local primitives to recover the whole common object. A multi-layer candidate pruning procedure is designed to accelerate the image data mining process. Our solution provides accurate localization of the common object, thus is able to crop the common objects despite their variations due to scale, view-point, lighting condition changes. Moreover, it can extract common objects even with few number of images. Experiments on challenging image and video datasets validate the effectiveness and efficiency of our method. expand
|
|
|
Saliency detection based on 2D log-gabor wavelets and center bias |
| |
Min Wang,
Jia Li,
Tiejun Huang,
Yonghong Tian,
Lingyu Duan,
Guochen Jia
|
|
Pages: 979-982 |
|
doi>10.1145/1873951.1874128 |
|
Full text: PDF
|
|
Visual saliency can be a useful tool for image content analysis such as automatic image cropping and image compression. In existing methods on visual saliency detection, most of them are related to the model of receptive field. In this paper, we propose ...
Visual saliency can be a useful tool for image content analysis such as automatic image cropping and image compression. In existing methods on visual saliency detection, most of them are related to the model of receptive field. In this paper, we propose a bottom-up model which introduces 2D Log-Gabor wavelets for saliency detection. Compared with the traditional model of receptive field, the 2D Log-Gabor wavelets can better simulate the biological characteristics of the simple cortical cell in the receptive filed. Moreover, we also incorporate the influence of center bias into our model, which is a common phenomenon that directs visual attention to the center of images in natural scenes. Experimental results show that our approach outperforms three state-of-the-art approaches remarkably. expand
|
|
|
Heterogeneous feature selection by group lasso with logistic regression |
| |
Fei Wu,
Ying Yuan,
Yueting Zhuang
|
|
Pages: 983-986 |
|
doi>10.1145/1873951.1874129 |
|
Full text: PDF
|
|
The selection of groups of discriminative features is critical for image understanding since the irrelevant features could deteriorate the performance of image understanding. This paper formulates the selection of groups of discriminative features by ...
The selection of groups of discriminative features is critical for image understanding since the irrelevant features could deteriorate the performance of image understanding. This paper formulates the selection of groups of discriminative features by the extension of group lasso with logistic regression for high-dimensional feature setting, we call it as the heterogeneous feature selection by Group Lasso with Logistic Regression (GLLR). GLLR encodes a sparse grouping prior to seek after a more interpretable model for feature selection and can identify most of discriminative groups of homogeneous features. The utilization of GLLR for image annotation shows the proposed GLLR achieves a better performance. expand
|
|
|
A novel audio fingerprinting method robust to time scale modification and pitch shifting |
| |
Bilei Zhu,
Wei Li,
Zhurong Wang,
Xiangyang Xue
|
|
Pages: 987-990 |
|
doi>10.1145/1873951.1874130 |
|
Full text: PDF
|
|
A novel audio fingerprinting method that is highly robust to Time Scale Modification (TSM) and pitch shifting is proposed. Instead of simply employing spectral or tempo-related features, our system is based on computer-vision techniques. We transform ...
A novel audio fingerprinting method that is highly robust to Time Scale Modification (TSM) and pitch shifting is proposed. Instead of simply employing spectral or tempo-related features, our system is based on computer-vision techniques. We transform each 1-D audio signal into a 2-D image and treat TSM and pitch shifting of the audio signal as stretch and translation of the corresponding image. Robust local descriptors are extracted from the image and matched against those of the reference audio signals. Experimental results show that our system is highly robust to various audio distortions, including the challenging TSM and pitch shifting. expand
|
|
|
Image classification using the web graph |
| |
Dhruv Kumar Mahajan,
Malcolm Slaney
|
|
Pages: 991-994 |
|
doi>10.1145/1873951.1874131 |
|
Full text: PDF
|
|
Image classification is a well-studied and hard problem in computer vision. We extend a proven solution for classifying web spam to handle images. We exploit the link structure of the web graph: a web page related to a given category is normally linked ...
Image classification is a well-studied and hard problem in computer vision. We extend a proven solution for classifying web spam to handle images. We exploit the link structure of the web graph: a web page related to a given category is normally linked to other pages describing related objects. Our approach combines information from the webgraph structure with semi-supervised learning from all the unlabeled images to create a superior image-classification model for multimedia data. We show that fusing image, text and web-graph features gives a 12% improvement (in the area under the ROC curve) over content features alone in an adult image-classification experiment. expand
|
|
|
SESSION: Short - S5/content/human-centered multimedia track |
| |
Marc Cavazza
|
|
|
|
|
Interactive learning of heterogeneous visual concepts with local features |
| |
Wajih Ouertani,
Michel Crucianu,
Nozha Boujemaa
|
|
Pages: 995-998 |
|
doi>10.1145/1873951.1874133 |
|
Full text: PDF
|
|
In the context of computer-assisted plant identification we are facing challenging information retrieval problems because of the very high within-class variability and of the limited number of training examples. To address these problems, we suggest ...
In the context of computer-assisted plant identification we are facing challenging information retrieval problems because of the very high within-class variability and of the limited number of training examples. To address these problems, we suggest a new interactive learning approach that combines similarity-based retrieval and re-ranking by SVM using local feature distributions. This approach leads to improved sample selection, allowing to obtain better results. expand
|
|
|
Index support for content-based multimedia exploration |
| |
Christian Beecks,
Philip Driessen,
Thomas Seidl
|
|
Pages: 999-1002 |
|
doi>10.1145/1873951.1874134 |
|
Full text: PDF
|
|
Content-based multimedia exploration systems support users in browsing and searching voluminous multimedia databases in an interactive and playful way. Guiding the user navigation and exploration process through the database contents, similarity-based ...
Content-based multimedia exploration systems support users in browsing and searching voluminous multimedia databases in an interactive and playful way. Guiding the user navigation and exploration process through the database contents, similarity-based layouts frequently serve as effective and intuitive user interfaces offering a multitude of easy-to-use query functionalities. In order to process queries arising in similarity-based layouts efficiently, we propose to support content-based exploration systems with index structures. For this purpose, we present a general approach which supports any kind of distance-based index structure. By evaluating our approach, we show that it improves the efficiency of content-based multimedia exploration. expand
|
|
|
Hybrid active learning for cross-domain video concept detection |
| |
Huan Li,
Yuan Shi,
Ming-yu Chen,
Alexander G. Hauptmann,
Zhang Xiong
|
|
Pages: 1003-1006 |
|
doi>10.1145/1873951.1874135 |
|
Full text: PDF
|
|
Cross-domain video concept detection is a challenging task due to the distribution difference between the source domain and target domain. In order to avoid expensive labeling the target-domain data, Active Learning can be used to incrementally learn ...
Cross-domain video concept detection is a challenging task due to the distribution difference between the source domain and target domain. In order to avoid expensive labeling the target-domain data, Active Learning can be used to incrementally learn a target classifier by reusing the one in the source domain. It uses a discriminative query strategy and picks the most ambiguous samples to label, which could fail if the distribution difference is too large. In this paper, to deal with large difference in data distributions, we propose a generative query strategy which is then combined with the existing discriminative one to yield a hybrid method. This method adaptively fits the distribution differences and gives a mixture strategy that performs more robustly compared to both single strategies. Experimental results on TRECVID semantic concept detection task demonstrate superior performance of our hybrid method. expand
|
|
|
Hierarchical image feature extraction and classification |
| |
Min-Hsuan Tsai,
Shen-Fu Tsai,
Thomas S. Huang
|
|
Pages: 1007-1010 |
|
doi>10.1145/1873951.1874136 |
|
Full text: PDF
|
|
In the field of machine learning and pattern recognition, an alternative to conventional classification is hierarchical classification that exploits hierarchical relations between concepts of interest. To the best of our knowledge, all hierarchical classification ...
In the field of machine learning and pattern recognition, an alternative to conventional classification is hierarchical classification that exploits hierarchical relations between concepts of interest. To the best of our knowledge, all hierarchical classification methods in the literature are designed to reduce computation complexity without sacrificing too much on accuracy performance. In this work on image classification, we first propose a hierarchical image feature extraction that extracts image feature based on the location of current node in hierarchy to fit the images under current node and to better distinguish its subclasses. As far as we know, such node-dependent feature extraction has not been considered in the literature. Contrary to former hierarchical classification methods that only consider local structure of the hierarchy, we propose a novel cross-level hierarchical classification method that utilizes both global and local concept structures throughout the entire path decision-making process. Our experimental result on two datasets shows that the proposed hierarchical feature extraction combined with our novel hierarchical classification achieves better accuracy performance than conventional non-hierarchical classification methods, and hence conventional hierarchical methods as well. expand
|
|
|
Revealing real quality of double compressed MP3 audio |
| |
Mengyu Qiao,
Andrew H. Sung,
Qingzhong Liu
|
|
Pages: 1011-1014 |
|
doi>10.1145/1873951.1874137 |
|
Full text: PDF
|
|
MP3 is the most popular format for audio storage and a de facto standard of digital audio compression for the transfer and playback. The flexibility of compression ratio of MP3 coding enables users to choose their customized configuration in the trade-off ...
MP3 is the most popular format for audio storage and a de facto standard of digital audio compression for the transfer and playback. The flexibility of compression ratio of MP3 coding enables users to choose their customized configuration in the trade-off between file size and quality. Double MP3 compression often occurs in audio forgery, steganography and quality faking by transcoding an MP3 audio to a different compression ratio. To detect double MP3 compression, in this paper, we extract the statistical features on the modified discrete cosine transform, and apply support vector machines and a dynamic evolving neuron-fuzzy inference system to the extracted features for classification. Experimental results show that our method effectively and accurately detects double MP3 compression for both up-transcoded and down-transcoded MP3 files. Our study also indicates the potential for mining the audio processing history for forensic purposes. expand
|
|
|
Prediction of favourite photos using social, visual, and textual signals |
| |
Roelof van Zwol,
Adam Rae,
Lluis Garcia Pueyo
|
|
Pages: 1015-1018 |
|
doi>10.1145/1873951.1874138 |
|
Full text: PDF
|
|
This paper focuses on the prediction of users' favourite photos in Flickr. We propose a multi-modal, machine learned approach that combines social, visual and textual signals into a single prediction system. Although each individual user has different ...
This paper focuses on the prediction of users' favourite photos in Flickr. We propose a multi-modal, machine learned approach that combines social, visual and textual signals into a single prediction system. Although each individual user has different motivations for calling a photo a favourite, we show that the textual, visual, and social modalities effectively capture the needs of most active Flickr users. We use gradient-boosted decision trees (GBDT) with a mod least squares loss function for the classification of a user's favourite photos, and evaluate the performance of our classifier with respect to the individual modalities and various combinations thereof. By using a combination of the social and visual modalities the GBDT creates a highly effective classifier. The addition of textual features allows us to significantly increase recall, with a slight trade off in precision. expand
|
|
|
One person labels one million images |
| |
Jinhui Tang,
Qiang Chen,
Shuicheng Yan,
Tat-Seng Chua,
Ramesh Jain
|
|
Pages: 1019-1022 |
|
doi>10.1145/1873951.1874139 |
|
Full text: PDF
|
|
Targeting the same objective of alleviating the manual work as automatic annotation, in this paper, we propose a novel framework with minimal human effort to manually annotate a large-scale image corpus. In this framework, a dynamic multi-scale cluster ...
Targeting the same objective of alleviating the manual work as automatic annotation, in this paper, we propose a novel framework with minimal human effort to manually annotate a large-scale image corpus. In this framework, a dynamic multi-scale cluster labeling strategy is proposed to manually label the clusters of similar image regions. The users label the multi-scale clusters of regions instead of individual images, thus each labeling operation can annotate hundreds or even thousands of images simultaneously with much reduced manual work. Meanwhile the manual labeling guarantees the accuracy of the labels. Compared to automatic annotation, the proposed framework is more flexible, general and effective, especially for annotating those labels with large semantic gaps. Experiments on NUS-WIDE dataset demonstrate that the proposed fast manual annotation framework is much more effective than automatic annotation and comparatively efficient. expand
|
|
|
A novel virtual world based HCI paradigm for multimedia scholarly communication |
| |
Arturo Nakasone,
Tiago da Silva,
Andreas Budde,
Kugamoorthy Gajananan,
Tri T. Truong,
Helmut Prendinger
|
|
Pages: 1023-1026 |
|
doi>10.1145/1873951.1874140 |
|
Full text: PDF
|
|
The sharing of academic knowledge through printed publications has been widely and successfully utilized for more than a hundred years. However, the need to process huge amounts of data in scientific analysis and communicate its results to the scientific ...
The sharing of academic knowledge through printed publications has been widely and successfully utilized for more than a hundred years. However, the need to process huge amounts of data in scientific analysis and communicate its results to the scientific community has presented a big challenge for researchers in the data-intensive era. In addition to providing accurate and flexible graphical representations of data, the entire research process should ideally be made verifiable by peers. While web-based tools have been proposed to address this problem, most of them lack important features for scientific work such as real-time collaboration, powerful multimedia visualization and interaction, and environment persistency. Therefore, in this paper, we will propose the use of virtual world technology to address these issues. In particular, we will give an overview of the different methods that we have implemented in the context of multimedia interaction, which is considered the most critical factor in the development of virtual worlds as sound platforms for human-centered multimedia systems. expand
|
|
|
Kodak moments and Flickr diamonds: how users shape large-scale media |
| |
Radu Andrei Negoescu,
Alexander C. Loui,
Daniel Gatica-Perez
|
|
Pages: 1027-1030 |
|
doi>10.1145/1873951.1874141 |
|
Full text: PDF
|
|
In today's age of digital multimedia deluge, a clear understanding of the dynamics of online communities is capital. Users have abandoned their role of passive consumers and are now the driving force behind large-scale media repositories, whose dynamics ...
In today's age of digital multimedia deluge, a clear understanding of the dynamics of online communities is capital. Users have abandoned their role of passive consumers and are now the driving force behind large-scale media repositories, whose dynamics and shaping factors are not yet fully understood. In this paper we present a novel human-centered analysis of two major photo sharing websites, Flickr and Kodak Gallery. On a combined dataset of over 5 million tagged photos, we investigate fundamental differences and similarities at the level of tag usage and propose a joint probabilistic topic model to provide further insight into semantic differences between the two communities. Our results show that the effects of the users' motivations and needs can be strongly observed in this large-scale data, in the form of what we call Kodak Moments and Flickr Diamonds. They are an indication that system designers should carefully take into account the target audience and its needs. expand
|
|
|
Inter-ACT: an affective and contextually rich multimodal video corpus for studying interaction with robots |
| |
Ginevra Castellano,
Iolanda Leite,
Andre Pereira,
Carlos Martinho,
Ana Paiva,
Peter W. McOwan
|
|
Pages: 1031-1034 |
|
doi>10.1145/1873951.1874142 |
|
Full text: PDF
|
|
The Inter-ACT (INTEracting with Robots - Affect Context Task) corpus is an affective and contextually rich multimodal video corpus containing affective expressions of children playing chess with an iCat robot. It contains videos that capture the interaction ...
The Inter-ACT (INTEracting with Robots - Affect Context Task) corpus is an affective and contextually rich multimodal video corpus containing affective expressions of children playing chess with an iCat robot. It contains videos that capture the interaction from different perspectives and includes synchronised contextual information about the game and the behaviour displayed by the robot. The Inter-ACT corpus is mainly intended to be a comprehensive repository of naturalistic and contextualised, task-dependent data for the training and evaluation of an affect recognition system in an educational game scenario. The richness of contextual data that captures the whole human-robot interaction cycle, together with the fact that the corpus was collected in the same interaction scenario of the target application, make the Inter-ACT corpus unique in its genre. expand
|
|
|
Multi-scale entropy analysis of dominance in social creative activities |
| |
Donald Glowinski,
Paolo Coletta,
Gualtiero Volpe,
Antonio Camurri,
Carlo Chiorri,
Andrea Schenone
|
|
Pages: 1035-1038 |
|
doi>10.1145/1873951.1874143 |
|
Full text: PDF
|
|
Our research focused on ensemble musical performance, an ideal test-bed for the development of models and techniques for measuring creative social interaction in an ecologically valid framework. Starting from expressive behavioral data of a string quartet, ...
Our research focused on ensemble musical performance, an ideal test-bed for the development of models and techniques for measuring creative social interaction in an ecologically valid framework. Starting from expressive behavioral data of a string quartet, this paper addresses the application of Multi-Scale Entropy method to investigate dominance. expand
|
|
|
Evaluation of digital games using QOL measurements |
| |
Yukari Hori,
Akira Baba
|
|
Pages: 1039-1042 |
|
doi>10.1145/1873951.1874144 |
|
Full text: PDF
|
|
Digital Games have become part of everyday life all over the world. In this article, we suggest to use Quality of Life (QOL) to investigate the characteristics of game play and to evaluate games. We measured emotional changes caused by game play using ...
Digital Games have become part of everyday life all over the world. In this article, we suggest to use Quality of Life (QOL) to investigate the characteristics of game play and to evaluate games. We measured emotional changes caused by game play using a QOL questionnaire specialized in the detection of the mental change (POMS-Brief Form) and saliva enzymatic analysis that is typical biomarker test about stress in Interrupted Time-Series Design. As a result, the analysis of variance showed that some negative feelings such as uneasiness and anger are reduced significantly by game play. Furthermore, it was indicated that an appreciative evaluation of the game correlates with a positive effect. Therefore it became clear that there is a need to consider the test participant's subjective evaluation of the game during the experiment. Additionally, the feelings affected were clearly different in each of the 16 titles analyzed. Therefore, a new classification of games from the viewpoint of the POMS change in QOL is possible. expand
|
|
|
MuVis: an application for interactive exploration of large music collections |
| |
Ricardo Dias,
Manuel J. Fonseca
|
|
Pages: 1043-1046 |
|
doi>10.1145/1873951.1874145 |
|
Full text: PDF
|
|
In this paper we present MuVis, an interactive visualization and exploration tool for large music collections, based on music content and metadata. We combined a user-centered design with three main components: information visualization techniques (based ...
In this paper we present MuVis, an interactive visualization and exploration tool for large music collections, based on music content and metadata. We combined a user-centered design with three main components: information visualization techniques (based on semantic ordered treemaps), music information retrieval mechanisms (for semantic and content-based information extraction) and dynamic queries, to offer users a more efficient, flexible and yet, easy to use solution for browsing music collections and to create playlists. Preliminary results reveal that our solution is faster and easier to use than the Windows Media Player, allowing users to perform a more effective and fast navigation, while getting a deeper knowledge of their library. Satisfaction survey revealed that users liked our approach for browsing, filtering and creating playlists, while at the same time they were able to "re-discover" forgotten music, due to the similarity mechanisms incorporated in our solution. expand
|
|
|
From photo networks to social networks, creation and use of a social network derived with photos |
| |
Michel Plantié,
Michel Crampes
|
|
Pages: 1047-1050 |
|
doi>10.1145/1873951.1874146 |
|
Full text: PDF
|
|
With the new possibilities in communication and information management, social networks and photos have received plenty of attention in the digital age. In this paper, we show how social photos, captured during family events, representing individuals ...
With the new possibilities in communication and information management, social networks and photos have received plenty of attention in the digital age. In this paper, we show how social photos, captured during family events, representing individuals or groups, can be visualized as a network that reveals social attributes. From this photo network, social network is extracted that can help to build personalized albums. The photo network organization makes use of Formal Concept Analysis methods. expand
|
|
|
Enriching audio-visual chat with conversation-based image retrieval and display |
| |
Jeroen Vanattenhoven,
Christof van Nimwegen,
Matthias Strobbe,
Olivier Van Laere,
Bart Dhoedt
|
|
Pages: 1051-1054 |
|
doi>10.1145/1873951.1874147 |
|
Full text: PDF
|
|
This paper presents the results of a user study carried out to evaluate an application prototype in which an audio-visual chat conversation between two users is augmented by pictures related to the topics of that conversation. The prototype analyses ...
This paper presents the results of a user study carried out to evaluate an application prototype in which an audio-visual chat conversation between two users is augmented by pictures related to the topics of that conversation. The prototype analyses the conversation and deducts the topic of conversation by means of a keyword tree, augmented by an ontology. Then it retrieves pictures from Flickr based on this topic, after which the pictures are shown to the users. This mechanism is called conversation-based image retrieval. 15 participants were recruited for this user study; the duration of one session was approximately 30 minutes. Eye tracking and questionnaires were used to evaluate participants' experiences. We found that participants value the use of pictures to augment an audio-visual chat application. Furthermore, participants claimed they would use it in a social context: talking to family, friends and acquaintances. One significant improvement over the prototype would be to use their own pictures (personal user-generated content) instead of just random pictures. expand
|
|
|
A shape-free, designable 6-DoF marker tracking method for camera-based interaction in mobile environment |
| |
Hiroki Nishino
|
|
Pages: 1055-1058 |
|
doi>10.1145/1873951.1874148 |
|
Full text: PDF
|
|
We developed a novel marker tracking method with shape-free designable markers, which can be visually meaningful to users. The method can work fast enough to provide a real-time camerabased interaction even on low performance CPUs such as one used in ...
We developed a novel marker tracking method with shape-free designable markers, which can be visually meaningful to users. The method can work fast enough to provide a real-time camerabased interaction even on low performance CPUs such as one used in mobile Internet devices. Features such as visually communicative design and inexpensive computational cost are very desirable for users with mobile devices in the mobile/pervasive interaction environment. The method utilizes the topological region adjacency to detect the marker candidates and then apply a simple method similar to geometric-hashing to determine the detected maker by voting to the hash tables. By such a combination of two different approaches, our method can distinguish those markers with the same topological structure and is also capable of 6-DoF post estimation whereas most of the existing topology-based systems can not distinguish markers with the same topological. expand
|
|
|
iWalk: a tool for interacting with geo-located data through movement and gesture |
| |
Visruth Premraj,
Margaret Schedel,
Tamara L. Berg
|
|
Pages: 1059-1062 |
|
doi>10.1145/1873951.1874149 |
|
Full text: PDF
|
|
In this work, we present iWalk, a multimedia exploration tool that provides an interactive virtual environment for physically exploring geo-tagged data. This tool is flexible enough for users to easily explore their own collections, or existing collections ...
In this work, we present iWalk, a multimedia exploration tool that provides an interactive virtual environment for physically exploring geo-tagged data. This tool is flexible enough for users to easily explore their own collections, or existing collections from the web. Two interaction modalities are incorporated into our tool -- movement and gesture. Movement (walking around the physical space) is used to intuitively move through the digital data space of a collection. Gesture is used for direct data manipulation; the user is able to select the mapping between gestures and interactions. In addition, we also provide functionality for exploring data that is not geo-located. Here the user defines a rough mapping between the data collection space and the physical interaction space and then operates the program as usual. We have currently tested the system on three different data sets -- a large collection of geo-tagged photographs from Flickr, a geo-located sound collection, and a museum collection that is not geo-located expand
|
|
|
The colour of life: novel visualisations of population lifestyles |
| |
Philip Kelly,
Aiden R. Doherty,
Alan F. Smeaton,
Cathal Gurrin,
Noel E. O'Connor
|
|
Pages: 1063-1066 |
|
doi>10.1145/1873951.1874150 |
|
Full text: PDF
|
|
Colour permeates our daily lives, yet we rarely take notice of it. In this work we utilise the SenseCam (a visual lifelogging tool), to investigate the predominant colours in one million minutes of human life that a group of 20 individuals encounter ...
Colour permeates our daily lives, yet we rarely take notice of it. In this work we utilise the SenseCam (a visual lifelogging tool), to investigate the predominant colours in one million minutes of human life that a group of 20 individuals encounter throughout their normal daily activities. We also compare the colours that different groups of people are exposed to in their typical days. This information is presented in using a novel colour-wheel visualisation which is a new means of illustrating that people are exposed to bright colours over longer durations of time during summer months, and more dark colours during winter months. expand
|
|
|
AudioFeeds: a mobile auditory application for monitoring online activities |
| |
Tilman Dingler,
Stephen Brewster
|
|
Pages: 1067-1070 |
|
doi>10.1145/1873951.1874151 |
|
Full text: PDF
|
|
User participation has transformed the way news travel the globe. With the rise of the 'Web 2.0' phenomenon users have been empowered with the means of creating and distributing informational items, which we call social feeds. Platforms like Twitter ...
User participation has transformed the way news travel the globe. With the rise of the 'Web 2.0' phenomenon users have been empowered with the means of creating and distributing informational items, which we call social feeds. Platforms like Twitter and Facebook provide a variety of tools to facilitate real-time communication among people. But social sites are not limited to personal chat; they also provide an effective means for organizing large groups of people in response to catastrophic disasters. Monitoring these feeds can provide time-critical information, but can easily lead to information overload due to the large amount of data being shared. In this paper we introduce a mobile auditory display application called AudioFeeds that allows users to maintain an overview of activities in different social feeds. AudioFeeds runs on a mobile device and enables users to get an overview of their social networks and spot peaks in activity by sonifying social feeds and creating a spatialised soundscape around the user's head. We conducted a user study looking into different aspects of activity monitoring. Results show that our application provides an effective way for monitoring overall activity levels and allows users to identify activity peaks with 86.1% accuracy even when mobile. expand
|
|
|
User driven audio content navigation for spoken web |
| |
Ketki A. Dhanesha,
Nitendra Rajput,
Kundan Srivastava
|
|
Pages: 1071-1074 |
|
doi>10.1145/1873951.1874152 |
|
Full text: PDF
|
|
It is a common practice for us to skim textual content on a web page. While skimming, we usually skip words or phrases that are not of interest to us and we slow down our speed when the content seems to be of relevance to us. But when we listen to audio ...
It is a common practice for us to skim textual content on a web page. While skimming, we usually skip words or phrases that are not of interest to us and we slow down our speed when the content seems to be of relevance to us. But when we listen to audio content, which is not persistent and is sequential, such skimming is not possible. In developing countries, cell-phone penetration is much higher than Internet penetration. Moreover, due to low literacy, voice is a convenient modality to access information. The skimming techniques are therefore more critical in the audio domain. In this paper, we describe the technique for navigating audio content while interacting with information systems in a client server environment, where a dumb phone is the client device. The paper presents techniques for skimming audio content and for placing markers in audio. The user studies conducted with 18 users for more than 1 month, in a live setting substantiates the usability and usefulness of the navigation techniques. expand
|
|
|
Structuring ordered nominal data for event sequence discovery |
| |
Chreston A. Miller,
Francis Quek,
Naren Ramakrishnan
|
|
Pages: 1075-1078 |
|
doi>10.1145/1873951.1874153 |
|
Full text: PDF
|
|
This work investigates using n-gram processing and a temporal relation encoding to providing relational information about events extracted from media streams. The event information is temporal and nominal in nature being categorized by a descriptive ...
This work investigates using n-gram processing and a temporal relation encoding to providing relational information about events extracted from media streams. The event information is temporal and nominal in nature being categorized by a descriptive label or symbolic means and can be difficult to relationally compare and give ranking metrics. Given a parsed sequence of events, relational information pertinent to comparison between events can be obtained through the application of n-grams techniques borrowed from speech processing and temporal relation logic. The procedure is discussed along with results computed using a representative data set characterized by nominal event data. expand
|
|
|
Automated sleep quality measurement using EEG signal: first step towards a domain specific music recommendation system |
| |
Wei Zhao,
Xinxi Wang,
Ye Wang
|
|
Pages: 1079-1082 |
|
doi>10.1145/1873951.1874154 |
|
Full text: PDF
|
|
With the rapid pace of modern life, millions of people suffer from sleep problems. Music therapy, as a non-medication approach to mitigating sleep problems, has attracted increasing attention recently. However the adaptability of music therapy is limited ...
With the rapid pace of modern life, millions of people suffer from sleep problems. Music therapy, as a non-medication approach to mitigating sleep problems, has attracted increasing attention recently. However the adaptability of music therapy is limited by the time consuming task of choosing suitable music for users. Inspired by this observation, we discuss the concept of a domain specific music recommendation system, which automatically recommends music for users according to their sleep quality. The proposed system requires multidisciplinary efforts including automated sleep quality measurement and content-based music similarity measure. As a first step, we focus on the automated sleep quality measurement in this paper. An EEG-based approach is proposed to measure user's sleep quality. The advantages of our approach over standard Polysomnography (PSG) method are: 1) it measures sleep quality by recognizing three sleep categories rather than six sleep stages, thus higher accuracy can be expected; 2) three sleep categories are recognized by analyzing Electroencephalography (EEG) signal only, so the user experience is improved because he is attached with fewer sensors during sleep. We conduct experiments based on a standard data set. Our approach achieves high accuracy and shows promising potential for the music recommendation system. expand
|
|
|
Deducing user's fatigue from haptic data |
| |
Abdelwahab Hamam,
Nicolas D. Georganas,
Fawaz Alsulaiman,
Abdulmotaleb El Saddik
|
|
Pages: 1083-1086 |
|
doi>10.1145/1873951.1874155 |
|
Full text: PDF
|
|
Undesired physical fatigue reduces the overall Quality of Experience (QoE) of virtual reality haptics applications. Detecting fatigue is the first step in rectifying this problem. Fatigue in usability analysis is usually detected through conducting questionnaires ...
Undesired physical fatigue reduces the overall Quality of Experience (QoE) of virtual reality haptics applications. Detecting fatigue is the first step in rectifying this problem. Fatigue in usability analysis is usually detected through conducting questionnaires and observations. This paper introduces an objective indirect discovery of user's fatigue through analyzing data of a haptic writing application. Our results show that if users are feeling tired their kinetic energy would decrease. We can compute this kinetic energy from the velocity of the arm movement during the usage of the haptic device. expand
|
|
|
Context-based indoor object detection as an aid to blind persons accessing unfamiliar environments |
| |
Xiaodong Yang,
YingLi Tian,
Chucai Yi,
Aries Arditi
|
|
Pages: 1087-1090 |
|
doi>10.1145/1873951.1874156 |
|
Full text: PDF
|
|
Independent travel is a well known challenge for blind or visually impaired persons. In this paper, we propose a computer vision-based indoor wayfinding system for assisting blind people to independently access unfamiliar buildings. In order to find ...
Independent travel is a well known challenge for blind or visually impaired persons. In this paper, we propose a computer vision-based indoor wayfinding system for assisting blind people to independently access unfamiliar buildings. In order to find different rooms (i.e. an office, a lab, or a bathroom) and other building amenities (i.e. an exit or an elevator), we incorporate door detection with text recognition. First we develop a robust and efficient algorithm to detect doors and elevators based on general geometric shape, by combining edges and corners. The algorithm is generic enough to handle large intra-class variations of the object model among different indoor environments, as well as small inter-class differences between different objects such as doors and elevators. Next, to distinguish an office door from a bathroom door, we extract and recognize the text information associated with the detected objects. We first extract text regions from indoor signs with multiple colors. Then text character localization and layout analysis of text strings are applied to filter out background interference. The extracted text is recognized by using off-the-shelf optical character recognition (OCR) software products. The object type, orientation, and location can be displayed as speech for blind travelers. expand
|
|
|
SESSION: Short - S6/applications/content track |
| |
Wei Tsang Ooi
|
|
|
|
|
Multimedia cross-platform content distribution for mobile peer-to-peer networks using network coding |
| |
Morten Videbæk Pedersen,
Janus Heide,
Péter Vingelmann,
László Blázovics,
Frank H.P. Fitzek
|
|
Pages: 1091-1094 |
|
doi>10.1145/1873951.1874158 |
|
Full text: PDF
|
|
This paper is looking into the possibility of multimedia content distribution over multiple mobile platforms forming wireless peer--to--peer networks. State of the art mobile networks are centralized and base station or access point oriented. Current ...
This paper is looking into the possibility of multimedia content distribution over multiple mobile platforms forming wireless peer--to--peer networks. State of the art mobile networks are centralized and base station or access point oriented. Current developments break ground for device to device communication. In this paper we will introduce a mobile application that runs on Symbian as well as iPhone/iPod devices and is able to exchange multimedia content in a point to multipoint fashion. The mobile application coined PictureViewer can convey pictures from one source device to many neighboring devices using a wireless 802.11 network together with network coding and user cooperation. The advantage of network coding in this context is that the source devices only need a minimal amount of knowledge about the sinks received packets and therefore only a minimal amount of feedback is needed to ensure reliable data delivery. expand
|
|
|
Topical summarization of web videos by visual-text time-dependent alignment |
| |
Song Tan,
Hung-Khoon Tan,
Chong-Wah Ngo
|
|
Pages: 1095-1098 |
|
doi>10.1145/1873951.1874159 |
|
Full text: PDF
|
|
Search engines are used to return a long list of hundreds or even thousands of videos in response to a query topic. Efficient navigation of videos becomes difficult and users often need to painstakingly explore the search list for a gist of the search ...
Search engines are used to return a long list of hundreds or even thousands of videos in response to a query topic. Efficient navigation of videos becomes difficult and users often need to painstakingly explore the search list for a gist of the search result. This paper addresses the challenge of topical summarization by providing a timeline-based visualization of videos through matching of heterogeneous sources. To overcome the so called sparse-text problem of web videos, auxiliary information from Google context is exploited. Google Trends is used to predict the milestone events of a topic. Meanwhile, the typical scenes of web videos are extracted by visual near-duplicate threading. Visual-text alignment is then conducted to align scenes from videos and articles from Google News. The outcome is a set of scene-news pairs, each representing an event mapped to the milestone timeline of a topic. The timeline-based visualization provides a glimpse of major events about a topic. We conduct both the quantitative and subjective studies to evaluate the practicality of the application. expand
|
|
|
Improved saliency detection based on superpixel clustering and saliency propagation |
| |
Zhixiang Ren,
Yiqun Hu,
Liang-Tien Chia,
Deepu Rajan
|
|
Pages: 1099-1102 |
|
doi>10.1145/1873951.1874160 |
|
Full text: PDF
|
|
Saliency detection is useful for high level applications such as adaptive compression, image retargeting, object recognition, etc. In this paper, we introduce an effective region-based solution for saliency detection. We first use the adaptive mean shift ...
Saliency detection is useful for high level applications such as adaptive compression, image retargeting, object recognition, etc. In this paper, we introduce an effective region-based solution for saliency detection. We first use the adaptive mean shift algorithm to extract superpixels from the input image, then apply Gaussian Mixture Model (GMM) to cluster superpixels based on their color similarity, and finally calculate the saliency value for each cluster using compactness metric together with modified PageRank propagation. This solution is able to represent the image in a perceptually meaningful way and is robust to over-segmentation. It highlights salient regions with full resolution, well-defined boundary. Experimental results show that both the adaptive mean shift and the modified PageRank algorithm contribute substantially to the saliency detection result. In addition, the ROC analysis demonstrates that our approach significantly outperforms five existing popular methods. expand
|
|
|
Refining video annotation by exploiting inter-shot context |
| |
Jian Yi,
Yuxin Peng,
Jianguo Xiao
|
|
Pages: 1103-1106 |
|
doi>10.1145/1873951.1874161 |
|
Full text: PDF
|
|
This paper proposes a new approach to refine video annotation by exploiting the inter-shot context. Our method is mainly novel in two ways. On one hand, to refine annotation result of the target concept, we model the sequence of shots in video as a conditional ...
This paper proposes a new approach to refine video annotation by exploiting the inter-shot context. Our method is mainly novel in two ways. On one hand, to refine annotation result of the target concept, we model the sequence of shots in video as a conditional random field with chain structure. In this way, we can capture different kinds of concept relationships in inter-shot context to improve the annotation accuracy. On the other hand, to exploit inter-shot context for the target concept, we classify shots into different types according to their correlation to the target concept, which will be used to represent different kinds of concept relationships in inter-shot context. Experiments on the widely used TRECVID 2006 data set show that our method is effective for refining video annotation, achieving a significant performance improvement over several state of the art methods. expand
|
|
|
Web video categorization based on Wikipedia categories and content-duplicated open resources |
| |
Zhineng Chen,
Juan Cao,
Yicheng Song,
Yongdong Zhang,
Jintao Li
|
|
Pages: 1107-1110 |
|
doi>10.1145/1873951.1874162 |
|
Full text: PDF
|
|
This paper presents a novel approach for web video categorization by leveraging Wikipedia categories (WikiCs) and open resources describing the same content as the video, i.e., content-duplicated open resources (CDORs). Note that current approaches only ...
This paper presents a novel approach for web video categorization by leveraging Wikipedia categories (WikiCs) and open resources describing the same content as the video, i.e., content-duplicated open resources (CDORs). Note that current approaches only collect CDORs within one or a few media forms and ignore CDORs of other forms. We explore all these resources by utilizing WikiCs and commercial search engines. Given a web video, its discriminative Wikipedia concepts are first identified and classified. Then a textual query is constructed and from which CDORs are collected. Based on these CDORs, we propose to categorize web videos in the space spanned by WikiCs rather than that spanned by raw tags. Experimental results demonstrate the effectiveness of both the proposed CDOR collection method and the WikiC voting categorization algorithm. In addition, the categorization model built based on both WikiCs and CDORs achieves better performance compared with the models built based on only one of them as well as state-of-the-art approach. expand
|
|
|
Supporting children's social communication skills through interactive narratives with virtual characters |
| |
Mary Ellen Foster,
Katerina Avramides,
Sara Bernardini,
Jingying Chen,
Christopher Frauenberger,
Oliver Lemon,
Kaska Porayska-Pomsta
|
|
Pages: 1111-1114 |
|
doi>10.1145/1873951.1874163 |
|
Full text: PDF
|
|
The development of social communication skills in children relies on multimodal aspects of communication such as gaze, facial expression, and gesture. We introduce a multimodal learning environment for social skills which uses computer vision to estimate ...
The development of social communication skills in children relies on multimodal aspects of communication such as gaze, facial expression, and gesture. We introduce a multimodal learning environment for social skills which uses computer vision to estimate the children's gaze direction, processes gestures from a large multi-touch screen, estimates in real time the affective state of the users, and generates interactive narratives with embodied virtual characters. We also describe how the structure underlying this system is currently being extended into a general framework for the development of interactive multimodal systems. expand
|
|
|
Automatic image tagging via category label and web data |
| |
Shenghua Gao,
Zhengxiang Wang,
Liang-Tien Chia,
Ivor Wai-Hung Tsang
|
|
Pages: 1115-1118 |
|
doi>10.1145/1873951.1874164 |
|
Full text: PDF
|
|
Image tagging is an important technique for the image content understanding and text based image processing. Given a selection of images, how to tag these images efficiently and effectively is an interesting problem. In this paper, a novel semi-auto ...
Image tagging is an important technique for the image content understanding and text based image processing. Given a selection of images, how to tag these images efficiently and effectively is an interesting problem. In this paper, a novel semi-auto image tagging technique is proposed: By assigning each image a category label first, our method can automatically recommend those promising tags to each image by utilizing existing vast web data. The main contributions of our paper can be highlighted as follows: (i) By assigning each image a category label, our method can automatically recommend other tags to the image, thus reducing the human annotation efforts. Meanwhile, our method guarantee tags' diversity due to abundant web data. (ii) We use sparse coding to automatically select those semantically related images for tag propagation. (iii) Local & global ranking agglomeration will make our method robust to noisy tags. We use Event dataset as the images to be tagged, and crawled Flickr images with their associated tags according to the category label in Event dataset as the auxiliary web data. Experimental results show that our method achieves promising performance for image tagging, which proves the effectiveness of our method. expand
|
|
|
Auto-tagging of images in non-english languages using tag language conversion |
| |
Keiichiro Hoashi,
Hiromi Ishizaki,
Hjalmar Wennerstrom,
Yasuhiro Takishima
|
|
Pages: 1119-1122 |
|
doi>10.1145/1873951.1874165 |
|
Full text: PDF
|
|
Utilization of web images with social tags as training data has been a major trend for the development of automatic image tagging/classification systems. While the amount of information available on web sites such as Flickr is abundant, the majority ...
Utilization of web images with social tags as training data has been a major trend for the development of automatic image tagging/classification systems. While the amount of information available on web sites such as Flickr is abundant, the majority of information obtained from such sites is in English, the dominant language on the web. This linguistic unbalance is expected to affect auto-tagging results in a negative way for non-English users, who demand image tags in their native languages. The objective of this research is to develop an image auto-tagging system which can generate tags in languages other than English. This paper examines the effect of linguistic unbalance in training data to construct auto-tagging systems, which aim to generate tags in minor languages. Furthermore, we propose methods which utilize an auto-tagging model generated on English training data, and convert the auto-tagging results to the target language. Subjective evaluations show that the proposed method is capable of generating auto-tagging results with better quality than conventional approaches. expand
|
|
|
Landmark image retrieval using visual synonyms |
| |
Efstratios Gavves,
Cees G.M. Snoek
|
|
Pages: 1123-1126 |
|
doi>10.1145/1873951.1874166 |
|
Full text: PDF
|
|
In this paper, we consider the incoherence problem of the visual words in bag-of-words vocabularies. Different from existing work, which performs assignment of words based solely on closeness in descriptor space, we focus on identifying pairs of independent, ...
In this paper, we consider the incoherence problem of the visual words in bag-of-words vocabularies. Different from existing work, which performs assignment of words based solely on closeness in descriptor space, we focus on identifying pairs of independent, distant words - the visual synonyms - that are still likely to host image patches with similar appearance. To study this problems we focus on landmark images, where we can examine whether image geometry is an appropriate vehicle for detecting visual synonyms. We propose an algorithm for the extraction of visual synonyms in landmark images. To show the merit of visual synonyms, we perform two experiments. We examine closeness of synonyms in descriptor space and we show a first application of visual synonyms in a landmark image retrieval setting. Using visual synonyms, we perform on par with the state-of-the-art, but with six times less visual words. expand
|
|
|
Approximate image color correlograms |
| |
Claudio Taranto,
Nicola Di Mauro,
Stefano Ferilli,
Floriana Esposito
|
|
Pages: 1127-1130 |
|
doi>10.1145/1873951.1874167 |
|
Full text: PDF
|
|
The recent explosion in Internet usage and the growing amount of digital images caused by the more and more ubiquitous presence of digital cameras has created a demand for effective and flexible techniques for automatic image retrieval. As the volume ...
The recent explosion in Internet usage and the growing amount of digital images caused by the more and more ubiquitous presence of digital cameras has created a demand for effective and flexible techniques for automatic image retrieval. As the volume of the data increases, memory and processing requirements need to correspondingly increase at the same rapid pace, and this is often prohibitively expensive. Image collections on this scale make performing even the most common and simple image processing and machine learning tasks non trivial. In this paper we present a method to reduce the computational complexity of a widely known method for image indexing and retrieval based on a second order statistical measure. The aim of the paper is twofold: Q1) is it possible to efficiently extract an approximate distribution of the image features with a resulting low error? Q2) how the resulting approximate distribution affects the similarity-based accuracy? In particular, we propose a sampling method to approximate the distribution of correlograms, adopting a Monte Carlo approach to compute the distribution on a subset of pixels uniformly sampled from the original image. A further variant is to sample the neighborhood of each pixel too. Validation on the Caltech 101 dataset proved that the proposed approximate distribution, obtained with a considerable decrease of the computational time, has an error very low when compared to the exact distribution. Result obtained in the second experiment on a similarity-based ranking task are encouraging. expand
|
|
|
Data-oriented locality sensitive hashing |
| |
Wei Zhang,
Ke Gao,
Yong-dong Zhang,
Jin-tao Li
|
|
Pages: 1131-1134 |
|
doi>10.1145/1873951.1874168 |
|
Full text: PDF
|
|
Locality Sensitive Hashing (LSH) has been proposed as a scalable and high-dimensional index for approximate similarity search. Euclidean LSH is a variation of LSH and has been successfully used in many multimedia applications. However, hash functions ...
Locality Sensitive Hashing (LSH) has been proposed as a scalable and high-dimensional index for approximate similarity search. Euclidean LSH is a variation of LSH and has been successfully used in many multimedia applications. However, hash functions of the basic Euclidean LSH project data points over randomly selected directions, which reduces accuracy when data are non-uniformly distributed. So more hash tables are needed to guarantee the accuracy, and thus more memory is consumed. Since heavy memory cost is a significant drawback of Euclidean LSH, we propose Data-Oriented LSH to reduce memory consumption when data are non-uniformly distributed. Most of existing methods are query-directed, such as multi-probe and query expansion methods. We focused on the hash table construction, and thus the query-directed methods can be applied to our index to improve further. The experiment shows that to achieve the same accuracy, our method uses less time and less memory compared with original Euclidean LSH. expand
|
|
|
Automatically protecting privacy in consumer generated videos using intended human object detector |
| |
Yuta Nakashima,
Noboru Babaguchi,
Jianping Fan
|
|
Pages: 1135-1138 |
|
doi>10.1145/1873951.1874169 |
|
Full text: PDF
|
|
The growing popularity of video sharing services such as YouTube enables us to upload and share consumer generated videos (CGVs) easily, resulting in disclosure of the privacy sensitive information (PSI) of persons, i.e., their appearances. Therefore, ...
The growing popularity of video sharing services such as YouTube enables us to upload and share consumer generated videos (CGVs) easily, resulting in disclosure of the privacy sensitive information (PSI) of persons, i.e., their appearances. Therefore, we need a technique for automatically protecting the privacy in CGVs; however, the main problem is how to determine PSI regions automatically. In this paper, we propose a novel system for automatically protecting the privacy in CGVs. The proposed system tackles the problem of determining PSI regions by using an intended human object detector that detects human objects which the camera person wanted to capture to achieve his/her capture intention. In addition, the proposed system adopts several PSI obscuring methods such as blocking out, blurring and seam carving. We present the results of subjective evaluations of a privacy protected video in terms of the visual quality and acceptability of PSI disclosure, as well as the performance of the intended human object detector. expand
|
|
|
Non-parametric anomaly detection exploiting space-time features |
| |
Lorenzo Seidenari,
Marco Bertini
|
|
Pages: 1139-1142 |
|
doi>10.1145/1873951.1874170 |
|
Full text: PDF
|
|
In this paper a real-time anomaly detection system for video streams is proposed. Spatio-temporal features are exploited to capture scene dynamic statistics together with appearance. Anomaly detection is performed in a non-parametric fashion, evaluating ...
In this paper a real-time anomaly detection system for video streams is proposed. Spatio-temporal features are exploited to capture scene dynamic statistics together with appearance. Anomaly detection is performed in a non-parametric fashion, evaluating directly local descriptor statistics. A method to update scene statistics, to cope with scene changes that typically happen in real world settings, is also provided. The proposed method is tested on publicly available datasets. expand
|
|
|
A scalable cover identification engine |
| |
Emanuele Di Buccio,
Nicola Montecchio,
Nicola Orio
|
|
Pages: 1143-1146 |
|
doi>10.1145/1873951.1874171 |
|
Full text: PDF
|
|
This paper describes the implementation of a content-based cover song identification system which has been released under an open source license. The system is centered around the Apache Lucene text search engine library, and proves how classic techniques ...
This paper describes the implementation of a content-based cover song identification system which has been released under an open source license. The system is centered around the Apache Lucene text search engine library, and proves how classic techniques derived from textual Information Retrieval, in particular the bag-of-words paradigm, can successfully be adapted to music identification. The paper focuses on extensive experimentation on the most influential system parameters, in order to find an optimal tradeoff between retrieval accuracy and speed of querying. expand
|
|
|
Interactive visual object search through mutual information maximization |
| |
Jingjing Meng,
Junsong Yuan,
Yuning Jiang,
Nitya Narasimhan,
Venu Vasudevan,
Ying Wu
|
|
Pages: 1147-1150 |
|
doi>10.1145/1873951.1874172 |
|
Full text: PDF
|
|
Searching for small objects (e.g., logos) in images is a critical yet challenging problem. It becomes more difficult when target objects differ significantly from the query object due to changes in scale, viewpoint or style, not to mention partial occlusion ...
Searching for small objects (e.g., logos) in images is a critical yet challenging problem. It becomes more difficult when target objects differ significantly from the query object due to changes in scale, viewpoint or style, not to mention partial occlusion or cluttered backgrounds. With the goal to retrieve and accurately locate the small object in the images, we formulate the object search as the problem of finding subimages with the largest mutual information toward the query object. Each image is characterized by a collection of local features. Instead of only using the query object for matching, we propose a discriminative matching using both positive and negative queries to obtain the mutual information score. The user can verify the retrieved subimages and improve the search results incrementally. Our experiments on a challenging logo database of 10,000 images highlight the effectiveness of this approach. expand
|
|
|
Nearest-neighbor classification using unlabeled data for real world image application |
| |
Shuhui Wang,
Qingming Huang,
Shuqiang Jiang,
Qi Tian
|
|
Pages: 1151-1154 |
|
doi>10.1145/1873951.1874173 |
|
Full text: PDF
|
|
Currently, Nearest-Neighbor approaches (NN) have been widely applied to real world image data mining. These approaches have the following three disadvantages: (i) the performance is inferior on small datasets; (ii) the performance of approximated nearest ...
Currently, Nearest-Neighbor approaches (NN) have been widely applied to real world image data mining. These approaches have the following three disadvantages: (i) the performance is inferior on small datasets; (ii) the performance of approximated nearest neighbor search will degrade for data with high dimensions; (iii) they are heavily dependent on the chosen feature and distance measure. To overcome these intrinsic weaknesses, we propose a novel Nearest-Neighbor method, which improves the original NN approaches from three aspects. Firstly, we propose a novel neighborhood similarity measure, where the similarity between test images and labeled images in the database is calculated jointly by the original image-to-image similarity and the average similarity of their neighboring unlabeled data. Secondly, we adopt the kernelized locality sensitive hashing to effectively conduct the nearest neighbor search for high dimensional data. Finally, to enhance the robustness of the method on different genres of images, we propose to fuse the discrimination power of different features by considering all the retrieved nearest neighbors via hashing systems using different features/kernels. Experimental result shows the advantage over traditional Nearest-Neighbor methods using the labeled data only. Even when the ratio of labeled data is very small, our method could also achieve remarkable results, thanks to the help of unlabeled data and multiple features. expand
|
|
|
Behavior and properties of spatio-temporal local features under visual transformations |
| |
Julian Stöttinger,
Bogdan Tudor Goras,
Nicu Sebe,
Allan Hanbury
|
|
Pages: 1155-1158 |
|
doi>10.1145/1873951.1874174 |
|
Full text: PDF
|
|
Successful state-of-the-art video retrieval and classification applications are predominantly carried out by means of spatio-temporal features. Typically, the evaluation of these tasks is exclusively done based on their final performance but no systematic ...
Successful state-of-the-art video retrieval and classification applications are predominantly carried out by means of spatio-temporal features. Typically, the evaluation of these tasks is exclusively done based on their final performance but no systematic analysis of feature robustness, invariance and stability has been done yet for large scale video retrieval. In this work, we analyze the impact of visual transformation on spatio-temporal features in large scale experiments. Following the recipe of recent state of the art evaluations, we choose the best performing approaches, namely the spatio-temporal Harris3D, Hessian3D, and Cuboid detectors and the HOG/HOF, SURF3D, and HOG3D descriptors. We show that these features have different properties and behave differently under varying transformations (challenges). This helps researchers to justify the choice of features for new applications and helps to optimize the choice of input video in terms of resolution, compression, frames per second or noise suppression. We make the extracted features accessible on-line for further independent evaluation and applications. expand
|
|
|
Boosting-based multiple kernel learning for image re-ranking |
| |
I-Hong Jhuo,
D. T. Lee
|
|
Pages: 1159-1162 |
|
doi>10.1145/1873951.1874175 |
|
Full text: PDF
|
|
Re-ranking the returned images from a query relies on two important steps to improve its effectiveness: the estimation of the image relevance and the enhancement of the similarity function. However, attaining an effective visual similarity and an accurate ...
Re-ranking the returned images from a query relies on two important steps to improve its effectiveness: the estimation of the image relevance and the enhancement of the similarity function. However, attaining an effective visual similarity and an accurate re-ranking are quite challenging. We address these issues by first evaluating the image relevance to the query from the dataset according to the visual features and the co-occurrence of local patches of images. Then we boost the visual similarity measure associated with image relevance, and propose an enhancement algorithm, called Boosting-MKL, which not only incrementally learns the feature fusion but also generally preserves the initial local ranking. Specifically, we perform a random walk over a similarity graph for re-ranking. The experimental results demonstrate that our proposed approach significantly improves the effectiveness of visual similarity measure and the performance of image reranking. expand
|
|
|
SESSION: Short - S7/content/systems track |
| |
Pascal Frossard
|
|
|
|
|
Multi-layer stereo video matting: video matting |
| |
M. Jiang,
Danny Crookes,
Min Chen
|
|
Pages: 1163-1166 |
|
doi>10.1145/1873951.1874177 |
|
Full text: PDF
|
|
In this paper, an unsupervised scheme for stereo video matting is presented, where stereo motion analysis is combined to provide an automatic multi-layer clustering scheme of alpha components. With this multi-layer matting scheme, objects in both foreground ...
In this paper, an unsupervised scheme for stereo video matting is presented, where stereo motion analysis is combined to provide an automatic multi-layer clustering scheme of alpha components. With this multi-layer matting scheme, objects in both foreground and background can be extracted for background substitution. The experiment shows that the proposed scheme can have a better performance in term of automatic grouping of alpha components in comparison with min-cut matte grouping. expand
|
|
|
GPU acceleration of Eff2 descriptors using CUDA |
| |
Kristleifur Daðason,
Herwig Lejsek,
Ársæll Þ. Jóhansson,
Björn Þór Jónsson,
Laurent Amsaleg
|
|
Pages: 1167-1170 |
|
doi>10.1145/1873951.1874178 |
|
Full text: PDF
|
|
Video analysis using local descriptors requires a high-throughput descriptor creation process. This speed can be obtained from modern GPUs. In this paper, we adapt the computation of the Eff2 descriptors, a SIFT variant, to the GPU. We compare our GPU-Eff ...
Video analysis using local descriptors requires a high-throughput descriptor creation process. This speed can be obtained from modern GPUs. In this paper, we adapt the computation of the Eff2 descriptors, a SIFT variant, to the GPU. We compare our GPU-Eff descriptors to SiftGPU and show that while both variants yield similar results, the GPU-Eff descriptors require significantly less processing time. expand
|
|
|
Dynamic multi-cue tracking with detection responses association |
| |
Guochen Jia,
Yonghong Tian,
Yaowei Wang,
Tiejun Huang,
Min Wang
|
|
Pages: 1171-1174 |
|
doi>10.1145/1873951.1874179 |
|
Full text: PDF
|
|
Multi-cue integration has proved successful at increasing the robustness of tracking algorithms and overcoming the failure cases of individual cue. But considering dynamic appearance of objects or clutter background, the integration based on constant ...
Multi-cue integration has proved successful at increasing the robustness of tracking algorithms and overcoming the failure cases of individual cue. But considering dynamic appearance of objects or clutter background, the integration based on constant weights may weaken the performance of this scheme. In this paper, we propose a dynamic weights update mechanism for multiple cues tracking with detection responses as supervision. We integrate multiple cues based on the observation hypotheses compared with detection association results and adjust the weights according to the approximation degree. The integration is adapted on-the-fly during tracking, in order to keep the tracker adaptive. The proposed method allows flexible combination of different cues and we select cues based on color and local feature for tracking. Experiments are carried out on 602 trajectories extracted from TRECVID 2008 event detection dataset which is recorded in an airport scenario. Comparison results prove the effectiveness of our method. expand
|
|
|
KPB-SIFT: a compact local feature descriptor |
| |
Gangqiang Zhao,
Ling Chen,
Gencai Chen,
Junsong Yuan
|
|
Pages: 1175-1178 |
|
doi>10.1145/1873951.1874180 |
|
Full text: PDF
|
|
Invariant feature descriptors such as SIFT and GLOH have been demonstrated to be very robust for image matching and object recognition. However, such descriptors are typically of high dimensionality, e.g. 128-dimension in the case of SIFT. This limits ...
Invariant feature descriptors such as SIFT and GLOH have been demonstrated to be very robust for image matching and object recognition. However, such descriptors are typically of high dimensionality, e.g. 128-dimension in the case of SIFT. This limits the performance of feature matching techniques in terms of speed and scalability. A new compact feature descriptor, called Kernel Projection Based SIFT (KPB-SIFT), is presented in this paper. Like SIFT, our descriptor encodes the salient aspects of image information in the feature point's neighborhood. However, instead of using SIFT's smoothed weighted histograms, we apply kernel projection techniques to orientation gradient patches. The produced KPB-SIFT descriptor is more compact as compared to the state-of-the-art, does not require pre-training step needed by PCA based descriptors, and shows superior advantages in terms of distinctiveness, invariance to scale, and tolerance of geometric distortions. We extensively evaluated the effectiveness of KPB-SIFT with datasets acquired under varying circumstances. expand
|
|
|
Fast feature selection and training for AdaBoost-based concept detection with large scale datasets |
| |
Shi Chen,
Jinqiao Wang,
Yang Liu,
Changsheng Xu,
Hanqing Lu
|
|
Pages: 1179-1182 |
|
doi>10.1145/1873951.1874181 |
|
Full text: PDF
|
|
AdaBoost has been proved a successful statistical learning method for concept detection with high performance of discrimination and generalization. However, it is computationally expensive to train a concept detector using boosting, especially on large ...
AdaBoost has been proved a successful statistical learning method for concept detection with high performance of discrimination and generalization. However, it is computationally expensive to train a concept detector using boosting, especially on large scale datasets. The bottleneck of training phase is to select the best learner among massive learners. Traditional approaches for selecting a weak classifier usually run in O(NT), with N examples and T learners. In this paper, we treat the best learner selection as a Nearest Neighbor Search problem in the function space instead of feature space. With the help of Locality Sensitive Hashing (LSH) algorithm, the best learner searching procedure can be speeded up in the time of O(NL), where L is the number of buckets in LSH. Compared with the T (~500,000), the L (~600) is much smaller in our experiments. In addition, through studying the distribution of weak learners and candidate query points, we present an efficient method to try to partition the weak learner points and the feasible region of query points uniformly as much as possible, which can achieve significant improvement in both recall and precision compared with the random projection in traditional LSH algorithm. Experimental results reveal our method can significantly reduce the training time. And still the performance of our method is comparable with the state-of-art methods. expand
|
|
|
Large-scale robust visual codebook construction |
| |
Darui Li,
Linjun Yang,
Xian-Sheng Hua,
Hong-Jiang Zhang
|
|
Pages: 1183-1186 |
|
doi>10.1145/1873951.1874182 |
|
Full text: PDF
|
|
The web-scale image retrieval system demands a large-scale visual codebook, which is difficult to be generated by the commonly adopted K-means vector quantization due to the applicability issue. While approximate K-means is proposed to scale up the visual ...
The web-scale image retrieval system demands a large-scale visual codebook, which is difficult to be generated by the commonly adopted K-means vector quantization due to the applicability issue. While approximate K-means is proposed to scale up the visual codebook construction it needs to employ a high-precision approximate nearest neighbor search in the assignment step and is difficult to converge, which limits its scalability. In this paper, we propose an improved approximate K-means, by leveraging the assignment information in the history, namely the previous iterations, to improve the assignment precision. By further randomizing the employed approximate nearest neighbor search in each iteration, the proposed algorithm can improve the assignment precision conceptually similarly as the randomized k-d trees, while nearly no additional cost is introduced. The algorithm can be proved to be convergent and we demonstrate that the proposed algorithm improves the quality of the generated visual codebook as well as the scalability experimentally and analytically. expand
|
|
|
Image annotation using multi-correlation probabilistic matrix factorization |
| |
Zechao Li,
Jing Liu,
Xiaobin Zhu,
Tinglin Liu,
Hanqing Lu
|
|
Pages: 1187-1190 |
|
doi>10.1145/1873951.1874183 |
|
Full text: PDF
|
|
The image-word correlation estimation is an essential issue in image annotation. In this paper, we propose a multi-correlation probabilistic matrix factorization (MPMF) algorithm for the correlation estimation. Different from the traditional solutions ...
The image-word correlation estimation is an essential issue in image annotation. In this paper, we propose a multi-correlation probabilistic matrix factorization (MPMF) algorithm for the correlation estimation. Different from the traditional solutions which treat the image-word correlation, image similarity and word relation independently or sequentially, in the proposed MPMF, these three elements are integrated together simultaneously and seamlessly. Specifically, we have derived two low-dimensional sets by conducting a joint factorization upon the word-to-image relation matrix, the image similarity matrix, and the word relation matrix to derive two low-dimensional sets of latent word factors and latent image factors. Finally, the annotation words of each untagged or noisily tagged image can be predicted by reconstructing the image-word correlations with the both derived latent factors. Experimental results on the Corel dataset and a Flickr image dataset show the superior performance of our proposed algorithm over the state-of-the-arts. expand
|
|
|
Interactive panoramic video streaming system over restricted bandwidth network |
| |
Masayuki Inoue,
Hideaki Kimata,
Katsuhiko Fukazawa,
Norihiko Matsuura
|
|
Pages: 1191-1194 |
|
doi>10.1145/1873951.1874184 |
|
Full text: PDF
|
|
Many new applications are being created around the panoramic video service. The typical system divides the high resolution panoramic video into tiles and the sender transmits a set of tiles, the partial panoramic video. Coding each tile at a uniform ...
Many new applications are being created around the panoramic video service. The typical system divides the high resolution panoramic video into tiles and the sender transmits a set of tiles, the partial panoramic video. Coding each tile at a uniform bitrate yields poor video quality because each tile has different visual characteristics. This paper proposes a new data format and tile adaptive rate control to achieve high quality partial panoramic video transmission, even over restricted bandwidth networks. The proposed data format, based on the MVC standard, has two types of video stream and meta data. Each tile is encoded at multiple bitrates and multiplexed synchronously. The meta data has quality values of each tile at the multi-bitrates, and is used to determine the view_ids associated with the bitrates and user's desired view. Our tile-adaptive rate control proposal maximizes the partial panoramic video quality even in restricted bandwidth networks. An experiment shows that the proposed method can achieve higher video quality. The method ensures that the facial elements in the user's view, which often exhibit the greatest motion and to which we are most sensitive, have high quality. expand
|
|
|
Understanding the security and robustness of SIFT |
| |
Thanh-Toan Do,
Ewa Kijak,
Teddy Furon,
Laurent Amsaleg
|
|
Pages: 1195-1198 |
|
doi>10.1145/1873951.1874185 |
|
Full text: PDF
|
|
Many content-based retrieval systems (CBIRS) describe images using the SIFT local features because of their very robust recognition capabilities. While SIFT features proved to cope with a wide spectrum of general purpose image distortions, its security ...
Many content-based retrieval systems (CBIRS) describe images using the SIFT local features because of their very robust recognition capabilities. While SIFT features proved to cope with a wide spectrum of general purpose image distortions, its security has not fully been assessed yet. In one of their scenario, Hsu et al. in [2] show that very specific anti-SIFT attacks can jeopardize the keypoint detection. These attacks can delude systems using SIFT targeting application such as image authentication and (pirated) copy detection. Having some expertise in CBIRS, we were extremely concerned by their analysis. This paper presents our own investigations on the impact of these anti SIFT attacks on a real CBIRS indexing a large collection of images. The attacks are indeed not able to break the system. A detailed analysis explains this assessment. expand
|
|
|
Implementation and demonstration of a credit-based home access point |
| |
Choong-Soo Lee,
Mark Claypool,
Robert Kinicki
|
|
Pages: 1199-1202 |
|
doi>10.1145/1873951.1874186 |
|
Full text: PDF
|
|
The increasing availability of high speed Internet access and the decreasing cost of wireless technologies has increased the number of devices in the home that wirelessly connect to the Internet. While home user applications often have different network ...
The increasing availability of high speed Internet access and the decreasing cost of wireless technologies has increased the number of devices in the home that wirelessly connect to the Internet. While home user applications often have different network requirements, the wireless access point (AP) typically gives them all the same treatment. It has been shown that applications that are sensitive to delay, such as VoIP, remote login and online games suffer degraded performance when running concurrently with applications that expand to fill the available capacity, such as file sharing and downloading. Unfortunately, there are few mechanisms available at the AP to mitigate these effects other than for users to explicitly classify traffic based on port numbers or host IP addresses. This work proposes a novel home access point called CHAP that features credit-based queue management designed to eliminate the need for explicit classification and configuration of per-application quality. CHAP is implemented as a Linux queuing discipline and compared with a traditional AP forWeb browsing and online game activities. The comparisons demonstrate the merits of our approach. expand
|
|
|
Spatially refined inter-sequence error concealment for a multi-broadcast receiver using frequency selective approximation |
| |
Tobias Tröger,
Jürgen Seiler,
André Kaup
|
|
Pages: 1203-1206 |
|
doi>10.1145/1873951.1874187 |
|
Full text: PDF
|
|
Mobile reception of digital TV often suffers from severe signal degradations. Inter-sequence error concealment reconstructs lost image blocks of a distorted high-resolution TV signal by inserting corresponding error-free blocks from a low-resolution ...
Mobile reception of digital TV often suffers from severe signal degradations. Inter-sequence error concealment reconstructs lost image blocks of a distorted high-resolution TV signal by inserting corresponding error-free blocks from a low-resolution reference TV signal. It is well-suited for application in future automotive multi-broadcast receivers and can outperform state-of-the-art methods by up to 15 dB PSNRY depending on the quality of the reference signal. In this contribution, we show that inter-sequence error concealment can be improved by approximating inserted blocks jointly with neighboring pixels according to a well-known frequency selective method. The reconstruction quality can be significantly increased especially in case of low-bitrate reference signals. Inserted blocks can be further refined also for high bitrates. On average, a gain of 1.7 dB PSNRY can be achieved. The peak gain which is evaluated on frame basis even reaches 5.6 dB PSNRY. expand
|
|
|
Performance improvement of distributed video coding by using block mode selection |
| |
Bo-Ruei Chiou,
Yun-Chung Shen,
Han-Ping Cheng,
Ja-Ling Wu
|
|
Pages: 1207-1210 |
|
doi>10.1145/1873951.1874188 |
|
Full text: PDF
|
|
Block mode selection is one new way to improve the performance of distributed video coding (DVC). Since there are many factors influencing the correctness of the block mode selection, the decision of block mode is not an easy work. In this paper, a low ...
Block mode selection is one new way to improve the performance of distributed video coding (DVC). Since there are many factors influencing the correctness of the block mode selection, the decision of block mode is not an easy work. In this paper, a low complexity block mode selection model is proposed to select the block modes more correctly in the WZ (Wyner-Ziv) frame. The proposed block mode selection modules can increase the RD performance up to 2.8 dB as compared with the traditional DVC codecs; moreover, the subjective quality can also be improved by using the proposed deblocking filter when the objective quality is kept the same. expand
|
|
|
Fast decoding for LDPC based distributed video coding |
| |
Yu-Shan Pai,
Han-Ping Cheng,
Yun-Chung Shen,
Ja-Ling Wu
|
|
Pages: 1211-1214 |
|
doi>10.1145/1873951.1874189 |
|
Full text: PDF
|
|
Distributed video coding (DVC) is a new coding paradigm targeting applications with the need for low-complexity encoding at the cost of a higher decoding complexity. In the DVC architecture based on a feedback channel, the high decoding complexity is ...
Distributed video coding (DVC) is a new coding paradigm targeting applications with the need for low-complexity encoding at the cost of a higher decoding complexity. In the DVC architecture based on a feedback channel, the high decoding complexity is mainly due to the request-decode operation with repetitively fixed step size (induced by Slepian-Wolf decoding). In this paper, a parallel message-passing decoding algorithm for low density parity check (LDPC) syndrome is applied through Compute Unified Device Architecture (CUDA) based on General-Purpose Graphics Processing Unit (GPGPU). Furthermore, we propose an approach to reduce the number of requests dubbed as Ladder Request Step Size (LRSS) which leads to more speedup gain. Experimental results show that, through our work, significant speedup in decoding time is achieved with negligible loss in rate-distortion (RD) performance. expand
|
|
|
Shape-stable region boundary extraction via affine morphological scale space (AMSS) |
| |
Petros Kapsalas,
Stefanos Kollias
|
|
Pages: 1215-1218 |
|
doi>10.1145/1873951.1874190 |
|
Full text: PDF
|
|
In this paper we present a new approach towards the extraction of affine image regions based on detecting shape-stable boundaries from a multi-scale image representation. We construct an affine morphological scale space (AMSS) representation [1], which ...
In this paper we present a new approach towards the extraction of affine image regions based on detecting shape-stable boundaries from a multi-scale image representation. We construct an affine morphological scale space (AMSS) representation [1], which performs anisotropic diffusion while preserving boundaries and being invariant to affine transformations. We extract the transition boundaries of the diffusivity velocity map and track their evolution at each level of the scale-space. We then determine the stability of the boundary shape through a minimization process over different scales. Unlike most state of the art detectors which use the Gaussian scale space for multi-scale image representation, our approach is intrinsically affine invariant. We evaluate our detector by measuring repeatability of regions in transformed images of the same scene and comparing it to the state-of-the-art region detectors [2]. expand
|
|
|
Robust digital watermarking in videos based on geometric transformations |
| |
Philipp Schaber,
Stephan Kopf,
Fabian Bauer,
Wolfgang Effelsberg
|
|
Pages: 1219-1222 |
|
doi>10.1145/1873951.1874191 |
|
Full text: PDF
|
|
In the efforts to fight piracy of high-valued media content, forensic digital watermarking as a passive content security scheme is a potential alternative to current, restrictive approaches like DRM. In this paper, we present a novel watermarking scheme ...
In the efforts to fight piracy of high-valued media content, forensic digital watermarking as a passive content security scheme is a potential alternative to current, restrictive approaches like DRM. In this paper, we present a novel watermarking scheme for videos based on affine geometric transformations. Frames can be modified in an imperceptible manner by applying a small, global rotation, translation, or zooming, which can be detected later on by comparison with the originals. To compensate geometric distortions that have been introduced while a video travels down legal as well as illegal distribution chains, a spatio-temporal synchronization is performed using our video registration toolkit application. To evaluate our approach, we compare it with several other schemes regarding the robustness against common attacks, including camcorder capture. expand
|
|
|
Joint layered video and digital fountain coding for multi-channel video broadcasting |
| |
Wen Ji,
Zhu Li
|
|
Pages: 1223-1226 |
|
doi>10.1145/1873951.1874192 |
|
Full text: PDF
|
|
In this paper, we consider a scenario where multiple video content channels are broadcasted to a set of heterogeneous mobile users with diverse display devices and different channel conditions. The objective is to design a joint coding and rate allocation ...
In this paper, we consider a scenario where multiple video content channels are broadcasted to a set of heterogeneous mobile users with diverse display devices and different channel conditions. The objective is to design a joint coding and rate allocation algorithm which achieves maximum overall receiving quality of the heterogeneous users, measured by broadcasting utility. We use hierarchy optimization to solve this problem, and decompose the problem into a two-tier solution: the inner loop aims at single content broadcasting, solved with a joint coding algorithm; while the outer loop focus on multiple contents broadcasting, using dynamic programming approach to find the optimal rate allocation policy. Numerical experiments demonstrate the effectiveness of the solution. expand
|
|
|
A novel P2P and cloud computing hybrid architecture for multimedia streaming with QoS cost functions |
| |
Irena Trajkovska,
Joaquin Salvachua Rodriguez,
Alberto Mozo Velasco
|
|
Pages: 1227-1230 |
|
doi>10.1145/1873951.1874193 |
|
Full text: PDF
|
|
Since its appearance, peer-to-peer technology has given raise to various multimedia streaming applications. Today, cloud computing offers different service models as a base for successful end user applications. In this paper we propose joining peer-to-peer ...
Since its appearance, peer-to-peer technology has given raise to various multimedia streaming applications. Today, cloud computing offers different service models as a base for successful end user applications. In this paper we propose joining peer-to-peer and cloud computing into new architectural realization of a distributed cloud computing network for multimedia streaming, in a both centralized and peer-to-peer distributed manner. This architecture merges private and public clouds and it is intended for a commercial use, but in the same time scalable to offer the possibility of non-profitable use. In order to take advantage of the cloud paradigm and make multimedia streaming more efficient, we introduce APIs in the cloud, containing build-in functions for automatic QoS calculation, which permits negotiating QoS parameters such as bandwidth, jitter and latency, among a cloud service provider and its potential clients. expand
|
|
|
Hybrid load balancing for online games |
| |
Rynson W.H. Lau
|
|
Pages: 1231-1234 |
|
doi>10.1145/1873951.1874194 |
|
Full text: PDF
|
|
As massively multiplayer online games are becoming very popular, how to support a large number of concurrent users while maintaining the game performance has become an important research topic. There are two main research directions based on the multi-server ...
As massively multiplayer online games are becoming very popular, how to support a large number of concurrent users while maintaining the game performance has become an important research topic. There are two main research directions based on the multi-server architecture, global load balancing, which is optimal but computationally expensive, or local load balancing, which is not optimal but efficient. In this paper, we propose a hybrid load balancing approach to support massively multiplayer online gaming. Our idea is to augment a local load balancing algorithm with some global load information, which may be obtained less frequently. We propose two methods to implement the hybrid approach. Our results show that the proposed methods reduce the frequency of server overloading and improve the overall game performance significantly. expand
|
|
|
SESSION: Brave new ideas - BNI1 track |
| |
Nozha Boujemaa
|
|
|
|
|
The wisdom of social multimedia: using flickr for prediction and forecast |
| |
Xin Jin,
Andrew Gallagher,
Liangliang Cao,
Jiebo Luo,
Jiawei Han
|
|
Pages: 1235-1244 |
|
doi>10.1145/1873951.1874196 |
|
Full text: PDF
|
|
Social multimedia hosting and sharing websites, such as Flickr, Facebook, Youtube, Picasa, ImageShack and Photobucket, are increasingly popular around the globe. A major trend in the current studies on social multimedia is using the social media sites ...
Social multimedia hosting and sharing websites, such as Flickr, Facebook, Youtube, Picasa, ImageShack and Photobucket, are increasingly popular around the globe. A major trend in the current studies on social multimedia is using the social media sites as a source of huge amount of labeled data for solving large scale computer science problems in computer vision, data mining and multimedia. In this paper, we take a new path to explore the global trends and sentiments that can be drawn by analyzing the sharing patterns of uploaded and downloaded social multimedia. In a sense, each time an image or video is uploaded or viewed, it constitutes an implicit vote for (or against) the subject of the image. This vote carries along with it a rich set of associated data including time and (often) location information. By aggregating such votes across millions of Internet users, we reveal the wisdom that is embedded in social multimedia sites for social science applications such as politics, economics, and marketing. We believe that our work opens a brand new arena for the multimedia research community with a potentially big impact on society and social sciences. expand
|
|
|
Multimodal location estimation |
| |
Gerald Friedland,
Oriol Vinyals,
Trevor Darrell
|
|
Pages: 1245-1252 |
|
doi>10.1145/1873951.1874197 |
|
Full text: PDF
|
|
In this article we define a multimedia content analysis problem, which we call multimodal location estimation: Given a video/image/audio file, the task is to determine where it was recorded. A single indication, such as a unique landmark, might already ...
In this article we define a multimedia content analysis problem, which we call multimodal location estimation: Given a video/image/audio file, the task is to determine where it was recorded. A single indication, such as a unique landmark, might already pinpoint a location precisely. In most cases, however, a combination of evidence from the visual and the acoustic domain will only narrow down the set of possible answers. Therefore, approaches to tackle this task should be inherently multimedia. While the task is hard, in fact sometimes unsolvable, training data can be leveraged from the Internet in large amounts. Moreover, even partially successful automatic estimation of location opens up new possibilities in video content matching, archiving, and organization. It could revolutionize law enforcement and computer-aided intelligence agency work, especially since both semi-automatic and fully automatic approaches would be possible. In this article, we describe our idea of growing multimodal location estimation as a research field in the multimedia community. Based on examples and scenarios, we propose a multimedia approach to leverage cues from the visual and the acoustic portions of a video as well as from given metadata. We also describe experiments to estimate the amount of available training data that could potentially be used as publicly available infrastructure for research in this field. Finally, we present an initial set of results based on acoustic and visual cues and discuss the massive challenges involved and some possible paths to solutions. expand
|
|
|
Video genetics: a case study from YouTube |
| |
John R. Kender,
Matthew L. Hill,
Apostol (Paul) Natsev,
John R. Smith,
Lexing Xie
|
|
Pages: 1253-1258 |
|
doi>10.1145/1873951.1874198 |
|
Full text: PDF
|
|
We explore in a single but large case study how videos within YouTube, competing for view counts, are like organisms within an ecology, competing for survival. We develop this analogy, whose core idea shows that short video clips, best detected across ...
We explore in a single but large case study how videos within YouTube, competing for view counts, are like organisms within an ecology, competing for survival. We develop this analogy, whose core idea shows that short video clips, best detected across videos as near-duplicate keyframes, behave similarly to genes. We report work in progress, on a dataset of 5.4K videos with 210K keyframes on a single topic, which traces sequences, not bags, of "near-dups" over time, both within videos and across them. We demonstrate their utility to: cleanse responses to queries contaminated by over-eager YouTube query expansion; separate videos temporally according to their responses to external events; track the evolution and lifespan of continuing video "stories"; automatically locate video summaries already present within a video ecology; quickly verify video copying via a direct application of the Smith-Waterman algorithm used in genetics - which also provides useful feedback for tuning the near-dup detection and clustering process; and quickly classify videos via a kind of Lempel-Ziv encoding into the categories of news, monologue, dialogue, and slideshow. We demonstrate a number of novel visualizations of this large dataset, including a direct use of the Matlab black-body "hot" false-color map, together with the GraphViz package, to display the gene-like inheritance of viral properties of keyframes. We further speculate that, as with genes, there are "functional roles" for semantic categories of clips, and, as with species, there are differing rates of "genetic drift" for each video genre. expand
|
|
|
Content without context is meaningless |
| |
Ramesh Jain,
Pinaki Sinha
|
|
Pages: 1259-1268 |
|
doi>10.1145/1873951.1874199 |
|
Full text: PDF
|
|
We revisit one of the most fundamental problems in multimedia that is receiving enormous attention from researchers without making much progress in solving it: the problem of bridging the semantic gap. Research in this area has focused on developing ...
We revisit one of the most fundamental problems in multimedia that is receiving enormous attention from researchers without making much progress in solving it: the problem of bridging the semantic gap. Research in this area has focused on developing increasingly rigorous techniques using the content. Researchers consider that Content is King and ignore everything else. In this paper, first we will discuss how this infatuation with content continues to be the biggest hurdle in the success of, ironically, content based approaches for multimedia search. Lately, many commercial systems have ignored content in favor of context and demonstrated better success. Given that the mobile phones are the major platform for the next generation of computing, context becomes easily available and more relevant. We show that it is not Content Versus Context; rather it is Content and Context that is required to bridge the semantic gap. In this paper, first we will discuss reasons for our approach and then present approaches that appropriately combine context with content to help bridge the semantic gap and solve important problems in multimedia computing. expand
|
|
|
SESSION: Brave new ideas - BNI2: track |
| |
Alejandro Jaimes
|
|
|
|
|
Human animal machine interaction: animal behavior awareness and digital experience |
| |
Karin Fahlquist,
Johannes Karlsson,
Haibo Li,
Li Liu,
Keni Ren,
Shafiq ur Réhman,
Tim Wark
|
|
Pages: 1269-1274 |
|
doi>10.1145/1873951.1874201 |
|
Full text: PDF
|
|
This paper proposes an intuitive wireless sensor/actuator based communication network for human animal interaction for a digital zoo. In order to enhance effective observation and control over wild life, we have built a wireless sensor network. 25 video ...
This paper proposes an intuitive wireless sensor/actuator based communication network for human animal interaction for a digital zoo. In order to enhance effective observation and control over wild life, we have built a wireless sensor network. 25 video transmitting nodes are installed for animal behavior observation and experimental vibrotactile collars have been designed for effective control in an animal park. The goal of our research is two-folded. Firstly, to provide an interaction between digital users and animals, and monitor the animal behavior for safety purposes. Secondly, we investigate how animals can be controlled or trained based on vibrotactile stimuli instead of electric stimuli. We have designed a multimedia sensor network for human animal machine interaction. We have evaluated the effect of human animal machine state communication model in field experiments. expand
|
|
|
Enriching social situational awareness in remote interactions: insights and inspirations from disability focused research |
| |
Sreekar Krishna,
Vineeth Balasubramanian,
Sethuraman Panchanathan
|
|
Pages: 1275-1284 |
|
doi>10.1145/1873951.1874202 |
|
Full text: PDF
|
|
In this paper we present a new perspective into developing technologies for enriching social presence among remote interaction partners. Inspired by the abilities and limitations faced by people who are disabled during their everyday social interactions, ...
In this paper we present a new perspective into developing technologies for enriching social presence among remote interaction partners. Inspired by the abilities and limitations faced by people who are disabled during their everyday social interactions, we propose novel portable and wearable technologies that could potentially enrich remote interactions even in audio and video deprived settings. We describe the important challenges faced by people who are disabled during everyday dyadic and group social interactions and correlate them to the challenges faced by participants in remote interactions. With a case study of visually impaired individuals, we demonstrate how assistive technologies developed for social assistance of people who are disabled can help in increasing the social situational awareness, and hence social presence, of remote interaction partners. expand
|
|
|
Requirements and design space for interactive public displays |
| |
Jörg Müller,
Florian Alt,
Daniel Michelis,
Albrecht Schmidt
|
|
Pages: 1285-1294 |
|
doi>10.1145/1873951.1874203 |
|
Full text: PDF
|
|
Digital immersion is moving into public space. Interactive screens and public displays are deployed in urban environments, malls, and shop windows. Inner city areas, airports, train stations and stadiums are experiencing a transformation from traditional ...
Digital immersion is moving into public space. Interactive screens and public displays are deployed in urban environments, malls, and shop windows. Inner city areas, airports, train stations and stadiums are experiencing a transformation from traditional to digital displays enabling new forms of multimedia presentation and new user experiences. Imagine a walkway with digital displays that allows a user to immerse herself in her favorite content while moving through public space. In this paper we discuss the fundamentals for creating exciting public displays and multimedia experiences enabling new forms of engagement with digital content. Interaction in public space and with public displays can be categorized in phases, each having specific requirements. Attracting, engaging and motivating the user are central design issues that are addressed in this paper. We provide a comprehensive analysis of the design space explaining mental models and interaction modalities and we conclude a taxonomy for interactive public display from this analysis. Our analysis and the taxonomy are grounded in a large number of research projects, art installations and experience. With our contribution we aim at providing a comprehensive guide for designers and developers of interactive multimedia on public displays. expand
|
|
|
SESSION: Video - VID1 track |
| |
Shin'ichi Satoh,
Jenny Benois Pineau
|
|
|
|
|
Acqua vellutata sospesa: interactive video painting |
| |
Laurel L. Johannesson
|
|
Pages: 1295-1298 |
|
doi>10.1145/1873951.1874205 |
|
Full text: PDF
|
|
Other formats:
Mov
|
|
In this paper I present the interactive video painting artwork "Acqua Vellutata Sospesa". I will describe the viewer interface for the interactive component as well as the conceptual approach to the project. Additionally, a comprehensive survey ...
In this paper I present the interactive video painting artwork "Acqua Vellutata Sospesa". I will describe the viewer interface for the interactive component as well as the conceptual approach to the project. Additionally, a comprehensive survey of works related to fluidity, interactivity, installation, and video is included. expand
|
|
|
The IMMED project: wearable video monitoring of people with age dementia |
| |
Rémi Mégret,
Vladislavs Dovgalecs,
Hazem Wannous,
Svebor Karaman,
Jenny Benois-Pineau,
Elie El Khoury,
Julien Pinquier,
Philippe Joly,
Régine André-Obrecht,
Yann Gaëstel,
Jean-François Dartigues
|
|
Pages: 1299-1302 |
|
doi>10.1145/1873951.1874206 |
|
Full text: PDF
|
|
Other formats:
M4v
|
|
In this paper, we describe a new application for multimedia indexing, using a system that monitors the instrumental activities of daily living to assess the cognitive decline caused by dementia. The system is composed of a wearable camera device designed ...
In this paper, we describe a new application for multimedia indexing, using a system that monitors the instrumental activities of daily living to assess the cognitive decline caused by dementia. The system is composed of a wearable camera device designed to capture audio and video data of the instrumental activities of a patient, which is leveraged with multimedia indexing techniques in order to allow medical specialists to analyze several hour long observation shots efficiently. expand
|
|
|
A multimodal virtual environment for interacting with 3d deformable models |
| |
Ziying Tang,
Anant Patel,
Xiaohu Guo,
Balakrishnan Prabhakaran
|
|
Pages: 1303-1306 |
|
doi>10.1145/1873951.1874207 |
|
Full text: PDF
|
|
Other formats:
Mov
|
|
In this video presentation, we introduce an immersive multimodal virtual environment which supports real-time interactions with 3D deformable model through a haptic device. We include a system called "FakeSpace" to imitate 3D environment, and a PHAMTOM ...
In this video presentation, we introduce an immersive multimodal virtual environment which supports real-time interactions with 3D deformable model through a haptic device. We include a system called "FakeSpace" to imitate 3D environment, and a PHAMTOM device to simulate touching forces. Movements of 3D deformable models are simulated based on a spectral method, and forces are simulated as spring forces. We are able to real-time update both visual and haptic feedbacks, so that to provide a more realistic user interaction. In addition, with the help of stereoscopic display, we can present an immersive 3D experience. This video illustrates the settings of our environment and demonstrates how users real-time manipulate 3D models in this immersive system using some interactive examples. Our system is reconfigurable and is useful for different applications in the fields of education, entertainment, medical simulation and so on. expand
|
|
|
Real-time detection of unusual regions in image streams |
| |
Rene Schuster,
Roland Mörzinger,
Werner Haas,
Helmut Grabner,
Luc Van Gool
|
|
Pages: 1307-1310 |
|
doi>10.1145/1873951.1874208 |
|
Full text: PDF
|
|
Automatic and real-time identification of unusual incidents is important for event detection and alarm systems. In today's camera surveillance solutions video streams are displayed on-screen for human operators, e.g. in large multi-screen control centers. ...
Automatic and real-time identification of unusual incidents is important for event detection and alarm systems. In today's camera surveillance solutions video streams are displayed on-screen for human operators, e.g. in large multi-screen control centers. This in turn requires the attention of operators for unusual events and urgent response. This paper presents a method for the automatic identification of unusual visual content in video streams real-time. In contrast to explicitly modeling specific unusual events, the proposed approach incrementally learns the usual appearances from the visual source and simultaneously identifies potential unusual image regions in the scene. Experiments demonstrate the general applicability on a variety of large-scale datasets including different scenes from public web cams and from traffic monitoring. To further demonstrate the real-time capabilities of the unusual scene detection we actively control a Pan-Tilt-Zoom camera to get close up views of the unusual incidents. expand
|
|
|
Video exploration: from multimedia content analysis to interactive visualization |
| |
Marie-luce Viaud,
Olivier Buisson,
Agnes Saulnier,
Clement Guenais
|
|
Pages: 1311-1314 |
|
doi>10.1145/1873951.1874209 |
|
Full text: PDF
|
|
Other formats:
Mp4
|
|
This paper presents 3 interfaces to access video contents. The stream explorer allows to explore and to segment video streams. The video explorer shows a synthetic view of structured TV programmes. The collection explorer proposes cartographies of large ...
This paper presents 3 interfaces to access video contents. The stream explorer allows to explore and to segment video streams. The video explorer shows a synthetic view of structured TV programmes. The collection explorer proposes cartographies of large video collections. Based on visual and textual automatic processing, proximities and redundancies are analyzed, allowing the emergence of different levels of structure. This is made possible thanks to the volume of data considered: 7 channels during 100 days, ie 16000 hours or 20 Millions key frames. These three tools allow efficient exploration of video contents at different levels of interest: image, shot and sequence, programme and collection level. expand
|
|
|
A 3d data intensive tele-immersive grid |
| |
Benjamin Petit,
Thomas Dupeux,
Benoit Bossavit,
Joeffrey Legaux,
Bruno Raffin,
Emmanuel Melin,
Jean-Sébastien Franco,
Ingo Assenmacher,
Edmond Boyer
|
|
Pages: 1315-1318 |
|
doi>10.1145/1873951.1874210 |
|
Full text: PDF
|
|
Other formats:
Mov
|
|
Networked virtual environments like Second Life enable distant people to meet for leisure as well as work. But users are represented through avatars controlled by keyboards and mouses, leading to a low sense of presence especially regarding body language. ...
Networked virtual environments like Second Life enable distant people to meet for leisure as well as work. But users are represented through avatars controlled by keyboards and mouses, leading to a low sense of presence especially regarding body language. Multi-camera real-time 3D modeling offers a way to ensure a significantly higher sense of presence. But producing quality geometries, well textured, and to enable distant user tele-presence in non trivial virtual environments is still a challenge today. In this paper we present a tele-immersive system based on multi-camera 3D modeling. Users from distant sites are immersed in a rich virtual environment served by a parallel terrain rendering engine. Distant users, present through their 3D model, can perform some local interactions while having a strong visual presence. We experimented our system between three large cities a few hundreds kilometers apart from each other. This work demonstrate the feasibility of a rich 3D multimedia environment ensuring users a strong sense of presence. expand
|
|
|
Real-time soccer player tracking method by utilizing shadow regions |
| |
Nozomu Kasuya,
Itaru Kitahara,
Yoshinari Kameda,
Yuichi Ohta
|
|
Pages: 1319-1322 |
|
doi>10.1145/1873951.1874211 |
|
Full text: PDF
|
|
Other formats:
Mov
|
|
Our research aims to generate a player's view video stream by using a 3D free-viewpoint video technique. Since player trajectories are necessary to generate the video, we propose a real-time player trajectory estimation method by utilizing the shadow ...
Our research aims to generate a player's view video stream by using a 3D free-viewpoint video technique. Since player trajectories are necessary to generate the video, we propose a real-time player trajectory estimation method by utilizing the shadow regions from soccer scenes. This paper describes our trial to realize real-time processing. We divide the process into capture and server computers. In addition, we reduced the processing cost with pipeline parallelization and optimization. We apply our proposed method to an actual soccer match held in a stadium and show its effectiveness. expand
|
|
|
The mediamill search engine video |
| |
Cees G.M. Snoek
|
|
Pages: 1323-1324 |
|
doi>10.1145/1873951.1874212 |
|
Full text: PDF
|
|
Other formats:
Mov
|
|
In this video demonstration, we advertise the MediaMill video search engine, a system that facilitates semantic access to video based on a large lexicon of visual concept detectors and interactive video browsers. With an ultimate aim to disseminate video ...
In this video demonstration, we advertise the MediaMill video search engine, a system that facilitates semantic access to video based on a large lexicon of visual concept detectors and interactive video browsers. With an ultimate aim to disseminate video retrieval research to a non-technical audience, we explain the need for a visual video retrieval solution, summarize the MediaMill technology, and hint at future perspectives. expand
|
|
|
SESSION: Interactive art -- IA1/cultural heritage tools track |
| |
James Wang
|
|
|
|
|
Determining the sexual identities of prehistoric cave artists using digitized handprints: a machine learning approach |
| |
James Z. Wang,
Weina Ge,
Dean R. Snow,
Prasenjit Mitra,
C. Lee Giles
|
|
Pages: 1325-1332 |
|
doi>10.1145/1873951.1874214 |
|
Full text: PDF
|
|
The sexual identities of human handprints inform hypotheses regarding the roles of males and females in prehistoric contexts. Sexual identity has previously been manually determined by measuring the ratios of the lengths of the individual's fingers as ...
The sexual identities of human handprints inform hypotheses regarding the roles of males and females in prehistoric contexts. Sexual identity has previously been manually determined by measuring the ratios of the lengths of the individual's fingers as well as by using other physical features. Most conventional studies measure the lengths manually and thus are often constrained by the lack of scaling information on published images. We have created a method that determines sex by applying modern machine-learning techniques to relative measures obtained from images of human hands. This is the known attempt at substituting automated methods for time-consuming manual measurement in the study of sexual identities of prehistoric cave artists. Our study provides quantitative evidence relevant to sexual dimorphism and the sexual division of labor in Upper Paleolithic societies. In addition to analyzing historical handprint records, this method has potential applications in criminal forensics and human-computer interaction. expand
|
|
|
Enhanced exploration of oral history archives through processed video and synchronized text transcripts |
| |
Michael G. Christel,
Scott M. Stevens,
Bryan S. Maher,
Julieanna Richardson
|
|
Pages: 1333-1342 |
|
doi>10.1145/1873951.1874215 |
|
Full text: PDF
|
|
A digital video library of over 900 hours of video and 18000 stories from The HistoryMakers was used by 266 students, faculty, librarians, and life-long learners interacting with a system providing multiple search and viewing capabilities over a trial ...
A digital video library of over 900 hours of video and 18000 stories from The HistoryMakers was used by 266 students, faculty, librarians, and life-long learners interacting with a system providing multiple search and viewing capabilities over a trial period of several months. User demographics and actions were logged with this multimedia collection, providing quantitative and qualitative metrics on system use. These transaction logs were complemented with heuristic evaluation, interviews, and contextual inquiry with representative users. Collectively, these mixed methods informed the development of the next generation web-based interface for the HistoryMakers video oral histories to improve access to and dissemination of this rich cultural resource. In particular, the feature of a synchronized text transcript in the video player for the narratives merited further investigation. Such an interface has not seen widespread use in digital video players available on the web, yet was valued highly by oral history archive viewers. A user study with 27 participants measured the utility of the HistoryMakers web interface incorporating the synchronized transcript video player for stated fact-finding and open-ended tasks. For life oral histories, an aligned text transcript is valued for both tasks, with the video rated significantly more useful for open-ended tasks over fact-finding. These results suggest a task-dependent role of modality in presentation of oral histories, with synchronized transcripts rated highly across tasks. expand
|
|
|
Surfing on artistic documents with visually assisted tagging |
| |
Daniele Borghesani,
Costantino Grana,
Rita Cucchiara
|
|
Pages: 1343-1352 |
|
doi>10.1145/1873951.1874216 |
|
Full text: PDF
|
|
This paper describes a complete architecture for the interactive exploration and annotation of artistic collections. In particular the focus is on Renaissance illuminated manuscripts, which typically contain thousands of pictures, used to comment or ...
This paper describes a complete architecture for the interactive exploration and annotation of artistic collections. In particular the focus is on Renaissance illuminated manuscripts, which typically contain thousands of pictures, used to comment or embellish the manuscript Gothic text. The final aim is to create a human centered multimedia application allowing the non practitioners to enjoy these masterpieces and expert users to share their knowledge. The system is composed by a modern user interface for browsing, surfing and querying, an automatic segmentation module, to ease the initial picture extraction task, and a similarity based retrieval engine, used to provide visually assisted tagging capabilities. A relevance feedback procedure is included to further refine the results. Experiments are reported regarding the adopted visual features based on covariance matrices and the Mean Shift Feature Space Warping relevance feedback. Finally some hints on the user interface for museum installations are discussed. expand
|
|
|
SESSION: Interactive art -- IA2/art and multimedia track |
| |
Tiziana Catarci
|
|
|
|
|
Bateau ivre: an artistic markerless outdoor mobile augmented reality installation on a riverboat |
| |
Christian Jacquemin,
Wai Kit Chan,
Mathieu Courgeon
|
|
Pages: 1353-1362 |
|
doi>10.1145/1873951.1874218 |
|
Full text: PDF
|
|
Bateau Ivre is a project presented on the Seine River to make a large audience aware of the possible developments of Augmented Reality through an artistic installation in a mobile outdoor environment. The installation could be viewed from a ship ...
Bateau Ivre is a project presented on the Seine River to make a large audience aware of the possible developments of Augmented Reality through an artistic installation in a mobile outdoor environment. The installation could be viewed from a ship by a large audience without specific equipment, through nightly video-projection on the River banks. The augmentation of the physical world was implemented through real-time image processing for live special effects such as contouring, particles, or non-realistic rendering. The artistic purpose of the project was to immerge the audience into a non-realistic view of the River banks that would differ from the traditional tourist tours that highlight the main landmarks of Paris classical architecture. The implemented software applied standard algorithms for special effects to a live video stream and reprojected these effects on the captured scenes to combine the physical world with its modified image. An analysis of the project output reveals that the impact of the effects in mobile SAR varies a lot, and does not correspond to the visual impact on a standard desktop screen. expand
|
|
|
Sonify your face: facial expressions for sound generation |
| |
Roberto Valenti,
Alejandro Jaimes,
Nicu Sebe
|
|
Pages: 1363-1372 |
|
doi>10.1145/1873951.1874219 |
|
Full text: PDF
|
|
We present a novel visual creativity tool that automatically recognizes facial expressions and tracks facial muscle movements in real time to produce sounds. The facial expression recognition module detects and tracks a face and outputs a feature vector ...
We present a novel visual creativity tool that automatically recognizes facial expressions and tracks facial muscle movements in real time to produce sounds. The facial expression recognition module detects and tracks a face and outputs a feature vector of motions of specific locations in the face. The feature vector is used as input to a Bayesian network which classifies facial expressions into several categories (e.g., angry, disgusted, happy, etc.). The classification results are used along with the feature vector to generate a combination of sounds that change in real time depending on the person's facial expressions. We explain the artistic motivation behind the work, the basic components of our tool, and possible applications in the arts (performance, installation) and in the medical domain. Finally, we report on the experience of approximately 25 users of our system at a conference demonstration session, of 9 participants in a pilot study to assess the system's usability, and discuss our experience installing the work at an important digital arts festival (RE-NEW 2009). expand
|
|
|
Flow: an interactive public artwork |
| |
Fiona Bowie,
Sidney Fels,
Morgan Hibbert
|
|
Pages: 1373-1382 |
|
doi>10.1145/1873951.1874220 |
|
Full text: PDF
|
|
This paper describes the conceptual, aesthetic, hardware, and software design of Flow, a photo/media-based permanent public interactive artwork in Vancouver, Canada. The work is located at street level in a new local community centre at one of the city's ...
This paper describes the conceptual, aesthetic, hardware, and software design of Flow, a photo/media-based permanent public interactive artwork in Vancouver, Canada. The work is located at street level in a new local community centre at one of the city's oldest intersections. In addition to the community centre location, it has a related interactive web component. It involves the animation and projection of continually recombining photographic images onto a large, interactive 4x4 array of electronically controlled switch glass windows. Over the course of the day and night, these photographic tableaux appear on the glass in combinations that depend upon image-to-image relationships, time of day, season and weather. In addition, lighting elements including water effect gobos are integrated at selected times of day. Images disappear when viewers inside the building come within close proximity to the work: the interactive windows respond to movement by changing from translucent to clear. In the daytime, when the projected image is off at the site, the work continues on the project's website, offering the visitor an interaction with the work. The work aims to provide an experience of the flux of people, animals, landscape and urban environment over time. It addresses the way landscape has transformed in response to colonialism, capital and local pressures, where change is rapid and histories are lost and rewritten. expand
|
|
|
Ozone: continuous state-based media choreography system for live performance |
| |
Xin Wei Sha,
Michael Fortin,
Navid Navab,
Timothy Sutton
|
|
Pages: 1383-1392 |
|
doi>10.1145/1873951.1874221 |
|
Full text: PDF
|
|
This paper describes Ozone, a new media choreography system based on layered, continuous physical models, designed for building a diverse range of interactive spaces that coordinate arbitrary streams of video and audio synthesized in real-time ...
This paper describes Ozone, a new media choreography system based on layered, continuous physical models, designed for building a diverse range of interactive spaces that coordinate arbitrary streams of video and audio synthesized in real-time response to continuous, concurrent activity by people in a live event. We aim to build rich responsive spaces that sustain the free improvisation of collectively or individually meaningful non-linguistic gesture. Ozone provides an expressive way to compose the potential "landscape" of an event evolving according to the designer's intent as well as contingent activity. A potential-energy engine evolves superposed states over simplicial complexes modeling the topological space of metaphorical states. expand
|
|
|
SESSION: Interactive art exhibit track |
| |
Luca Farulli,
Frank Nack,
Andruid Kerne
|
|
|
|
|
Tempo universale |
| |
Giovanna Bianco,
Pino Valente
|
|
Pages: 1395-1396 |
|
doi>10.1145/1873951.1874223 |
|
Full text: PDF
|
|
This paper describes the artwork 'Tempo Universale', a video installation: composed out of 3 projections and a 10 channels audio track, which is performed as an endless loop. Tempo Universale is presented at the ACM MM 2010 art exhibition.
This paper describes the artwork 'Tempo Universale', a video installation: composed out of 3 projections and a 10 channels audio track, which is performed as an endless loop. Tempo Universale is presented at the ACM MM 2010 art exhibition. expand
|
|
|
Liquid views: memory stage activating perception |
| |
Monika Fleischmann,
Wolfgang Strauss
|
|
Pages: 1397-1398 |
|
doi>10.1145/1873951.1874224 |
|
Full text: PDF
|
|
The central theme of interactive media art installation Liquid Views is the well in which Narcissus discovers his reflection. The work from 1992-93 was first exhibited at Siggraph 1993 in Simon Penny's Machine Culture show. It was exhibited in over 50 ...
The central theme of interactive media art installation Liquid Views is the well in which Narcissus discovers his reflection. The work from 1992-93 was first exhibited at Siggraph 1993 in Simon Penny's Machine Culture show. It was exhibited in over 50 cities with different cultural background worldwide. Nearly 20 years later the work is on exhibition again to study the changed conditions of human media communication. expand
|
|
|
Exploring touch and breath in networked wearable installation design |
| |
Thecla Schiphorst,
Jinsil Seo,
Norm Jaffe
|
|
Pages: 1399-1400 |
|
doi>10.1145/1873951.1874225 |
|
Full text: PDF
|
|
This paper describes the artistic design concepts for the interactive wearable artworks tendrils and exhale exhibited at ACM Multimedia 2010 Interactive Art Program in Firenze Italy at the at the Palazzo Medici-Riccardi from 25 October ...
This paper describes the artistic design concepts for the interactive wearable artworks tendrils and exhale exhibited at ACM Multimedia 2010 Interactive Art Program in Firenze Italy at the at the Palazzo Medici-Riccardi from 25 October through 6 November, 2010. These wearable art works are based in artistic exploration influenced by the somatic turn: an approach to designing for experience using body based practices that highlight the concept of somaesthetics as an approach to the design of expressive interaction. The interactive wearable art installations tendrils and exhale emphasize the experience of self-observation, poetics, materiality, and computational semantics that support sensory input such as touch and breath. In the context of interaction, somaesthetics offers a bridging strategy between embodied practices based in somatics, and the design for an aesthetics of interaction in wearable technology. These artworks illustrate the value of exploring artistic design strategies that employ a somaesthetic approach, and exemplify this approach in the design of a networked, wearable interactive art. expand
|
|
|
Living wall: programmable wallpaper for interactive spaces |
| |
Leah Buechley,
David Mellis,
Hannah Perner-Wilson,
Emily Lovell,
Bonifaz Kaufmann
|
|
Pages: 1401-1402 |
|
doi>10.1145/1873951.1874226 |
|
Full text: PDF
|
|
The Living Wall project explores the construction and application of interactive wallpaper. Using conductive, resistive, and magnetic paints we produced wallpaper that enables us to create dynamic, reconfigurable, programmable spaces. The wallpaper consists ...
The Living Wall project explores the construction and application of interactive wallpaper. Using conductive, resistive, and magnetic paints we produced wallpaper that enables us to create dynamic, reconfigurable, programmable spaces. The wallpaper consists of circuitry that is painted onto a sheet of paper and a set of electronic modules that are attached to it with magnets. The wallpaper can be used for a multitude of functional and fanciful applications involving lighting, environmental sensing, appliance control, and ambient information display. expand
|
|
|
Blue morph: metaphor and metamorphosis |
| |
Victoria Vesna,
James K. Gimzewski
|
|
Pages: 1403-1404 |
|
doi>10.1145/1873951.1874227 |
|
Full text: PDF
|
|
The authors describe the Blue Morph installation they developed and produced in full collaboration as an art/science hybrid. Together, Vesna and Gimzewski created an art | science project that uses nano-scale images and sounds derived from the metamorphosis ...
The authors describe the Blue Morph installation they developed and produced in full collaboration as an art/science hybrid. Together, Vesna and Gimzewski created an art | science project that uses nano-scale images and sounds derived from the metamorphosis of a chrysalis into a butterfly as the overarching metaphor for the collective shift in consciousness. This is a condensed version of the conceptual and scientific background for the artwork that was developed with the goal of creating many different interpretations and experiences. expand
|
|
|
Chromatic perspectives... scaling my art |
| |
Franz Fischnaller
|
|
Pages: 1405-1406 |
|
doi>10.1145/1873951.1874228 |
|
Full text: PDF
|
|
This paper attempts to describe Chromatic Perspectives... Scaling my Art; which addresses the results of a trans-medial exploration departing from an"unframed"process of creativity and multi layered convergence within traditional media Art and virtual ...
This paper attempts to describe Chromatic Perspectives... Scaling my Art; which addresses the results of a trans-medial exploration departing from an"unframed"process of creativity and multi layered convergence within traditional media Art and virtual Art, mathematics, motion-Golden Ratio, motion perspective, nonlinear dimensionality, immersive virtual representation with particular attention in color, sound, locative and emotional involvement and cognitive processes in visual perception. expand
|
|
|
CCC trilogy: the italian garden |
| |
Davide Venturini,
Francesco Gandi
|
|
Pages: 1407-1408 |
|
doi>10.1145/1873951.1874229 |
|
Full text: PDF
|
|
In this paper we outline the interactive installation 'The Italian garden', which is based on our CCC [children cheering carpet] technology (created with Max/Msp Jitter). The installation, which invites users to play in a typical Italian Renaissance ...
In this paper we outline the interactive installation 'The Italian garden', which is based on our CCC [children cheering carpet] technology (created with Max/Msp Jitter). The installation, which invites users to play in a typical Italian Renaissance Garden, is presented at the ACM MM 2010 art exhibition. expand
|
|
|
SESSION: Interactive art short -- IA3 track |
| |
Sethuraman Panchanathan
|
|
|
|
|
Building with a memory: responsive color interventions |
| |
Andreea Danielescu,
Ryan Spicer,
David Tinapple,
Aisling Kelliher,
Ellen Campana
|
|
Pages: 1409-1412 |
|
doi>10.1145/1873951.1874231 |
|
Full text: PDF
|
|
Building with a Memory is a subtle responsive intervention that aims to provide cohesion and community awareness through the use of light and color. The installation delivers thought-provoking information by capturing, analyzing and rendering real-time ...
Building with a Memory is a subtle responsive intervention that aims to provide cohesion and community awareness through the use of light and color. The installation delivers thought-provoking information by capturing, analyzing and rendering real-time and archived human activity in a workplace setting. The installation senses movement in the space through an IR camera and computer vision techniques. Two custom lighting fixtures and a video monitor render the aggregated movements. The visually simple aesthetic of the piece aims to balance active engagement and passive contribution, providing a rewarding experience for both occasional passersby and regular users of the space. This paper describes the motivations and contributions of the installation, together with insights gained from an informal evaluation and directions for future explorations. expand
|
|
|
The rumentarium project |
| |
Andrea Valle
|
|
Pages: 1413-1416 |
|
doi>10.1145/1873951.1874232 |
|
Full text: PDF
|
|
The paper describes the design, production and usage of the "Rumentarium", a computer-based sound generating system involving physical objects as sound sources. The Rumentarium is a set of handmade resonators, acoustically excited by DC motors, interfaced ...
The paper describes the design, production and usage of the "Rumentarium", a computer-based sound generating system involving physical objects as sound sources. The Rumentarium is a set of handmade resonators, acoustically excited by DC motors, interfaced to a computer by four microcontrollers. Following an ecological/anthropological perspective, in the Rumentarium discarded materials are used as sound sources. While entirely computationally-controlled, the Rumentarium is an acoustic sound generator. The paper provides a general description of the Rumentarium and discusses some artistic applications. expand
|
|
|
Alan01: slivers of color, media and a soul |
| |
Mika Tuomola,
Teemu Korpilahti,
Jaakko Pesonen
|
|
Pages: 1417-1420 |
|
doi>10.1145/1873951.1874233 |
|
Full text: PDF
|
|
This paper introduces the interactive art installation Alan01, which wakes up the 1952 criminally convicted Alan Turing as a piece of code within the art work - thus fulfilling Turing's own vision of preserving human consciousness in a computer.
This paper introduces the interactive art installation Alan01, which wakes up the 1952 criminally convicted Alan Turing as a piece of code within the art work - thus fulfilling Turing's own vision of preserving human consciousness in a computer. expand
|
|
|
Encounter (resonances) |
| |
Hayley Hung,
Christian Jacquemin
|
|
Pages: 1421-1424 |
|
doi>10.1145/1873951.1874234 |
|
Full text: PDF
|
|
This work is about the remediation of one of Mark Rothko's Seagram murals through the composition of several online sources and additional digital rendering. Based on reproductions of Rothko's "Red on Maroon" found on the Internet, and using computer ...
This work is about the remediation of one of Mark Rothko's Seagram murals through the composition of several online sources and additional digital rendering. Based on reproductions of Rothko's "Red on Maroon" found on the Internet, and using computer graphics compositing associated with moiré and specular lighting effects, "Encounter (Resonances)" offers a new approach to the presentation of a piece of work that allows a viewer to perceive some of its very subtle nuances. The work echoes Rothko's mixed media layered painting technique by using reproductions of various color palettes and resolutions as metaphors for the layers of paint in his original works. While each of these copies may instantly remind us of the original work, the graphical rendering of "Encounter (Resonances)" combines them at three levels of representation (global shape, micro and macro structure), in an effort to encourage a level of prolonged engagement and gradual discovery in the artwork. expand
|
|
|
Thrii |
| |
Nicole Lehrer,
David Tinapple,
Tatyana Koziupa,
Meng Chen,
Assegid Kidane,
Stjepan Rajko,
Isaac Wallis,
Michael Baran,
David Lorig,
Diana Siwiak,
Loren Olson
|
|
Pages: 1425-1428 |
|
doi>10.1145/1873951.1874235 |
|
Full text: PDF
|
|
Thrii is a multimodal interactive installation that explores levels of movement similarity among its participants. Each of the three participants manipulates a large spherical object whose movement is tracked via an embedded accelerometer. An ...
Thrii is a multimodal interactive installation that explores levels of movement similarity among its participants. Each of the three participants manipulates a large spherical object whose movement is tracked via an embedded accelerometer. An analysis engine computes the similarity of movement for each possible pair of objects, as well as self-similarity (e.g., repetition of movement over time) for each object. The extent of similarity among the movements of each object is communicated by a visualization projected on a three-sided pyramid, a non-directional audio environment, and lighting produced by the spherical objects. The installation's focus is intended to examine notions of collaboration between participants. We have found that participants engage with Thrii through exploration of collaborative gestures. expand
|
|
|
HUM, an interactive and collaborative art installation |
| |
Jean-Julien Filatriau,
François Zajéga
|
|
Pages: 1429-1432 |
|
doi>10.1145/1873951.1874236 |
|
Full text: PDF
|
|
This paper describes HUM, an interactive art installation which interprets the behavior of the visitors on different time scales to render visual and sonic artwork in real-time. HUM was presented at BRASS cultural center (Brussels, Belgium) ...
This paper describes HUM, an interactive art installation which interprets the behavior of the visitors on different time scales to render visual and sonic artwork in real-time. HUM was presented at BRASS cultural center (Brussels, Belgium) in May 2009. expand
|
|
|
Coming together: composition by negotiation |
| |
Arne Eigenfeldt
|
|
Pages: 1433-1436 |
|
doi>10.1145/1873951.1874237 |
|
Full text: PDF
|
|
In this paper, we describe a software system that generates unique musical compositions in realtime, created by four autonomous multi-agents. Given no explicit musical data, agents explore their environment, building beliefs through interactions with ...
In this paper, we describe a software system that generates unique musical compositions in realtime, created by four autonomous multi-agents. Given no explicit musical data, agents explore their environment, building beliefs through interactions with other agents via messaging and listening (to both audio and/or MIDI data), generating goals, and executing plans. The artistic focus of Coming Together is the actual process of convergence, heard during performance (each of which usually lasts about ten minutes): the movement from random individualism to converged ensemble interaction. If convergence is successful, four additional agents are instantiated that exploit the emergent harmony and rhythm through brief, but beautiful melodic gestures. Once these agents have completed their work, or if the original "explorer" agents fail to converge, the system resets itself, and the process begins again. expand
|
|
|
RTiVISS: real-time video interactive systems for sustainability |
| |
Mónica Mendes,
Nuno Correia
|
|
Pages: 1437-1440 |
|
doi>10.1145/1873951.1874238 |
|
Full text: PDF
|
|
RTiVISS is an exploratory project that proposes to investigate innovative concepts and design methods regarding environmental and sustainability issues. It is concerned with natural resources, specially forests, and their preservation, through critical ...
RTiVISS is an exploratory project that proposes to investigate innovative concepts and design methods regarding environmental and sustainability issues. It is concerned with natural resources, specially forests, and their preservation, through critical research and experimental artistic approaches. The project proposes multiplatform access to real-time networked video and allows users to adopt selected forests under surveillance. The interactive system feeds a whole community that establishes connections by sharing "the emotion of real-time" and the challenge of uncertainty, while remotely monitoring natural environments for forests protection. This achievement enables artistic explorations with digital media in interactive installations that engage the audience senses in unconventional ways. expand
|
|
|
Chroma space: affective colors in interactive 3d world |
| |
Wendy Ann C. Mansilla,
Jordi Puig,
Andrew Perkis,
Touradj Ebrahimi
|
|
Pages: 1441-1444 |
|
doi>10.1145/1873951.1874239 |
|
Full text: PDF
|
|
We have developed an installation called Chroma Space to serve as a platform for experimenting the novel usage of affective colors in an interactive synthetic scenario. Chroma Space demonstrates the effective impacts of using a stylistic ...
We have developed an installation called Chroma Space to serve as a platform for experimenting the novel usage of affective colors in an interactive synthetic scenario. Chroma Space demonstrates the effective impacts of using a stylistic approach to address emotional sensations, by making colors move in space. We conducted a study to assess the impact of exposure to achromatic colors to different interactive 3D scenarios. The results suggest that presentation of color has an emotional impact to viewers. We argue that designers of synthetic 3D environments should consider the application of stylistic or thematic use of color on screen to increase the emotional attractiveness of an application. expand
|
|
|
An interactive multimedia framework for digital heritage narratives |
| |
Neeharika Adabala,
Naren Datha,
Joseph Joy,
Chinmay Kulkarni,
Ajay Manchepalli,
Aditya Sankar,
Rebecca Walton
|
|
Pages: 1445-1448 |
|
doi>10.1145/1873951.1874240 |
|
Full text: PDF
|
|
The cultural heritage of a region is conveyed by both tangible physical artifacts and intangible aspects in the form of stories, dance styles, rituals, etc. Hitherto, the task of creating digital representations for each of these aspects has been addressed ...
The cultural heritage of a region is conveyed by both tangible physical artifacts and intangible aspects in the form of stories, dance styles, rituals, etc. Hitherto, the task of creating digital representations for each of these aspects has been addressed in isolation, i.e. using specific media most suited to the artifact such as video, audio, three-dimensional (3D) models, scanning, etc. The challenge of bringing together these separate elements to create a coherent story, however, has remained unaddressed until recently. In this paper we present a unified digital framework that enables the integration of disparate representations of heritage elements into a holistic entity. Our approach results in a compelling and engaging narration that affords a unified user experience. Our solution supports both active (user-controlled explorations) and passive (watching pre-orchestrated narrations) user interactions. We demonstrate the capabilities of our framework through a qualitative user study based on two rich interactive narratives built using our framework: (1) history and folklore surrounding a temple in South India, and (2) a historical account of an educational institution also in South India. expand
|
|
|
Natural interaction for cultural heritage: the archaeological site of Shawbak |
| |
Thomas Matteo Alisi,
Gianpaolo D'Amico,
Andrea Ferracani,
Lea Landucci,
Nicola Torpei
|
|
Pages: 1449-1452 |
|
doi>10.1145/1873951.1874241 |
|
Full text: PDF
|
|
One of the most interesting issues in the field of cultural heritage is the adoption of multimedia systems for the visualization and organization of information. In this paper we present a natural interaction based system designed to represent multimedia ...
One of the most interesting issues in the field of cultural heritage is the adoption of multimedia systems for the visualization and organization of information. In this paper we present a natural interaction based system designed to represent multimedia contents related to the archaeological site of Shawbak, situated in the Petra region of Jordan. Contents are composed of texts, images and videos showing and explaining the archeological site areas and the history of the castle. This system was installed at the Limonaia di Palazzo Pitti (Italy) for the archeological exhibition called "From Petra to Shawbak". expand
|
|
|
Yongzheng emperor's interactive tabletop: seamless multimedia system in a museum context |
| |
Chun-Ko Hsieh,
I-Ling Liu,
Neng-Hao Yu,
Yueh-Hsuan Chiang,
Hsiang-Tao Wu,
Ying-Jui Chen,
Yi-Ping Hung
|
|
Pages: 1453-1456 |
|
doi>10.1145/1873951.1874242 |
|
Full text: PDF
|
|
In this paper, we propose the seamless multimedia system Yongzheng Emperor's interactive tabletop, which has been incorporated into the special exhibition "Harmony and Integrity: The Yongzheng Emperor and His Times" at the National Palace Museum ...
In this paper, we propose the seamless multimedia system Yongzheng Emperor's interactive tabletop, which has been incorporated into the special exhibition "Harmony and Integrity: The Yongzheng Emperor and His Times" at the National Palace Museum in Taiwan. The multimedia system features the innovative use of physical artifacts - Yongzheng figurines and a model of Yongzheng-era calendar clock as tangible user interfaces, which have been used to activate on the Surface the emperor's life at court and chronological events of his times. Museum audiences can naturally and intuitively explore the Emperor's stories by the use of hand gestures. The system vividly connects the modern world of the audiences with the emperor's virtual world, engaging museum audiences in the most interactive and compelling way to learn about the emperor. The paper aims to present the development of the seamless tabletop system in a historical museum context, including design principles, implementation, applications, and effectiveness of the system. Our contribution in this project is to demonstrate a new exhibition display model for the museum sector. expand
|
|
|
SESSION: Interactive art open workshop: interactive multimedia computing for creativity and expression track |
| |
andruid Kerne
|
|
|
|
|
Interactive multimedia computing for creativity and expression |
| |
Andruid Kerne,
Frank Nack,
Luca Farulli
|
|
Pages: 1457-1458 |
|
doi>10.1145/1873951.1874244 |
|
Full text: PDF
|
|
In this paper we outline the aims and organization of the ACM MM 10 workshop on 'Interactive Multimedia Computing for Creativity and Expression'.
In this paper we outline the aims and organization of the ACM MM 10 workshop on 'Interactive Multimedia Computing for Creativity and Expression'. expand
|
|
|
SESSION: Open source software competition -- OS1 track |
| |
Nicu Sebe
|
|
|
|
|
Opensmile: the munich versatile and fast open-source audio feature extractor |
| |
Florian Eyben,
Martin Wöllmer,
Björn Schuller
|
|
Pages: 1459-1462 |
|
doi>10.1145/1873951.1874246 |
|
Full text: PDF
|
|
We introduce the openSMILE feature extraction toolkit, which unites feature extraction algorithms from the speech processing and the Music Information Retrieval communities. Audio low-level descriptors such as CHROMA and CENS features, loudness, Mel-frequency ...
We introduce the openSMILE feature extraction toolkit, which unites feature extraction algorithms from the speech processing and the Music Information Retrieval communities. Audio low-level descriptors such as CHROMA and CENS features, loudness, Mel-frequency cepstral coefficients, perceptual linear predictive cepstral coefficients, linear predictive coefficients, line spectral frequencies, fundamental frequency, and formant frequencies are supported. Delta regression and various statistical functionals can be applied to the low-level descriptors. openSMILE is implemented in C++ with no third-party dependencies for the core functionality. It is fast, runs on Unix and Windows platforms, and has a modular, component based architecture which makes extensions via plug-ins easy. It supports on-line incremental processing for all implemented features as well as off-line and batch processing. Numeric compatibility with future versions is ensured by means of unit tests. openSMILE can be downloaded from http://opensmile.sourceforge.net/. expand
|
|
|
Open SVC decoder: a flexible SVC library |
| |
Médéric Blestel,
Mickaël Raulet
|
|
Pages: 1463-1466 |
|
doi>10.1145/1873951.1874247 |
|
Full text: PDF
|
|
This paper describes the Open SVC Decoder project, an open source library which implements the Scalable Video Coding (SVC) standard, the latest standardized by the Joint Video Team (JVT). This library has been integrated into open source players The ...
This paper describes the Open SVC Decoder project, an open source library which implements the Scalable Video Coding (SVC) standard, the latest standardized by the Joint Video Team (JVT). This library has been integrated into open source players The Core Pocket Media Player (TCPMP) and mplayer, in order to be deployed over different platforms with different operating systems. expand
|
|
|
Sonic visualiser: an open source application for viewing, analysing, and annotating music audio files |
| |
Chris Cannam,
Christian Landone,
Mark Sandler
|
|
Pages: 1467-1468 |
|
doi>10.1145/1873951.1874248 |
|
Full text: PDF
|
|
Sonic Visualiser is a friendly and flexible end-user desktop application for analysis, visualisation, and annotation of music audio files. Its stated goal is to be "the first program you reach for when want to study a musical recording rather than simply ...
Sonic Visualiser is a friendly and flexible end-user desktop application for analysis, visualisation, and annotation of music audio files. Its stated goal is to be "the first program you reach for when want to study a musical recording rather than simply listen to it". To this end, it has a user interface that resembles familiar audio editing applications, a set of useful standard visualisation facilities, and support for a plugin format for additional automated analysis methods. expand
|
|
|
Vlfeat: an open and portable library of computer vision algorithms |
| |
Andrea Vedaldi,
Brian Fulkerson
|
|
Pages: 1469-1472 |
|
doi>10.1145/1873951.1874249 |
|
Full text: PDF
|
|
VLFeat is an open and portable library of computer vision algorithms. It aims at facilitating fast prototyping and reproducible research for computer vision scientists and students. It includes rigorous implementations of common building blocks such ...
VLFeat is an open and portable library of computer vision algorithms. It aims at facilitating fast prototyping and reproducible research for computer vision scientists and students. It includes rigorous implementations of common building blocks such as feature detectors, feature extractors, (hierarchical) k-means clustering, randomized kd-tree matching, and super-pixelization. The source code and interfaces are fully documented. The library integrates directly with MATLAB, a popular language for computer vision research. expand
|
|
|
TOP-SURF: a visual words toolkit |
| |
Bart Thomee,
Erwin M. Bakker,
Michael S. Lew
|
|
Pages: 1473-1476 |
|
doi>10.1145/1873951.1874250 |
|
Full text: PDF
|
|
TOP-SURF is an image descriptor that combines interest points with visual words, resulting in a high performance yet compact descriptor that is designed with a wide range of content-based image retrieval applications in mind. TOP-SURF offers the flexibility ...
TOP-SURF is an image descriptor that combines interest points with visual words, resulting in a high performance yet compact descriptor that is designed with a wide range of content-based image retrieval applications in mind. TOP-SURF offers the flexibility to vary descriptor size and supports very fast image matching. In addition to the source code for the visual word extraction and comparisons, we also provide a high level API and very large pre-computed codebooks targeting web image content for both research and teaching purposes. expand
|
|
|
SESSION: Open source software competition -- OS2 track |
| |
Marco Bertini
|
|
|
|
|
FALCON: FAst Lucene-based Cover sOng identification |
| |
Emanuele Di Buccio,
Nicola Montecchio,
Nicola Orio
|
|
Pages: 1477-1480 |
|
doi>10.1145/1873951.1874252 |
|
Full text: PDF
|
|
We present FALCON, an open-source engine for content-based cover song identification written in Java. The popular Lucene search engine library is used as the core of the software, proving that textual methods in information retrieval can be successfully ...
We present FALCON, an open-source engine for content-based cover song identification written in Java. The popular Lucene search engine library is used as the core of the software, proving that textual methods in information retrieval can be successfully adapted to multimedia tasks. An overview of the system methodology and of the implementation are provided, along with experimental results on a medium-size test collection expand
|
|
|
The python computer vision framework |
| |
Bertrand Nouvel,
Shin'Ichi Satoh
|
|
Pages: 1481-1484 |
|
doi>10.1145/1873951.1874253 |
|
Full text: PDF
|
|
PyCVF is an open source framework for computer vision and video-mining. It allows rapid development of applications and it provides standardized tools for common operations such as : browsing datasets, applying transformations to one dataset on-the-fly, ...
PyCVF is an open source framework for computer vision and video-mining. It allows rapid development of applications and it provides standardized tools for common operations such as : browsing datasets, applying transformations to one dataset on-the-fly, computing features, indexing multimedia datasets, querying for nearest-neighbors, training a statistical model, or browsing the result in a 3d-space. PyCVF has a Python API, it also provides command line programs, QT Gui, and a web front-end. It can interacts nicely with other leading frameworks such as Weka, Orange, OpenCV, Django... expand
|
|
|
Torchvision the machine-vision package of torch |
| |
Sébastien Marcel,
Yann Rodriguez
|
|
Pages: 1485-1488 |
|
doi>10.1145/1873951.1874254 |
|
Full text: PDF
|
|
This paper presents Torchvision an open source machine vision package for Torch. Torch is a machine learning library providing a series of the state-of-the-art algorithms such as Neural Networks, Support Vector Machines, Gaussian Mixture Models, Hidden ...
This paper presents Torchvision an open source machine vision package for Torch. Torch is a machine learning library providing a series of the state-of-the-art algorithms such as Neural Networks, Support Vector Machines, Gaussian Mixture Models, Hidden Markov Models and many others. Torchvision provides additional functionalities to manipulate and process images with standard image processing algorithms. Hence, the resulting images can be used directly with the Torch machine learning algorithms as Torchvision is fully integrated with Torch. Both Torch and Torchvision are written in C++ language and are publicly available under the Free-BSD License. expand
|
|
|
The openip open source image processing library |
| |
György Kovács,
János István Iván,
Árpád Pányik,
Attila Fazekas
|
|
Pages: 1489-1492 |
|
doi>10.1145/1873951.1874255 |
|
Full text: PDF
|
|
The openIP open source image processing library is a set of c++ libraries providing tools for education, research and industrial purposes. The aim of the development is to fill in the gap between the academic and commercial utilization of image ...
The openIP open source image processing library is a set of c++ libraries providing tools for education, research and industrial purposes. The aim of the development is to fill in the gap between the academic and commercial utilization of image processing. The openIP libraries are interoperable, open source and easy to install. To provide fast codes, assembler optimization, OpenMP parallelization and OpenCL based GPU utilization is integrated. expand
|
|
|
An open-source SIFTLibrary |
| |
Rob Hess
|
|
Pages: 1493-1496 |
|
doi>10.1145/1873951.1874256 |
|
Full text: PDF
|
|
Recent years have seen an explosion in the use of invariant keypoint methods across nearly every area of computer vision research. Since its introduction, the scale-invariant feature transform (SIFT) has been one of the most effective and widely-used ...
Recent years have seen an explosion in the use of invariant keypoint methods across nearly every area of computer vision research. Since its introduction, the scale-invariant feature transform (SIFT) has been one of the most effective and widely-used of these methods and has served as a major catalyst in their popularization. In this paper, I present an open-source SIFT library, implemented in C and freely available at http://eecs.oregonstate.edu/~hess/sift.html, and I briefly compare its performance with that of the original SIFT executable released by David Lowe. expand
|
|
|
SESSION: Industrial exhibit -- IE1 track |
| |
Berna Erol
|
|
|
|
|
Virtual environment for surprises |
| |
Lara Oliveti,
Marcella Albiero,
Paolo Giordano
|
|
Pages: 1497-1498 |
|
doi>10.1145/1873951.1874258 |
|
Full text: PDF
|
|
Creation of a virtual interactive and highly evolved environment with Surprises characters.
Creation of a virtual interactive and highly evolved environment with Surprises characters. expand
|
|
|
RAPID: a reliable protocol for improving delay |
| |
Sanjeev Mehrotra,
Jin Li,
Cheng Huang
|
|
Pages: 1499-1500 |
|
doi>10.1145/1873951.1874259 |
|
Full text: PDF
|
|
Recently, there has been a dramatic increase in interactive cloud based software applications (e.g. working on remote machines, online games, interactive websites such as financial, web search) and other soft real-time applications (traffic within data ...
Recently, there has been a dramatic increase in interactive cloud based software applications (e.g. working on remote machines, online games, interactive websites such as financial, web search) and other soft real-time applications (traffic within data center). Compared to classical real-time media applications (VoIP/conferencing) and non real-time file delivery, these interactive software applications have unique characteristics as they are delay sensitive yet demand in order and reliable data delivery. Therefore existing protocols for delivery of lossless data (such as TCP) and other delivery protocols using UDP do not work well. In this demo, we show substantially improved performance for such traffic by using a transport protocol built on top of UDP which uses intelligent adaptive forward error correction (FEC) and improved congestion control (rate control). The transport protocol is made lossless by using a hybrid FEC/ARQ strategy. The congestion control technique improves the delay performance by preventing congestion induced loss and minimizing queuing delay while still fully utilizing network capacity and maintaining fairness across flows. In this demo, we present RAPID (a ReliAble transport Protocol for Improving end-to-end Delay) and show its effectiveness in improving the performance of interactive client-server applications. expand
|
|
|
A location based reminder system for advertisement |
| |
Yiqun Li,
Aiyuan Guo,
Siying Liu,
Yan Gao,
Yan-Tao Zheng
|
|
Pages: 1501-1502 |
|
doi>10.1145/1873951.1874260 |
|
Full text: PDF
|
|
In this paper, we propose a location based reminder system with image recognition technology. With this system, mobile phone users can actively capture pictures from their favorite product or event promotional materials. After the phone user sends the ...
In this paper, we propose a location based reminder system with image recognition technology. With this system, mobile phone users can actively capture pictures from their favorite product or event promotional materials. After the phone user sends the picture to a computer server, location based reminders will be downloaded to the phone. The mobile phone will alert the user when he/she is close to the place where the product is selling or the event is happening. Kd-tree image matching and geometric validation are used to identify which product the user is interested in. A mobile client application is developed to take pictures, conduct GPS location tracking and pop up the reminder. expand
|
|
|
Embedded media marker: linking multimedia to paper |
| |
Qiong Liu,
Chunyuan Liao,
Lynn Wilcox,
Anthony Dunnigan,
Bee Liew
|
|
Pages: 1503-1504 |
|
doi>10.1145/1873951.1874261 |
|
Full text: PDF
|
|
An Embedded Media Marker (EMM) is a transparent mark printed on a paper document that signifies the availability of additional media associated with that part of the document. Users take a picture of the EMM using a camera phone, and the media associated ...
An Embedded Media Marker (EMM) is a transparent mark printed on a paper document that signifies the availability of additional media associated with that part of the document. Users take a picture of the EMM using a camera phone, and the media associated with that part of the document is displayed on the phone. Unlike bar codes, EMMs are nearly transparent and thus do not interfere with the document appearance. Retrieval of media associated with an EMM is based on image features of the document within the EMM boundary. Unlike other feature-based retrieval methods, the EMM clearly indicates to the user the existence and type of media associated with the document location. A semi-automatic authoring tool is used to place an EMM at a location in a document, in such a way that it encompasses sufficient identification features with minimal disturbance to the original document. We will demonstrate how to create an EMM-enhanced document, and how the EMM enables access to the associated media on a cell phone. expand
|
|
|
The virtual chocolate factory: mixed reality industrial collaboration and control |
| |
Maribeth Back,
Don Kimber,
Eleanor Rieffel,
Anthony Dunnigan,
Bee Liew,
Sagar Gattepally,
Jonathan Foote,
Jun Shingu,
James Vaughan
|
|
Pages: 1505-1506 |
|
doi>10.1145/1873951.1874262 |
|
Full text: PDF
|
|
We show several aspects of a complex mixed reality system that we have built and deployed in a real-world factory setting. In our system, virtual worlds, augmented realities, and social and mobile applications are all fed from the same infrastructure. ...
We show several aspects of a complex mixed reality system that we have built and deployed in a real-world factory setting. In our system, virtual worlds, augmented realities, and social and mobile applications are all fed from the same infrastructure. In collaboration with TCHO[1], a chocolate maker in San Francisco, we built a virtual "mirror" world of a real-world chocolate factory and its processes. Sensor data is imported into the multi-user 3D environment from hundreds of sensors on the factory floor. The resulting virtual factory is used for simulation, visualization, and collaboration, using a set of interlinked, real-time layers of information. Another part of our infrastructure is designed to support appropriate industrial uses for mobile devices such as cell phones and tablet computers. We deployed this system at the real-world factory in 2009, and it is now is daily use there. By simultaneously developing mobile, virtual, and web-based display and collaboration environments, we aimed to create an infrastructure that did not skew toward one type of application but that could serve many at once, interchangeably. Through this mixture of mobile, social, mixed and virtual technologies, we hope to create systems for enhanced collaboration in industrial settings between physically remote people and places, such as factories in China with managers in the US. expand
|
|
|
TalkMiner: a search engine for online lecture video |
| |
John Adcock,
Matthew Cooper,
Laurent Denoue,
Hamed Pirsiavash,
Lawrence A. Rowe
|
|
Pages: 1507-1508 |
|
doi>10.1145/1873951.1874263 |
|
Full text: PDF
|
|
TalkMiner is a search engine for lecture webcasts. Lecture videos are processed to recover a set of distinct slide images and OCR is used to generate a list of indexable terms from the slides. On our prototype system, users can search and browse lists ...
TalkMiner is a search engine for lecture webcasts. Lecture videos are processed to recover a set of distinct slide images and OCR is used to generate a list of indexable terms from the slides. On our prototype system, users can search and browse lists of lectures, slides in a specific lecture, and play the lecture video. Over 10,000 lecture videos have been indexed from a variety of sources. A public website will be published in mid 2010 that allows users to experiment with the search engine. expand
|
|
|
Multi-sensor fusion for interactive visual computing in mixed environment |
| |
Peng Patricia Wang,
Tao Wang,
Dayong Ding,
Yimin Zhang,
Kai Miao,
Cynthia K. Pickering,
Phil Tian,
Jinxue Zhang
|
|
Pages: 1509-1510 |
|
doi>10.1145/1873951.1874264 |
|
Full text: PDF
|
|
Mobile Augmented Reality, as an emerging application for handheld devices, explores more natural interactions in real and virtual environments. For the purpose of accurate system response and manipulating objects in real-time, extensive efforts have ...
Mobile Augmented Reality, as an emerging application for handheld devices, explores more natural interactions in real and virtual environments. For the purpose of accurate system response and manipulating objects in real-time, extensive efforts have been made to estimate six Degree-of-Freedom and extract robust feature to track. However there are still quite a lot challenges today in achieving rich user experience. To allow for a seamless transition from outdoor to indoor service, we investigated and integrated various sensing techniques of GPS, wireless, Inertial Measurement Units, and optical. A parallel tracking and matching scheme is presented to address the speed-accuracy tradeoff issue. Two prototypes, fine-scale mirror world navigation and context-aware trouble shooting, have been developed to demonstrate the suitability of our approach. expand
|
|
|
Uffizi touch®: a new experience with art |
| |
Marco Cappellini,
Paolo De Rocco,
Leonardo Serni
|
|
Pages: 1511-1512 |
|
doi>10.1145/1873951.1874265 |
|
Full text: PDF
|
|
Centrica (www.centrica.it) has developed Uffizi Touch®, the most famous art Gallery worldwide in an interactive digital signage solution you can place in your personal space. Demo videos at www.uffizitouch.com.
Centrica (www.centrica.it) has developed Uffizi Touch®, the most famous art Gallery worldwide in an interactive digital signage solution you can place in your personal space. Demo videos at www.uffizitouch.com. expand
|
|
|
Social recommendation and visual analysis on the TV |
| |
Cathal Gurrin,
Hyowon Lee,
Paul Ferguson,
Alan F. Smeaton,
Noel E. O'Connor,
Yoonhee Choi,
Heeseon Park
|
|
Pages: 1513-1514 |
|
doi>10.1145/1873951.1874266 |
|
Full text: PDF
|
|
In this paper, we present prototype interactive TV software that incorporates visual content analysis tools and social networking in the home TV. We present the challenges of working with the living room TV environment and outline how we have utilized ...
In this paper, we present prototype interactive TV software that incorporates visual content analysis tools and social networking in the home TV. We present the challenges of working with the living room TV environment and outline how we have utilized visual processing and search technologies to address these challenges and create a novel prototype interactive TV system. expand
|
|
|
Multimedia security technologies for movie protection |
| |
Michael Arnold,
Séverine Baudry,
Peter Baum,
Xiao-Ming Chen,
Bertrand Chupeau,
Olivier Courtay,
Gwenaël Doërr,
Ulrich Gries,
Frédéric Lefèbvre,
Michel Morvan,
Antoine Robert,
Charles Salmon-Legagneur,
Christophe Vincent,
Mario de Vito
|
|
Pages: 1515-1516 |
|
doi>10.1145/1873951.1874267 |
|
Full text: PDF
|
|
In this industrial exhibit, Technicolor will showcase various technologies developed to protect digital audio-visual material all along the media value chain, from production to distribution, including forensics services in case pirated content is detected. ...
In this industrial exhibit, Technicolor will showcase various technologies developed to protect digital audio-visual material all along the media value chain, from production to distribution, including forensics services in case pirated content is detected. The exhibit will be organized around Technicolor's proprietary technologies in video fingerprinting and audio/video watermarking. expand
|
|
|
Efficient and robust near-duplicate detection in large and growing image data-sets |
| |
Thomas Pönitz,
Julian Stöttinger
|
|
Pages: 1517-1518 |
|
doi>10.1145/1873951.1874268 |
|
Full text: PDF
|
|
Due to the increasing flood of digital images and the overall increase of storage capacity, large scale image databases are common these days. This work deals with the problem of finding replicas in image databases containing more than 100000 images. ...
Due to the increasing flood of digital images and the overall increase of storage capacity, large scale image databases are common these days. This work deals with the problem of finding replicas in image databases containing more than 100000 images. A clustering algorithm is developed that has linear runtime and can be carried out in parallel. We observe that with increasing size of the database, the problem of decreasing discrimination between high frequency images arises. Features of images with natural repetitive texture become similar to other images and show up in most of the search results. This problem is addressed by developing an asymmetric Hamming distance measurement for bags of visual words. It allows better discrimination power in large databases, while being robust to image transformations such as rotation, cropping, or change of resolution and size. expand
|
|
|
File-based media workflows using ltfs tapes |
| |
Arnon Amir,
David Pease,
Rainer Richter,
Brian Biskeborn,
Michael Richmond,
Lucas Villa Real
|
|
Pages: 1519-1520 |
|
doi>10.1145/1873951.1874269 |
|
Full text: PDF
|
|
While digital video cameras have existed for over two decades digital video cassettes are still the primary storage medium in professional video archives. One of the major inhibitors in the transition to file-based workflows and media archives ...
While digital video cameras have existed for over two decades digital video cassettes are still the primary storage medium in professional video archives. One of the major inhibitors in the transition to file-based workflows and media archives is the lack of an affordable, portable and archive compatible storage medium for the vast amounts of content produced. We address this need by a) defining the Linear Tape File System (LTFS) tape format for storing files, file properties, hierarchical directories and extended attributes; b) building file system software that allows LTFS tapes to be used in the same way as portable storage devices and c) leveraging LTFS to create efficient file-based media workflows. In the exhibit we present LTFS on LTO-5 (Linear Tape Open, Gen 5) tapes. We demonstrate file-based workflows with storyboards, video proxies and partial video restore of MXF (Material Exchange Format) professional video content. LTFS on LTO-5 tape can be 20 times higher in capacity, 10 times faster and 40 times cheaper than digital video cassette media. Furthermore, it combines the benefits of tape-based and file-based workflows. The new tape format streamlines file-based production, from video capture and transport to long-term archive. The tape format and file system implementation are available as open source. expand
|
|
|
PopApp: the first 3D popular application for non-linear presentations |
| |
Claudio Mazzanti
|
|
Pages: 1521-1522 |
|
doi>10.1145/1873951.1874270 |
|
Full text: PDF
|
|
PopApp is an application aimed to organize and publish multimedia contents in a 3d graphical interface, with a non-linear hierarchical structure. Presentations could be quickly and easily created and edited using the PopApp Editor. A step forward from ...
PopApp is an application aimed to organize and publish multimedia contents in a 3d graphical interface, with a non-linear hierarchical structure. Presentations could be quickly and easily created and edited using the PopApp Editor. A step forward from Presentation to Presentainment. expand
|
|
|
Large scale partially duplicated web image retrieval |
| |
Wengang Zhou,
Yijuan Lu,
Houqiang Li,
Yibing Song,
Qi Tian
|
|
Pages: 1523-1524 |
|
doi>10.1145/1873951.1874271 |
|
Full text: PDF
|
|
The state-of-the-art image retrieval approaches represent images with a high dimensional vector of visual words by quantizing local features, such as SIFT, in the descriptor space. The geometric clues among visual words in an image is usually ignored ...
The state-of-the-art image retrieval approaches represent images with a high dimensional vector of visual words by quantizing local features, such as SIFT, in the descriptor space. The geometric clues among visual words in an image is usually ignored or exploited for full geometric verification, which is computationally expensive. In recent years, partially duplicated images are prevalent on the web. In this demo, we focus on partial-duplicated web image retrieval, and propose a retrieval system based on a novel scheme, spatial coding, to encode the spatial information among local features in an image. Our spatial coding is both efficient and effective to discover false matches of local features between images, and can greatly improve retrieval performance. expand
|
|
|
Visual search applications for connecting published works to digital material |
| |
Jamey Graham,
Jorge Moraleda,
Jonathan J. Hull,
Timothee Bailloeul,
Xu Liu,
Andrea Mariotta
|
|
Pages: 1525-1526 |
|
doi>10.1145/1873951.1874272 |
|
Full text: PDF
|
|
Visual search connects physical (offline) objects with (online) digital media. Using objects from the environment, like newspapers, magazines, books and posters, we can retrieve supplemental information from the online world. In this demonstration, ...
Visual search connects physical (offline) objects with (online) digital media. Using objects from the environment, like newspapers, magazines, books and posters, we can retrieve supplemental information from the online world. In this demonstration, we show a framework for delivering visual search services to users of mobile devices. We show how users can point a mobile device at any location in a document, magazine or book to view related, online material on the device. We describe client applications now being deployed for the iPhone and the server architecture used for recognition of scanned images. expand
|
|
|
SCOTT: set cover tracing technology |
| |
Dulce Ponceleon,
Jeff Lostpiech,
Hongxia Jin,
Eric Wilcox
|
|
Pages: 1527-1528 |
|
doi>10.1145/1873951.1874273 |
|
Full text: PDF
|
|
In this paper, we describe SCOTT: a demonstration system that uses the Set Cover Tracing algorithm for determining the source of pirate content. This algorithm is very efficient in dealing with collusion attacks - the performance is close to linear in ...
In this paper, we describe SCOTT: a demonstration system that uses the Set Cover Tracing algorithm for determining the source of pirate content. This algorithm is very efficient in dealing with collusion attacks - the performance is close to linear in the number of colluders. However, the algorithm is based on the Set Cover Problem, which is known to be NP hard. SCOTT confirms the assertion in the original paper that a set cover algorithm is efficient in this particular application. The SCOTT system is suitable for use in a commercial application; the most notable of which is tracing the source of pirate Blu-ray movies. (Blu-ray players contain a built-in tracing traitors key assignment.) It also contains a visualization of the tracing process. After each pirate movie, SCOTT displays the universal of all players and its estimate of guilt for each player. expand
|
|
|
Physical hyperlinks for citizen interaction |
| |
Dustin Haisler,
Phil Tate
|
|
Pages: 1529-1530 |
|
doi>10.1145/1873951.1874274 |
|
Full text: PDF
|
|
In 2008, the City of Manor deployed Quick Response barcodes, also known as QR-codes, throughout the community of 6,500 people. The QR-codes were initially intended as a document management solution, but eventually turned into a powerful and engaging ...
In 2008, the City of Manor deployed Quick Response barcodes, also known as QR-codes, throughout the community of 6,500 people. The QR-codes were initially intended as a document management solution, but eventually turned into a powerful and engaging government transparency initiative that reshaped how information was disseminated to citizens. expand
|
|
|
Mobile document scanning and copying |
| |
Jian Fan,
Qian Lin,
Jerry Liu
|
|
Pages: 1531-1532 |
|
doi>10.1145/1873951.1874275 |
|
Full text: PDF
|
|
In this paper, we show a multimedia system for processing mobile camera captured documents. Using a client application on a mobile phone, a user can capture a document image, and send the image to a processing server so that the document image can be ...
In this paper, we show a multimedia system for processing mobile camera captured documents. Using a client application on a mobile phone, a user can capture a document image, and send the image to a processing server so that the document image can be restored using automatic perspective and illumination corrections. The restored document can then be sent to a web-connected printer to complete the copying task. expand
|
|
|
Data-driven behavioural algorithms for online advertising |
| |
Antonio Tomarchio,
Francesco Bellacci,
Filippo Privitera
|
|
Pages: 1533-1534 |
|
doi>10.1145/1873951.1874276 |
|
Full text: PDF
|
|
In this paper, we describe an innovative data-driven behavioural approach that we developed for the optimization of performance online advertising on Simply, the new international adnetwork developed by Dada spa.
In this paper, we describe an innovative data-driven behavioural approach that we developed for the optimization of performance online advertising on Simply, the new international adnetwork developed by Dada spa. expand
|
|
|
DEMONSTRATION SESSION: Demo - D1 track |
| |
Daniel Gatica-Perez
|
|
|
|
|
Crowdsourcing rock n' roll multimedia retrieval |
| |
Cees G.M. Snoek,
Bauke Freiburg,
Johan Oomen,
Roeland Ordelman
|
|
Pages: 1535-1538 |
|
doi>10.1145/1873951.1874278 |
|
Full text: PDF
|
|
In this technical demonstration, we showcase a multimedia search engine that facilitates semantic access to archival rock n' roll concert video. The key novelty is the crowdsourcing mechanism, which relies on online users to improve, extend, and share, ...
In this technical demonstration, we showcase a multimedia search engine that facilitates semantic access to archival rock n' roll concert video. The key novelty is the crowdsourcing mechanism, which relies on online users to improve, extend, and share, automatically detected results in video fragments using an advanced timeline-based video player. The user-feedback serves as valuable input to further improve automated multimedia retrieval results, such as automatically detected concepts and automatically transcribed interviews. The search engine has been operational online to harvest valuable feedback from rock n' roll enthusiasts. expand
|
|
|
Media distribution over 2D communication sheet |
| |
Youiti Kado,
Bing Zhang,
Jiang Yu Zheng
|
|
Pages: 1539-1542 |
|
doi>10.1145/1873951.1874279 |
|
Full text: PDF
|
|
This paper will demonstrate a media infrastructure that can distribute multimedia signals and power via a two dimensional communication sheet. Small and low power devices placed on top of the sheet will be able to receive audio and video signals transmitted ...
This paper will demonstrate a media infrastructure that can distribute multimedia signals and power via a two dimensional communication sheet. Small and low power devices placed on top of the sheet will be able to receive audio and video signals transmitted from a computer. The 2D sheet is printed in three layers and the electromagnetic waves are distributed in the middle layer. Through a grid of slits printed on the top layer, the electromagnetic waves leak and are thus picked up by small multimedia devices with antennas. Compared to the wired and wireless communication, this 2D sheet has its unique properties such as free to put and move devices on it, power efficient, wide transmission bandwidth, secure communication, etc. expand
|
|
|
Melog |
| |
Hongzhi Li,
Xian-Sheng Hua,
Xijia Liu
|
|
Pages: 1543-1546 |
|
doi>10.1145/1873951.1874280 |
|
Full text: PDF
|
|
We demonstrate Melog, a "mobile + cloud" multimedia system enabling efficient and near-realtime experience sharing through automatic blogging and micro-blogging, which are based on multi-modal media content analyses and syntheses. Unlike existing mobile ...
We demonstrate Melog, a "mobile + cloud" multimedia system enabling efficient and near-realtime experience sharing through automatic blogging and micro-blogging, which are based on multi-modal media content analyses and syntheses. Unlike existing mobile blogging methods, Melog tends to reduce uses' manual effort, but summarize the trip intelligently and share travel experience automatically or semi-automatically with few user interactions. expand
|
|
|
Color and luminance compensation for mobile panorama construction |
| |
Yingen Xiong,
Kari Pulli
|
|
Pages: 1547-1550 |
|
doi>10.1145/1873951.1874281 |
|
Full text: PDF
|
|
We provide an efficient technique of color and luminance compensation for sequences of overlapping images. It can be used in construction of high-resolution and high-quality panoramic images even when the input images have very different colors and luminance. ...
We provide an efficient technique of color and luminance compensation for sequences of overlapping images. It can be used in construction of high-resolution and high-quality panoramic images even when the input images have very different colors and luminance. The technique uses color matching in the overlapping areas of source image pairs to balance colors and luminance in the whole image sequence. It performs gamma correction for the luminance component and linear correction for the chrominance components of source images in the image sequence. Compared to existing approaches, our technique is simple and efficient, yet it can avoid color saturation problems during color correction and perform color and luminance compensation globally in the whole image sequence. We apply the technique to create high-quality panoramic images on mobile phones expand
|
|
|
iPhotobook: creating photo books on mobile devices |
| |
Jun Xiao,
Nic Lyons,
C. Brian Atkins,
Yuli Gao,
Hui Chao,
Xuemei Zhang
|
|
Pages: 1551-1554 |
|
doi>10.1145/1873951.1874282 |
|
Full text: PDF
|
|
The amount of photo that is captured and stored with mobile devices is growing rapidly. We regularly see traditional desktop multimedia applications being ported to mobile devices. However, less often do we see novel interaction mechanism being developed ...
The amount of photo that is captured and stored with mobile devices is growing rapidly. We regularly see traditional desktop multimedia applications being ported to mobile devices. However, less often do we see novel interaction mechanism being developed to effectively deal with the physical limitations of the display and input of such devices. We built a mobile application called iPhotobook that enables users to create and edit photo books on iPhones. The application leveraged image analysis algorithms to automate user tasks of photo selection, grouping, editing and layout that would be difficult to accomplish without a large screen input. In this paper, we present these technologies and more importantly illustrate how they work together seamlessly with a gesture based user interface that creates a fun photo book authoring experience. Although the iPhone is our platform of choice and photo book creation is our target application, our innovations on UI design and automation algorithms may well be generalized to other small screen devices and applied to other mobile media applications. expand
|
|
|
Blog2Book: transforming blogs into photo books employing aesthetic principles |
| |
Philipp Sandhaus,
Mohammad Rabbath,
Ilja Erbis,
Susanne Boll
|
|
Pages: 1555-1556 |
|
doi>10.1145/1873951.1874283 |
|
Full text: PDF
|
|
For many people web blogs are the preferred means to document important moments of their lifes, e.g. a holiday trip or the year abroad. Such blogs contain photos and textual descriptions of events in a well-structured form. However, while being a perfect ...
For many people web blogs are the preferred means to document important moments of their lifes, e.g. a holiday trip or the year abroad. Such blogs contain photos and textual descriptions of events in a well-structured form. However, while being a perfect means to share such important moments with friends and family, blogs remain digital and thus do not provide the valuable experience of a physical souvenir of the documented event, such as a photo book. In this paper we therefore propose a solution to combine the advantages of both web blogs and printed photo books. We bridge the gap between the digital and physical world by providing a system to automatically transform a blog into a photo book. For this we employ our system for automatic page layout following aesthetic principles. The structure of the blog is reflected in the overall layout of the photo book. We also enrich the resulting photo book with additional content from the web by employing links and location information present in the blog entries. The result is a physical counterpart to the original blog in a nice layout. Our approach is implemented as a web-based rich media application. expand
|
|
|
TagCaptcha: annotating images with CAPTCHAs |
| |
Donn Morrison,
Stéphane Marchand-Maillet,
Éric Bruno
|
|
Pages: 1557-1558 |
|
doi>10.1145/1873951.1874284 |
|
Full text: PDF
|
|
We demonstrate our TagCaptcha image annotation system. TagCaptcha presents the user with a number of images that must be correctly labelled in order to pass a human verification test on the web. The images are divided into two subsets: a control or verification ...
We demonstrate our TagCaptcha image annotation system. TagCaptcha presents the user with a number of images that must be correctly labelled in order to pass a human verification test on the web. The images are divided into two subsets: a control or verification set for which annotations are known, and an unknown set for which no verified annotations exist. The verification set is used to control against the tags provided for the unknown set. If the user provides correct verification tags, the tags for the unknown set are promoted. An image with a promoted tag must be validated by other users before it can be classed as annotated and added to the verification set. Given a partially annotated database, the images can be incrementally annotated over time. The TagCaptcha system is intended to replace traditional text-based CAPTCHA systems currently used for human verification on the web in the fight against spam. expand
|
|
|
DEMONSTRATION SESSION: Demo - D2 track |
| |
James Lynch
|
|
|
|
|
A technical demonstration of large-scale image object retrieval by efficient query evaluation and effective auxiliary visual feature discovery |
| |
Yin-Hsi Kuo,
Yi-Lun Wu,
Kuan-Ting Chen,
Yi-Hsuan Yang,
Tzu-Hsuan Chiu,
Winston H. Hsu
|
|
Pages: 1559-1562 |
|
doi>10.1145/1873951.1874286 |
|
Full text: PDF
|
|
In this demonstration, we present a real-time system that addresses three essential issues of large-scale image object retrieval: 1) image object retrieval-facilitating pseudo-objects in inverted indexing and novel object-level pseudo-relevance feedback ...
In this demonstration, we present a real-time system that addresses three essential issues of large-scale image object retrieval: 1) image object retrieval-facilitating pseudo-objects in inverted indexing and novel object-level pseudo-relevance feedback for retrieval accuracy; 2) time efficiency-boosting the time efficiency and memory usage of object-level image retrieval by a novel inverted indexing structure and efficient query evaluation; 3) recall rate improvement--mining semantically relevant auxiliary visual features through visual and textual clusters in an unsupervised and scalable (i.e., MapReduce) manner. We are able to search over one-million image collection in respond to a user query in 121ms, with significantly better accuracy (+99%) than the traditional bag-of-words model. expand
|
|
|
SIVA suite: authoring system and player for interactive non-linear videos |
| |
Britta Meixner,
Beate Siegel,
Günther Hölbling,
Franz Lehner,
Harald Kosch
|
|
Pages: 1563-1566 |
|
doi>10.1145/1873951.1874287 |
|
Full text: PDF
|
|
In this paper, an intuitive authoring system and player for interactive non-linear video called SIVA Suite is presented for demonstration. Such videos are enriched by additional content. Possible forms of additional content are plaintext, richtext, images ...
In this paper, an intuitive authoring system and player for interactive non-linear video called SIVA Suite is presented for demonstration. Such videos are enriched by additional content. Possible forms of additional content are plaintext, richtext, images and videos. Interactivity is implemented based upon selection buttons which allow the user to follow different plotlines. Additional forms of interactivity are realized as clickable objects in the video and a table of contents for the video. The software provides a tool for manually cutting videos and an automated shot detection. The non-linear flow of the video can be designed using a scene graph with fork nodes. Editors for text and images support the user in adding information to the video. A finished video project is exported to an XML file with a specific schema and Flash video (flv) files. The player processes the XML file, plays the interactive video and shows additional contents. It can be customized to the requirements of the presentation of the video and the corporate design of the homepage the video is embedded in. expand
|
|
|
Integrated mobile visualization and interaction of events and POIs |
| |
Daniel Schmeiß,
Ansgar Scherp,
Steffen Staab
|
|
Pages: 1567-1570 |
|
doi>10.1145/1873951.1874288 |
|
Full text: PDF
|
|
We propose a new approach for mobile visualization and interaction of temporal information by integrating support for time with today's most prevalent visualization of spatial information the map. Our approach allows for an easy and precise selection ...
We propose a new approach for mobile visualization and interaction of temporal information by integrating support for time with today's most prevalent visualization of spatial information the map. Our approach allows for an easy and precise selection of the time that is of interest and provides immediate feedback to the users when interacting with it. It has been developed in an evolutionary process gaining formative feedback from end users. expand
|
|
|
3D ancient mosaics |
| |
Sebastiano Battiato,
Giovanni Puglisi
|
|
Pages: 1571-1574 |
|
doi>10.1145/1873951.1874289 |
|
Full text: PDF
|
|
Digital 3D mosaics generation is a current trend of NPR (Non Photorealistic Rendering) field; in this demo we present an interactive system realized in JAVA where the user can simulate ancient mosaic in a 3D environment starting for any input image. ...
Digital 3D mosaics generation is a current trend of NPR (Non Photorealistic Rendering) field; in this demo we present an interactive system realized in JAVA where the user can simulate ancient mosaic in a 3D environment starting for any input image. Different simulation engines able to render the so-called "Opus Musivum"and "Opus Vermiculatum" are employed. Different parameters can be dynamically adjusted to obtain very impressive results. expand
|
|
|
Training data collection system for a learning-based photographic aesthetic quality inference engine |
| |
Razvan Orendovici,
James Z. Wang
|
|
Pages: 1575-1578 |
|
doi>10.1145/1873951.1874290 |
|
Full text: PDF
|
|
We present a novel data collection system deployed for the ACQUINE - Aesthetic Quality Inference Engine. The goal of the system is to collect online user opinions, both structured and unstructured, for training future generation learning-based aesthetic ...
We present a novel data collection system deployed for the ACQUINE - Aesthetic Quality Inference Engine. The goal of the system is to collect online user opinions, both structured and unstructured, for training future generation learning-based aesthetic quality inference engines. The development of the system was based on an analysis of over 60,000 user comments of photographs. For photos processed and rated by our engine, all users are invited to provide manual ratings. The users can also choose up to three key photographic features that the user liked, from a list, or to add features not in the list. Within a few months that the system is available for public used more than 20,000 photos have received manual ratings and key features for over 1,800 photos have been identified. We expect the data generated over time will be critical in the study of computational inferencing of visual aesthetics in photographs. The system is demonstrated at http://acquine.alipr.com expand
|
|
|
Photo2Trip: an interactive trip planning system based on geo-tagged photos |
| |
Huagang Yin,
Xin Lu,
Changhu Wang,
Nenghai Yu,
Lei Zhang
|
|
Pages: 1579-1582 |
|
doi>10.1145/1873951.1874291 |
|
Full text: PDF
|
|
In this technical demonstration, we present a novel interactive trip planning system, i.e. Photo2Trip, by leveraging existing travel clues recovered from 20 million geo-tagged photos. Compared with the most common ways of trip planning, such as surveying ...
In this technical demonstration, we present a novel interactive trip planning system, i.e. Photo2Trip, by leveraging existing travel clues recovered from 20 million geo-tagged photos. Compared with the most common ways of trip planning, such as surveying travelogues and resorting to travel forums, Photo2Trip enables users to plan their trips in a more effective way. To meet users' diverse travel requirements, the system considers the following preferences: travel location (e.g. Beijing, Paris, or New York), travel duration (e.g. a two-day trip or a five-day trip), visiting time (e.g. summer, winter, March, or October), and travel style preference (e.g. prefer historic or prefer scenery sites). According to user requirements, Photo2Trip can automatically recommend popular travel routes among multiple destinations (attractions/ landmarks), and suggest typical internal paths within each destination. Moreover, users are allowed to interactively adjust the suggested plans by adding or removing destinations to get more customized travel routes from the system. Owning to 20 million geo-tagged photos and 200,000 travelogues, Photo2Trip is capable of supporting users plan travel routes for over 30,000 attractions/landmarks in more than 100 countries and territories. expand
|
|
|
Coming together: negotiated content by multi-agents |
| |
Arne Eigenfeldt
|
|
Pages: 1583-1586 |
|
doi>10.1145/1873951.1874292 |
|
Full text: PDF
|
|
In this paper, we describe a software system that generates unique musical compositions in realtime, created by four autonomous multi-agents. Given no explicit musical data, agents explore their environment, building beliefs through interactions with ...
In this paper, we describe a software system that generates unique musical compositions in realtime, created by four autonomous multi-agents. Given no explicit musical data, agents explore their environment, building beliefs through interactions with other agents via messaging and listening (to both audio and/or MIDI data), generating goals, and executing plans. The artistic focus of Coming Together is the actual process of convergence, heard during performance (each of which usually lasts about ten minutes): the movement from random individualism to converged ensemble interaction. If convergence is successful, four additional agents are instantiated that exploit the emergent harmony and rhythm through brief, but beautiful melodic gestures. Once these agents have completed their work, or if the original "explorer" agents fail to converge, the system resets itself, and the process begins again. expand
|
|
|
Mobile product recognition |
| |
Sam S. Tsai,
David Chen,
Vijay Chandrasekhar,
Gabriel Takacs,
Ngai-Man Cheung,
Ramakrishna Vedantham,
Radek Grzeszczuk,
Bernd Girod
|
|
Pages: 1587-1590 |
|
doi>10.1145/1873951.1874293 |
|
Full text: PDF
|
|
We present a mobile product recognition system for the camera-phone. By snapping a picture of a product with a camera-phone, the user can retrieve online information of the product. The product is recognized by an image-based retrieval system located ...
We present a mobile product recognition system for the camera-phone. By snapping a picture of a product with a camera-phone, the user can retrieve online information of the product. The product is recognized by an image-based retrieval system located on a remote server. Our database currently comprises more than one million entries, primarily products packaged in rigid boxes with printed labels, such as CDs, DVDs, and books. We extract low bit-rate descriptors from the query image and compress the location of the descriptors using location histogram coding on the camera-phone. We transmit the compressed query features, instead of a query image, to reduce the transmission delay. We use inverted index compression and fast geometric re-ranking on our database to provide a low delay image recognition response for large scale databases. Experimental timing results on different parts of the mobile product recognition system is reported in this work. expand
|
|
|
DEMONSTRATION SESSION: Demo - D3 track |
| |
Kiyoharu Aizawa
|
|
|
|
|
Joke-o-Mat HD: browsing sitcoms with human derived transcripts |
| |
Adam Janin,
Luke Gottlieb,
Gerald Friedland
|
|
Pages: 1591-1594 |
|
doi>10.1145/1873951.1874295 |
|
Full text: PDF
|
|
Joke-o-mat HD is a system that allows a user to navigate sitcoms (such as Seinfeld) by "narrative themes", including scenes, punchlines, and dialog segments. The themes can be filtered by the main actors and by keyword. For example, the user can ...
Joke-o-mat HD is a system that allows a user to navigate sitcoms (such as Seinfeld) by "narrative themes", including scenes, punchlines, and dialog segments. The themes can be filtered by the main actors and by keyword. For example, the user can select to see only punchlines by Kramer that contain the word "armoire". The system infers the narrative themes using segmentation of the audio track into laughter, actors, words, and music. The segmentation can be generated either by an expert annotator, via automatic methods, or by exploiting human derived (HD) "found" data such as fan-generated scripts and closed captions. We demonstrate browsing one episode of Seinfeld using all three methods of generating segmentations. expand
|
|
|
Multi-exposure imaging on mobile devices (demo) |
| |
Natasha Gelfand,
Andrew Adams,
Sung Hee Park,
Kari Pulli
|
|
Pages: 1595-1598 |
|
doi>10.1145/1873951.1874296 |
|
Full text: PDF
|
|
Many natural scenes have a dynamic range that is larger than the dynamic range of a camera's image sensor. A popular approach to producing an image without under- and over-exposed areas is to capture several input images with varying exposure settings, ...
Many natural scenes have a dynamic range that is larger than the dynamic range of a camera's image sensor. A popular approach to producing an image without under- and over-exposed areas is to capture several input images with varying exposure settings, and later merge them into a single high-quality result using offline image processing software. We present a system for creating images of high-dynamic-range (HDR) scenes that operates entirely on a mobile camera. Our system consists of an automatic HDR metering algorithm that determines which exposures to capture, a video-rate viewfinder preview algorithm that allows the user to verify the dynamic range that will be recorded, and a light-weight image merging algorithm that computes a high-quality result directly on the camera. By using our system, a photographer can capture, view, and share images of HDR scenes directly on camera, without using offline image processing software expand
|
|
|
iComics: automatic conversion of movie into comics |
| |
Richang Hong,
Meng Wang,
Guangda Li,
Xiao-Tong Yuan,
Shuicheng Yan,
Tat-Seng Chua
|
|
Pages: 1599-1602 |
|
doi>10.1145/1873951.1874297 |
|
Full text: PDF
|
|
This demonstration presents a system, named iComics, for automatic conversion of movie into comics. We design three components to realize the system: script-face mapping, key-scene extraction, and cartoonization. Script-face mapping utilizes face recognition ...
This demonstration presents a system, named iComics, for automatic conversion of movie into comics. We design three components to realize the system: script-face mapping, key-scene extraction, and cartoonization. Script-face mapping utilizes face recognition and tracking techniques to accomplish the mapping between character's faces and their scripts. Key-scene extraction combines the frames derived from subshots and the extracted index frames based on subtitle to select a sequence of frames for cartoonization. Finally, the cartoonization is accomplished via four steps: panel scale, stylization, word balloon placement and comics layout. expand
|
|
|
vESP: enriching enterprise document search results with aligned video summarization |
| |
Pål Halvorsen,
Dag Johansen,
Bjørn Olstad,
Tomas Kupka,
Sverre Tennøe
|
|
Pages: 1603-1604 |
|
doi>10.1145/1873951.1874298 |
|
Full text: PDF
|
|
In this demo, we present a video-enabled enterprise search platform (vESP), an application prototype that enhance a widely deployed commercial enterprise search engine with video streaming. The idea is that for example in a large enterprise, like Microsoft, ...
In this demo, we present a video-enabled enterprise search platform (vESP), an application prototype that enhance a widely deployed commercial enterprise search engine with video streaming. The idea is that for example in a large enterprise, like Microsoft, there exists a lot of information in form of presentations with corresponding video. Using our enhancements, a user can select and combine slides from different presentations generating a new slide deck dynamically and the corresponding video clips are concatenated and presented vis-a-vis the slides on-the-fly. The prototype is evaluated using a data set from Microsoft, and our initial user surveys indicate that the opportunity to enrich the search results with corresponding video is embraced by potential users expand
|
|
|
MindFinder: interactive sketch-based image search on millions of images |
| |
Yang Cao,
Hai Wang,
Changhu Wang,
Zhiwei Li,
Liqing Zhang,
Lei Zhang
|
|
Pages: 1605-1608 |
|
doi>10.1145/1873951.1874299 |
|
Full text: PDF
|
|
In this paper, we showcase the MindFinder system, which is an interactive sketch-based image search engine. Different from existing work, most of which is limited to a small scale database or only enables single modality input, MindFinder is a sketch-based ...
In this paper, we showcase the MindFinder system, which is an interactive sketch-based image search engine. Different from existing work, most of which is limited to a small scale database or only enables single modality input, MindFinder is a sketch-based multimodal search engine for million-level database. It enables users to sketch major curves of the target image in their mind, and also supports tagging and coloring operations to better express their search intentions. Owning to a friendly interface, our system supports multiple actions, which help users to flexibly design their queries. After each operation, top returned images are updated in real time, based on which users could interactively refine their initial thoughts until ideal images are returned. The novelty of the MindFinder system includes the following two aspects: 1) A multimodal searching scheme is proposed to retrieve images which meet users' requirements not only in structure, but also in semantic meaning and color tone. 2) An indexing framework is designed to make MindFinder scalable in terms of database size, memory cost, and response time. By scaling up the database to more than two million images, MindFinder not only helps users to easily present whatever they are imagining, but also has the potential to retrieve the most desired images in their mind. expand
|
|
|
Facilitating interactive search and navigation in videos |
| |
Klaus Schoeffmann
|
|
Pages: 1609-1612 |
|
doi>10.1145/1873951.1874300 |
|
Full text: PDF
|
|
We present a tool that can efficiently facilitate interactive navigation and search in videos. In addition to browsing a video by shots it also allows a user to navigate through a video with extended seeker bars showing time-related content abstractions. ...
We present a tool that can efficiently facilitate interactive navigation and search in videos. In addition to browsing a video by shots it also allows a user to navigate through a video with extended seeker bars showing time-related content abstractions. Users having a rough knowledge about the content characteristics of scenes to be found can efficiently use these extended seeker bars to quickly find these scenes by interactive navigation. Moreover, users can easily perform similarity queries by utilizing content knowledge that has been gained during the browsing/navigation process. These queries can also be stored and reused for search in other videos having similar/same content. Furthermore, the tool can execute many queries at once and visualize the results as semantic events in a different seeker bar. Our tool provides a real alternative for situations where a user currently needs to employ a common video player for the task of search and navigation. expand
|
|
|
BIOFACE: a biometric face demonstrator |
| |
Mourad Ouaret,
Antitza Dantcheva,
Rui Min,
Lionel Daniel,
Jean Luc Dugelay
|
|
Pages: 1613-1616 |
|
doi>10.1145/1873951.1874301 |
|
Full text: PDF
|
|
In this paper, a demonstrator called BIOFACE incorporating several facial biometric techniques is described. It includes the well established Eigenfaces and the recently published Tomofaces techniques, which perform face recognition based on facial appearance ...
In this paper, a demonstrator called BIOFACE incorporating several facial biometric techniques is described. It includes the well established Eigenfaces and the recently published Tomofaces techniques, which perform face recognition based on facial appearance and dynamics, respectively. Both techniques are based on the space dimensionality reduction and the enrollment requires the projection of several positive face samples to the reduced space. Alternatively, BIOFACE also performs face recognition based on the matching of Scale Invariant Feature Transform (SIFT) features. Moreover, BIOFACE extracts a facial soft biometric profile, which consists of a bag of facial soft biometric traits such as skin, hair, and eye color, the presence of glasses, beard and moustache. The fast and efficient detection of the facial soft biometrics is performed as a pre-processing step, and employed for pruning the search for the facial recognition module. Finally, the demonstrator also detects facial events such as blinking, yawning and looking-away. The car driver scenario is a good example to exhibit the importance of such traits to detect fatigue. The BIOFACE demonstrator is an attempt to show the potential and the performance of such facial processing techniques in a real-life scenario. The demonstrator is built using the C/C++ programming language, which is suitable for implementing image and video processing techniques due to its fast execution. On top of that, the Open Source Computer Vision Library (OpenCV), which is optimized for Intel processors, is used to implement the image processing algorithms. expand
|
|
|
ClustTour: city exploration by use of hybrid photo clustering |
| |
Symeon Papadopoulos,
Christos Zigkolis,
Stefanos Kapiris,
Yiannis Kompatsiaris,
Athena Vakali
|
|
Pages: 1617-1620 |
|
doi>10.1145/1873951.1874302 |
|
Full text: PDF
|
|
We present a technical demonstration of an online city exploration application that helps users identify interesting spots in a city by use of photo clusters corresponding to landmarks and events. Our application, called ClustTour, is based on an efficient ...
We present a technical demonstration of an online city exploration application that helps users identify interesting spots in a city by use of photo clusters corresponding to landmarks and events. Our application, called ClustTour, is based on an efficient landmark and event detection scheme for tagged photo collections. The proposed scheme relies on the combination of a graph-based photo clustering algorithm, making use of both visual and tag information of photos, with a cluster classification and merging module. ClustTour creates a map-based visualization of the identified photo clusters that are classified in prominent categories and are filterable by time and tag. We believe that such an application can greatly facilitate the task of knowing a city through its landmarks and events. So far, the demo has been based on a large photo dataset focused on Barcelona, and it is gradually expanding to contain photo clusters of several major cities of Europe. Furthermore, an Android application is developed that complements the web-based version of ClustTour. expand
|
|
|
DEMONSTRATION SESSION: Demo - D4 track |
| |
Winston Hsu
|
|
|
|
|
Rerum novarum: interactive exploration of illuminated manuscripts |
| |
Daniele Borghesani,
Costantino Grana,
Rita Cucchiara
|
|
Pages: 1621-1624 |
|
doi>10.1145/1873951.1874304 |
|
Full text: PDF
|
|
This paper describes an interactive application for the exploration and annotation of illuminated manuscripts, which typically contain thousands of pictures, used to comment or embellish the manuscript Gothic text. The system is composed by a modern ...
This paper describes an interactive application for the exploration and annotation of illuminated manuscripts, which typically contain thousands of pictures, used to comment or embellish the manuscript Gothic text. The system is composed by a modern user interface for browsing, surfing and querying, an automatic segmentation module, to ease the initial picture extraction task, and a similarity based retrieval engine, used to provide visually assisted tagging capabilities. A relevance feedback procedure is included to further refine the results. expand
|
|
|
Sirio, orione and pan: an integrated web system for ontology-based video search and annotation |
| |
Marco Bertini,
Gianpaolo D'Amico,
Andrea Ferracani,
Marco Meoni,
Giuseppe Serra
|
|
Pages: 1625-1628 |
|
doi>10.1145/1873951.1874305 |
|
Full text: PDF
|
|
In this technical demonstration we show an integrated web system for video search and annotation based on ontologies. The system is composed by three components: the Orione ontology-based search engine, the Sirio\footnote{Sirio was the hound of Orione. ...
In this technical demonstration we show an integrated web system for video search and annotation based on ontologies. The system is composed by three components: the Orione ontology-based search engine, the Sirio\footnote{Sirio was the hound of Orione. It was a dog so swift that no prey could escape it.} search interface, and the Pan web-based video annotation tool. The system is currently being developed within the EU IM3I project. The goal of the system is to provide an integrated environment for video annotation and retrieval of videos, for both technical and non-technical users. In fact, the search engine has different interfaces that permit different query modalities: free-text, natural language, graphical composition of concepts using Boolean and temporal relations and query by visual example. In addition, the ontology structure is exploited to encode semantic relations between concepts permitting, for example, to expand queries to synonyms and concept specializations. The annotation tool can be used to create ground-truth annotations to train automatic annotations systems, or to complement the results of automatic annotation, e.g. adding geolocalized information. expand
|
|
|
Web-based semantic browsing of video collections using multimedia ontologies |
| |
Marco Bertini,
Gianpaolo D'Amico,
Andrea Ferracani,
Marco Meoni,
Giuseppe Serra
|
|
Pages: 1629-1632 |
|
doi>10.1145/1873951.1874306 |
|
Full text: PDF
|
|
In this technical demonstration we present a novel web-based tool that allows a user friendly semantic browsing of video collections, based on ontologies, concepts, concept relations and concept clouds. The system is developed as a Rich Internet Application ...
In this technical demonstration we present a novel web-based tool that allows a user friendly semantic browsing of video collections, based on ontologies, concepts, concept relations and concept clouds. The system is developed as a Rich Internet Application (RIA) to achieve a fast responsiveness and ease of use that can not be obtained by other web application paradigms, and uses streaming to access and inspect the videos. Users can also use the tool to browse the content of social and media sharing sites like YouTube, Flickr and Twitter, accessing these external resources through the ontologies used in the system. The tool has won the second prize in the Adobe YouGC contest, in the RIA category. expand
|
|
|
MediaTable: a tool for categorizing multimedia collections |
| |
Ork de Rooij,
Marcel Worring
|
|
Pages: 1633-1636 |
|
doi>10.1145/1873951.1874307 |
|
Full text: PDF
|
|
In this technical demonstration, we present MediaTable, our interactive multimedia collection search and categorization tool. MediaTable allows users to search through, and categorize a multimedia collection with ease by employing several familiar interface ...
In this technical demonstration, we present MediaTable, our interactive multimedia collection search and categorization tool. MediaTable allows users to search through, and categorize a multimedia collection with ease by employing several familiar interface components specifically adapted for multimedia collections. In our demonstration we expand on the visual interface, on how several types of search tasks can be completed with MediaTable. expand
|
|
|
Interactive person-retrieval in TV series and distributed surveillance video |
| |
Martin Bäuml,
Mika Fischer,
Keni Bernardin,
Hazim K. Ekenel,
Rainer Stiefelhagen
|
|
Pages: 1637-1638 |
|
doi>10.1145/1873951.1874308 |
|
Full text: PDF
|
|
Tracking and identifying persons in videos are important building blocks in many applications. For browsing of multimedia data or interactive investigation of surveillance footage it is not even necessary to uniquely identify a person. Rather it often ...
Tracking and identifying persons in videos are important building blocks in many applications. For browsing of multimedia data or interactive investigation of surveillance footage it is not even necessary to uniquely identify a person. Rather it often suffices to find occurrences of a person indicated by the user with an exemplary image sequence. We present two systems in which the search for a specific person can be initiated by a sample image sequence and then be further refined by interactive feedback by the operator. In the first system, episodes of TV series have been processed offline and can be searched for occurrences of the different characters. The second system tracks people online in multiple cameras and makes the sequences immediately searchable from a central station expand
|
|
|
Trajectory-based visualization of web video topics |
| |
Juan Cao,
Chong-Wah Ngo,
YongDong Zhang,
DongMing Zhang,
Liang Ma
|
|
Pages: 1639-1642 |
|
doi>10.1145/1873951.1874309 |
|
Full text: PDF
|
|
While there have been research efforts in organizing large scale web videos into clusters or topics, efficient browsing of web video topics remains a challenging problem not yet addressed. The related issues include how to efficiently browse and track ...
While there have been research efforts in organizing large scale web videos into clusters or topics, efficient browsing of web video topics remains a challenging problem not yet addressed. The related issues include how to efficiently browse and track the evolution of topics and eventually locate the videos of interest. In this demo paper, we introduce a novel interface for visualizing video topics as evolution trajectories. The trajectory visualization is capable of highlighting milestone events and depicting the topical hotness over time. The interface also allows multi-level browsing from topics to events and to videos, resulting in search exploration could be more efficiently conducted to locate videos of interest. In addition, recommendation of topics accordingly to three-hots content-hot, evolution-hot and potential-hot, can be easily supported by our system. A user study on three months. YouTube videos using our interface demonstrates the efficiency of our system in browsing web videos. expand
|
|
|
Adding haptic feature to YouTube |
| |
Md. Abdur Rahman,
Abdulmajeed Alkhaldi,
Jongeun Cha,
Abdulmotaleb El Saddik
|
|
Pages: 1643-1646 |
|
doi>10.1145/1873951.1874310 |
|
Full text: PDF
|
|
In this paper, we present a web-based framework in which users can annotate tactile feeling to a YouTube video and experience the tactile feeling by wearing a tactile device while watching\annotating the video. The tactile device is embedded into a wearable ...
In this paper, we present a web-based framework in which users can annotate tactile feeling to a YouTube video and experience the tactile feeling by wearing a tactile device while watching\annotating the video. The tactile device is embedded into a wearable garment, a haptic jacket and a haptic arm band in this paper, and has a rectangular layout like a video screen. Therefore, the tactile information is represented as a sequence of rectangular arrays with time stamps and stored in XML format. Each element of the array represents a tactile intensity, a magnitude of actuation. In the framework we provide a web-based authoring tool to add tactile feeling while navigating a video and setting tactile intensity in a time line. We also introduce a web browser in which a tactile device driver is embedded to activate the tactile device based on the annotated tactile information. expand
|
|
|
Assisted news reading with automated illustration |
| |
Diogo Delgado,
Joao Magalhaes,
Nuno Correia
|
|
Pages: 1647-1650 |
|
doi>10.1145/1873951.1874311 |
|
Full text: PDF
|
|
We all had the problem of forgetting about what we just read a few sentences before. This comes from the problem of attention and is more common with children and elderly. People feel either bored or distracted by something more interesting. This paper ...
We all had the problem of forgetting about what we just read a few sentences before. This comes from the problem of attention and is more common with children and elderly. People feel either bored or distracted by something more interesting. This paper proposes an application to help people reading news by illustrating the news story. The application provides mechanisms to (1) select the best illustration for each scene and (2) to select the set of illustrations to improve the story sequence. The application proposed in this technical demo aims at improving the user's attention when reading news articles. The application implements several information processing techniques to generate an audio-visual presentation of the text news article. expand
|
|
|
MediaPick: tangible semantic media retrieval system |
| |
Gianpaolo D'Amico,
Andrea Ferracani,
Lea Landucci,
Matteo Mancini,
Daniele Pezzatini,
Nicola Torpei
|
|
Pages: 1651-1654 |
|
doi>10.1145/1873951.1874312 |
|
Full text: PDF
|
|
This paper addresses the design and development of MediaPick [1], an interactive multi-touch system for semantic search of multimedia contents. Our solution provides an intuitive, easy-to-use way to select concepts organized according to an ontological ...
This paper addresses the design and development of MediaPick [1], an interactive multi-touch system for semantic search of multimedia contents. Our solution provides an intuitive, easy-to-use way to select concepts organized according to an ontological structure and retrieve the related contents. Users are then able to examine and organize such results through semantic or subjective criteria. As use case we considered professionals who work with huge multimedia archives (journalists, archivists, editors etc.). expand
|
|
|
Effects of environmental colour on mood: a wearable LifeColour capture device |
| |
Aiden R. Doherty,
Philip Kelly,
Brendan O'Flynn,
Padraig Curran,
Alan F. Smeaton,
Cian O'Mathuna,
Noel E. O'Connor
|
|
Pages: 1655-1658 |
|
doi>10.1145/1873951.1874313 |
|
Full text: PDF
|
|
Colour is everywhere in our daily lives and impacts things like our mood, yet we rarely take notice of it. One method of capturing and analysing the predominant colours that we encounter is through visual lifelogging devices such as the SenseCam. However ...
Colour is everywhere in our daily lives and impacts things like our mood, yet we rarely take notice of it. One method of capturing and analysing the predominant colours that we encounter is through visual lifelogging devices such as the SenseCam. However an issue related to these devices is the privacy concerns of capturing image level detail. Therefore in this work we demonstrate a hardware prototype wearable camera that captures only one pixel - of the dominant colour prevelant in front of the user, thus circumnavigating the privacy concerns raised in relation to lifelogging. To simulate whether the capture of dominant colour would be sufficient we report on a simulation carried out on 1.2 million SenseCam images captured by a group of 20 individuals. We compare the dominant colours that different groups of people are exposed to and show that useful inferences can be made from this data. We believe our prototype may be valuable in future experiments to capture colour correlated associated with an individual's mood. expand
|
|
|
DEMONSTRATION SESSION: Demo - D5 track |
| |
Paul Natsev
|
|
|
|
|
Mobile video browsing and retrieval with the OVIDIUS platform |
| |
Andrei Bursuc,
Titus Zaharia,
Françoise Prêteux
|
|
Pages: 1659-1662 |
|
doi>10.1145/1873951.1874315 |
|
Full text: PDF
|
|
This paper describes a mobile video browsing and retrievalapproach, based on the so-called OVIDIUS (On-line VIDeo Indexing Universal System) platform. In contrast with traditional and commercial video retrieval platforms, where video content is treated ...
This paper describes a mobile video browsing and retrievalapproach, based on the so-called OVIDIUS (On-line VIDeo Indexing Universal System) platform. In contrast with traditional and commercial video retrieval platforms, where video content is treated in a more or less monolithic manner (i.e. with global descriptions associated with the whole document), the proposed approach makes it possible to browse and access video content in a finer, per-segment basis. The hierarchical metadata structure exploits the MPEG-7 approach for structural description of video content. The MPEG-7 description schemes have been here enriched with both semantic and content-based metadata. The developed approach shows all its pertinence within a multiterminal context and in particular for video access from mobile devices. The platform has been recently (February, 2010) validated within the framework of the Médi@TIC French national project. expand
|
|
|
Serious games for health: personalized exergames |
| |
Stefan Göbel,
Sandro Hardy,
Viktor Wendel,
Florian Mehm,
Ralf Steinmetz
|
|
Pages: 1663-1666 |
|
doi>10.1145/1873951.1874316 |
|
Full text: PDF
|
|
In this paper, we describe a set of personalized exergames which combine methods and concepts of serious games, adaptation and personalization, authoring and sensor technologies. Compared to existing systems, the set of games does not only keep track ...
In this paper, we describe a set of personalized exergames which combine methods and concepts of serious games, adaptation and personalization, authoring and sensor technologies. Compared to existing systems, the set of games does not only keep track of the user's vital state, but also directly integrates vital parameters into the gameplay and supports the training and motivation for sustainable physical activity in a playful manner. expand
|
|
|
A GPU-accelerated face annotation system for smartphones |
| |
Yi-Chu Wang,
Sydney Pang,
Kwang-Ting Cheng
|
|
Pages: 1667-1668 |
|
doi>10.1145/1873951.1874317 |
|
Full text: PDF
|
|
Face annotation makes it easy to share and manage digital photos and videos. While state-of-the-art face recognition algorithms can achieve high accuracy to support automatic face annotation, their implementations on an embedded platform cannot achieve ...
Face annotation makes it easy to share and manage digital photos and videos. While state-of-the-art face recognition algorithms can achieve high accuracy to support automatic face annotation, their implementations on an embedded platform cannot achieve real-time performance due to the demanding computational requirement. However, the availability of an embedded GPU in most smartphones offers the opportunity to use it as an accelerator for the face recognition task. In this demonstration, we show that, with acceleration achieved by the embedded low-power GPU, a real-time face annotation system could be realized on an existing off-the-shelf smartphone. expand
|
|
|
Crew: cross-modal resource searching by exploiting wikipedia |
| |
Chen Liu,
Beng Chin Ooi,
Anthony K.H. Tung,
Dongxiang Zhang
|
|
Pages: 1669-1672 |
|
doi>10.1145/1873951.1874318 |
|
Full text: PDF
|
|
In Web 2.0, users have generated and shared massive amounts of resources in various media formats, such as news, blogs, audios, photos and videos. The abundance and diversity of the resources call for better integration to improve the accessibility. ...
In Web 2.0, users have generated and shared massive amounts of resources in various media formats, such as news, blogs, audios, photos and videos. The abundance and diversity of the resources call for better integration to improve the accessibility. A straightforward approach is to link the resources via tags so that resources from different modals sharing the same tag can be connected as a graph structure. This naturally motivates a new kind of information retrieval system, named cross-modal resource search, in which given a query object from any modal, all the related resources from other modals can be retrieved in a convenient manner. However, due to the tag homonym and synonym, such an approach returns results of low quality because resources with the same tag but not semantically related will be directly connected as well. In this paper, we propose to build the resource graph and perform query processing by exploiting Wikipedia. We construct a concept middle-ware between the layer of tags and resources to fully capture the semantic meaning of the resources. Such a cross-modal search system based on Wikipedia, named Crew, is built and demonstrates promising search results. expand
|
|
|
Construction of image retrieval systems focused on user knowledge interaction |
| |
Tomoko Kajiyama,
Shin'ichi Satoh
|
|
Pages: 1673-1676 |
|
doi>10.1145/1873951.1874319 |
|
Full text: PDF
|
|
Our objective was to apply different kinds of database with our proposed graphical search interface, and to verify the effectiveness focused on user knowledge structure in searching because it allowed users to easily modify received information to suitable ...
Our objective was to apply different kinds of database with our proposed graphical search interface, and to verify the effectiveness focused on user knowledge structure in searching because it allowed users to easily modify received information to suitable knowledge. This interface was for multi-faceted metadata named Concentric Ring View. The design concept of this interface is "result-oriented", which means users continue to search by evaluating the retrieved results. We constructed four image retrieval systems with images designed for web pages, flower, insect, and country. We selected databases from two aspects; the gap between images and information features, and the necessity of general knowledge to understand values of information features. To make users more able to understand the relationship between retrieved results and attributes values, we added independent functions depending on the features of database, e.g., preparing images focused on search keys, mapping to meaningful area like a world map. We confirmed that this interface bridged the gaps by materializing user knowledge from abstract images, and that users learned with our system and modified users' knowledge structure without general knowledge. This interface can be used not only as a retrieval system but also as an educational system. expand
|
|
|
Visualization of concurrent tones in music with colours |
| |
Peter Ciuha,
Bojan Klemenc,
Franc Solina
|
|
Pages: 1677-1680 |
|
doi>10.1145/1873951.1874320 |
|
Full text: PDF
|
|
Visualizing music in a meaningful and intuitive way is a challenge. Our aim is to visualize music by interconnecting similar aspects in music and in visual perception. We focus on visualizing harmonic relationships between tones and colours. Related ...
Visualizing music in a meaningful and intuitive way is a challenge. Our aim is to visualize music by interconnecting similar aspects in music and in visual perception. We focus on visualizing harmonic relationships between tones and colours. Related existing visualizations map tones or keys into a discrete set of colours. As concurrent (simultaneous) tones are not perceived as entirely separate, but also as a whole, we present a novel method for visualizing a group of concurrent tones (limited to the pitches of the 12-tone chromatic scale) with one colour for the whole group. The basis for calculation of colour is the assignment of key spanning circle of thirds to the colour wheel. The resulting colour is not limited to discrete set of colours: similar tones, chords and keys have similar colour hue; dissonance and consonance are represented by low and high colour saturation respectively. The proposed method is demonstrated as part of our prototype music visualization system using extended 3-dimensional piano roll notation. expand
|
|
|
Changing characters' point of view in interactive storytelling |
| |
Fred Charles,
Julie Porteous,
Marc Cavazza
|
|
Pages: 1681-1684 |
|
doi>10.1145/1873951.1874321 |
|
Full text: PDF
|
|
Virtual characters are at the epicentre of Interactive Storytelling systems and in recent years multiple AI planning approaches have been described to specify their autonomous behaviour. This demonstrator provides an overview of our novel approach to ...
Virtual characters are at the epicentre of Interactive Storytelling systems and in recent years multiple AI planning approaches have been described to specify their autonomous behaviour. This demonstrator provides an overview of our novel approach to the definition of virtual characters aimed at achieving a balance between character autonomy and global plot structure which proposes the notion of a character's Point of View. Additionally, the demonstrator offers the active spectator the ability to discover the story described from the perspective of a number of different characters. We present our fully-implemented Interactive Narrative based on Shakespeare's Merchant of Venice. The system, which features a novel AI planning approach to story generation, can generate very different stories depending on the Point of View adopted and support dynamic modification of the story world which results in different story consequences. expand
|
|
|
Speeding up mobile multimedia applications |
| |
Jiang Gao
|
|
Pages: 1685-1688 |
|
doi>10.1145/1873951.1874322 |
|
Full text: PDF
|
|
Mobile devices are becoming ubiquitous multimedia computing platforms. However, due to limited computational power on these devices, a good mobile application requires far more considerations in algorithm design and optimization than for desktop systems. ...
Mobile devices are becoming ubiquitous multimedia computing platforms. However, due to limited computational power on these devices, a good mobile application requires far more considerations in algorithm design and optimization than for desktop systems. This demo shows two multimedia applications on a mobile phone: mobile visual search and panorama. The innovative components of the system include hybrid tracking and visual matching, optimal region classification and feature selection for faster and more reliable image-related applications, and an optimized image processing pipeline for multimedia applications on a mobile phone. expand
|
|
|
A multimedia approach to visualize and interact with large scale mobile LiDAR data |
| |
James D. Lynch,
Xin Chen,
Roger B. Hui
|
|
Pages: 1689-1692 |
|
doi>10.1145/1873951.1874323 |
|
Full text: PDF
|
|
This paper presents a multimedia visualization tool for large-scale mobile LIDAR, panoramic imagery, high-resolution view targeted camera imagery, and GPS/IMU geo-location. A first of its kind system joins all sensor data providing a powerful tool for ...
This paper presents a multimedia visualization tool for large-scale mobile LIDAR, panoramic imagery, high-resolution view targeted camera imagery, and GPS/IMU geo-location. A first of its kind system joins all sensor data providing a powerful tool for evaluation, validation, analysis and annotation. The viewer computes necessary real-time structures on the fly, which enables the user to view and annotate the entire dataset interactively. A viewpoint based multimedia integration of vehicle centerline map, video imagery, and LIDAR are displayed for navigation. The viewer automatically determines when to page data sets and which type of data to display. The user is able to measure and tag real world objects in 3D. A typical graphics card is capable of interactively presenting half terabyte data sets by intelligently controlling the timing and volume of data loading. expand
|
|
|
Automatic skin enhancement with visible and near-infrared image fusion |
| |
Sabine Süsstrunk,
Clément Fredembach,
Daniel Tamburrino
|
|
Pages: 1693-1696 |
|
doi>10.1145/1873951.1874324 |
|
Full text: PDF
|
|
Skin tones, portraits in particular, are of critical importance in photography and video, but a number of factors, such as pigmentation irregularities (e.g., moles, freckles), irritation, roughness, or wrinkles can reduce their appeal. Moreover, such ...
Skin tones, portraits in particular, are of critical importance in photography and video, but a number of factors, such as pigmentation irregularities (e.g., moles, freckles), irritation, roughness, or wrinkles can reduce their appeal. Moreover, such "defects" are oftentimes enhanced by scene lighting conditions. Starting with the observations that melanin and hemoglobin, the key components of skin color, have little absorption in the near-infrared (NIR) part of the spectrum, and that the depth of light penetration in the epidermis is proportional to the incident light's wavelength, we show that near-infrared images provide information that can be used to automatically smooth skin tones in a physically realistic manner. Specifically, we developed a prototype camera system that consists of capturing a pair of visible/near-infrared images and separating both of them into base and detail layers (akin to a low/high frequency decomposition) with the fast bilateral filter. Smooth and realistic output images are obtained by fusing the base layer of the visible image with the near-infrared detail layer. The proposed method delivers consistently good results across various skin types. The prototype system is currently in use at the Swiss Camera Museum in Vevey, Switzerland, where the visitors can take their pictures and e-mail themselves the results. In the process, we are collecting the users' preference for either the "original" (visible) image or the "enhanced" (visible and NIR fused) image. The system has been deployed for three months. Preliminary statistics indicate that a large majority (79%) prefers the enhanced image. expand
|
|
|
SESSION: Doctoral symposium - DS1 track |
| |
Susanne Boll,
Carlo Colombo
|
|
|
|
|
Free-hand sketch based image and video retrieval |
| |
Rui Hu
|
|
Pages: 1697-1698 |
|
doi>10.1145/1873951.1874326 |
|
Full text: PDF
|
|
We present an overview of our work to date on a sketch based retrieval of image and video. We present a fast technique for extracting motion trajectories from videos and a Viterbi matching approach for retrieving video clips using free-hand sketched ...
We present an overview of our work to date on a sketch based retrieval of image and video. We present a fast technique for extracting motion trajectories from videos and a Viterbi matching approach for retrieving video clips using free-hand sketched queries. For sketch based image retrieval, we introduce a depiction invariant image descriptor Gradient-Field-HOG (GF-HOG) that encapsulates local spatial structure in the sketch and facilitates efficient codebook based image retrieval driven by free-hand sketched queries. We further enhance the system with semantic information by incorporating user's search context. expand
|
|
|
Automatic and manual processes in end-user multimedia authoring tools: where is the balance? |
| |
Rodrigo Laiola Guimarães
|
|
Pages: 1699-1700 |
|
doi>10.1145/1873951.1874327 |
|
Full text: PDF
|
|
This thesis aims to analyze, model, and develop a framework for next-generation multimedia authoring tools targeted to end-users. In particular, I concentrate on the combination of automatic and manual processes for the realization of such framework. ...
This thesis aims to analyze, model, and develop a framework for next-generation multimedia authoring tools targeted to end-users. In particular, I concentrate on the combination of automatic and manual processes for the realization of such framework. My contributions are realized in the context of a pan-European project called Together Anywhere, Together Anytime (TA2). More specifically in a community-sharing environment in which users can combine video assets contributed by other community members to form personalized mini-stories that can be shared within their (probably restricted) social groups. The expected outcome of my thesis work is contributing for the design of authoring and sharing tools that better fit end-users' needs. expand
|
|
|
Analysis and classification of conversational interactions |
| |
Anna Pesarin
|
|
Pages: 1701-1702 |
|
doi>10.1145/1873951.1874328 |
|
Full text: PDF
|
|
|
|
|
Flashboost: design of flash memory buffer cache mechanism for video-on-demand |
| |
Moonkyung Ryu
|
|
Pages: 1703-1704 |
|
doi>10.1145/1873951.1874329 |
|
Full text: PDF
|
|
A magnetic disk is a serious bottleneck which limits the scalability of a video server due to its head seek overhead. For a video server, Interval Caching is a state-of-the-art caching mechanism that addresses the problem utilizing RAM as a buffer ...
A magnetic disk is a serious bottleneck which limits the scalability of a video server due to its head seek overhead. For a video server, Interval Caching is a state-of-the-art caching mechanism that addresses the problem utilizing RAM as a buffer cache to serve more video streams. Flash memory SSD (Solid State Drive) is a brand new storage device which has very different traits from old storage devices like RAM or disks. The objective of this research is to investigate the applicability and potential impact that flash memory SSD has for a video server. Moreover, I will propose a novel buffer cache mechanism which exploits characteristics of flash memory SSD at maximum to improve the scalability of a video server. expand
|
|
|
Interoperable and unified multimedia retrieval in distributed and heterogeneous environments |
| |
Florian Stegmaier
|
|
Pages: 1705-1706 |
|
doi>10.1145/1873951.1874330 |
|
Full text: PDF
|
|
In this abstract, the research topics of my doctoral thesis will be introduced. These emerged within THESEUS1, in which I work as a third-party funded researcher. The overall aim of my work is to provide unified and interoperable multimedia ...
In this abstract, the research topics of my doctoral thesis will be introduced. These emerged within THESEUS1, in which I work as a third-party funded researcher. The overall aim of my work is to provide unified and interoperable multimedia retrieval in distributed and highly heterogeneous environments. In this case, I am focusing on multimedia databases, multimedia metadata formats, semantic web technologies as well as international standardization (MPEG/JPEG and W3C). expand
|
|
|
SESSION: Discussion room - DR1 track |
| |
Ramesh Jain
|
|
|
|
|
Towards a universal detector by mining concepts with small semantic gaps |
| |
Jiashi Feng,
Yan-tao Zheng,
Shuicheng Yan
|
|
Pages: 1707-1710 |
|
doi>10.1145/1873951.1874332 |
|
Full text: PDF
|
|
Can we have a universal detector that could recognize unseen objects with no training exemplars available? Such a detector is so desirable, as there are hundreds of thousands of object concepts in human vocabulary but few available labeled image examples. ...
Can we have a universal detector that could recognize unseen objects with no training exemplars available? Such a detector is so desirable, as there are hundreds of thousands of object concepts in human vocabulary but few available labeled image examples. In this study, we attempt to build such a universal detector to predict concepts in the absence of training data. First, by considering both semantic relatedness and visual variance, we mine a set of realistic small-semantic-gap (SSG) concepts from a large-scale image corpus. Detectors of these concepts can deliver reasonably satisfactory recognition accuracies. From these distinctive visual models, we then leverage the semantic ontology knowledge and co-occurrence statistics of concepts to extend visual recognition to unseen concepts. To the best of our knowledge, this work presents the first research attempting to substantiate the semantic gap measuring of a large amount of concepts and leverage visually learnable concepts to predicate those with no training images available. Testings on NUS-WIDE dataset demonstrate that the selected concepts with small semantic gaps can be well modeled and the prediction of unseen concepts delivers promising results with comparable accuracy to preliminary training-based methods. expand
|
|
|
Intelligent query: open another door to 3d object retrieval |
| |
Yue Gao,
Meng Wang,
Jialie Shen,
Qionghai Dai,
Naiyao Zhang
|
|
Pages: 1711-1714 |
|
doi>10.1145/1873951.1874333 |
|
Full text: PDF
|
|
The increasing number of available 3D objects makes their efficient retrieval technology highly desired. Extensive research has been dedicated to view-based 3D object retrieval because of its advantage of 2D views for 3D object content representation. ...
The increasing number of available 3D objects makes their efficient retrieval technology highly desired. Extensive research has been dedicated to view-based 3D object retrieval because of its advantage of 2D views for 3D object content representation. In this paradigm, typically the retrieval is accomplished based a set of different views of the query object, and focuses on the 3D object representation, matching and indexing. In this work, we present another aspect towards 3D object retrieval: intelligent query. Intelligent query includes query selection, query description and combination, and assistive query. We will show how this scheme is ideally suit for the 3D object retrieval problem. We conduct experiments on the National Taiwan University 3D Model database and results demonstrated that our approach can improve retrieval performance. Finally, we give insight into the future of the intelligent query for 3D object retrieval. expand
|
|
|
Interactive storytelling via video content recombination |
| |
Julie Porteous,
Sergio Benini,
Luca Canini,
Fred Charles,
Marc Cavazza,
Riccardo Leonardi
|
|
Pages: 1715-1718 |
|
doi>10.1145/1873951.1874334 |
|
Full text: PDF
|
|
In the paper we present a prototype of video-based storytelling that is able to generate multiple story variants from a baseline video. The video content for the system is generated by an adaptation of forefront video summarisation techniques that decompose ...
In the paper we present a prototype of video-based storytelling that is able to generate multiple story variants from a baseline video. The video content for the system is generated by an adaptation of forefront video summarisation techniques that decompose the video into a number of Logical Story Units (LSU) representing sequences of contiguous and interconnected shots sharing a common semantic thread. Alternative storylines are generated using AI Planning techniques and these are used to direct the combination of elementary LSU for output. We report early results from experiments with the prototype in which the reordering of video shots on the basis of their high-level semantics produces trailers giving the illusion of different storylines. expand
|
|
|
PANEL SESSION: Panel - PA1 |
| |
Ed Chang,
Tat-Seng Chua
|
|
|
|
|
The use of non-conventional methods for content analysis and understanding: panel overview |
| |
Nicu Sebe,
Qi Tian
|
|
Pages: 1719-1720 |
|
doi>10.1145/1873951.1874336 |
|
Full text: PDF
|
|
This panel will enable the participants to understand key concepts, state-of-the-art techniques, and open issues in content analysis and understanding that make use of non-conventional methods. As such we will cover aspects such as (1) eye gaze for multimodal ...
This panel will enable the participants to understand key concepts, state-of-the-art techniques, and open issues in content analysis and understanding that make use of non-conventional methods. As such we will cover aspects such as (1) eye gaze for multimodal interaction and content analysis; (2) multimodal interaction for affective retrieval and in affective interfaces: approaches to multimedia content analysis and interaction that use multiple channels of information, new interaction paradigms and physiological signals; (3) the use of brain signals for advanced brain-computer interaction and interfaces; and (4) applications: traditional and emerging application areas. expand
|
|
|
PANEL SESSION: Panel - PA2 |
| |
Ed Chang,
Tat-Seng Chua
|
|
|
|
|
All things mobile: the present and future of mobile phone computing |
| |
Daniel Gatica-Perez
|
|
Pages: 1721-1722 |
|
doi>10.1145/1873951.1874338 |
|
Full text: PDF
|
|
This is the summary of the panel All Things Mobile: The Present and Future of Mobile Phone Computing.
This is the summary of the panel All Things Mobile: The Present and Future of Mobile Phone Computing. expand
|
|
|
PANEL SESSION: Panel - PA3 |
| |
Ed Chang,
Tat-Seng Chua
|
|
|
|
|
"Disputatio" on the Use of Ontologies in Multimedia |
| |
Simone Santini,
Amarnath Gupta
|
|
Pages: 1723-1728 |
|
doi>10.1145/1873951.1874340 |
|
Full text: PDF
|
|
|
|
|
WORKSHOP SESSION: Workshop overviews track |
|
|
|
|
eHeritage 2010: 2nd ACM workshop on eHeritage and digital art preservation |
| |
Olga Pereira Bellon,
Ilan Shimshoni,
Matteo Dellepiane
|
|
Pages: 1729-1730 |
|
doi>10.1145/1873951.1874342 |
|
Full text: PDF
|
|
|
|
|
ACM workshop on 3d object retrieval: 3DOR'10 chair's welcome |
| |
Mohamed Daoudi,
Michela Spagnuolo,
Remco Veltkamp
|
|
Pages: 1731-1732 |
|
doi>10.1145/1873951.1874343 |
|
Full text: PDF
|
|
3D media has emerged rapidly as a new type of content within the multimedia domain. The recent acceleration of 3D content production, witnessed across all fields up to user-generated content, is causing a huge amount of traffic and data stored and transmitted ...
3D media has emerged rapidly as a new type of content within the multimedia domain. The recent acceleration of 3D content production, witnessed across all fields up to user-generated content, is causing a huge amount of traffic and data stored and transmitted using Internet technologies. Recent advances in 3D acquisition and 3D graphics rendering technologies boosted the creation of 3D model archives for several application domains. These include archaeology and cultural heritage, computer-assisted design (CAD), medicine and bioinformatics, 3D face recognition and security, entertainment and serious gaming, spatial data and 3D city management. Search engines will soon become a key interaction tool for engaging with this data deluge, and 3D content-based retrieval methods will be crucial in the development of effective 3D search engines: visual media are meant to be seen and should be searched accordingly. 3D content-based retrieval is attracting researchers from different fields: computer vision, computer graphics, machine learning, human-computer interaction, and the semantic web. Since 2008, a series of workshops specifically devoted to the topic was initiated under the auspices of the Eurographics association. The first EG 3D Object Retrieval (3DOR) workshop took place in Crete, April 2008, followed by 3DOR'09 in Munich, March 2009, and 3DOR'10 in Nörkopping, May 2010. The response of the community in all these years was encouraging in terms of number of submission and attendance rate. Due to the co-location of the 3DOR workshop with the Eurographics conference, the events primarily addressed the computer graphics community. Now, the co-location with ACM Multimedia 2010, the worldwide premier multimedia conference, gave us the opportunity to meet the multimedia community and further promote a cross-fertilization ground that hopefully will stimulate further discussions on the next steps in this important research area. The response to the call for participation was a success: even if scheduled shortly after the EG 3DOR'10 workshop, the ACM 3DOR'10 received 24 full paper submissions on various topics related to 3D retrieval, ranging from new indexing methods for generic 3D models to context-specific methods, such as face recognition and molecular data analysis. Out of the 24 submissions received, 7 contributions were accepted as oral papers (acceptance rate 30%), and 7 as poster papers. The ACM 3DOR'10 workshop will feature a one-day technical programme, with the presentation of the full papers and poster session. The invited talk given by Prof. Anuj Srivastava on Elastic Riemannian Frameworks and Statistical Tools for Shape Analysis complements the programme. The 3D Object Retrieval workshops gathered and continues to gather great interest in the research community and there are several people we would like to thank for keeping alive this interest: first of all, we would like to acknowledge and thank Ioannis Patrikakis (Democritus University of Thrace, Greece) and Theoharis Theoharis (University of Athens, Greece) for having started the 3DOR workshop series; Alberto Del Bimbo, for the encouragement to bring 3DOR closer to ACM Multimedia 2010; the ACM - 3DOR'10 PC members and reviewers for their efforts and commitment; all the authors of the submitted papers that are demonstrating the importance of the topic. We would like to thank the Institut TELECOM for the financial support. We look forward to the next event on 3D Object Retrieval. expand
|
|
|
MML 2010: international workshop on machine learning and music |
| |
Rafael Ramirez,
Darrell Conklin,
Christina Anagnostopoulou,
José M. Iñesta
|
|
Pages: 1733-1734 |
|
doi>10.1145/1873951.1874344 |
|
Full text: PDF
|
|
MML 2010, the International Workshop on Machine Learning and Music, continues a series of workshops related to artificial intelligence and machine learning in music. In this short article the Programme Chairs summarize the content of the workshop.
MML 2010, the International Workshop on Machine Learning and Music, continues a series of workshops related to artificial intelligence and machine learning in music. In this short article the Programme Chairs summarize the content of the workshop. expand
|
|
|
ACM workshop on mobile video delivery |
| |
Mainak Chatterjee,
Samrat Ganguly
|
|
Pages: 1735-1736 |
|
doi>10.1145/1873951.1874345 |
|
Full text: PDF
|
|
|
|
|
WSM'10: 2nd ACM workshop on social media |
| |
Susanne Boll,
Steven C.H. Hoi,
Roelof van Zwol,
Jiebo Luo
|
|
Pages: 1737-1738 |
|
doi>10.1145/1873951.1874346 |
|
Full text: PDF
|
|
The ACM SIGMM International Workshop on Social Media (WSM'10) is the second workshop held in conjunction with the ACM International Multimedia Conference (MM'10) at Firenze, Italy, 2010. This workshop provides a forum for researchers and practitioners ...
The ACM SIGMM International Workshop on Social Media (WSM'10) is the second workshop held in conjunction with the ACM International Multimedia Conference (MM'10) at Firenze, Italy, 2010. This workshop provides a forum for researchers and practitioners from all over the world to share information on their latest investigations on social media analysis, exploration, search, mining, and emerging new social media applications. expand
|
|
|
Modeling, detecting, and processing events in multimedia |
| |
Ansgar Scherp,
Ramesh Jain,
Mohan Kankanhalli,
Vasileios Mezaris
|
|
Pages: 1739-1740 |
|
doi>10.1145/1873951.1874347 |
|
Full text: PDF
|
|
|
|
|
Second ACM international workshop on multimedia in forensics, security and intelligence (MiFor 2010) |
| |
Sebastiano Battiato,
Sabu Emmanuel,
Adrian Ulges,
Marcel Worring
|
|
Pages: 1741-1742 |
|
doi>10.1145/1873951.1874348 |
|
Full text: PDF
|
|
This paper introduces the context of the workshop and the associated papers.
This paper introduces the context of the workshop and the associated papers. expand
|
|
|
The second ACM international workshop on multimedia technologies for distance learning (MTDL 2010) |
| |
Timothy K. Shih,
Rynson Lau,
Nadia Magnenat-Thalmann,
Marc Spaniol,
Baltasar Fernández-Manjón
|
|
Pages: 1743-1744 |
|
doi>10.1145/1873951.1874349 |
|
Full text: PDF
|
|
The MTDL 2010 workshop in its second edition aims to continue in the contribution and evaluation of the impact of multimedia technologies to e-Learning. This workshop is held in conjunction with the ACM Multimedia 2010 Conference in Firenze (Italy). ...
The MTDL 2010 workshop in its second edition aims to continue in the contribution and evaluation of the impact of multimedia technologies to e-Learning. This workshop is held in conjunction with the ACM Multimedia 2010 Conference in Firenze (Italy). As a cover paper of this workshop, we briefly summarize important issues to be addressed in e-learning in the first section, followed by a discussion of important issues proposed in the 6 papers accepted to the workshop (among the 14 submissions). expand
|
|
|
ACM multimedia 2010 workshop on 3D video processing |
| |
Oliver Schreer,
Adrian Hilton,
Emanuele Trucco
|
|
Pages: 1745-1746 |
|
doi>10.1145/1873951.1874350 |
|
Full text: PDF
|
|
Research on 3D video processing has gained a tremendous amount of momentum due to advances in video communications, broadcasting and entertainment technology (e.g., animation blockbusters like Avatar and Up). There is an increasing need for reliable ...
Research on 3D video processing has gained a tremendous amount of momentum due to advances in video communications, broadcasting and entertainment technology (e.g., animation blockbusters like Avatar and Up). There is an increasing need for reliable technologies capable of visualizing 3-D content from viewpoints decided by the user; the 2010 football World Cup in South Africa has made very evident the need to replay crucial football footage from new viewpoints to decide whether the ball has or has not crossed the goal line. Remote videoconferencing prototypes are introducing a sense of presence into large- and small-scale (PC-based) systems alike by manipulating single and multiple video sequences to improve eye contact and place participants in convincing virtual spaces. All this, and more, is pushing the introduction of 3D services and the development of high-quality 3D displays to be available in a future which is drawing nearer and nearer. expand
|
|
|
Multimedia content with a speech track: ACM multimedia 2010 workshop on searching spontaneous conversational speech |
| |
Martha Larson,
Roeland Ordelman,
Florian Metze,
Wessel Kraaij,
Franciska de Jong
|
|
Pages: 1747-1748 |
|
doi>10.1145/1873951.1874351 |
|
Full text: PDF
|
|
|
|
|
First ACM international workshop on analysis and retrieval of tracked events and motion in imagery streams (ARTEMI 2010) |
| |
Anastasios Doulamis,
Jordi Gonzàlez
|
|
Pages: 1749-1750 |
|
doi>10.1145/1873951.1874352 |
|
Full text: PDF
|
|
The advancement of novel capabilities for video understanding does increase the cross-fertilization between multiple computer vision and pattern recognition research topics. ARTEMIS2010 provides the forum for discussing a holistic view on the interpretation ...
The advancement of novel capabilities for video understanding does increase the cross-fertilization between multiple computer vision and pattern recognition research topics. ARTEMIS2010 provides the forum for discussing a holistic view on the interpretation and description of human behaviors in multimedia content such as sports, news, documentaries, movies and surveillance footage. expand
|
|
|
3rd international workshop on automated information extraction in media production |
| |
Alberto Messina,
Robbie De Sutter,
Jean-Pierre Evain,
Masanori Sano,
Gerald Friedland
|
|
Pages: 1751-1752 |
|
doi>10.1145/1873951.1874353 |
|
Full text: PDF
|
|
The third Workshop on Automated Information Extraction in Media Production (AIEMPro10) aims at fostering exchange of ideas and of practices between leading experts in research and leading actors in the media community, in order to catalyze the migration ...
The third Workshop on Automated Information Extraction in Media Production (AIEMPro10) aims at fostering exchange of ideas and of practices between leading experts in research and leading actors in the media community, in order to catalyze the migration towards new ways of producing media content, aided by large scale introduction of tools for automated multimedia analysis and understanding. Furthermore, the workshop helps researchers in better understanding what are some real-life key requirements which would enable their scientific developments come into wider adoption. expand
|
|
|
Pervasive video analysis: workshop overview |
| |
Hamid Aghajan,
Marco Cristani,
Vittorio Murino,
Nicu Sebe
|
|
Pages: 1753-1754 |
|
doi>10.1145/1873951.1874354 |
|
Full text: PDF
|
|
This workshop aims at tackling the novel challenging scenarios in pervasive video analysis which require not only to address specific problems (e.g., tracking, recognition) on a single view, but to deal with a set of distributed observations, eventually ...
This workshop aims at tackling the novel challenging scenarios in pervasive video analysis which require not only to address specific problems (e.g., tracking, recognition) on a single view, but to deal with a set of distributed observations, eventually integrated with subjective mobile video streams. Accepted papers cover a wide range of subjects going from the joint analysis of video sequences, taken from fixed location and mobile cameras, to situation awareness and understanding. expand
|
|
|
ACM workshop on mobile cloud media computing |
| |
Xian-Sheng Hua,
Gang Hua,
Chang Wen Chen
|
|
Pages: 1755-1756 |
|
doi>10.1145/1873951.1874355 |
|
Full text: PDF
|
|
Smart mobile devices such as camera phones typically will be carried by people all the time. These devices are true "multimedia" devices that acquire, process, transmit and present text, image, video and audio data. However, due to the limitations in ...
Smart mobile devices such as camera phones typically will be carried by people all the time. These devices are true "multimedia" devices that acquire, process, transmit and present text, image, video and audio data. However, due to the limitations in hardware and networking, multimedia applications and systems have not been adequately supported on mobile devices. With the recent developments mobile hardware, wireless network, and cloud computing, it is now the prime time for us to realize intelligent mobile device centered multimedia applications with the support of a cloud computing platform. The focus of this workshop is on exploring challenges and opportunities of intelligent multimedia technologies, applications and systems on mobile devices, especially when a media cloud computing platform can be appropriately leveraged. expand
|
|
|
ACM workshop on advanced video streaming techniques for peer-to-peer networks and social networking |
| |
Gabriella Olmo,
Christian Timmerer,
Pascal Frossard,
Keith Mitchell
|
|
Pages: 1757-1758 |
|
doi>10.1145/1873951.1874356 |
|
Full text: PDF
|
|
This paper provides a summary and overview of the ACM workshop on advanced video streaming techniques for peer-to-peer networks and social networking.
This paper provides a summary and overview of the ACM workshop on advanced video streaming techniques for peer-to-peer networks and social networking. expand
|
|
|
3rd international workshop on affective interaction in natural environments (AFFINE) |
| |
Ginevra Castellano,
Kostas Karpouzis,
Jean-Claude Martin,
Louis-Philippe Morency,
Christopher Peters,
Laurel D. Riek
|
|
Pages: 1759-1760 |
|
doi>10.1145/1873951.1874357 |
|
Full text: PDF
|
|
The 3rd International Workshop on Affective Interaction in Natural Environments, AFFINE, follows a number of successful AFFINE workshops and events commencing in 2008.A key aim of AFFINE is the identification and investigation of significant open issues ...
The 3rd International Workshop on Affective Interaction in Natural Environments, AFFINE, follows a number of successful AFFINE workshops and events commencing in 2008.A key aim of AFFINE is the identification and investigation of significant open issues in real-time, affect-aware applications 'in the wild' and especially in embodied interaction, for example, with robots or virtual agents. AFFINE seeks to bring together researchers working on the real-time interpretation of user behaviour with those who are concerned with social robot and virtual agent interaction frameworks. expand
|
|
|
ACM international workshop on social, adaptive and personalized multimedia interaction and access (SAPMIA 2010) |
| |
David Vallet,
Naeem Ramzan,
Martin Halvey,
Charalampos Z. Patrikakis
|
|
Pages: 1761-1762 |
|
doi>10.1145/1873951.1874358 |
|
Full text: PDF
|
|
In an effort to address and overcome some of the open issues that hinder effective access and interaction of multimedia content, this workshop will bring together individuals from a number of research communities, including but not limited to Multimedia ...
In an effort to address and overcome some of the open issues that hinder effective access and interaction of multimedia content, this workshop will bring together individuals from a number of research communities, including but not limited to Multimedia Distribution and Access, Social Network Analysis, Multimedia Content Analysis, and User Modelling Adaptation and Personalization. It is our belief that a synergetic approach involving these areas of work can exceed their individual potentials, leading to improved access, understanding, and retrieval of multimedia content. The main objective of this workshop is to provide a forum to disseminate work that explicitly exploits the synergy between multimedia content analysis, personalisation, and next generation networking and community aspects of social networks. We believe that this integration could result on robust, personalized multimedia services, providing users with an improved multimedia experience. expand
|
|
|
Overview of ACM international workshop on connected multimedia |
| |
Zhongfei (Mark) Zhang,
Zhengyou Zhang,
Ramesh Jain,
Yueting Zhuang
|
|
Pages: 1763-1764 |
|
doi>10.1145/1873951.1874359 |
|
Full text: PDF
|
|
Following the very first international workshop on connected multimedia held in Hangzhou, China, in October of 2009 jointly sponsored by US National Science Foundation and Zhejiang University of China, this is the very first ACM International Workshop ...
Following the very first international workshop on connected multimedia held in Hangzhou, China, in October of 2009 jointly sponsored by US National Science Foundation and Zhejiang University of China, this is the very first ACM International Workshop on Connected Multimedia in conjunction with ACM International Conference on Multimedia held in Florence, Italy, in October of 2010. In this workshop overview, we first define what we mean by connected multimedia, and then briefly overview the program of this workshop. expand
|
|
|
MM'10 workshop summary for SSPW: ACM workshop on social signal processing 2010 |
| |
Maja Pantic,
Alessandro Vinciarelli,
Alex Pentland
|
|
Pages: 1765-1766 |
|
doi>10.1145/1873951.1874360 |
|
Full text: PDF
|
|
The Workshop on Social Signal Processing (SSPW) is the yearly event of the Social Signal Processing Network (EU-FP7 SSPNet project). This year's workshop programme consists of 4 premium Key Note Talks by Jeff Cohn, Alex Pentland. Justine Cassell, and ...
The Workshop on Social Signal Processing (SSPW) is the yearly event of the Social Signal Processing Network (EU-FP7 SSPNet project). This year's workshop programme consists of 4 premium Key Note Talks by Jeff Cohn, Alex Pentland. Justine Cassell, and Toyoaki Nishida, an oral session with 4 presentations, a poster session with 7 posters, and a panel session where the panelists will be the Key Note Speakers and the workshop organizers. expand
|
|
|
ACM workshop on surreal media and virtual cloning |
| |
Ebroul Izquierdo,
Yang Cai,
Qianni Zhang,
Manuel García-Herranz
|
|
Pages: 1767-1768 |
|
doi>10.1145/1873951.1874361 |
|
Full text: PDF
|
|
This paper gives an overview of ACM Multimedia 2010 Workshop on Surreal Media and Virtual Cloning, including research work towards the creation of surreal media and realistic 3D virtual environments where virtual humans and objects can interact remotely. ...
This paper gives an overview of ACM Multimedia 2010 Workshop on Surreal Media and Virtual Cloning, including research work towards the creation of surreal media and realistic 3D virtual environments where virtual humans and objects can interact remotely. The primary objective is to discuss key research issues related to the generation of surreal media and 3D cooperative virtual worlds. We expect that the one-day program will bring together research groups from related fields and explore research problems, potential applications and collaborative opportunities. expand
|
|
|
ACM international workshop on very-large-scale multimedia corpus, mining and retrieval (VLS-MCMR'10) |
| |
Benoit Huet,
Tat-Seng Chua,
Alexander Hauptmann
|
|
Pages: 1769-1770 |
|
doi>10.1145/1873951.1874362 |
|
Full text: PDF
|
|
The purpose of this workshop is to bring together researchers interested in the construction and analysis of Very Large Scale Multimedia Corpus, as well as the methodologies to Mine and Retrieve information from them. The Workshop will provide a forum ...
The purpose of this workshop is to bring together researchers interested in the construction and analysis of Very Large Scale Multimedia Corpus, as well as the methodologies to Mine and Retrieve information from them. The Workshop will provide a forum to consolidate key issues related to research on very large scale multimedia dataset such as the construction of dataset, creation of ground truth, sharing and extension of such resources in terms of ground truth, features, algorithms and tools etc. The Workshop will discuss and formulate action plan towards these goals. expand
|
|
|
TUTORIAL SESSION: Tutorials track |
|
|
|
|
Processing web-scale multimedia data |
| |
Malcolm Slaney,
Edward Y. Chang
|
|
Pages: 1771-1772 |
|
doi>10.1145/1873951.1874364 |
|
Full text: PDF
|
|
The Internet brings us access to multimedia databases with billions of data instances. The massive amount of data available to researchers and application developers brings both opportunities and challenges. In particular, massive amount of data makes ...
The Internet brings us access to multimedia databases with billions of data instances. The massive amount of data available to researchers and application developers brings both opportunities and challenges. In particular, massive amount of data makes data-driven approach feasible, but at the same time, it demands scalable algorithms. In this tutorial we present a range of algorithms and approaches that make it easy/easier to scale our work to Internet-sized collections of multimedia data. The tutorial will start by providing attendees an overview and pointers to the tools that will allow them to scale their work to massive datasets. The tutorial discusses the theoretical and practical problem with large data, applications where large amounts of data are important to consider, types of algorithms that are practical with such large datasets, and examples of implementation techniques that make these algorithms practical. Many real-world examples and results illustrate the tutorial. expand
|
|
|
Advances in multimedia retrieval, part i: frontiers in multimedia search |
| |
Alan Hanjalic,
Martha Larson
|
|
Pages: 1773-1774 |
|
doi>10.1145/1873951.1874365 |
|
Full text: PDF
|
|
|
|
|
Video search engines: advances in multimedia retrieval, part ii [1] |
| |
Cees G.M. Snoek,
Arnold W.M. Smeulders
|
|
Pages: 1775-1776 |
|
doi>10.1145/1873951.1874366 |
|
Full text: PDF
|
|
In this tutorial, we focus on the challenges in video search, present methods how to achieve state-of-the-art performance, and indicate how to obtain improvements in the near future. Moreover, we give an overview of the latest developments and future ...
In this tutorial, we focus on the challenges in video search, present methods how to achieve state-of-the-art performance, and indicate how to obtain improvements in the near future. Moreover, we give an overview of the latest developments and future trends in the field on the basis of the TRECVID competition - the leading competition for video search engines run by NIST - where we have achieved consistent top performance over the years, including the 2008 and 2009 editions. expand
|
|
|
Understanding multimedia content using web scale social media data |
| |
Dong Xu,
Lei Zhang,
Jiebo Luo
|
|
Pages: 1777-1778 |
|
doi>10.1145/1873951.1874367 |
|
Full text: PDF
|
|
Nowadays, increasingly rich and massive social media data (such as texts, images, audios, videos, blogs, and so on) are being posted to the web, including social networking websites (e.g., MySpace, Facebook), photo and video sharing websites (e.g., Flickr, ...
Nowadays, increasingly rich and massive social media data (such as texts, images, audios, videos, blogs, and so on) are being posted to the web, including social networking websites (e.g., MySpace, Facebook), photo and video sharing websites (e.g., Flickr, YouTube), and photo forums (e.g., Photosig.com and Photo.net). Recently, researchers from multidisciplinary areas have proposed to use data-driven approaches for multimedia content understanding by leveraging such unlimited web images and videos as well as their associated rich contextual information (e.g., tag, comments, category, title and metadata). In this three hour tutorial, we plan to introduce the important general concepts and themes of this timely topic. We will also review and summarize the recent multimedia content analysis methods using web-scale social media data as well as present insight into the challenges and future directions in this area. Moreover, we will also show extensive demos on image annotation and retrieval by using rich social media data. expand
|
|
|
Mobile video streaming in modern wireless networks |
| |
Mohamed Hefeeda,
Cheng-Hsin Hsu
|
|
Pages: 1779-1780 |
|
doi>10.1145/1873951.1874368 |
|
Full text: PDF
|
|
Increasingly more users use mobile devices to watch videos streamed over wireless networks, and they demand more content at better quality. For example, market forecasts reveal that mobile video streaming, such as mobile TV, will catch up with gaming ...
Increasingly more users use mobile devices to watch videos streamed over wireless networks, and they demand more content at better quality. For example, market forecasts reveal that mobile video streaming, such as mobile TV, will catch up with gaming and music, and become the most popular application on mobile devices. In this tutorial, we will present different approaches to deliver multimedia content over various wireless networks to a large number of mobile users. We will study and analyze the main research problems in modern wireless networks that need to be addressed in order to enable efficient mobile video services. The tutorial will cover common research problems in wireless networks such as HSDPA, MBMS, WiMAX, LTE, DVB-H, MediaFLO, and ATSC M/H. After giving the preliminaries of the considered wireless network standards, we will focus on important research problems and present their solutions in details. Finally, we will discuss open problems and future research directions in mobile video. The tutorial will be composed of five parts, which are briefly described in Sec. 1-5. expand
|
|
|
Immersive future media technologies: from 3D video to sensory experiences |
| |
Christian Timmerer,
Karsten Müller
|
|
Pages: 1781-1782 |
|
doi>10.1145/1873951.1874369 |
|
Full text: PDF
|
|
In this tutorial we present immersive future media technologies ranging from 3D video to sensory experiences. The former targets stereo and multi-view video technologies whereas the latter aims at stimulating other senses than vision or audition enabling ...
In this tutorial we present immersive future media technologies ranging from 3D video to sensory experiences. The former targets stereo and multi-view video technologies whereas the latter aims at stimulating other senses than vision or audition enabling an advanced user experiences through sensory effects. expand
|
|
|
Modeling human behavior with mobile phones |
| |
Daniel Gatica-Perez
|
|
Pages: 1783-1784 |
|
doi>10.1145/1873951.1874370 |
|
Full text: PDF
|
|
In just a few years, mobile phones have emerged as the ultimate multimedia device. This is the summary of a proposed tutorial on Modeling Human Behavior with Mobile Phones, which aims to present the scientific and technological state-of-the-art in mobile ...
In just a few years, mobile phones have emerged as the ultimate multimedia device. This is the summary of a proposed tutorial on Modeling Human Behavior with Mobile Phones, which aims to present the scientific and technological state-of-the-art in mobile phone-based modeling of large-scale human behavior from a coherent perspective, and hopes to motivate further work in this domain by the multimedia research community. expand
|
|
|
Human-centered multimedia systems: tutorial overview |
| |
Nicu Sebe,
Alejandro Jaimes,
Hamid Aghajan
|
|
Pages: 1785-1786 |
|
doi>10.1145/1873951.1874371 |
|
Full text: PDF
|
|
This tutorial will focus on technical analysis and interaction techniques formulated from the perspective of key human factors in a user-centered approach to developing multimedia systems. The tutorial will take a holistic view on the research issues ...
This tutorial will focus on technical analysis and interaction techniques formulated from the perspective of key human factors in a user-centered approach to developing multimedia systems. The tutorial will take a holistic view on the research issues and applications of Human-Centered Systems, focusing on four main areas: (1) multimodal interaction: visual (body, gaze, gesture); (2) image indexing and retrieval: user behavior, context modeling, cultural issues, and machine learning for user-centric approaches; (3) multimedia data: conceptual analysis at different levels (feature, cognitive, and affective); and (4) sources of contextual information and case studies in multi-camera networks. This full-day tutorial will consist of two parts: the first half will consist of presentations by the instructors, and the second part will consist of practical workgroup activities expand
|
|
|
Designing and optimizing large-scale multimedia mining applications in distributed processing environments |
| |
Deepak S. Turaga,
Mihaela van der Schaar
|
|
Pages: 1787-1788 |
|
doi>10.1145/1873951.1874372 |
|
Full text: PDF
|
|
In this tutorial, we will present the fundamental principles of large-scale adaptive multimedia stream mining, describe state-of-the-art in terms of systems and algorithms, and include recent theoretical and experimental results. We will also discuss ...
In this tutorial, we will present the fundamental principles of large-scale adaptive multimedia stream mining, describe state-of-the-art in terms of systems and algorithms, and include recent theoretical and experimental results. We will also discuss how we can construct different cooperative and non-cooperative games to model, analyze, optimize, and shape these applications in different system or connectivity scenarios and under various constraints. expand
|