|
|
Future of home media |
| |
Kazumasa Enami
|
|
Pages: 1-1 |
|
doi>10.1145/1101149.1101150 |
|
Full text: PDF
|
|
In the future, the television set at home will be a key device for providing integrated information services through broadcasting, communication and storage media. In this environment, users will be able to receive any type of information, e. g., HDTV ...
In the future, the television set at home will be a key device for providing integrated information services through broadcasting, communication and storage media. In this environment, users will be able to receive any type of information, e. g., HDTV programs, in real-time and whenever s/he wants. Moreover, the advent of advanced mobile terminals means that this new environment will eventually expand to anywhere outside the home. In this speech, I will introduce the services in the future and some topics of the technologies being developed by the NHK Science and Technical Research Laboratories (STRL) to achieve such services. expand
|
|
|
SESSION: Content 1: news video processing |
|
|
|
|
Tracking news stories across different sources |
| |
Yun Zhai,
Mubarak Shah
|
|
Pages: 2-10 |
|
doi>10.1145/1101149.1101152 |
|
Full text: PDF
|
|
Information linkage is becoming more and more important in this digital age. In this paper, we propose a concept tracking method, which links news stories on the same topic across multiple sources. The semantic linkage between the news stories is reflected ...
Information linkage is becoming more and more important in this digital age. In this paper, we propose a concept tracking method, which links news stories on the same topic across multiple sources. The semantic linkage between the news stories is reflected in combination of both of their visual content and their spoken language content. Visually, each news story is represented by a set of key-frames with or without detected faces. The facial key-frames are linked based on the analysis of the extended facial regions, and the non-facial key-frames are correlated using the global Affine matching. The language similarity is expressed in terms of the normalized text similarity between the stories' keywords. The output results of the story linking are further used in a story ranking task, which indicate the interesting level of the stories. The proposed semantic linking framework and the story ranking method have been tested on a set of 60 hours open-benchmark TRECVID video data, and very satisfactory results for both tasks have been obtained. expand
|
|
|
Topic transition detection using hierarchical hidden Markov and semi-Markov models |
| |
Dinh Q. Phung,
T. V. Duong,
S. Venkatesh,
Hung H. Bui
|
|
Pages: 11-20 |
|
doi>10.1145/1101149.1101153 |
|
Full text: PDF
|
|
In this paper we introduce a probabilistic framework to exploit hierarchy, structure sharing and duration information for topic transition detection in videos. Our probabilistic detection framework is a combination of a shot classification step and a ...
In this paper we introduce a probabilistic framework to exploit hierarchy, structure sharing and duration information for topic transition detection in videos. Our probabilistic detection framework is a combination of a shot classification step and a detection phase using hierarchical probabilistic models. We consider two models in this paper: the extended Hierarchical Hidden Markov Model (HHMM) and the Coxian Switching Hidden semi-Markov Model (S-HSMM) because they allow the natural decomposition of semantics in videos, including shared structures, to be modeled directly, and thus enable efficient inference and reduce the sample complexity in learning. Additionally, the S-HSMM allows the duration information to be incorporated, consequently the modeling of long-term dependencies in videos is enriched through both hierarchical and duration modeling. Furthermore, the use of Coxian distribution in the S-HSMM makes it tractable to deal with long sequences in video. Our experimentation of the proposed framework on twelve educational and training videos shows that both models outperform the baseline cases (flat HMM and HSMM) and performances reported in earlier work in topic detection. The superior performance of the S-HSMM over the HHMM verifies our belief that the duration information is an important factor in video content modeling. expand
|
|
|
Joint visual-text modeling for automatic retrieval of multimedia documents |
| |
G. Iyengar,
P. Duygulu,
S. Feng,
P. Ircing,
S. P. Khudanpur,
D. Klakow,
M. R. Krause,
R. Manmatha,
H. J. Nock,
D. Petkova,
B. Pytlik,
P. Virga
|
|
Pages: 21-30 |
|
doi>10.1145/1101149.1101154 |
|
Full text: PDF
|
|
In this paper we describe a novel approach for jointly modeling the text and the visual components of multimedia documents for the purpose of information retrieval(IR). We propose a novel framework where individual components are developed to model different ...
In this paper we describe a novel approach for jointly modeling the text and the visual components of multimedia documents for the purpose of information retrieval(IR). We propose a novel framework where individual components are developed to model different relationships between documents and queries and then combined into a joint retrieval framework. In the state-of-the-art systems, a late combination between two independent systems, one analyzing just the text part of such documents, and the other analyzing the visual part without leveraging any knowledge acquired in the text processing, is the norm. Such systems rarely exceed the performance of any single modality (i.e. text or video) in information retrieval tasks. Our experiments indicate that allowing a rich interaction between the modalities results in significant improvement in performance over any single modality. We demonstrate these results using the TRECVID03 corpus, which comprises 120 hours of broadcast news videos. Our results demonstrate over 14 % improvement in IR performance over the best reported text-only baseline and ranks amongst the best results reported on this corpus. expand
|
|
|
Multiple instance learning for labeling faces in broadcasting news video |
| |
Jun Yang,
Rong Yan,
Alexander G. Hauptmann
|
|
Pages: 31-40 |
|
doi>10.1145/1101149.1101155 |
|
Full text: PDF
|
|
Labeling faces in news video with their names is an interesting research problem which was previously solved using supervised methods that demand significant user efforts on labeling training data. In this paper, we investigate a more challenging setting ...
Labeling faces in news video with their names is an interesting research problem which was previously solved using supervised methods that demand significant user efforts on labeling training data. In this paper, we investigate a more challenging setting of the problem where there is no complete information on data labels. Specifically, by exploiting the uniqueness of a face's name, we formulate the problem as a special multi-instance learning (MIL) problem, namely exclusive MIL or eMIL problem, so that it can be tackled by a model trained with partial labeling information as the anonymity judgment of faces, which requires less user effort to collect. We propose two discriminative probabilistic learning methods named Exclusive Density (ED) and Iterative ED for eMIL problems. Experiments on the face labeling problem shows that the performance of the proposed approaches are superior to the traditional MIL algorithms and close to the performance achieved by supervised methods trained with complete data labels. expand
|
|
|
SESSION: Applications 1: media fusion for communication and presentation |
|
|
|
|
Complementing your TV-viewing by web content automatically-transformed into TV-program-type content |
| |
Akiyo Nadamoto,
Katsumi Tanaka
|
|
Pages: 41-50 |
|
doi>10.1145/1101149.1101157 |
|
Full text: PDF
|
|
Despite much talk about the fusion of broadcasting and the Internet, no technology has been established for fusing web and TV program content. In this paper, we propose ways to transform web content into TV-program-type content as a first step towards ...
Despite much talk about the fusion of broadcasting and the Internet, no technology has been established for fusing web and TV program content. In this paper, we propose ways to transform web content into TV-program-type content as a first step towards the fusion of these media. Our transformation method is based on two criteria - the transmitted information and the dialogue among character agents. The method deals with both an audio component and a visual component. By combining these techniques, we can transform web content into various forms of TV-program-type content depending on the user's aims. We present three different prototype systems, u-Pav which reads out the entire text of web content and presents image animation, Web2TV which reads out the entire text of web content and presents character agent animation, and Web2Talkshow which presents keyword-based dialogue and character agent animation. These prototype systems enable users to watch web content in the same way, they watch a TV program. expand
|
|
|
Augmented segmentation and visualization for presentation videos |
| |
Alexander Haubold,
John R. Kender
|
|
Pages: 51-60 |
|
doi>10.1145/1101149.1101158 |
|
Full text: PDF
|
|
We investigate methods of segmenting, visualizing, and indexing presentation videos by both audio and visual data. The audio track is segmented by speaker, and augmented with key phrases which are extracted using an Automatic Speech Recognizer (ASR). ...
We investigate methods of segmenting, visualizing, and indexing presentation videos by both audio and visual data. The audio track is segmented by speaker, and augmented with key phrases which are extracted using an Automatic Speech Recognizer (ASR). The video track is segmented by visual dissimilarities and changes in speaker gesturing, and augmented by representative key frames. An interactive user interface combines a visual representation of audio, video, text, key frames, and allows the user to navigate presentation videos. User studies with 176 students of varying knowledge were conducted on 7.5 hours of student presentation video (32 presentations). Tasks included searching for various portions of presentations, both known and unknown to students, and summarizing presentations given the annotations. The results are favorable towards the video summaries and the interface, suggesting faster responses by a factor of 20% compared to having access to the actual video. Accuracy of responses remained the same on average. Follow-up surveys present a number of suggestions towards improving the interface, such as the incorporation of automatic speaker clustering and identification, and the display of an abstract topological view of the presentation. Surveys also show alternative contexts in which students would like to use the tool in the classroom environment. expand
|
|
|
Exploring media correlation and synchronization for navigated hypermedia documents |
| |
Kuo-Yu Liu,
Herng-Yow Chen
|
|
Pages: 61-70 |
|
doi>10.1145/1101149.1101159 |
|
Full text: PDF
|
|
This paper is devoted to explore media correlation and media synchronization in a composite multimedia document, the so-called navigated hypermedia document in our language learning system, to facilitate the multimedia authoring, presentation, and access. ...
This paper is devoted to explore media correlation and media synchronization in a composite multimedia document, the so-called navigated hypermedia document in our language learning system, to facilitate the multimedia authoring, presentation, and access. Two levels of media correlation in temporal, spatial, and content domains are investigated: syntactic level correlation and semantic level correlation. We devise a capturing mechanism to record all the media streams and relations between them, including voice and event streams, for replaying the lecturing in a form as close as possible to the original classroom experience. The syntactic level correlation is based on specific timestamps of the media stream and used to reconstruct the recorded lecture for synchronized presentation. Furthermore, to integrate media objects with specific segments within the media stream, some computed synchronization processes are required to discover semantic content of the media. The proposed computed synchronization techniques, including speech-event binding process for temporal domain, tele-pointer (i.e. cursor) movement interpolation and adaptable handwriting presentation for spatial domain, and erasing handling for content domain, will be addressed. Experimental results show that in the speech-event binding process 74% of speech access entries for accessible visualized events are found. The acceptable rate of human perception on tele-pointer movement is higher than 85% if time interval is selected carefully. Finally, the accuracy of erasing handling for content removing is about 71%. expand
|
|
|
Designing a large-scale video chat application |
| |
Jeremiah Scholl,
Peter Parnes,
John D. McCarthy,
Angela Sasse
|
|
Pages: 71-80 |
|
doi>10.1145/1101149.1101160 |
|
Full text: PDF
|
|
Studies of video conferencing systems generally focus on scenarios where users communicate using an audio channel. However, text chat serves users in a wide variety of contexts, and is commonly included in multimedia conferencing systems as a complement ...
Studies of video conferencing systems generally focus on scenarios where users communicate using an audio channel. However, text chat serves users in a wide variety of contexts, and is commonly included in multimedia conferencing systems as a complement to the audio channel. This paper introduces a prototype application which integrates video and text communication, and describes a formative evaluation of the prototype with 53 users in a social setting. We focus the evaluation on bandwidth and view navigation requirements in order to determine how to better serve users with video chat, and discuss how the findings from this evaluation can inform the design of future video chat applications. Bandwidth requirements are evaluated through user perceptions of video delivered using three different bandwidth schemes. For view navigation, we examine a system that automatically switches the video focus to the current "chatter", instead of requiring users to navigate manually to find the video steam they are interested in viewing. expand
|
|
|
SESSION: Brave new topics 1: multimedia challenges for planetary scale applications |
|
|
|
|
IrisNet: an internet-scale architecture for multimedia sensors |
| |
Jason Campbell,
Phillip B. Gibbons,
Suman Nath,
Padmanabhan Pillai,
Srinivasan Seshan,
Rahul Sukthankar
|
|
Pages: 81-88 |
|
doi>10.1145/1101149.1101162 |
|
Full text: PDF
|
|
Most current sensor network research explores the use of extremely simple sensors on small devices called motes and focuses on over-coming the resource constraints of these devices. In contrast, our research explores the challenges of multimedia sensors ...
Most current sensor network research explores the use of extremely simple sensors on small devices called motes and focuses on over-coming the resource constraints of these devices. In contrast, our research explores the challenges of multimedia sensors and is motivated by the fact that multimedia devices, such as cameras, are rapidly becoming inexpensive, yet their use in a sensor network presents a number of unique challenges. For example, the data rates involved with multimedia sensors are orders of magnitude greater than those for sensor motes and this data cannot easily be processed by traditional sensor network techniques that focus on scalar data. In addition, the richness of the data generated by multimedia sensors makes them useful for a wide variety of applications. This paper presents an overview of IRISNET, a sensor network architecture that enables the creation of a planetary-scale infrastructure of multimedia sensors that can be shared by a large number of applications. To ensure the efficient collection of sensor readings, IRISNET enables the application-specific processing of sensor feeds on the significant computation resources that are typically attached to multimedia sensors. IRISNET enables the storage of sensor readings close to their source by providing a convenient and extensible distributed XML database infrastructure. Finally, IRISNET provides a number of multimedia processing primitives that enable the effective processing of sensor feeds in-network and at-sensor. expand
|
|
|
The multimedia challenges raised by pervasive games |
| |
Mauricio Capra,
Milena Radenkovic,
Steve Benford,
Leif Oppermann,
Adam Drozd,
Martin Flintham
|
|
Pages: 89-95 |
|
doi>10.1145/1101149.1101163 |
|
Full text: PDF
|
|
Pervasive gaming is a new form of multimedia entertainment that extends the traditional computer gaming experience out into the real world. Through a combination of personal devices, positioning systems and other multimedia sensors, combined with wireless ...
Pervasive gaming is a new form of multimedia entertainment that extends the traditional computer gaming experience out into the real world. Through a combination of personal devices, positioning systems and other multimedia sensors, combined with wireless networking, a pervasive game can respond to a player's movements and context and enable them to communicate with a game server and other players. We review recent examples of pervasive games in order to explain their distinctive characteristics as multimedia applications. We then consider the challenge of scaling pervasive games to include potentially very large numbers of players. We propose a new approach based upon a campaign model in which individuals, local groups and experts draw on a combination of pervasive games, online services and broadcasting to take part in national or even global events. We discuss the challenges that this raises for further research. expand
|
|
|
PLASMA: a PLAnetary scale monitoring architecture |
| |
Demet Aksoy
|
|
Pages: 96-102 |
|
doi>10.1145/1101149.1101164 |
|
Full text: PDF
|
|
While sensor networks continue to attract significant interest in various research communities, high impact applications still have a long list of challenges to be addressed. An individual sensor system can provide important observations within a local ...
While sensor networks continue to attract significant interest in various research communities, high impact applications still have a long list of challenges to be addressed. An individual sensor system can provide important observations within a local area. However, local observations alone are not sufficient for some applications that require planetary scale coverage. Monitoring volcanic activity, nuclear disasters, magnetic field changes, migration patterns of species, pandemic disease spread patterns are some examples to such applications. These applications require a close interaction between different sensor networks with in-situ and remotely sensed observations. In this paper we describe our PLASMA (PLAnetary Scale Monitoring Architecture) project to motivate the challenges that need to be addressed at such scale. These include approximations in spatiotemporal attributes due to resource constraints and also multi-attribute visualization to enable a real-time user interface to the system. expand
|
|
|
Gates of global perception: forensic graphics for evidence presentation |
| |
A. M. Burton,
D. Schofield,
L. M. Goodwin
|
|
Pages: 103-111 |
|
doi>10.1145/1101149.1101165 |
|
Full text: PDF
|
|
The admissibility of the inevitably increasing amount of digital evidence to the world's courtrooms may be one of the keys to the preservation of global justice. Digital evidence can take many forms, this paper will concentrate on both graphical evidence ...
The admissibility of the inevitably increasing amount of digital evidence to the world's courtrooms may be one of the keys to the preservation of global justice. Digital evidence can take many forms, this paper will concentrate on both graphical evidence presentation technologies currently in use (such as forensic animations and interactive environments) and potential future applications (e.g. the introduction of more pervasive computer devices). Technologies utilising Computer Graphics (CG) and Virtual Reality (VR) for evidence presentation can have great persuasive powers. These can be perceived as a benefit in increasing the understanding of complicated technical information to a generic audience, or as a threat to justice introducing potential bias and prejudice.This paper describes some cases where CG and VR evidence has been previously admitted to courtrooms. It goes on to discuss the various factors affecting the admissibility of current digital evidence forms on a global scale and concludes by introducing new technologies which may have worldwide potential in the field of forensic evidence presentation. expand
|
|
|
SESSION: Content 2: image clustering |
|
|
|
|
Web image clustering by consistent utilization of visual features and surrounding texts |
| |
Bin Gao,
Tie-Yan Liu,
Tao Qin,
Xin Zheng,
Qian-Sheng Cheng,
Wei-Ying Ma
|
|
Pages: 112-121 |
|
doi>10.1145/1101149.1101167 |
|
Full text: PDF
|
|
Image clustering, an important technology for image processing, has been actively researched for a long period of time. Especially in recent years, with the explosive growth of the Web, image clustering has even been a critical technology to help users ...
Image clustering, an important technology for image processing, has been actively researched for a long period of time. Especially in recent years, with the explosive growth of the Web, image clustering has even been a critical technology to help users digest the large amount of online visual information. However, as far as we know, many previous works on image clustering only used either low-level visual features or surrounding texts, but rarely exploited these two kinds of information in the same framework. To tackle this problem, we proposed a novel method named consistent bipartite graph co-partitioning in this paper, which can cluster Web images based on the consistent fusion of the information contained in both low-level features and surrounding texts. In particular, we formulated it as a constrained multi-objective optimization problem, which can be efficiently solved by semi-definite programming (SDP). Experiments on a real-world Web image collection showed that our proposed method outperformed the methods only based on low-level features or surround texts. expand
|
|
|
Iteratively clustering web images based on link and attribute reinforcements |
| |
Xin-Jing Wang,
Wei-Ying Ma,
Lei Zhang,
Xing Li
|
|
Pages: 122-131 |
|
doi>10.1145/1101149.1101168 |
|
Full text: PDF
|
|
Image clustering is an important research topic which contributes to a wide range of applications. Traditional image clustering approaches are based on image content features only, while content features alone can hardly describe the semantics of the ...
Image clustering is an important research topic which contributes to a wide range of applications. Traditional image clustering approaches are based on image content features only, while content features alone can hardly describe the semantics of the images. In the context of Web, images are no longer assumed homogeneous and "flatdistributed but are richly structured. There are two kinds of reinforcements embedded in such data: 1) the reinforcement between attributes of different data types (intra-type links reinforcements); and 2) the reinforcement between object attributes and the inter-type links (inter-type links reinforcements). Unfortunately, most of the previous works addressing relational data failed to fully explore the reinforcements. In this paper, we propose a reinforcement clustering framework to tackle this problem. It reinforces images and texts' attributes via inter-type links and inversely uses these attributes to update these links. The iterative reinforcing nature of this framework promises the discovery of the semantic structure of images, which is the basis of image clustering. Experimental results show the effectiveness of our proposed framework. expand
|
|
|
Image clustering with tensor representation |
| |
Xiaofei He,
Deng Cai,
Haifeng Liu,
Jiawei Han
|
|
Pages: 132-140 |
|
doi>10.1145/1101149.1101169 |
|
Full text: PDF
|
|
We consider the problem of image representation and clustering. Traditionally, an n1 x n2 image is represented by a vector in the Euclidean space ℝ n1 x n2. Some learning algorithms are ...
We consider the problem of image representation and clustering. Traditionally, an n1 x n2 image is represented by a vector in the Euclidean space ℝ n1 x n2. Some learning algorithms are then applied to these vectors in such a high dimensional space for dimensionality reduction, classification, and clustering. However, an image is intrinsically a matrix, or the second order tensor. The vector representation of the images ignores the spatial relationships between the pixels in an image. In this paper, we introduce a tensor framework for image analysis. We represent the images as points in the tensor space Rn1 mathcal Rn2 which is a tensor product of two vector spaces. Based on the tensor representation, we propose a novel image representation and clustering algorithm which explicitly considers the manifold structure of the tensor space. By preserving the local structure of the data manifold, we can obtain a tensor subspace which is optimal for data representation in the sense of local isometry. We call it TensorImage approach. Traditional clustering algorithm such as k-means is then applied in the tensor subspace. Our algorithm shares many of the data representation and clustering properties of other techniques such as Locality Preserving Projections, Laplacian Eigenmaps, and spectral clustering, yet our algorithm is much more computationally efficient. Experimental results show the efficiency and effectiveness of our algorithm. expand
|
|
|
Building and tracking hierarchical geographical & temporal partitions for image collection management on mobile devices |
| |
A. Pigeau,
M. Gelgon
|
|
Pages: 141-150 |
|
doi>10.1145/1101149.1101170 |
|
Full text: PDF
|
|
Usage of mobile devices (phones, digital cameras) raises the need for organizing large personal image collections. In accordance with studies on user needs, we propose a statistical criterion and an associated optimization technique, relying on geo-temporal ...
Usage of mobile devices (phones, digital cameras) raises the need for organizing large personal image collections. In accordance with studies on user needs, we propose a statistical criterion and an associated optimization technique, relying on geo-temporal image metadata, for building and tracking a hierarchical structure on the image collection. In a mixture model framework, particularities of the application and typical data sets are taken into account in the design of the scheme (incrementality, ability to cope with non-Gaussian data, with both small and large samples). Results are reported on real data sets. expand
|
|
|
SESSION: Systems 1: multi-camera systems |
|
|
|
|
Critical video quality for distributed automated video surveillance |
| |
Pavel Korshunov,
Wei Tsang Ooi
|
|
Pages: 151-160 |
|
doi>10.1145/1101149.1101172 |
|
Full text: PDF
|
|
Large-scale distributed video surveillance systems pose new scalability challenges. Due to the large number of video sources in such systems, the amount of bandwidth required to transmit video streams for monitoring often strains the capability of the ...
Large-scale distributed video surveillance systems pose new scalability challenges. Due to the large number of video sources in such systems, the amount of bandwidth required to transmit video streams for monitoring often strains the capability of the network. On the other hand, large-scale surveillance systems often rely on computer vision algorithms to automate surveillance tasks. We observe that these surveillance tasks present an opportunity for trade-off between the accuracy of the tasks and the bit rate of the video being sent. This paper shows that there exists a sweet spot, which we term critical video quality that can be used to reduce video bit rate without significantly affecting the accuracy of the surveillance tasks. We demonstrate this point by running extensive experiments on standard face detection and face tracking algorithms. Our experiments show that face detection works equally well even if the quality of compression is significantly reduced, and face tracking still works even if the frame rate is reduced to 6 frames per second. We further develop a prototype video surveillance system to demonstrate this idea. Our evaluation shows that we can achieve up to 29 times reduction in video bit rate when detecting faces and 16 times reduction when tracking faces. This paper also proposes a formal rate-accuracy optimization framework which can be used to determine appropriate encoding parameters in distributed video surveillance systems that are subjected to either bandwidth constraints or accuracy constraints. expand
|
|
|
A real-time interactive multi-view video system |
| |
Jian-Guang Lou,
Hua Cai,
Jiang Li
|
|
Pages: 161-170 |
|
doi>10.1145/1101149.1101173 |
|
Full text: PDF
|
|
With the rapid development of electronic and computing technology, multi-view video is attracting extensive interest recently due to its greatly enhanced viewing experience. In this paper, we present the system architecture for real-time capturing, processing, ...
With the rapid development of electronic and computing technology, multi-view video is attracting extensive interest recently due to its greatly enhanced viewing experience. In this paper, we present the system architecture for real-time capturing, processing, and interactive delivery of multi-view video. Unlike previous systems that mainly focus on multi-view video capturing, our system is designed to provide multi-view video service with high degree of interactivity in real time, which is still challenging in the current state of the technology. The proposed architecture tackles many practical problems in system calibration, object tracking, video compression, interactive delivery, etc. With the proposed system, users can interactively select their desired viewing directions and enjoy many exciting visual experiences, such as view switching, frozen moment and view sweeping, in real-time and with great freedom. expand
|
|
|
MedSMan: a streaming data management system over live multimedia |
| |
Bin Liu,
Amarnath Gupta,
Ramesh Jain
|
|
Pages: 171-180 |
|
doi>10.1145/1101149.1101174 |
|
Full text: PDF
|
|
Querying live media streams is a challenging problem that is becoming an essential requirement in a growing number of applications. Research in multimedia information systems has addressed and made good progress in dealing with archived data. Meanwhile, ...
Querying live media streams is a challenging problem that is becoming an essential requirement in a growing number of applications. Research in multimedia information systems has addressed and made good progress in dealing with archived data. Meanwhile, research in stream databases has received significant attention for querying alphanumeric symbolic streams. The lack of a unifying data model capable of representing multimedia data and providing reasonable abstractions for querying live multimedia streams poses the challenge of how to make the best use of data in video and other sensor networks for various applications including video surveillance, live conferencing and Eventweb. This paper presents a system that enables direct capture of media streams from sensors and automatically generates meaningful feature streams that can be queried by a data stream processor. The system provides an effective combination of extensible digital processing techniques and general data stream management research. expand
|
|
|
SESSION: Interactive arts 1: interfaces for audio and music creation |
|
|
|
|
"fl Huge UId streams": fountains that are keyboards with nozzle spray as keys that give rich tactile feedback and are more expressive and more fun than plastic keys |
| |
Steve Mann
|
|
Pages: 181-190 |
|
doi>10.1145/1101149.1101176 |
|
Full text: PDF
|
|
"flUId" is a system for fluid-based tactile user interfaces with an array of fluid streams that work like the keys on a keyboard, but that can also provide a much richer and more expressive form of input by virtue of the infinitely ...
"flUId" is a system for fluid-based tactile user interfaces with an array of fluid streams that work like the keys on a keyboard, but that can also provide a much richer and more expressive form of input by virtue of the infinitely diverse ways in which each fluid jet can be pressed, hit, restricted, or otherwise manipulated by a user. Additionally, if desired, flUId can provide tactile feedback by dynamically modulating the pressure of the fluid spray, so that the keyboard is actually bi-directional (i.e. is both an input and an output device). A 104-jet version can be used as a fun and tactile "QWERTY..." style keyboard. More importantly, however, flUId can also be used for applications, such as musical instruments, where its more expressive multi-dimensional input capabilities can be put to full use. One such instrument, the "FUNtain", is a hollow tubular object with a row of holes in it. It is played much like one would play a tin flute or recorder, by covering up the holes to restrict fluid flow. The FUNtain's fluid-based "keys" embody features of a keyboard instrument (piano or organ) as well as features of the tin flute, to create a hybrid water-pipe organflut ("waterpipe florgan") instrument. This gives rise to a fun new way of playing music by successivly blocking water jets in a fountain, while sitting in a hot tub, or while frolicking in a pool, or lake. Other examples of fluid-user-interface systems that were invented, designed, and built by the author, to enable direct interaction with fluids, as input media, are also discussed. Some of the input devices will work with either air or water, to provide the benefits of richly expressive input and dynamic tactile feedback in settings where use of wet fluid is inappropriate. FUNtains that use no computer or electricity are also presented as wholly acoustic musical instruments. Some of these embody "back to basics" postcyborg/undigital multimedia design elements such as fire, water, and air. expand
|
|
|
Facilitating collective musical creativity |
| |
Atau Tanaka,
Nao Tokui,
Ali Momeni
|
|
Pages: 191-198 |
|
doi>10.1145/1101149.1101177 |
|
Full text: PDF
|
|
We present two projects that facilitate collective music creativity over networks. One system is a participative social music system on mobile devices. The other is a collaborative music mixing environment that adheres to the Creative Commons license ...
We present two projects that facilitate collective music creativity over networks. One system is a participative social music system on mobile devices. The other is a collaborative music mixing environment that adheres to the Creative Commons license [1]. We discuss how network and community infrastructures affect the creative musical process, and the implications for artists creating new content for these formats. The projects described are real-world examples of collaborative systems as musical works. expand
|
|
|
MobiLenin combining a multi-track music video, personal mobile phones and a public display into multi-user interactive entertainment |
| |
Jürgen Scheible,
Timo Ojala
|
|
Pages: 199-208 |
|
doi>10.1145/1101149.1101178 |
|
Full text: PDF
|
|
This paper introduces a novel and creative approach for coupling multimedia art with a non-conventional distributed human-computer interface for multi-user interactive entertainment. The proposed MobiLenin system allows a group of people to interact ...
This paper introduces a novel and creative approach for coupling multimedia art with a non-conventional distributed human-computer interface for multi-user interactive entertainment. The proposed MobiLenin system allows a group of people to interact simultaneously with a multi-track music video shown on a large public display using their personal mobile phones, effectively empowering the group with the joint authorship of the video. The system is realized with a client-server architecture which includes server-driven real-time control of the client UI to guarantee ease of use and a lottery mechanism as an incentive for interaction. Our analysis of the findings of an empirical user evaluation conducted in a true environment of use shows that the MobiLenin system is successful, addressing many of the challenges identified in the literature. The proposed system offers a new form of interactive entertainment for pubs and other public places, and the underlying architecture provides a framework for realizing similar installations with different types of multimedia content. expand
|
|
|
DEMONSTRATION SESSION: Technical demonstration 1: media understanding and browsing |
|
|
|
|
PhotoRouter: destination-centric mobile media messaging |
| |
Shane Ahern,
Simon King,
Hong Qu,
Marc Davis
|
|
Pages: 209-210 |
|
doi>10.1145/1101149.1101180 |
|
Full text: PDF
|
|
The number of people using cameraphones is growing by tens of millions every month. Yet the majority of cameraphone users have difficulty transferring photos off their phone and sharing them with others. PhotoRouter is a software application for ...
The number of people using cameraphones is growing by tens of millions every month. Yet the majority of cameraphone users have difficulty transferring photos off their phone and sharing them with others. PhotoRouter is a software application for cameraphones that makes the photo sharing process destination-centric by allowing users to focus on who the photo should go to, not how it needs to get there. Attempting to produce an application which meets user needs better than current, technology-centric cameraphone photo sharing applications, we designed PhotoRouter. In this paper we describe PhotoRouter's user interface innovations that we will show in our technical demonstration. expand
|
|
|
Content-based music audio recommendation |
| |
Pedro Cano,
Markus Koppenberger,
Nicolas Wack
|
|
Pages: 211-212 |
|
doi>10.1145/1101149.1101181 |
|
Full text: PDF
|
|
We present the MusicSurfer, a metadata free system for the interaction with massive collections of music. MusicSurfer automatically extracts descriptions related to instrumentation, rhythm and harmony from music audio signals. Together with efficient ...
We present the MusicSurfer, a metadata free system for the interaction with massive collections of music. MusicSurfer automatically extracts descriptions related to instrumentation, rhythm and harmony from music audio signals. Together with efficient similarity metrics, the descriptions allow navigation of multimillion track music collections in a flexible and efficient way without the need for metadata nor human ratings. expand
|
|
|
MediaMetro: browsing multimedia document collections with a 3D city metaphor |
| |
Patrick Chiu,
Andreas Girgensohn,
Surapong Lertsithichai,
Wolf Polak,
Frank Shipman
|
|
Pages: 213-214 |
|
doi>10.1145/1101149.1101182 |
|
Full text: PDF
|
|
The MediaMetro application provides an interactive 3D visualization of multimedia document collections using a city metaphor. The directories are mapped to city layouts using algorithms similar to treemaps. Each multimedia document is represented by ...
The MediaMetro application provides an interactive 3D visualization of multimedia document collections using a city metaphor. The directories are mapped to city layouts using algorithms similar to treemaps. Each multimedia document is represented by a building and visual summaries of the different constituent media types are rendered onto the sides of the building. From videos, Manga storyboards with keyframe images are created and shown on the fatade; from slides and text, thumbnail images are produced and subsampled for display on the building sides. The images resemble windows on a building and can be selected for media playback. To support more facile navigation between high overviews and low detail views, a novel swooping technique was developed that combines altitude and tilt changes with zeroing in on a target. expand
|
|
|
mCLOVER: mobile content-based leaf image retrieval system |
| |
Suckchul Kim,
Yoonsik Tak,
Yunyoung Nam,
Eenjun Hwang
|
|
Pages: 215-216 |
|
doi>10.1145/1101149.1101183 |
|
Full text: PDF
|
|
This demonstration presents a content-based leaf image retrieval system that supports wired/wireless access. For example, if we want to know about a plant that we encounter in a mountain or field, we might look it up in an illustrated book. But, it will ...
This demonstration presents a content-based leaf image retrieval system that supports wired/wireless access. For example, if we want to know about a plant that we encounter in a mountain or field, we might look it up in an illustrated book. But, it will take a long time to search due to the lack of appropriate indexing or search clues and huge amounts of similar plants. In order to solve this problem, we developed a content-based leaf image retrieval system called mCLOVER that supports both wired and wireless access and includes a set of novel features for easy querying and efficient retrieval. expand
|
|
|
Video2Cartoon: generating 3D cartoon from broadcast soccer video |
| |
Dawei Liang,
Yang Liu,
Qingming Huang,
Guangyu Zhu,
Shuqiang Jiang,
Zhebin Zhang,
Wen Gao
|
|
Pages: 217-218 |
|
doi>10.1145/1101149.1101184 |
|
Full text: PDF
|
|
In this demonstration, a prototype system for generating 3D cartoon from broadcast soccer video is proposed. This system takes advantage of computer vision (CV) and computer graphics (CG) techniques to provide users new experience that can not be obtained ...
In this demonstration, a prototype system for generating 3D cartoon from broadcast soccer video is proposed. This system takes advantage of computer vision (CV) and computer graphics (CG) techniques to provide users new experience that can not be obtained from original video. Firstly, it uses CV techniques to obtain 3D positions of the players and ball. Then, CG techniques are applied to model the playfield, players, and ball. Finally, 3D cartoon is generated. Our system allows users to watch the game at any point of view using a 3D viewer based on OpenGL. expand
|
|
|
Online face detection and user authentication |
| |
Caroline Mallauran,
Jean-Luc Dugelay,
Florent Perronnin,
Christophe Garcia
|
|
Pages: 219-220 |
|
doi>10.1145/1101149.1101185 |
|
Full text: PDF
|
|
The ability to verify automatically and with great accuracy the identity of a person has become crucial in everyday life. Biometrics is an emerging topic in the field of signal processing. Our research on biometrics aims at developing a complete framework ...
The ability to verify automatically and with great accuracy the identity of a person has become crucial in everyday life. Biometrics is an emerging topic in the field of signal processing. Our research on biometrics aims at developing a complete framework useful to control access. This technical demo shows the latest image processing techniques for face detection developed at France Telecom and for face recognition developed at Eurécom. Using only one computer and one standard webcam, our biometric system detects the user face and the recognition algorithm uses this image to enable the access to a resource, a service or a location. expand
|
|
|
Intention-based home video browsing |
| |
Tao Mei,
Xian-Sheng Hua
|
|
Pages: 221-222 |
|
doi>10.1145/1101149.1101186 |
|
Full text: PDF
|
|
This demonstration presents an efficient home video browsing system from a novel viewpoint -- capture intention. We extend our previous work to build up a comprehensive scheme to mine the capture intention, and based on this scheme, propose a ...
This demonstration presents an efficient home video browsing system from a novel viewpoint -- capture intention. We extend our previous work to build up a comprehensive scheme to mine the capture intention, and based on this scheme, propose a novel home video browsing system. Such an intention based system assists both camcorder users and viewers to experience their personal home videos by providing two better manners of browsing -- by intention thumbnails and by intention curves. The user study indicates that our system offers users a more efficient way to search a given clip in a video than attention-based scheme. expand
|
|
|
Photo LOI: browsing multi-user photo collections |
| |
Rahul Nair,
Nick Reid,
Marc Davis
|
|
Pages: 223-224 |
|
doi>10.1145/1101149.1101187 |
|
Full text: PDF
|
|
The number of digital photographs is growing beyond the abilities of individuals to easily manage and understand their own photo collections. Photo LOI (Level of Interest) is a technique that filters, aggregates, and visualizes photographs taken by multiple ...
The number of digital photographs is growing beyond the abilities of individuals to easily manage and understand their own photo collections. Photo LOI (Level of Interest) is a technique that filters, aggregates, and visualizes photographs taken by multiple users who shared temporal, spatial, and/or social context at the point of photo capture. Photo LOI enables groups of photographers to see and manipulate visualizations of their photographic activities over time and social space in order to help cluster and select photos, and enables researchers to study contextual patterns in the phototaking habits of different users and groups of users. In this paper we give a brief overview of Photo LOI's features and describe some of its applications. expand
|
|
|
MediaMill: exploring news video archives based on learned semantics |
| |
Cees G. M. Snoek,
Marcel Worring,
Jan van Gemert,
Jan-Mark Geusebroek,
Dennis Koelma,
Giang P. Nguyen,
Ork de Rooij,
Frank Seinstra
|
|
Pages: 225-226 |
|
doi>10.1145/1101149.1101188 |
|
Full text: PDF
|
|
In this technical demonstration we showcase the MediaMill system. A search engine that facilitates access to news video archives at a semantic level. The core of the system is an unprecedented lexicon of 100 automatically detected semantic concepts. ...
In this technical demonstration we showcase the MediaMill system. A search engine that facilitates access to news video archives at a semantic level. The core of the system is an unprecedented lexicon of 100 automatically detected semantic concepts. Based on this lexicon we demonstrate how users can obtain highly relevant retrieval results using query-by-concept. In addition, we show how the lexicon of concepts can be exploited for novel applications using advanced semantic visualizations. Several aspects of the MediaMill system are evaluated as part of our TRECVID 2005 efforts. expand
|
|
|
A repeated video clip identification system |
| |
Xianfeng Yang,
Ping Xue,
Qi Tian
|
|
Pages: 227-228 |
|
doi>10.1145/1101149.1101189 |
|
Full text: PDF
|
|
Identifying short repeated video clips, such as news program logo, station logo, TV commercials, etc., from broadcasting video databases or streams is important for video content indexing, personalization as well as monitoring. In this demo system we ...
Identifying short repeated video clips, such as news program logo, station logo, TV commercials, etc., from broadcasting video databases or streams is important for video content indexing, personalization as well as monitoring. In this demo system we present the following two functions: 1) automatically identifying variable length unknown repeated clips and known reference clips from video collections or steams; 2) visualize temporal distribution of repeated short video clips and analyze video structure based them. Experiment has been conducted on 12 hour CNN and ABC news videos, and excellent results have been achieved in both short repeated video clip identification and news video structure analysis. expand
|
|
|
SESSION: Best student papers |
|
|
|
|
SensEye: a multi-tier camera sensor network |
| |
Purushottam Kulkarni,
Deepak Ganesan,
Prashant Shenoy,
Qifeng Lu
|
|
Pages: 229-238 |
|
doi>10.1145/1101149.1101191 |
|
Full text: PDF
|
|
This paper argues that a camera sensor network containing heterogeneous elements provides numerous benefits over traditional homogeneous sensor networks. We present the design and implementation of senseye---a multi-tier network of heterogeneous ...
This paper argues that a camera sensor network containing heterogeneous elements provides numerous benefits over traditional homogeneous sensor networks. We present the design and implementation of senseye---a multi-tier network of heterogeneous wireless nodes and cameras. To demonstrate its benefits, we implement a surveillance application using senseye comprising three tasks: object detection, recognition and tracking. We propose novel mechanisms for low-power low-latency detection, low-latency wakeups, efficient recognition and tracking. Our techniques show that a multi-tier sensor network can reconcile the traditionally conflicting systems goals of latency and energy-efficiency. An experimental evaluation of our prototype shows that, when compared to a single-tier prototype, our multi-tier senseye can achieve an order of magnitude reduction in energy usage while providing comparable surveillance accuracy. expand
|
|
|
Physics-motivated features for distinguishing photographic images and computer graphics |
| |
Tian-Tsong Ng,
Shih-Fu Chang,
Jessie Hsu,
Lexing Xie,
Mao-Pei Tsui
|
|
Pages: 239-248 |
|
doi>10.1145/1101149.1101192 |
|
Full text: PDF
|
|
The increasing photorealism for computer graphics has made computer graphics a convincing form of image forgery. Therefore, classifying photographic images and photorealistic computer graphics has become an important problem for image forgery detection. ...
The increasing photorealism for computer graphics has made computer graphics a convincing form of image forgery. Therefore, classifying photographic images and photorealistic computer graphics has become an important problem for image forgery detection. In this paper, we propose a new geometry-based image model, motivated by the physical image generation process, to tackle the above-mentioned problem. The proposed model reveals certain physical differences between the two image categories, such as the gamma correction in photographic images and the sharp structures in computer graphics. For the problem of image forgery detection, we propose two levels of image authenticity definition, i.e., imaging-process authenticity and scene authenticity, and analyze our technique against these definitions. Such definition is important for making the concept of image authenticity computable. Apart from offering physical insights, our technique with a classification accuracy of 83.5% outperforms those in the prior work, i.e., wavelet features at 80.3% and cartoon features at 71.0%. We also consider a recapturing attack scenario and propose a counter-attack measure. In addition, we constructed a publicly available benchmark dataset with images of diverse content and computer graphics of high photorealism. expand
|
|
|
Semantic manifold learning for image retrieval |
| |
Yen-Yu Lin,
Tyng-Luh Liu,
Hwann-Tzong Chen
|
|
Pages: 249-258 |
|
doi>10.1145/1101149.1101193 |
|
Full text: PDF
|
|
Learning the user's semantics for CBIR involves two different sources of information: the similarity relations entailed by the content-based features, and the relevance relations specified in the feedback. Given that, we propose an augmented relation ...
Learning the user's semantics for CBIR involves two different sources of information: the similarity relations entailed by the content-based features, and the relevance relations specified in the feedback. Given that, we propose an augmented relation embedding (ARE) to map the image space into a semantic manifold that faithfully grasps the user's preferences. Besides ARE, we also look into the issues of selecting a good feature set for improving the retrieval performance. With these two aspects of efforts we have established a system that yields far better results than those previously reported. Overall, our approach can be characterized by three key properties: 1) The framework uses one relational graph to describe the similarity relations, and the other two to encode the relevant/irrelevant relations indicated in the feedback. 2) With the relational graphs so defined, learning a semantic manifold can be transformed into solving a constrained optimization problem, and is reduced to the ARE algorithm accounting for both the representation and the classification points of views. 3) An image representation based on augmented features is introduced to couple with the ARE learning. The use of these features is significant in capturing the semantics concerning different scales of image regions. We conclude with experimental results and comparisons to demonstrate the effectiveness of our method. expand
|
|
|
DEMONSTRATION SESSION: Video demonstrations and visions |
|
|
|
|
How speech/text alignment benefits web-based learning |
| |
Sheng-Wei Li,
Hao-Tung Lin,
Herng-Yow Chen
|
|
Pages: 259-260 |
|
doi>10.1145/1101149.1101195 |
|
Full text: PDF
|
|
This demonstration presents an integrated web-based synchronized scenario for many-to-one cross-media correlations between speech (an EFL, English as Foreign Language, lecture with free-style lecturing behaviors) and the corresponding textual content. ...
This demonstration presents an integrated web-based synchronized scenario for many-to-one cross-media correlations between speech (an EFL, English as Foreign Language, lecture with free-style lecturing behaviors) and the corresponding textual content. The analysis/presentation of the temporal correlations enable the vivid web-based language learning through the interactive functions: browsing speech via content, word-by-word pointer guidance, synchronized scrolling/highlighting, and listening training mode. We regularly analyze and repackage the multimedia content of VoA (Voice of America) [1], ICRT (International Community Radio Taipei) [2], and Online Lectures in our University [3]. Through the subjective experiments, this repackaged synchronized speech/text content does facilitate the learning for EFL learners. expand
|
|
|
My digital photos: where and when? |
| |
Neil O'Hare,
Cathal Gurrin,
Hyowon Lee,
Noel Murphy,
Alan F. Smeaton,
Gareth J.F. Jones
|
|
Pages: 261-262 |
|
doi>10.1145/1101149.1101196 |
|
Full text: PDF
|
|
In recent years digital cameras have seen an enormous rise in popularity, leading to a huge increase in the quantity of digital photos being taken. This brings with it the challenge of organising these large collections. We preset work which organises ...
In recent years digital cameras have seen an enormous rise in popularity, leading to a huge increase in the quantity of digital photos being taken. This brings with it the challenge of organising these large collections. We preset work which organises personal digital photo collections based on date/time and GPS location, which we believe will become a key organisational methodology over the next few years as consumer digital cameras evolve to incorporate GPS and as cameras in mobile phones spread further. The accompanying video illustrates the results of our research into digital photo management tools which contains a series of screen and user interactions highlighting how a user utilises the tools we are developing to manage a personal archive of digital photos. expand
|
|
|
Post-bit: embodied video contents on tiny stickies |
| |
Takashi Matsumoto,
Tony Dunnigan,
Maribeth Back
|
|
Pages: 263-264 |
|
doi>10.1145/1101149.1101197 |
|
Full text: PDF
|
|
Post-Bit is a small e-paper device modeled after paper Post-Its®1. We explored and designed interfaces to handle multi-media contents with paper-like manipulations using this e-paper device. The functions of each Post-Bit combined the ...
Post-Bit is a small e-paper device modeled after paper Post-Its®1. We explored and designed interfaces to handle multi-media contents with paper-like manipulations using this e-paper device. The functions of each Post-Bit combined the affordance of physical tiny sticky memos and digital handling of information. At this stage of the design, we have prototyped two features of the interface: connecting computer-based workspaces and physical workspaces (using a function called Drop-Beyond-Drag), and tangible and tactile operation of multi-media contents. In this paper, we present the integrated design and functionality of the Post-Bit system's main components as shown in the video scenario. expand
|
|
|
Natural video browsing |
| |
Cai-Zhi Zhu,
Tao Mei,
Xian-Sheng Hua
|
|
Pages: 265-266 |
|
doi>10.1145/1101149.1101198 |
|
Full text: PDF
|
|
In this demonstration, we show a novel system, Video Booklet, which enables nature personal video browsing and searching. Firstly representative thumbnails of video segments are selected and reshaped by a set of pre-trained personalized shape templates, ...
In this demonstration, we show a novel system, Video Booklet, which enables nature personal video browsing and searching. Firstly representative thumbnails of video segments are selected and reshaped by a set of pre-trained personalized shape templates, and then printed out on a real booklet. When we want to watch the segment indicated by a certain thumbnail in the booklet, we are able to use camera phones or similar devices to capture the corresponding thumbnail, and send it to a computer via wireless network. Thereafter, the target thumbnail is accurately located by a Self-Trained Active Shape Model algorithm, and then the distortion of the captured image is corrected. Finally the Video Booklet system will automatically find the most similar thumbnail to the corrected one and begin to play the corresponding segment in the video library for us. Thereby, Video Booklet builds a seamless bridge between digital videos and analog albums. expand
|
|
|
MMM2: mobile media metadata for media sharing |
| |
Marc Davis,
John Canny,
Nancy Van House,
Nathan Good,
Simon King,
Rahul Nair,
Carrie Burgener,
Bruce Rinehart,
Rachel Strickland,
Guy Campbell,
Scott Fisher,
Nick Reid
|
|
Pages: 267-268 |
|
doi>10.1145/1101149.1101199 |
|
Full text: PDF
|
|
As cameraphones become the dominant platform for consumer multimedia capture worldwide, multimedia researchers are faced both with the challenge of how to help users manage the billions of photographs they are collectively producing and the opportunity ...
As cameraphones become the dominant platform for consumer multimedia capture worldwide, multimedia researchers are faced both with the challenge of how to help users manage the billions of photographs they are collectively producing and the opportunity to leverage cameraphones' ability to automatically capture temporal, spatial, and social contextual metadata to help manage consumer multimedia content. In our Mobile Media Metadata 2 (MMM2) prototype, we apply collaborative filtering techniques to automatically gathered contextual metadata to infer the likely sharing recipients for photos captured on cameraphones. We show that while current cameraphone sharing interfaces are fraught with difficulty, it is possible to use a context-aware approach to make the sharing of cameraphone photos simpler and more satisfying for users. Based on our analysis of the relative contributions of different cameraphone sensors to predicting the likely recipients for photos, we discover for our user population that the temporal context of photo capture proved highly predictive of photo sharing behavior. expand
|
|
|
Media gallery TV: view and shop your photos on interactive digital television |
| |
Sabine Thieme,
Ansgar Scherp,
Melanie Albrecht,
Susanne Boll
|
|
Pages: 269-270 |
|
doi>10.1145/1101149.1101200 |
|
Full text: PDF
|
|
In this paper, we present the Media Gallery, a MHP-based interactive multimedia application on digital TV. This application allows customers to view and order their digital photos and to order physical prints and fun products from these digital ...
In this paper, we present the Media Gallery, a MHP-based interactive multimedia application on digital TV. This application allows customers to view and order their digital photos and to order physical prints and fun products from these digital photos directly from TV. The Media Gallery opens a new distribution channel and market opportunity for the photo finisher and a platform to comfortably view and order their digital images directly on their TV. expand
|
|
|
POSTER SESSION: Poster 1: systems track |
|
|
|
|
Content-adaptive transmission of reconstructed soccer goal events over low bandwidth networks |
| |
Qing Tang,
Irena Koprinska,
Jesse S. Jin
|
|
Pages: 271-274 |
|
doi>10.1145/1101149.1101202 |
|
Full text: PDF
|
|
This paper presents a content-adaptive system for streaming reconstructed soccer goal events over networks with bandwidth limited to 1.5Mbps or below. The reconstruction module analyzes a soccer video to produce corresponding panoramic field model with ...
This paper presents a content-adaptive system for streaming reconstructed soccer goal events over networks with bandwidth limited to 1.5Mbps or below. The reconstruction module analyzes a soccer video to produce corresponding panoramic field model with localized motion trajectories to provide tactic analysis for user's better comprehension of the goal events. The transmission module designs 3 schemes for different bandwidth conditions (150Kbps-1.5Mbps), (56-150Kbps), and (<56Kbps). Each scheme can choose dynamically the video content, viewing quality and where the reconstruction happens based on the network conditions and computational power of connected devices. We demonstrate the effectiveness and scalability of our system in experimental results of adaptive transmission two soccer goal events under three typical bandwidth conditions. expand
|
|
|
A new selection method for H.264 based fine granular scalable video coding |
| |
Won-Hyuck Yoo,
Jihun Cha,
Won-Sik Jeong,
Kyuheon Kim,
Gwang Hoon Park
|
|
Pages: 275-278 |
|
doi>10.1145/1101149.1101203 |
|
Full text: PDF
|
|
In this paper, we introduce a new selection method for H.264 based Fine Granular Scalable video coding. It selectively uses the temporal-prediction data inside the enhancement-layer only when those data can significantly reduce the temporal-redundancies, ...
In this paper, we introduce a new selection method for H.264 based Fine Granular Scalable video coding. It selectively uses the temporal-prediction data inside the enhancement-layer only when those data can significantly reduce the temporal-redundancies, thereby the improvement of the overall coding efficiency is accomplished by minimizing the drift errors. Simulation results show that the proposed scheme has 1~3 dB better coding efficiency than H.264-based FGS coding scheme. expand
|
|
|
JADE: jabber-based authoring in distributed environments |
| |
Andrew Roczniak,
Abdulmotaleb El Saddik
|
|
Pages: 279-282 |
|
doi>10.1145/1101149.1101204 |
|
Full text: PDF
|
|
We present our initial results in developing a framework for collaborative multimedia authoring tools. This research is motivated by the lack of tools that take into account consumers' quality of experience. By mapping factors that have an impact ...
We present our initial results in developing a framework for collaborative multimedia authoring tools. This research is motivated by the lack of tools that take into account consumers' quality of experience. By mapping factors that have an impact on the quality of experience into requirements, we are developing a framework for tools that allow retrieval and manipulation of multimedia objects, and collaborative authoring of multimedia documents based on Jabber set of protocols. expand
|
|
|
Streaming with causality: a practical approach |
| |
Cezar Pleşca,
Romulus Grigoraş,
Philippe Quéinnec,
Gérard Padiou
|
|
Pages: 283-286 |
|
doi>10.1145/1101149.1101205 |
|
Full text: PDF
|
|
Highly interactive collaborative streaming applications express the need for causality. Solutions exist but we argue that more work needs to be done especially from a perceptual point of view. The key question is: given the current state of the Internet ...
Highly interactive collaborative streaming applications express the need for causality. Solutions exist but we argue that more work needs to be done especially from a perceptual point of view. The key question is: given the current state of the Internet and the perceptual tolerance of causal desynchronization, does causality make any difference? This paper proposes a practical answer to this question by comparing different solutions. We support this comparison by producing video results for a live streaming scenario on an experimental platform. Further, this paper proposes a novel approach for handling causality in multimedia and shows that it can perform better than Δ-causality, usually considered the best solution. expand
|
|
|
A peer-to-peer network for live media streaming using a push-pull approach |
| |
Meng Zhang,
Jian-Guang Luo,
Li Zhao,
Shi-Qiang Yang
|
|
Pages: 287-290 |
|
doi>10.1145/1101149.1101206 |
|
Full text: PDF
|
|
In this paper, we present an unstructured peer-to-peer network called GridMedia for live media streaming employing a push-pull approach. Each node in GridMedia randomly selects its neighbors in the overlay and uses push-pull method to fetch data from ...
In this paper, we present an unstructured peer-to-peer network called GridMedia for live media streaming employing a push-pull approach. Each node in GridMedia randomly selects its neighbors in the overlay and uses push-pull method to fetch data from the neighbors. The pull mode in the unstructured overlay which is inherently robust can work well with the high churn rate in P2P environment while the push mode can efficiently reduce the accumulated latency observed at user nodes. A practical system based on this framework has been developed. And the performance evaluation of our system which is established on PlanetLab [8] demonstrates that the pull-push method in GridMedia achieves good qualities even in high group change rate. Furthermore, our system was adopted by CCTV to broadcast the Gala Evening for Spring Festival 2005 through the Internet and attracted more than 500,000 users all over the world at that night with the incredibly maximum concurrent users of 15,239. expand
|
|
|
Power-aware bandwidth and stereo-image scalable audio decoding |
| |
Wendong Huang,
Ye Wang,
Samarjit Chakraborty
|
|
Pages: 291-294 |
|
doi>10.1145/1101149.1101207 |
|
Full text: PDF
|
|
We propose a new workload-scalable audio decoding scheme that would enable users to control the tradeoff between playback quality and power consumption in battery-powered portable audio players. Our objective is to give users a control at the decoder ...
We propose a new workload-scalable audio decoding scheme that would enable users to control the tradeoff between playback quality and power consumption in battery-powered portable audio players. Our objective is to give users a control at the decoder side, similar to the Long Play (LP) recording mode at the encoder side in many media recording devices. The main contribution of this paper is a proposal for a Bandwidth and Stereo-image Scalable (BSS) decoding scheme for single-layer audio formats such as MP3. The proposed scheme is based on an analysis of the perceptual relevance of different audio components in the compressed bitstream. The bandwidth and stereo-image scalability directly translates into scalability in terms of the computational workload generated by the decoder. This can be exploited by a voltage/frequency scalable processor to save energy and prolong the battery life. expand
|
|
|
TrustStream: a novel secure and scalable media streaming architecture |
| |
Hao Yin,
Chuang Lin,
Feng Qiu,
Xuening Liu,
Dapeng Wu
|
|
Pages: 295-298 |
|
doi>10.1145/1101149.1101208 |
|
Full text: PDF
|
|
Streaming media over networks has gained renewed interest recently due to the emerging IP-TV and mobile TV. The success of commercial media streaming systems critically depends on two important capabilities, namely, 1) scalability in distributing media ...
Streaming media over networks has gained renewed interest recently due to the emerging IP-TV and mobile TV. The success of commercial media streaming systems critically depends on two important capabilities, namely, 1) scalability in distributing media content to diverse clients, and 2) security management of the media and the systems. However, existing media streaming systems such as content distribution networks (CDN) and Peer-to-Peer (P2P) networks lack either security or scalability. In this paper, we propose a novel secure and scalable media streaming architecture, called TrustStream. Our architecture combines the best features of CDN and P2P networks to achieve unprecedented security, scalability, and certain quality of service simultaneously. Our experimental results demonstrate the advantages of the TrustStream. expand
|
|
|
Using offline bitstream analysis for power-aware video decoding in portable devices |
| |
Yicheng Huang,
Samarjit Chakraborty,
Ye Wang
|
|
Pages: 299-302 |
|
doi>10.1145/1101149.1101209 |
|
Full text: PDF
|
|
Dynamic voltage/frequency scheduling algorithms for multimedia applications have recently been a subject of intensive research. Many of these algorithms use control-theoretic feedback techniques to predict the future execution demand of an application ...
Dynamic voltage/frequency scheduling algorithms for multimedia applications have recently been a subject of intensive research. Many of these algorithms use control-theoretic feedback techniques to predict the future execution demand of an application based on the demand in the recent past. Such techniques suffer from two major disadvantages: (i) they are computationally expensive, and (ii) it is difficult to give performance or quality-of-service guarantees based on these techniques (since the predictions can occasionally turn out to be incorrect). To address these shortcomings, in this paper we propose a completely new approach for dynamic voltage and frequency scaling. Our technique is based on an offline bitstream analysis of multimedia files. Based on this analysis, we insert metadata information describing the computational demand that will be generated when decoding the file. Such bitstream analysis and metadata insertion can be done when the multimedia file is being downloaded into a portable device from a desktop computer. In this paper we illustrate this technique using the MPEG-2 decoder application. We show that the amount of metadata that needs to be inserted is a very small fraction of the total size of the video clip and it can lead to significant energy savings. The metadata inserted will typically consist of the frequency value at which the processor needs to be run at different points in time during the decoding process. Lastly, in contrast to runtime prediction-based techniques, our scheme can be used to provide performance and quality-of-service guarantees and at the same time avoids any runtime computation overhead. expand
|
|
|
Supporting multi-party voice-over-IP services with peer-to-peer stream processing |
| |
Xiaohui Gu,
Zhen Wen,
Philip S. Yu,
Zon-Yin Shae
|
|
Pages: 303-306 |
|
doi>10.1145/1101149.1101210 |
|
Full text: PDF
|
|
Multi-party voice-over-IP (MVoIP) services provide economical and natural group communication mechanisms for many emerging applications such as on-line gaming, distance collaboration, and tele-immersion. In this paper, we present a novel peer-to-peer ...
Multi-party voice-over-IP (MVoIP) services provide economical and natural group communication mechanisms for many emerging applications such as on-line gaming, distance collaboration, and tele-immersion. In this paper, we present a novel peer-to-peer (P2P) stream processing system called peerTalk to provide resource-efficient and failure-resilient MVoIP services. Different from previous work, our solution is fully distributed and self-organizing without requiring specialized servers or IP multicast support. Particularly, we decouple the stream processing in MVoIP services into two phases: (1) aggregation phase that mixes audio streams from active speakers into a single stream; and (2) distribution phase that distributes the mixed audio stream to all listeners. The decoupled model allows us to optimize and adapt the P2P stream mixing and distribution processes separately. Specifically, we can adaptively spread stream mixing workload among resource-constrained peer hosts according to current speaking activities. We have implemented a prototype of the peerTalk system and conducted experiments in real-world wide-area networks. The results show that peerTalk can achieve lower resource contention and better service quality than previous common solution. expand
|
|
|
rStream: resilient peer-to-peer streaming with rateless codes |
| |
Chuan Wu,
Baochun Li
|
|
Pages: 307-310 |
|
doi>10.1145/1101149.1101211 |
|
Full text: PDF
|
|
The inherent instability and unreliability of peer-to-peer networks introduce several fundamental engineering challenges to multimedia streaming over peer-to-peer networks. First, multimedia streaming sessions need to be resilient to the volatile network ...
The inherent instability and unreliability of peer-to-peer networks introduce several fundamental engineering challenges to multimedia streaming over peer-to-peer networks. First, multimedia streaming sessions need to be resilient to the volatile network dynamics in peer-to-peer networks. Second, they need to take full advantage of the existing bandwidth capacities, by minimizing the delivery of redundant content during streaming. In this paper, we propose to use a recent coding technique, referred to as rateless codes, to code the multimedia bitstreams before they are transmitted over peer-to-peer links. The use of rateless codes eliminates the requirements of content reconciliation, as well as the risks of delivering redundant content over the network. It also helps the streaming sessions to adapt to volatile network dynamics. Our preliminary simulation results demonstrate the validity and effectiveness of our new contribution, as compared to traditional solutions with or without erasure codes. expand
|
|
|
Impact of incentive mechanisms on quality of experience |
| |
Andrew Roczniak,
Abdulmotaleb El Saddik
|
|
Pages: 311-314 |
|
doi>10.1145/1101149.1101212 |
|
Full text: PDF
|
|
Since entities participating in P2P networks are usually autonomous and therefore free to decide on their level of participation, mechanisms to resolve conflicts between individual and collective rationality are needed. How can implementations of such ...
Since entities participating in P2P networks are usually autonomous and therefore free to decide on their level of participation, mechanisms to resolve conflicts between individual and collective rationality are needed. How can implementations of such mechanisms be compared? This paper introduces a qualitative reference framework, highlighting essential elements and major design decisions in any implementation of incentive mechanisms. In the context of multimedia applications built on top of P2P architectures, the reference framework can be used in assessing the impact on the quality of experience (QoE) when incentive mechanisms are included. expand
|
|
|
POSTER SESSION: Poster 2: applications track |
|
|
|
|
ClickRemoval: interactive pinpoint image object removal |
| |
Frank Nielsen,
Richard Nock
|
|
Pages: 315-318 |
|
doi>10.1145/1101149.1101214 |
|
Full text: PDF
|
|
In this paper, we explore the problem of deleting objects in still pictures. We present an interactive system based on an intuitive user-friendly interface for removing undesirable objects in digital pictures. To erase an object in an image, a user indicates ...
In this paper, we explore the problem of deleting objects in still pictures. We present an interactive system based on an intuitive user-friendly interface for removing undesirable objects in digital pictures. To erase an object in an image, a user indicates which object is to be removed by simply pinpointing it with the mouse cursor. As the mouse cursor rolls over the image, the current implicit selected object's border is highlighted, providing a visual feedback. In case where the computer-segmented area does not match the users' perception of the object, users can further provide a few inside/outside object cues by clicking on a small number of object or nonobject pixels. A small number of such cues is generally enough to reach a correct matching, even for complex textured images. Afterwards, the user removes the object by clicking the left mouse button, and a hole-filling technique is initiated to generate a seamless background portion. Our image manipulation system consists of two components: (i) fully automatic or partially user-steered image segmentation based on an improved fast statistical region-growing segmentation, and (ii) texture synthesis or image inpainting of irregular shaped hole regions. Experiments on a variety of photographs display the ability of the system to handle complex scenes with highly textured objects. expand
|
|
|
Motion picture inpainting on aged films |
| |
Timothy K. Shih,
Rong-Chi Chang,
Yu-Ping Chen
|
|
Pages: 319-322 |
|
doi>10.1145/1101149.1101215 |
|
Full text: PDF
|
|
Video inpainting uses spatial-temporal information to repair defects such as spikes and lines on aged films. We propose a series of new algorithms based on adjustable thresholds to repair different varieties of aged films. The main contribution is an ...
Video inpainting uses spatial-temporal information to repair defects such as spikes and lines on aged films. We propose a series of new algorithms based on adjustable thresholds to repair different varieties of aged films. The main contribution is an automatic spike and dirt detection mechanism. We prove that if appropriate threshold is once decided by the author, most damages in an aged video clip can be detected. In addition, the repairing procedure first estimates temporal information and obtain replacement blocks among several frames. Spatial information is then used to repair damages that can not be fixed by temporal information due to fast motion. The results are visually pleasant with most defects removed. expand
|
|
|
Implementation of a mobile MPEG-21 peer |
| |
Shane Lauf,
Ian Burnett
|
|
Pages: 323-326 |
|
doi>10.1145/1101149.1101216 |
|
Full text: PDF
|
|
The MPEG-21 Multimedia Framework aims to realize interoperable access to content across heterogeneous networks and devices. Within the Framework, the concept of Digital Items is introduced as a structured digital representation for multimedia. To demonstrate ...
The MPEG-21 Multimedia Framework aims to realize interoperable access to content across heterogeneous networks and devices. Within the Framework, the concept of Digital Items is introduced as a structured digital representation for multimedia. To demonstrate the applicability of MPEG-21 to seamless multimedia interactions on limited platforms, the authors have produced an implementation of MPEG-21 for a mobile device, in Java 2 Micro Edition (J2ME). This paper examines the design and implementation of the Mobile MPEG-21 Peer, including a specialized architecture and processing mechanisms specific to the J2ME platform. expand
|
|
|
Exploiting self-adaptive posture-based focus estimation for lecture video editing |
| |
Feng Wang,
Chong-Wah Ngo,
Ting-Chuen Pong
|
|
Pages: 327-330 |
|
doi>10.1145/1101149.1101217 |
|
Full text: PDF
|
|
Head pose plays a special role in estimating a presenter's focuses and actions for lecture video editing. This paper presents an efficient and robust head pose estimation algorithm to cope with the new challenges arising in the content management of ...
Head pose plays a special role in estimating a presenter's focuses and actions for lecture video editing. This paper presents an efficient and robust head pose estimation algorithm to cope with the new challenges arising in the content management of lecture videos. These challenges include speed requirement, low video quality, variant presenting styles and complex settings in modern classrooms. Our algorithm is based on a robust hierarchical representation of skin color clustering and a set of pose templates that are automatically trained. Contextual information is also considered to refine pose estimation. Most importantly, we propose an online learning approach to deal with different presenting styles, which has not been addressed before. We show that the proposed approach can significantly improve the performance of pose estimation. In addition, we also describe how posture is used in focus estimation for lecture video editing by integrating with gesture. expand
|
|
|
IMAGINATION: a robust image-based CAPTCHA generation system |
| |
Ritendra Datta,
Jia Li,
James Z. Wang
|
|
Pages: 331-334 |
|
doi>10.1145/1101149.1101218 |
|
Full text: PDF
|
|
We propose IMAGINATION (IMAge Generation for INternet AuthenticaTION), a system for the generation of attack-resistant, user-friendly, image-based CAPTCHAs. In our system, we produce controlled distortions on randomly chosen images and present them to ...
We propose IMAGINATION (IMAge Generation for INternet AuthenticaTION), a system for the generation of attack-resistant, user-friendly, image-based CAPTCHAs. In our system, we produce controlled distortions on randomly chosen images and present them to the user for annotation from a given list of words. The distortions are performed in a way that satisfies the incongruous requirements of low perceptual degradation and high resistance to attack by content-based image retrieval systems. Word choices are carefully generated to avoid ambiguity as well as to avoid attacks based on the choices themselves. Preliminary results demonstrate the attack-resistance and user-friendliness of our system compared to text-based CAPTCHAs. expand
|
|
|
[hid] toolkit: a unified framework for instrument design |
| |
Hans-Christoph Steiner
|
|
Pages: 335-338 |
|
doi>10.1145/1101149.1101219 |
|
Full text: PDF
|
|
The [hid] toolkit is a set of software objects for designing gestural instruments. All too frequently, computer performers are tied to the keyboard/mouse/monitor model, narrowly constraining the range of possible gestures. A multitude of off-the-shelf ...
The [hid] toolkit is a set of software objects for designing gestural instruments. All too frequently, computer performers are tied to the keyboard/mouse/monitor model, narrowly constraining the range of possible gestures. A multitude of off-the-shelf input devices are readily available, making it easy to utilize a broader range of gestures. Human Interface Devices (HIDs) such as joysticks, tablets, and gamepads are cheap and can be good musical controllers, with some even provide haptic feedback. The [hid] toolkit provides a unified, consistent framework for getting gestural data from these devices, controlling the feedback, and mapping this data to the desired output. The [hid] toolkit is built in Pd, which provides an ideal platform for this work, combining the ability to synthesize and control audio, video, and other media. The addition of easy access to gestural data allows for rapid prototypes. A usable environment also makes computer music instrument design accessible to novices. expand
|
|
|
Hierarchical voting classification scheme for improving visual sign language recognition |
| |
Liang-Guo Zhang,
Xilin Chen,
Chunli Wang,
Wen Gao
|
|
Pages: 339-342 |
|
doi>10.1145/1101149.1101220 |
|
Full text: PDF
|
|
As one of the important research areas of multimodal interaction,
sign language recognition (SLR) has attracted increasing interest.
In SLR, especially on medium or large vocabulary, it is usually
difficult or impractical to collect enough training ...
As one of the important research areas of multimodal interaction,
sign language recognition (SLR) has attracted increasing interest.
In SLR, especially on medium or large vocabulary, it is usually
difficult or impractical to collect enough training data. Thus, how
to improve the recognition on the limited training samples is a
significant issue. In this paper, a simple but effective
hierarchical voting classification (HVC) scheme for improving
visual SLR, which makes efficient use of limited training data, is
proposed. The key idea of HVC scheme is similar to but not the same
as Bagging technique. Firstly, it constructs several training sets
from the original training set in a combinatorial fashion to
generate the corresponding continuous hidden Markov models (CHMM)
ensemble. Then, it determines the ensemble output by appropriate
local voting strategy. Finally, it obtains the final recognition
result by the global voting. Experimental results show that the HVC
scheme outperforms the conventional single CHMM approach in terms
of recognition accuracy on the limited training data.
expand
|
|
|
Real time advertisement insertion in baseball video based on advertisement effect |
| |
Yiqun Li,
Kong Wah Wan,
Xin Yan,
Changsheng Xu
|
|
Pages: 343-346 |
|
doi>10.1145/1101149.1101221 |
|
Full text: PDF
|
|
In this paper, we propose a novel method to detect baseball video scene for commercial advertisement insertion. The method uses the criteria based on better advertisement effect to generate a set of rules. From these rules, proper timing (starting time ...
In this paper, we propose a novel method to detect baseball video scene for commercial advertisement insertion. The method uses the criteria based on better advertisement effect to generate a set of rules. From these rules, proper timing (starting time and ending time) and location are identified to insert the advertisement in the broadcast baseball video automatically. The proper timing is detected based on the consistent existence of simple background objects in the video for a period of time. The proper location is identified by less-informative-region detection in the video. The rules make sure that the critical information of the video is not blocked by the advertisement and the advertisement is always stable, clear and viewable on the stationary background of the video. The experimental results are encouraging. expand
|
|
|
Providing on-demand sports video to mobile devices |
| |
Qingshan Liu,
Zhigang Hua,
Cunxun Zang,
Xiaofeng Tong,
Hanqing Lu
|
|
Pages: 347-350 |
|
doi>10.1145/1101149.1101222 |
|
Full text: PDF
|
|
This paper introduces a system for providing on-demand sports video to mobile devices, which has two main contributions. First, we construct an infrastructure for extracting and delivering the highlights instead of the whole sport videos to mobile clients, ...
This paper introduces a system for providing on-demand sports video to mobile devices, which has two main contributions. First, we construct an infrastructure for extracting and delivering the highlights instead of the whole sport videos to mobile clients, which can significantly reduce the bandwidth consumption. Second, we design an advanced UI for the mobile clients to effectively browse and interact with the video highlights. To validate the practicality and effectiveness of this system, we conduct the experiments on several real soccer videos. The results demonstrated that more than 65% of bandwidth consumption could be reduced. Moreover, the initial user study results show that the mobile users could interact effectively with the interface to seek or navigate sports videos. expand
|
|
|
An adaptive edge detection based colorization algorithm and its applications |
| |
Yi-Chin Huang,
Yi-Shin Tung,
Jun-Cheng Chen,
Sung-Wen Wang,
Ja-Ling Wu
|
|
Pages: 351-354 |
|
doi>10.1145/1101149.1101223 |
|
Full text: PDF
|
|
Colorization is a computer-assisted process for adding colors to grayscale images or movies. It can be viewed as a process for assigning a three-dimensional color vector (YUV or RGB) to each pixel of a grayscale image. In previous works, with some color ...
Colorization is a computer-assisted process for adding colors to grayscale images or movies. It can be viewed as a process for assigning a three-dimensional color vector (YUV or RGB) to each pixel of a grayscale image. In previous works, with some color hints the resultant chrominance value varies linearly with that of the luminance. However, it is easy to find that existing methods may introduce obvious color bleeding, especially, around region boundaries. It then needs extra human-assistance to fix these artifacts, which limits its practicability. Facing such a challenging issue, we introduce a general and fast colorization methodology with the aid of an adaptive edge detection scheme. By extracting reliable edge information, the proposed approach may prevent the colorization process from bleeding over object boundaries. Next, integration of the proposed fast colorization scheme to a scribble-based colorization system, a modified color transferring system and a novel chrominance coding approach are investigated. In our experiments, each system exhibits obvious improvement as compared to those corresponding previous works. expand
|
|
|
Recognition of hands-free speech and hand pointing action for conversational TV |
| |
Yasuo Ariki,
Tetsuya Takiguchi,
Atsushi Sako
|
|
Pages: 355-358 |
|
doi>10.1145/1101149.1101224 |
|
Full text: PDF
|
|
In this paper, we propose a structure and components of a conversational television set(TV) to which we can ask anything on the broadcasted contents and receive the interesting information from the TV. The conversational TV is composed of two types of ...
In this paper, we propose a structure and components of a conversational television set(TV) to which we can ask anything on the broadcasted contents and receive the interesting information from the TV. The conversational TV is composed of two types of processing; back end processing and front end processing. In the back end processing, broadcasted contents are analyzed using speech and video recognition techniques and both of the meta data and the structure are extracted. In the front end processing, human speech and hand action are recognized to understand the user intention. We show some applications, being developed in this conversational TV with multi-modal interactions, such as word explanation, human information retrieval, event retrieval in soccer and baseball video games with contextual awareness. expand
|
|
|
A corpus-based singing voice synthesis system for mandarin Chinese |
| |
Cheng-Yuan Lin,
Tzu-Ying Lin,
J.-S. Roger Jang
|
|
Pages: 359-362 |
|
doi>10.1145/1101149.1101225 |
|
Full text: PDF
|
|
In this paper, the design and implementation of a corpus-based singing voice synthesis (SVS) system for Mandarin Chinese was introduced. The design rules of three corpora for singing voice synthesis were proposed. After that, two distance functions were ...
In this paper, the design and implementation of a corpus-based singing voice synthesis (SVS) system for Mandarin Chinese was introduced. The design rules of three corpora for singing voice synthesis were proposed. After that, two distance functions were defined and the Viterbi search algorithm was applied to identify the optimal combinations of synthesis units from the three corpora. For better performance, several sound effects with synthesized outputs were combined. Finally, we conduct a listening experiment to demonstrate the feasibility of this system. expand
|
|
|
Dynamic shot suggestion filtering for home video based on user performance |
| |
Brett Adams,
Svetha Venkatesh
|
|
Pages: 363-366 |
|
doi>10.1145/1101149.1101226 |
|
Full text: PDF
|
|
This paper presents novel additions to our existing amateur media creation framework. The framework provides at-capture guidance to enable the home movie maker to realize their aesthetic and narrative goals and automation of post-production editing. ...
This paper presents novel additions to our existing amateur media creation framework. The framework provides at-capture guidance to enable the home movie maker to realize their aesthetic and narrative goals and automation of post-production editing. A common problem with the amateur filming context is its contingent nature, which often results in the failure to gain footage vital to the user's goals, even with at-capture software embedding. Accordingly, we have modelled minimizing the difference between target and captured footage at a given time during filming as a probability distribution divergence problem. We apply two policies of feedback to the user on thier performance, passive communication via a suggestion desirability measure, and active filtering of undesirable suggestions. We demonstrate the framework using each policy with a simulation of various user and filming situations with promising results. expand
|
|
|
Creating MAGIC: system for generating learning object metadata for instructional content |
| |
Ying Li,
Chitra Dorai,
Robert Farrell
|
|
Pages: 367-370 |
|
doi>10.1145/1101149.1101227 |
|
Full text: PDF
|
|
This paper presents our latest work on building a system called MAGIC (Metadata Automated Generation for Instructional Content) that will automatically identify segments and generate critical metadata conforming with the SCORM (Sharable Content Object ...
This paper presents our latest work on building a system called MAGIC (Metadata Automated Generation for Instructional Content) that will automatically identify segments and generate critical metadata conforming with the SCORM (Sharable Content Object Reference Model) standard for instructional content. Various content analytics engines are utilized to automatically generate key metadata, which include audiovisual analysis modules that recognize semantic sound categories and identify narrators and informative text segments; text analysis modules that extract title, keywords and summary from text documents; and a text categorizer that classifies a document according to a pre-generated taxonomy. With MAGIC, instructional content developers can generate and edit SCORM metadata to richly describe their content asset for use in distributed learning applications. Experimental results obtained from collections of real data from targeted user communities will be presented. expand
|
|
|
Cooking navi: assistant for daily cooking in kitchen |
| |
Reiko Hamada,
Jun Okabe,
Ichiro Ide,
Shin'ichi Satoh,
Shuichi Sakai,
Hidehiko Tanaka
|
|
Pages: 371-374 |
|
doi>10.1145/1101149.1101228 |
|
Full text: PDF
|
|
We are developing a cooking navigation system, which helps even a novice user to cook several recipes in parallel without failure, while improving an advanced user's skill further. To realize this, the system optimizes the cooking procedure considering ...
We are developing a cooking navigation system, which helps even a novice user to cook several recipes in parallel without failure, while improving an advanced user's skill further. To realize this, the system optimizes the cooking procedure considering the following restrictions: (1) Duration of cooking, (2) Accuracy of cooking, and (3) Learning effect, by providing appropriate instructions to user's at the right timing, making full use of multimedia information. The users should be able to cook perfectly and comfortably just by following the text, video and audio provided by the system. According to the result of a preliminary experiment, all users from novice to experienced cooks could finish two dishes in parallel while enjoyeing the cooking very much. The result of a questionnaire shows the effectiveness of the multimedia navigation that we propose. expand
|
|
|
Personal media sharing and authoring on the web |
| |
Xian-Sheng Hua,
Shipeng Li
|
|
Pages: 375-378 |
|
doi>10.1145/1101149.1101229 |
|
Full text: PDF
|
|
In this paper, we propose a novel system working on the Web for personal media sharing and authoring. Three primary technologies enable this end-to-end system, including scalable video coding, intelligent multimedia content analysis, and template-based ...
In this paper, we propose a novel system working on the Web for personal media sharing and authoring. Three primary technologies enable this end-to-end system, including scalable video coding, intelligent multimedia content analysis, and template-based media authoring. Scalable video codec tackles the issue of huge data transmission, multimedia content analysis facilitates automatic video editing, and template-based media authoring scheme further reduces the workload of media sharing and authoring. Experiment and a demo system on a real Internet environment show that this novel system is effective. expand
|
|
|
Automatic generating detail-on-demand hypervideo using MPEG-7 and SMIL |
| |
Tina T. Zhou,
Tom Gedeon,
Jess S. Jin
|
|
Pages: 379-382 |
|
doi>10.1145/1101149.1101230 |
|
Full text: PDF
|
|
Detail-on-demand hypervideo will provide a powerful mechanism to allow viewers to see additional information of video segments through hyperlinks. A large number of tools are devoted to the identification of selectable video objects and the synchronization ...
Detail-on-demand hypervideo will provide a powerful mechanism to allow viewers to see additional information of video segments through hyperlinks. A large number of tools are devoted to the identification of selectable video objects and the synchronization mechanisms for linking additional information to selectable video objects. We focus here on the automatic generation of additional information and the integration of the additional information to its corresponding selectable video object. We demonstrate a method using the Multimedia Content Description Interface defined in MPEG-7 and the Synchronized Multimedia Integration Language (SMIL) to automatically generate detail-on-demand hypervideos. expand
|
|
|
Office blogger |
| |
Berna Erol,
Jonathan J. Hull
|
|
Pages: 383-386 |
|
doi>10.1145/1101149.1101231 |
|
Full text: PDF
|
|
The Office Blogger (OBlog) is an experimental prototype of a multimedia appliance that allows an office worker to easily record events, conversations, meetings, pictures and documents, and helps create blog entries from that data. OBlog employs a novel ...
The Office Blogger (OBlog) is an experimental prototype of a multimedia appliance that allows an office worker to easily record events, conversations, meetings, pictures and documents, and helps create blog entries from that data. OBlog employs a novel image classification algorithm and automatic post processing to make the information about images more accessible to users. The OBlog system uses an älways-on" design that makes it convenient to use and it's packaged as a stand-alone system to remove any computational burden from a user's PC. Results show that the image classification algorithm achieves over 90% recall on five image categories commonly encountered by office workers. expand
|
|
|
Haptic: the new biometrics-embedded media to recognizing and quantifying human patterns |
| |
Mauricio Orozco Trujillo,
Ismail Shakra,
Abdulmotaleb El Saddik
|
|
Pages: 387-390 |
|
doi>10.1145/1101149.1101232 |
|
Full text: PDF
|
|
Authentication for the purposes of security has taken giant strides since the introduction of Biometrics to help identify people by their behavioral and physiological features. From organizations and corporations to educational institutes, electronic ...
Authentication for the purposes of security has taken giant strides since the introduction of Biometrics to help identify people by their behavioral and physiological features. From organizations and corporations to educational institutes, electronic resources, and even crime scenes, Biometrics offers a wide application scope to detect fraud attempts. This paper proposes a research path to achieve the task of authenticating users that are working in a haptic-based environment. The field of Biometrics can be divided into two main classes of human features. Birth-given characteristics like fingerprints and facial features cannot be developed or altered by humans. Behavioral characteristics such as hand signature and voice fall into the second class [1]. The work presented in this paper pursues the latter class and specifically studies how a person reacts to using daily devices or tools. The fact that we can exploit people's habits in handling devices to detect identity was the hypothesis that motivated this work. expand
|
|
|
Attention region selection with information from professional digital camera |
| |
Song Liu,
Liang-Tien Chia,
Deepu Rajan
|
|
Pages: 391-394 |
|
doi>10.1145/1101149.1101233 |
|
Full text: PDF
|
|
The attentive region extraction is a challenging issue for semantic interpretation of image and video content. The successful attentive region extraction greatly facilitates image classification, adaptation, compression and retrieval. Different from ...
The attentive region extraction is a challenging issue for semantic interpretation of image and video content. The successful attentive region extraction greatly facilitates image classification, adaptation, compression and retrieval. Different from the traditional visual attention detection models, we propose a new attentive region extraction method based on out-of-focus blurring (OFB) technique used by professional photographers. Firstly, we combine metadata in Exchangeable Image File Format (EXIF) with visual features to quickly select professional photographs from image database. After that, an algorithm is implemented to automatically extract the attentive region from these photographs. This algorithm measures the saliency for individual pixels based on edge distribution of the images. The experimental results on OFB images have proved that our approach is able to overcome the contrast map selection problem of traditional visual attention methods and extract the attentive region using OFB information. The attentive region generated by our algorithm has similar shape and size with the subject of photographs which is a useful information for searching and retrieving the high-level semantic meaningful objects. expand
|
|
|
POSTER SESSION: Poster 3: content track |
|
|
|
|
Automatic video annotation using ontologies extended with visual information |
| |
Marco Bertini,
Alberto Del Bimbo,
Carlo Torniai
|
|
Pages: 395-398 |
|
doi>10.1145/1101149.1101235 |
|
Full text: PDF
|
|
Classifying video elements according to some pre-defined ontology of the video content domain is a typical way to perform video annotation. Ontologies are defined by establishing relationships between linguistic terms that specify domain concepts at ...
Classifying video elements according to some pre-defined ontology of the video content domain is a typical way to perform video annotation. Ontologies are defined by establishing relationships between linguistic terms that specify domain concepts at different abstraction levels. However, although linguistic terms are appropriate to distinguish event and object categories, they are inadequate when they must describe specific patterns of events or video entities. Instead, in these cases, pattern specifications can be better expressed through visual prototypes that capture the essence of the event or entity. Therefore pictorially enriched ontologies, that include both visual and linguistic concepts, can be useful to support video annotation up to the level of detail of pattern specification.This paper presents pictorially enriched ontologies and discusses a solution for their implementation for the soccer video domain. An unsupervised clustering method is proposed in order to create the enriched ontologies by defining visual prototypes representing specific patterns of highlights and adding them as visual concepts to the ontology.An algorithm that uses pictorially enriched ontologies to perform automatic soccer video annotation is proposed and results for typical highlights are presented. Annotation is performed associating occurrences of events, or entities, to higher level concepts by checking their proximity to visual concepts that are hierarchically linked to higher level semantics. expand
|
|
|
Early versus late fusion in semantic video analysis |
| |
Cees G. M. Snoek,
Marcel Worring,
Arnold W. M. Smeulders
|
|
Pages: 399-402 |
|
doi>10.1145/1101149.1101236 |
|
Full text: PDF
|
|
Semantic analysis of multimodal video aims to index segments of interest at a conceptual level. In reaching this goal, it requires an analysis of several information streams. At some point in the analysis these streams need to be fused. In this paper, ...
Semantic analysis of multimodal video aims to index segments of interest at a conceptual level. In reaching this goal, it requires an analysis of several information streams. At some point in the analysis these streams need to be fused. In this paper, we consider two classes of fusion schemes, namely early fusion and late fusion. The former fuses modalities in feature space, the latter fuses modalities in semantic space. We show by experiment on 184 hours of broadcast video data and for 20 semantic concepts, that late fusion tends to give slightly better performance for most concepts. However, for those concepts where early fusion performs better the difference is more significant. expand
|
|
|
Detecting group activities using rigidity of formation |
| |
Saad M. Khan,
Mubarak Shah
|
|
Pages: 403-406 |
|
doi>10.1145/1101149.1101237 |
|
Full text: PDF
|
|
Most work in human activity recognition is limited to relatively simple behaviors like sitting down, standing up or other dramatic posture changes. Very little has been achieved in detecting more complicated behaviors especially those characterized by ...
Most work in human activity recognition is limited to relatively simple behaviors like sitting down, standing up or other dramatic posture changes. Very little has been achieved in detecting more complicated behaviors especially those characterized by the collective participation of several individuals. In this work we present a novel approach to recognizing the class of activities characterized by their rigidity in formation for example people parades, airplane flight formations or herds of animals. The central idea is to model the entire group as a collective rather than focusing on each individual separately. We model the formation as a 3D polygon with each corner representing a participating entity. Tracks from the entities are treated as tracks of feature points on the 3D polygon. Based on the rank of the track matrix we can determine if the 3D polygon under consideration behaves rigidly or undergoes non-rigid deformation. Our method is invariant to camera motion and does not require an a priori model or a training phase. expand
|
|
|
A probabilistic template-based approach to discovering repetitive patterns in broadcast videos |
| |
Peng Wang,
Zhi-Qiang Liu,
Shi-Qiang Yang
|
|
Pages: 407-410 |
|
doi>10.1145/1101149.1101238 |
|
Full text: PDF
|
|
There are usually repetitive sub-segments in broadcast videos, which may be associated with high-level concepts or events, e.g., news footage, repeated scores in basketball. Unsupervised mining techniques provide generic solutions to discovering such ...
There are usually repetitive sub-segments in broadcast videos, which may be associated with high-level concepts or events, e.g., news footage, repeated scores in basketball. Unsupervised mining techniques provide generic solutions to discovering such temporal patterns in various video genres, which are currently the subject of great interests to researchers working on multimedia content analysis. In this paper, we propose a novel approach to automatically detecting repetitive patterns in a video stream. In this approach, a video stream is first transformed to a symbol sequence via the spectral clustering algorithm. After computing the transition probabilities of any two symbols in temporal evolution, we produce a set of probabilistic templates to characterize the patterns of potential interest. Finally, we verify each probabilistic template by measuring the similarities between the video sub-segments and the template. Evaluations on various sports videos show promising results. expand
|
|
|
Learning with non-metric proximity matrices |
| |
Gang Wu,
Edward Y. Chang,
Zhihua Zhang
|
|
Pages: 411-414 |
|
doi>10.1145/1101149.1101239 |
|
Full text: PDF
|
|
Many emerging applications formulate non-metric proximity matrices (non-positive semidefinite), and hence cannot fit into the framework of kernel machines. A popular approach to this problem is to transform the spectrum of the similarity matrix so as ...
Many emerging applications formulate non-metric proximity matrices (non-positive semidefinite), and hence cannot fit into the framework of kernel machines. A popular approach to this problem is to transform the spectrum of the similarity matrix so as to generate a positive semidefinite kernel matrix. In this paper, we explore four representative transformation methods: denoise, flip, diffusion, and shift. Theoretically, we discuss a generalization problem where the test data are not available during transformation, and thus propose an efficient algorithm to address the problem of updating the cross-similarity matrix between test and training data. Extensive experiments have been conducted to evaluate the performance of these methods on several real-world (dis)similarity matrices with semantic meanings. expand
|
|
|
Semantic feedback for interactive image retrieval |
| |
Changbo Yang,
Ming Dong,
Farshad Fotouhi
|
|
Pages: 415-418 |
|
doi>10.1145/1101149.1101240 |
|
Full text: PDF
|
|
In this paper we present a semantic image retrieval system with integrated feedback mechanism. In our system, we propose a novel feedback solution for semantic retrieval: semantic feedback, which allows our system to interact with users directly ...
In this paper we present a semantic image retrieval system with integrated feedback mechanism. In our system, we propose a novel feedback solution for semantic retrieval: semantic feedback, which allows our system to interact with users directly at the semantic level. The learning process of the semantic feedback substantially improves the image retrieval performance of the proposed system. We demonstrate the effectiveness of our approach with experiments using 5,000 images from Corel database. expand
|
|
|
Image region entropy: a measure of "visualness" of web images associated with one concept |
| |
Keiji Yanai,
Kobus Barnard
|
|
Pages: 419-422 |
|
doi>10.1145/1101149.1101241 |
|
Full text: PDF
|
|
We propose a new method to measure "visualness" of concepts, that is, what extent concepts have visual characteristics. To know which concept has visually discriminative power is important for image annotation, especially automatic image annotation by ...
We propose a new method to measure "visualness" of concepts, that is, what extent concepts have visual characteristics. To know which concept has visually discriminative power is important for image annotation, especially automatic image annotation by image recognition system, since not all concepts are related to visual contents. Our method performs probabilistic region selection for images which are labeled as concept "X" or "non-X", and computes an entropy measure which represents "visualness" of concepts. In the experiments, we collected about forty thousand images from the World-Wide Web using the Google Image Search for 150 concepts. We examined which concepts are suitable for annotation of image contents. expand
|
|
|
To learn representativeness of video frames |
| |
Hong-Wen Kang,
Xian-Sheng Hua
|
|
Pages: 423-426 |
|
doi>10.1145/1101149.1101242 |
|
Full text: PDF
|
|
With the rapid explosion of video data, compact representation of videos is becoming more and more desirable for efficient browsing and communication, which leads to a number of research works on video summarization in recent years. Among these works, ...
With the rapid explosion of video data, compact representation of videos is becoming more and more desirable for efficient browsing and communication, which leads to a number of research works on video summarization in recent years. Among these works, summaries based on a set of still frames are frequently studied and applied due to its high compactness. However, the representativeness of the selected frames, which are taken as the compact representation of the video or video segment, has not been well studied. It is observed that frame representativeness is highly related to the following elements: image quality, user attention measure, visual details, and displaying duration. It is also observed that users have similar tendency in selecting the most representative frame for a certain video segment. In this paper, we developed a method to examine and evaluate the representativeness of video frames based on learning users' perceptive evaluations. expand
|
|
|
Affect-based indexing and retrieval of films |
| |
Ching Hau Chan,
Gareth J. F. Jones
|
|
Pages: 427-430 |
|
doi>10.1145/1101149.1101243 |
|
Full text: PDF
|
|
Digital multimedia systems are creating many new opportunities for rapid access to content archives. In order to explore these collections using search applications, the content must be annotated with significant features. An important and often overlooked ...
Digital multimedia systems are creating many new opportunities for rapid access to content archives. In order to explore these collections using search applications, the content must be annotated with significant features. An important and often overlooked aspect of human interpretation of multimedia data is the affective dimension. Affective labels of content can be extracted automatically from within multimedia data streams. These can then be used for content-based retrieval and browsing. In this study affective features extracted from multimedia audio content are mapped onto a set of keywords with predetermined emotional interpretations. These labels are then used to demonstrate affect-based retrieval on a range of feature films. expand
|
|
|
Toward emergent representations for video |
| |
Ryan Shaw,
Marc Davis
|
|
Pages: 431-434 |
|
doi>10.1145/1101149.1101244 |
|
Full text: PDF
|
|
Advanced systems for finding, using, sharing, and remixing video require high-level representations of video content. A number of researchers have taken top-down, analytic approaches to the specification of representation structures for video. The resulting ...
Advanced systems for finding, using, sharing, and remixing video require high-level representations of video content. A number of researchers have taken top-down, analytic approaches to the specification of representation structures for video. The resulting schemes, while showing the potential of high-level representations for aiding the retrieval and resequencing of video, have generally proved too complex for mainstream use. In this paper, we propose a bottom-up, emergent approach to developing video representation structures by examining retrieval requests and annotations made by a community of video remixers. Our initial research has found a useful degree of convergence between user-generated indexing terms and query terms, with the salient exception of descriptions of characters' corporeal characteristics. expand
|
|
|
Region based image annotation through multiple-instance learning |
| |
Changbo Yang,
Ming Dong,
Farshad Fotouhi
|
|
Pages: 435-438 |
|
doi>10.1145/1101149.1101245 |
|
Full text: PDF
|
|
In an annotated image database, keywords are usually associated with images instead of individual regions, which poses a major challenge for any region based image annotation algorithm. In this paper, we propose to learn the correspondence between image ...
In an annotated image database, keywords are usually associated with images instead of individual regions, which poses a major challenge for any region based image annotation algorithm. In this paper, we propose to learn the correspondence between image regions and keywords through Multiple-Instance Learning (MIL). After a representative image region has been learned for a given keyword, we consider image annotation as a problem of image classification, in which each keyword is treated as a distinct class label. The classification problem is then addressed using the Bayesian framework. The proposed image annotation method is evaluated on an image database with 5,000 images. expand
|
|
|
Spatio-temporal quality assessment for home videos |
| |
Tao Mei,
Cai-Zhi Zhu,
He-Qin Zhou,
Xian-Sheng Hua
|
|
Pages: 439-442 |
|
doi>10.1145/1101149.1101246 |
|
Full text: PDF
|
|
Compared with the video programs taken by professionals, home videos are always with low-quality content resulted from lack of professional capture skills. In this paper, we present a novel spatio-temporal quality assessment scheme in terms of low-level ...
Compared with the video programs taken by professionals, home videos are always with low-quality content resulted from lack of professional capture skills. In this paper, we present a novel spatio-temporal quality assessment scheme in terms of low-level content features for home videos. In contrast to existing frame-level-based quality assessment approaches, a type of temporal segment of video, sub-shot, is selected as the basic unit for quality assessment. A set of spatio-temporal artifacts, regarded as the key factors affecting the overall perceived quality (i.e. unstableness, jerkiness, infidelity, blurring, brightness and orientation), are mined from each sub-shot based on the particular characteristics of home videos. The relationship between the overall quality metric and these factors are exploited by three different methods, including user study, factor fusion, and a learning-based scheme. To validate the proposed scheme, we present a scalable quality-based home video summarization system, aiming at achieving the best quality while simultaneously preserving the most informative content. A comparison user study between this system and the attention model based video skimming approach demonstrated the effectiveness of the proposed quality assessment scheme. expand
|
|
|
Light weight MP3 watermarking method for mobile terminals |
| |
Koichi Takagi,
Shigeyuki Sakazawa
|
|
Pages: 443-446 |
|
doi>10.1145/1101149.1101247 |
|
Full text: PDF
|
|
This paper proposes an MP3 watermarking method that is applicable to a mobile terminal with limited computational resources. Considering that the embedded information is copyright information and metadata, which should be extracted before playing back, ...
This paper proposes an MP3 watermarking method that is applicable to a mobile terminal with limited computational resources. Considering that the embedded information is copyright information and metadata, which should be extracted before playing back, the watermark detection process should be executed quickly. However, conventional methods cannot detect a digital watermark at high speed. Thus, this paper proposes that scalefactor values in MP3 data be altered so as not to spoil audio quality. Evaluation tests show that the proposed method is capable of embedding 3bits/frame information and detecting at very high speed without degrading audio quality. Finally, this paper describes an application example of our proposal for authentication with a digital signature. expand
|
|
|
GPU-assisted decoding of video samples represented in the YCoCg-R color space |
| |
Wesley De Neve,
Dieter Van Rijsselbergen,
Charles Hollemeersch,
Jan De Cock,
Stijn Notebaert,
Rik Van de Walle
|
|
Pages: 447-450 |
|
doi>10.1145/1101149.1101248 |
|
Full text: PDF
|
|
Although pixel shaders were designed for the creation of programmable rendering effects, they can also be used as generic processing units for vector data. In this paper, attention is paid to an implementation of the YCoCg-R to RGB color space transform, ...
Although pixel shaders were designed for the creation of programmable rendering effects, they can also be used as generic processing units for vector data. In this paper, attention is paid to an implementation of the YCoCg-R to RGB color space transform, as defined in the H.264/AVC Fidelity Range Extensions, by making use of pixel shaders. Our results show that a significant speedup can be achieved by relying on the processing power of the GPU, relative to the CPU. To be more specific, high definition video (1080p), represented in the YCoCg-R color space, could be decoded to RGB at 30 Hz on a PC with an AMD Athlon XP 2800+ CPU, an AGP bus and an NVIDIA GeForce 6800 graphics card, an effort that could not be realized in real-time by the CPU. expand
|
|
|
Learning an image-word embedding for image auto-annotation on the nonlinear latent space |
| |
Wei Liu,
Xiaoou Tang
|
|
Pages: 451-454 |
|
doi>10.1145/1101149.1101249 |
|
Full text: PDF
|
|
Latent Semantic Analysis (LSA) has shown encouraging performance for the problem of unsupervised image automatic annotation. LSA conducts annotation by keywords propagation on a linear Latent Space, which accounts for the underlying semantic structure ...
Latent Semantic Analysis (LSA) has shown encouraging performance for the problem of unsupervised image automatic annotation. LSA conducts annotation by keywords propagation on a linear Latent Space, which accounts for the underlying semantic structure of word and image features. In this paper, we formulate a more general nonlinear model, called Nonlinear Latent Space model, to reveal the latent variables of word and visual features more precisely. Instead of the basic propagation strategy, we present a novel inference strategy for image annotation via Image-Word Embedding (IWE). IWE simultaneously embeds images and words and captures the dependencies between them from a probabilistic viewpoint. Experiments show that IWE-based annotation on the nonlinear latent space outperforms previous unsupervised annotation methods. expand
|
|
|
Exciting event detection in broadcast soccer video with mid-level description and incremental learning |
| |
Qixiang Ye,
Qingming Huang,
Wen Gao,
Shuqiang Jiang
|
|
Pages: 455-458 |
|
doi>10.1145/1101149.1101250 |
|
Full text: PDF
|
|
In this paper, we propose a method for exciting event detection in broadcast soccer video with mid-level description and SVM-based incremental learning. In the method, video frames are firstly classified and grouped into views in terms of low-level playfield ...
In this paper, we propose a method for exciting event detection in broadcast soccer video with mid-level description and SVM-based incremental learning. In the method, video frames are firstly classified and grouped into views in terms of low-level playfield features. Mid-level description including view label, motion descriptor and shot descriptor are then extracted to present the characteristics of a view. By using the fixed temporal structure of views, SVM classification models are constructed to detected exciting events in a soccer match. In the view classification and event detection procedures, SVM-based incremental learning method is explored to improve the extensibility of view classification and event detection. Experiments on real soccer video programs demonstrate encouraging results. expand
|
|
|
A method for retrieving music data with different bit rates using MPEG-4 TwinVQ audio compression |
| |
Michihiro Kobayakawa,
Mamoru Hoshi,
Kensuke Onishi
|
|
Pages: 459-462 |
|
doi>10.1145/1101149.1101251 |
|
Full text: PDF
|
|
The present paper describes a method for indexing a piece of music using the TwinVQ (Transform-domain Weighted Interleave Vector Quantization) audio compression (MPEG-4 audio standard). First, we present a framework for indexing a piece of music based ...
The present paper describes a method for indexing a piece of music using the TwinVQ (Transform-domain Weighted Interleave Vector Quantization) audio compression (MPEG-4 audio standard). First, we present a framework for indexing a piece of music based on the autocorrelation coefficients computed in the encoding step of TwinVQ audio compression. Second, we propose a new music feature that is robust with respect to bit rate based on the fact that the i-th autocorrelation coefficient with bit rate B1 of a piece of music computed in the encoding step of TwinVQ audio compression can approximate the j-th autocorrelation coefficient with bit rate B2 of the piece of music where i= left lfloor frac B_1 B_2 j right rfloor, and on the wavelet transform. Finally, we perform retrieval experiments on 1,023 pieces of polyphonic music with bit rate (8 kbps, 12 kbps, 16 kbps, 20 kbps, 24 kbps, 28 kbps, 32 kbps, 36 kbps, 40 kbps, and 44 kbps). The experimental results indicate that the proposed music feature for indexing has excellent retrieval performance for queries of various bit rates. expand
|
|
|
Sound source location cue coding system for compact representation of multi-channel audio |
| |
Inseon Jang,
Jeongil Seo,
Seungkwon Beack,
Kyeongok Kang,
Han-gil Moon
|
|
Pages: 463-466 |
|
doi>10.1145/1101149.1101252 |
|
Full text: PDF
|
|
Binaural cue coding (BCC) has been introduced for compact representation of multi-channel audio. It exploits binaural cue parameters for capturing the spatial image of multi-channel audio. Recently, it has been standardized within MPEG as the name of ...
Binaural cue coding (BCC) has been introduced for compact representation of multi-channel audio. It exploits binaural cue parameters for capturing the spatial image of multi-channel audio. Recently, it has been standardized within MPEG as the name of "MPEG Surround." In this paper, we propose a sound source location cue coding (SSLCC) system for compressing multi-channel audio to be suitable at the narrow bandwidth transmission environment. To improve the compression ability of the conventional BCC, the SSLCC system utilizes the virtual source location information (VSLI) as a spatial cue parameter instead of the inter-channel level difference (ICLD) of the BCC system. Also the SSLCC system adopts enhanced pre/post processing algorithms to improve perceptual sound quality. Objective and subjective assessment results show that the proposed SSLCC system reveals better performance than the conventional BCC system. expand
|
|
|
Semantic knowledge extraction and annotation for web images |
| |
Zhigang Hua,
Xiang-Jun Wang,
Qingshan Liu,
Hanqing Lu
|
|
Pages: 467-470 |
|
doi>10.1145/1101149.1101253 |
|
Full text: PDF
|
|
Nowadays, images have become widely available on the World Wide Web (WWW). It's essential to develop effective ways for managing and retrieving such abundant images. Advantageously, compared to the traditional images where very little information is ...
Nowadays, images have become widely available on the World Wide Web (WWW). It's essential to develop effective ways for managing and retrieving such abundant images. Advantageously, compared to the traditional images where very little information is provided, the web images contain plentiful context data. This paper introduces a system that can automatically acquire semantic knowledge for web image annotation. By using a page layout analysis method that can precisely assign context to web images, we developed efficient algorithms to extract semantic knowledge for web images, such as description, people, temporal and geographic information. To validate the practicality and efficiency of this system, we applied it to about 6,500 images crawled from Web. Experiments demonstrated that our approach could achieve satisfactory results. expand
|
|
|
On indexing of 3D scenes using MPEG-7 |
| |
Ioan Marius Bilasco,
Jérôme Gensel,
Marlène Villanova-Oliver,
Hervé Martin
|
|
Pages: 471-474 |
|
doi>10.1145/1101149.1101254 |
|
Full text: PDF
|
|
The evolving desktop computer capacities and the emergence of the X3D standard offer a new boost to 3D domain. Giving sense to 3D content becomes a major issue specially for reusing such a content extracted from existing 3D scenes. In this paper, we ...
The evolving desktop computer capacities and the emergence of the X3D standard offer a new boost to 3D domain. Giving sense to 3D content becomes a major issue specially for reusing such a content extracted from existing 3D scenes. In this paper, we address this issue by proposing a generic semantic annotation model for 3D called 3DSEAM (3D SEmantics Annotation Model). 3DSEAM aims at indexing 3D content considering visual, geometric and semantic aspects. 3DSEAM is instantiated using MPEG-7 extended with 3D specific locators. These locators link the visual, geometric and semantic features of a 3D content to the corresponding X3D fragment. expand
|
|
|
Natural language processing of lyrics |
| |
Jose P. G. Mahedero,
Álvaro MartÍnez,
Pedro Cano,
Markus Koppenberger,
Fabien Gouyon
|
|
Pages: 475-478 |
|
doi>10.1145/1101149.1101255 |
|
Full text: PDF
|
|
We report experiments on the use of standard natural language processing (NLP) tools for the analysis of music lyrics. A significant amount of music audio has lyrics. Lyrics encode an important part of the semantics of a song, therefore their analysis ...
We report experiments on the use of standard natural language processing (NLP) tools for the analysis of music lyrics. A significant amount of music audio has lyrics. Lyrics encode an important part of the semantics of a song, therefore their analysis complements that of acoustic and cultural metadata and is fundamental for the development of complete music information retrieval systems. Moreover, a textual analysis of a song can generate ground truth data that can be used to validate results from purely acoustic methods. Preliminary results on language identification, structure extraction, categorization and similarity searches suggests that a lot of profit can be gained from the analysis of lyrics. expand
|
|
|
Building a visual ontology for video retrieval |
| |
L. Hollink,
M. Worring,
A. Th. Schreiber
|
|
Pages: 479-482 |
|
doi>10.1145/1101149.1101256 |
|
Full text: PDF
|
|
To ensure access to growing video collections, annotation is becoming more and more important using background knowledge in the form of ontologies or thesauri is a way to facilitate annotation in a broad domain. Current ontologies are not suitable for ...
To ensure access to growing video collections, annotation is becoming more and more important using background knowledge in the form of ontologies or thesauri is a way to facilitate annotation in a broad domain. Current ontologies are not suitable for (semi-) automatic annotation of visual resources as they contain little visual information about the concepts they describe. We investigate how an ontology that does contain visual information can facilitate annotation in a broad domain and identify requirements that a visual ontology has to meet. Based on these requirements, we create a visual ontology out of two existing knowledge corpora (WordNet and MPEG-7) by creating links between visual and general concepts. We test performance of the ontology on 40 shots of news video, and discuss the added value of each visual property. expand
|
|
|
Towards context-aware face recognition |
| |
Marc Davis,
Michael Smith,
John Canny,
Nathan Good,
Simon King,
Rajkumar Janakiraman
|
|
Pages: 483-486 |
|
doi>10.1145/1101149.1101257 |
|
Full text: PDF
|
|
In this paper, we focus on the use of context-aware, collaborative filtering, machine-learning techniques that leverage automatically sensed and inferred contextual metadata together with computer vision analysis of image content to make accurate predictions ...
In this paper, we focus on the use of context-aware, collaborative filtering, machine-learning techniques that leverage automatically sensed and inferred contextual metadata together with computer vision analysis of image content to make accurate predictions about the human subjects depicted in cameraphone photos. We apply Sparse-Factor Analysis (SFA) to both the contextual metadata gathered in the MMM2 system and the results of PCA (Principal Components Analysis) of the photo content to achieve a 60% face recognition accuracy of people depicted in our cameraphone photos, which is 40% better than media analysis alone. In short, we use context-aware media analysis to solve the face recognition problem for cameraphone photos. expand
|
|
|
A novel framework for SVM-based image retrieval on large databases |
| |
Lei Wang,
Xuchun Li,
Ping Xue,
Kap Luk Chan
|
|
Pages: 487-490 |
|
doi>10.1145/1101149.1101258 |
|
Full text: PDF
|
|
In this paper, a novel framework is proposed to deliver a fast, robust, and generally applicable SVM-based image retrieval for large databases. A quick test scheme is developed, and on-line kernel learning is employed to realize it after analyzing ...
In this paper, a novel framework is proposed to deliver a fast, robust, and generally applicable SVM-based image retrieval for large databases. A quick test scheme is developed, and on-line kernel learning is employed to realize it after analyzing the relationship between them. Then an upper bound on maximum test scope is derived to speed up testing further. Also, the general applicability is well maintained because this framework does not need a kernel function and index structure to be pre-defined. Taking the advantages of this framework, more sophisticated SVM can be used to improve retrieval performance while keeping short response time. Experimental results on large image databases verify the effectiveness and efficiency of the proposed framework. expand
|
|
|
Automatic image orientation determination with natural image statistics |
| |
Siwei Lyu
|
|
Pages: 491-494 |
|
doi>10.1145/1101149.1101259 |
|
Full text: PDF
|
|
In this paper, we propose a new method for automatically determining image orientations. This method is based on a set of natural image statistics collected from a multi-scale multi-orientation image decomposition (e.g., wavelets). From these statistics, ...
In this paper, we propose a new method for automatically determining image orientations. This method is based on a set of natural image statistics collected from a multi-scale multi-orientation image decomposition (e.g., wavelets). From these statistics, a two-stage hierarchal classification with multiple binary SVM classifiers is employed to determine image orientation. The proposed method is evaluated and compared to existing methods with experiments performed on 18040 natural images, where it showed promising performance. expand
|
|
|
Determining structure in continuously recorded videos |
| |
Yun Zhai,
Mubarak Shah
|
|
Pages: 495-498 |
|
doi>10.1145/1101149.1101260 |
|
Full text: PDF
|
|
In this paper, we present a scene detection framework on continuously recorded videos. Conventional temporal scene segmentation methods work for the videos composed of discrete shots, where shot boundaries are clearly defined. The proposed method detects ...
In this paper, we present a scene detection framework on continuously recorded videos. Conventional temporal scene segmentation methods work for the videos composed of discrete shots, where shot boundaries are clearly defined. The proposed method detects scene segments by the spectral clustering technique and fuzzy analysis. The detected scenes are represented by the corresponding representative feature values of the feature clusters, rather than abrupt temporal boundaries. The feature clusters are generated using the spectral clustering technique. The video units have the fuzzy memberships to the feature clusters, which are generated using the Hyperbolic tangent fuzzy function. The final output is collected from the candidate scenes from all clusters. The proposed method has been tested on several video sequences, and very promising results have been obtained. expand
|
|
|
Two-scale image retrieval with significant meta-information feedback |
| |
Jia Li
|
|
Pages: 499-502 |
|
doi>10.1145/1101149.1101261 |
|
Full text: PDF
|
|
A two-scale image retrieval system is developed to provide efficient search in large-scale databases as well as flexibility for users to incorporate ubjective preferences during retrieval. A new clustering method is developed for images each characterized ...
A two-scale image retrieval system is developed to provide efficient search in large-scale databases as well as flexibility for users to incorporate ubjective preferences during retrieval. A new clustering method is developed for images each characterized by a varying number of weighted feature vectors. Furthermore, significant meta-information is mined within every cluster. A scanning mode of retrieval is created using cluster centers, which serve as a low scale version of a database in contrast to original images. In particular, users are presented with representative images of highly ranked clusters along with prominent meta-information. This retrieval approach enables users to quickly examine a large and diverse portion of a database surrounding a query and to learn about hidden connections between visual patterns and non-imagery types of data. The clusters formed also facilitate fast search in the case of individual image-based retrieval by filtering out images whose cluster centers are far from the query. The two-scale retrieval system has been implemented on a fine art painting database. Advantages of the system have been demonstrated by quantitative evaluation of the retrieval performance. expand
|
|
|
A multiview video transcoder |
| |
Baochun Bai,
Janelle Harms
|
|
Pages: 503-506 |
|
doi>10.1145/1101149.1101262 |
|
Full text: PDF
|
|
Video transcoding can convert a compressed video from one format to another format. In this paper, we propose a novel multiview video transcoder, which is used for bit-rate scaling of multiple compressed synchronized video streams. Different from the ...
Video transcoding can convert a compressed video from one format to another format. In this paper, we propose a novel multiview video transcoder, which is used for bit-rate scaling of multiple compressed synchronized video streams. Different from the traditional joint transcoder for independent multiple program transcoding, the multiview video transcoder has one unique task to decorrelate spatial redundancies among video streams. A fast disparity estimation algorithm with the GOP-based disparity search window is put forward and evaluated. The proposed multiview video transcoder can be used to design a cost-effective video acquisition and transmission system for emerging 3D video applications. expand
|
|
|
Emotion-based music recommendation by association discovery from film music |
| |
Fang-Fei Kuo,
Meng-Fen Chiang,
Man-Kwan Shan,
Suh-Yin Lee
|
|
Pages: 507-510 |
|
doi>10.1145/1101149.1101263 |
|
Full text: PDF
|
|
With the growth of digital music, the development of music recommendation is helpful for users. The existing recommendation approaches are based on the users' preference on music. However, sometimes, recommending music according to the emotion is needed. ...
With the growth of digital music, the development of music recommendation is helpful for users. The existing recommendation approaches are based on the users' preference on music. However, sometimes, recommending music according to the emotion is needed. In this paper, we propose a novel model for emotion-based music recommendation, which is based on the association discovery from film music. We investigated the music feature extraction and modified the affinity graph for association discovery between emotions and music features. Experimental result shows that the proposed approach achieves 85% accuracy in average. expand
|
|
|
An improved QTCQ wavelet image coding method using DCT and coefficient reorganization |
| |
Li Chen,
Jia Wang
|
|
Pages: 511-514 |
|
doi>10.1145/1101149.1101264 |
|
Full text: PDF
|
|
An improved quadtree classification and TCQ (QTCQ) wavelet image compression method is proposed in this paper. The method applies small block DCT to coefficients in the high frequency subbands and reorders them before quantizing and coding. Experiments ...
An improved quadtree classification and TCQ (QTCQ) wavelet image compression method is proposed in this paper. The method applies small block DCT to coefficients in the high frequency subbands and reorders them before quantizing and coding. Experiments results show that the proposed method outperforms QTCQ and other state-of-the-art wavelet coding methods. For example, for Barbara image, at 0.5 bit/pixel, this algorithm outperforms QTCQ, SPIHT, SLCCA and JPEG2000 by 0.4dB, 1.04dB, 0.54dB and 0.17dB in PSNR, respectively. It is also observed that this algorithm works extremely well for textured images like fingerprint images. expand
|
|
|
Automatic identification of digital video based on shot-level sequence matching |
| |
Jian Zhou,
Xiao-Ping Zhang
|
|
Pages: 515-518 |
|
doi>10.1145/1101149.1101265 |
|
Full text: PDF
|
|
To locate a video clip in large collections is very important for retrieval applications, especially for digital rights management. In this paper, we present a novel technique for automatic identification of digital video. This new algorithm is based ...
To locate a video clip in large collections is very important for retrieval applications, especially for digital rights management. In this paper, we present a novel technique for automatic identification of digital video. This new algorithm is based on dynamic programming that fully uses the temporal dimension to measure the similarity between two video sequences. A normalized chromaticity histogram is used as a feature which is illumination-invariant. Dynamic programming is applied on shot-level to find the optimal nonlinear mapping between video sequences. Two new normalized distance measures are presented for video sequence matching. One measure is based on the normalization of the optimal path found by dynamic programming. The other measure combines both the visual features and the temporal information. Experimental results show that the shot-level approach is robust to frame rate conversion, color correction, and compressions. The proposed distance measures are suitable for variable-length comparisons. expand
|
|
|
Highlight ranking for sports video browsing |
| |
Xiaofeng Tong,
Qingshan Liu,
Yifan Zhang,
Hanqing Lu
|
|
Pages: 519-522 |
|
doi>10.1145/1101149.1101266 |
|
Full text: PDF
|
|
Sports video has been extensively studied for its wide viewer-ship and tremendous commercial potentials. Many studies focused on highlight extraction for summarizing a lengthy video. In this paper, we present an advanced highlight analysis system for ...
Sports video has been extensively studied for its wide viewer-ship and tremendous commercial potentials. Many studies focused on highlight extraction for summarizing a lengthy video. In this paper, we present an advanced highlight analysis system for sports video browsing, in which highlight evaluation and ranking are concerned besides highlight detection. First, we use replay detection to efficiently localize the highlights. Then incorporating with domain-specific knowledge, we adopt several significant cues to evaluate the importance degree of the highlights with support vector regression. Finally, the highlights are ranked with descending sort according to their importance value. The ranking results can provide a hierarchical video browsing and customized content delivery scheme. Initial experimental results on soccer videos show an encouraging performance comparing with human subjective evaluation. expand
|
|
|
SSF fingerprint for image authentication: an incidental distortion resistant scheme |
| |
Sheng Tang,
Jin-Tao Li,
Yong-Dong Zhang
|
|
Pages: 523-526 |
|
doi>10.1145/1101149.1101267 |
|
Full text: PDF
|
|
We propose a novel method for image authentication which can distinguish incidental manipulations from malicious ones. The authentication fingerprint is based on the Hotelling's T-square statistic (HTS) via Principal Component Analysis (PCA) of block ...
We propose a novel method for image authentication which can distinguish incidental manipulations from malicious ones. The authentication fingerprint is based on the Hotelling's T-square statistic (HTS) via Principal Component Analysis (PCA) of block DCT coefficients. HTS values of all blocks construct an unique and stable "block-edge image", i.e., Structural and Statistical Fingerprint (SSF). The characteristic of the SSF is that it is short, and can tolerate content-preserving modifications while keeping sensitive to content-changing modifications, and can locate tampered blocks easily. Furthermore, we use Fisher criterion to obtain optimal threshold for distinguishing manipulations. The security of the SSF is also achieved by encryption of the DCT coefficients with chaotic sequences. Experiments show that the proposed method is effective for authentication. expand
|
|
|
Validating cardiac echo diagnosis through video similarity |
| |
Tanveer Syeda-Mahmood,
Dulce Ponceleon,
Jing Yang
|
|
Pages: 527-530 |
|
doi>10.1145/1101149.1101268 |
|
Full text: PDF
|
|
Video data is increasingly being used in medical diagnosis. Due to the quality of the video and the complexities of underlying motion captured, it is difficult for an in-experienced physician/radiologist to describe motion abnormalities in a crisp way, ...
Video data is increasingly being used in medical diagnosis. Due to the quality of the video and the complexities of underlying motion captured, it is difficult for an in-experienced physician/radiologist to describe motion abnormalities in a crisp way, leading to possible errors in diagnosis. In this paper, we present a method of capturing video similarity and its use for diagnosis verification during decision support. Specifically, we describe the motion information in videos using average velocity curves. Second-order motion statistics are extracted from average velocity curves and serve as features for computing video similarity. Given a new video sample already labeled with a diagnosis, a neighborhood of similar videos is assembled from the training set and their diagnosis labels are used to verify the diagnosis. expand
|
|
|
Tracking users' capture intention: a novel complementary view for home video content analysis |
| |
Tao Mei,
Xian-Sheng Hua,
He-Qin Zhou
|
|
Pages: 531-534 |
|
doi>10.1145/1101149.1101269 |
|
Full text: PDF
|
|
In this paper, we present a novel view to home video content analysis, which aims at tracking the capture intention of camcorder users. Based on the study of intention mechanism in psychology, a set of domain-specific capture intention concepts ...
In this paper, we present a novel view to home video content analysis, which aims at tracking the capture intention of camcorder users. Based on the study of intention mechanism in psychology, a set of domain-specific capture intention concepts are defined. A comprehensive and extensible scheme consisting of video structuring, intention oriented feature analysis, as well as intention unit segmentation and classification is proposed to mine the users' capture intention. Experiments were carried on home video sequences of 90 hours in total, taken by 16 persons in recent 20 years. Both the user study and objective evaluations indicate that our proposed intention-based approach is an effective complement to existing home video content analysis schemes. expand
|
|
|
Evaluation of subjective video quality of mobile devices |
| |
Satu Jumisko-Pyykkö,
Jukka Häkkinen
|
|
Pages: 535-538 |
|
doi>10.1145/1101149.1101270 |
|
Full text: PDF
|
|
Subjectively perceived video quality is a critical factor when adopting new mobile video applications. When video is used in mobile networks the most important requirements are related to low bitrates, framerates and the screen size of mobile device. ...
Subjectively perceived video quality is a critical factor when adopting new mobile video applications. When video is used in mobile networks the most important requirements are related to low bitrates, framerates and the screen size of mobile device. In two tests we investigated the effects of codecs and combinations of audio and video streams with low bitrates and different contents on the perceived video quality of mobile devices. The first test showed that the codec H.264 produced the most satisfying video quality, but the quality was not high enough for the presentation of textual information. In the second test, the audio-video ratio 32/128kbps was found to be the most pleasant, but there were content dependent variations. expand
|
|
|
A unified shot boundary detection framework based on graph partition model |
| |
Jinhui Yuan,
Jianmin Li,
Fuzong Lin,
Bo Zhang
|
|
Pages: 539-542 |
|
doi>10.1145/1101149.1101271 |
|
Full text: PDF
|
|
In this paper, we propose a unified shot boundary detection framework by extending the previous work of graph partition model with temporal constraints. To detect both the abrupt transitions (CUTs) and gradual transitions (GTs, excluding fade out/in) ...
In this paper, we propose a unified shot boundary detection framework by extending the previous work of graph partition model with temporal constraints. To detect both the abrupt transitions (CUTs) and gradual transitions (GTs, excluding fade out/in) in a unified way, we incorporate temporal multi-resolution analysis into the model. Furthermore, instead of ad-hoc thresholding scheme, we construct a novel kind of feature to characterize shot transitions and employ support vector machine (SVM) with active leaning strategy to classify boundaries and non-boundaries. Extensive experiments have been carried out on the platform of TRECVID benchmark. The experimental results show that the proposed framework outperforms some others and achieves satisfactory results. expand
|
|
|
Part-based shape retrieval |
| |
Mirela Tanase,
Remco C. Veltkamp
|
|
Pages: 543-546 |
|
doi>10.1145/1101149.1101272 |
|
Full text: PDF
|
|
This paper introduces a measure for computing the dissimilarity between multiple polylines and a polygon based on the turning function, and describes a part-based retrieval system using that dissimilarity measure. This dissimilarity can be efficiently ...
This paper introduces a measure for computing the dissimilarity between multiple polylines and a polygon based on the turning function, and describes a part-based retrieval system using that dissimilarity measure. This dissimilarity can be efficiently computed in time O(kmn log mn), where m denotes the number of vertices in the polygon, and n is the total number of vertices in the k polylines that are matched against the polygon. This dissimilarity measure identifies similarities even when a significant portion of one shape is different from the other, for example because the shape is articulated, or because of occlusion or distortion. The effectiveness of the dissimilarity measure in demonstrated in a part-based shape retrieval system. Quantitative experimental verification is performed with a known ground-truth, the MPEG-7 Core Experiment test set, in a comparison with the Curvature Scale Space method, and a global turning angle function method. expand
|
|
|
Co-active intelligence for image retrieval |
| |
Mark Truran,
James Goulding,
Helen Ashman
|
|
Pages: 547-550 |
|
doi>10.1145/1101149.1101273 |
|
Full text: PDF
|
|
Lexical ambiguity in query-based image retrieval is an immemorial problem which has seemingly resisted all countermeasures. In this paper we introduce a methodology that expresses the users of a system and their navigational behaviour as the paramount ...
Lexical ambiguity in query-based image retrieval is an immemorial problem which has seemingly resisted all countermeasures. In this paper we introduce a methodology that expresses the users of a system and their navigational behaviour as the paramount resource for resolving query term ambiguity. Mass user consensus is modelled within a multi-dimensional feature space and evaluated through cluster analysis. This technique resolves query term ambiguity in a wholly democratic and dynamic fashion, in contrast to the brittle centralised models of contemporary word sense classification systems. The simple approach contained herein leads to several interesting emergent properties. expand
|
|
|
POSTER SESSION: Art poster session |
|
|
|
|
Face to face: a media-art using a face detection system and its exhibition |
| |
Yasuto Nakanishi
|
|
Pages: 551-554 |
|
doi>10.1145/1101149.1101275 |
|
Full text: PDF
|
|
"Face to face" is a media-art that only takes pictures of a profile or a blurring face, etc. those might be thought as failure pictures generally. Its theme is sameness and difference between camera and mirror, and it aims to offer an experience that ...
"Face to face" is a media-art that only takes pictures of a profile or a blurring face, etc. those might be thought as failure pictures generally. Its theme is sameness and difference between camera and mirror, and it aims to offer an experience that people come across oneself whom he/she might not know after he/she sees oneself whom oneself usually see in a mirror. We had an exhibition at NTT ICC (Inter-Communication Center) in Tokyo. In seeing stored images, people seemed to find various activities thorough interacting with this work. expand
|
|
|
The dancing genome project: generation of a human-computer choreography using a genetic algorithm |
| |
François-Joseph Lapointe,
Martine Époque
|
|
Pages: 555-558 |
|
doi>10.1145/1101149.1101276 |
|
Full text: PDF
|
|
In this paper, we present an interactive genetic algorithm for the generation of human-computer choreography, using motion capture technology. First, we introduce the four steps of the algorithm to (1) define a movement vocabulary, (2) initialize movement ...
In this paper, we present an interactive genetic algorithm for the generation of human-computer choreography, using motion capture technology. First, we introduce the four steps of the algorithm to (1) define a movement vocabulary, (2) initialize movement sequences, (3) generate mutants, and (4) select mutant sequences to create a choreography. Then, we show how this approach is implemented in real time to create interaction among dancers. Finally, we run simulations to assess the convergence rate of the algorithm, before generating a simple duet for actual and virtual dancers. expand
|
|
|
The "control of fear": an interactive art experiencing and presenting system with multimodal sensors and media |
| |
Chin Chih Yang,
Lipin Liu,
Jacy Chen
|
|
Pages: 559-562 |
|
doi>10.1145/1101149.1101277 |
|
Full text: PDF
|
|
The "Control of Fear" project is an interactive art exhibition project to provide the general public an opportunity to experience what it might be occurred to them if their lives were suddenly altered by an unforeseen and unpredictable catastrophic event.The ...
The "Control of Fear" project is an interactive art exhibition project to provide the general public an opportunity to experience what it might be occurred to them if their lives were suddenly altered by an unforeseen and unpredictable catastrophic event.The project uses varieties of modern emerging technologies to simulate unpredictable catastrophic events and detect participant's behavior. Technologies used include multimodal sensors, 3D holographic projection, 360 degree panoramic view video, speech recognition, intelligent interaction response, and synchronous control of multiple multimedia programs. expand
|
|
|
MusicStory: a personalized music video creator |
| |
David A. Shamma,
Bryan Pardo,
Kristian J. Hammond
|
|
Pages: 563-566 |
|
doi>10.1145/1101149.1101278 |
|
Full text: PDF
|
|
In this paper, we describe MusicStory, a system that automatically creates videos to accompany music with lyrics. MusicStory uses common search engines, photo-sharing websites, and simple analysis of the dynamics and tempo of the music to create personalized ...
In this paper, we describe MusicStory, a system that automatically creates videos to accompany music with lyrics. MusicStory uses common search engines, photo-sharing websites, and simple analysis of the dynamics and tempo of the music to create personalized photo-narratives. Video pacing and content is based on the content of the song and structure of the image repositories selected. The image associations MusicStory presents amplify the emotional experience by externalizing the imagery in song lyrics with the content found within a social network. The resulting work juxtaposes the meanings inherent in the social network with those in the song. expand
|
|
|
Impossible geographies of belonging |
| |
Petra Gemeinboeck
|
|
Pages: 567-570 |
|
doi>10.1145/1101149.1101279 |
|
Full text: PDF
|
|
The paper discusses the boundary between virtual and physical spaces as it is constituted and perforated in a series of installation works by the author. It focuses on the recently completed interactive installation, Impossible Geographies 01: Memory, ...
The paper discusses the boundary between virtual and physical spaces as it is constituted and perforated in a series of installation works by the author. It focuses on the recently completed interactive installation, Impossible Geographies 01: Memory, in which the memory of a space becomes the metaphoric articulation of a virtual seepage into the physical present. Impossible Geographies 02: Urban Fiction, currently being developed, will expand the underlying idea of computated negotiations of geographies and will embed these evolving dialogues in the urban fabric. The series aims to move the contact surface of the virtual further into the physical and to investigate issues of belonging and unbelonging to a space that is naturally inhabited by the participants. expand
|
|
|
The SINE WAVE ORCHESTRA stay |
| |
Kazuhiro Jo,
Ken Furudate,
Daisuke Ishida,
Mizuki Noguchi
|
|
Pages: 571-573 |
|
doi>10.1145/1101149.1101280 |
|
Full text: PDF
|
|
This is a report of creative and technical considerations in building a participatory sound performance The SINE WAVE ORCHESTRA stay. In this performance, the participants one by one leave their own sine wave in the performance space. These sine waves ...
This is a report of creative and technical considerations in building a participatory sound performance The SINE WAVE ORCHESTRA stay. In this performance, the participants one by one leave their own sine wave in the performance space. These sine waves form a mutually interfering sound space while the sound field of a room changes during the performance. expand
|
|
|
Seven mile boots: implications of an everyday interface |
| |
Martin Pichlmair
|
|
Pages: 574-577 |
|
doi>10.1145/1101149.1101281 |
|
Full text: PDF
|
|
With seven-league boots through the Internet - when you take a stroll through the physical world in this wireless LAN footwear, you might meet people who happen to be spending some time in a chat-room. Their virtual conversations are made audible as ...
With seven-league boots through the Internet - when you take a stroll through the physical world in this wireless LAN footwear, you might meet people who happen to be spending some time in a chat-room. Their virtual conversations are made audible as spoken text coming out of the boots. The piece Seven Mile Boots augments reality by introducing a new layer of communication to an everyday task: spoken chat. Virtual crowds are met equipped with an intentionally basic interface. expand
|
|
|
A new approach to interactive performance systems |
| |
Hüseyin Kuşcu,
B. Tevfik Akgün
|
|
Pages: 578-581 |
|
doi>10.1145/1101149.1101282 |
|
Full text: PDF
|
|
In this paper, we describe the basic principles that has to be in an interactive performance system, and we present a new solution to realize the principles to be used in dance performance with a distributable and hybrid approach of time and states. ...
In this paper, we describe the basic principles that has to be in an interactive performance system, and we present a new solution to realize the principles to be used in dance performance with a distributable and hybrid approach of time and states. By the frame of this research, a timer-state machine hybrid, modular and distributed system that can be used for virtual performances is introduced for choreographers to edit the audiovisual linear interaction choices in hand. expand
|
|
|
Mulholland drive: a movie with no image |
| |
D. Scott Hessels
|
|
Pages: 582-585 |
|
doi>10.1145/1101149.1101283 |
|
Full text: PDF
|
|
Three media artists, Martin Bonadeo, Michael Chu, and D. Scott Hessels, drove Los Angeles' famous Mulholland Drive with five types of sensors--measuring the car's tilt, direction, altitude, speed, and engine sound. The captured data of the mountain road ...
Three media artists, Martin Bonadeo, Michael Chu, and D. Scott Hessels, drove Los Angeles' famous Mulholland Drive with five types of sensors--measuring the car's tilt, direction, altitude, speed, and engine sound. The captured data of the mountain road was loaded into a computer and a 3-dimensional model was created. This model was used computationally to control two robotic lights in a room filled with fog. Two 100-foot beams of light and the processed sound of the engine recreated the topology of the road as a new form of visual experience and sculpture-cinema without image. The artwork is designed as a contemporary category of Land Art sculpture, where new media sensors now offer increased abilities to read the environment and allow its forces to generate art. expand
|
|
|
An adaptation framework for new media artworks |
| |
Anis Ouali,
Brigitte Kerhervé,
Paul Landon
|
|
Pages: 586-589 |
|
doi>10.1145/1101149.1101284 |
|
Full text: PDF
|
|
In this paper, we are interested in adaptation mechanisms for the design, creation and experimentation of adaptive and interactive new media artworks. Through a concrete case study, we propose an adaptation framework that combines semantic and physical ...
In this paper, we are interested in adaptation mechanisms for the design, creation and experimentation of adaptive and interactive new media artworks. Through a concrete case study, we propose an adaptation framework that combines semantic and physical adaptation and that can be specialized to the specific needs of various new media artists. This adaptation framework is supported by an adaptation engine, the kernel of the software architecture we are currently building. We have validated our adaptation framework through the implementation of a prototype of the adaptation engine. This prototype integrates the management of various types of metadata and allows a representation of adaptation scenarios as policies of the form event-condition-action.We present how we used our adaptation engine to reconstruct and experiment the adaptation model of The Man of the Crowd, an existing adaptive new media installation, where the artist introduces a semantic adaptation of the video content displayed on four screens, according to the relative position of the viewer in the artistic installation. expand
|
|
|
Man in |e|space.mov / motion analysis in 3D space |
| |
Wolf Ka
|
|
Pages: 590-593 |
|
doi>10.1145/1101149.1101285 |
|
Full text: PDF
|
|
The article documents the theoretical and aesthetical basis of the interactive dance performance "man in |e|space.mov". The text discusses the abstraction of the human body in this performance by an interactive costume of light whose motion is analyzed ...
The article documents the theoretical and aesthetical basis of the interactive dance performance "man in |e|space.mov". The text discusses the abstraction of the human body in this performance by an interactive costume of light whose motion is analyzed by a 3D motion-rendering programme, which assembles and recombines the captured frame in real time in electronic 3D space. Thus appears a juxtaposition of 3 visions on the body making up the representation: the eye of the spectator, the camera and the 3D camera view: 3 visions constituting the contemporary body. Furthermore, the text questions the dislocation of the sensual body in physical space to the reading of body as a data in the matrix of virtual space in performing arts.In order to investigate the meaning of the aesthetics of abstraction, and dislocation of human body in performance art and motion analysis, the article puts 'man in |e|space.mov' in perspective to historical references of the 20th century, in particular to the work of the physiologist and pioneer of cinema J. E. Marey and the Bauhaus artist O. Schlemmer. expand
|
|
|
Organum: individual presence through collaborative play |
| |
Greg Niemeyer,
Dan Perkel,
Ryan Shaw,
Jane McGonigal
|
|
Pages: 594-597 |
|
doi>10.1145/1101149.1101286 |
|
Full text: PDF
|
|
Organum Playtest is an interactive installation in which three players collaboratively navigate through a model of the human voice box, using their voices as a joystick. By asking players to solve collaborative maze puzzles through cross-functional control, ...
Organum Playtest is an interactive installation in which three players collaboratively navigate through a model of the human voice box, using their voices as a joystick. By asking players to solve collaborative maze puzzles through cross-functional control, voice interaction and non-verbal communication, Organum Playtest generates novel relationships between individuals, groups and audiences. The game shows how individuals can interact with abstract data forms collectively and perform distinctly on several layers of interaction. The researchers refer to this process as polyvalent performance. Players perform as individuals interacting with graphics, as individuals interacting with a group, and as a group interacting with an audience, thus achieving a "tangible sense of beneficial (...) collaboration" [6]. expand
|
|
|
SESSION: Plenary papers |
|
|
|
|
Learning the semantics of multimedia queries and concepts from a small number of examples |
| |
Apostol (Paul) Natsev,
Milind R. Naphade,
Jelena TešiĆ
|
|
Pages: 598-607 |
|
doi>10.1145/1101149.1101288 |
|
Full text: PDF
|
|
In this paper we unify two supposedly distinct tasks in multimedia retrieval. One task involves answering queries with a few examples. The other involves learning models for semantic concepts, also with a few examples. In our view these two tasks are ...
In this paper we unify two supposedly distinct tasks in multimedia retrieval. One task involves answering queries with a few examples. The other involves learning models for semantic concepts, also with a few examples. In our view these two tasks are identical with the only differentiation being the number of examples that are available for training. Once we adopt this unified view, we then apply identical techniques for solving both problems and evaluate the performance using the NIST TRECVID benchmark evaluation data [15]. We propose a combination hypothesis of two complementary classes of techniques, a nearest neighbor model using only positive examples and a discriminative support vector machine model using both positive and negative examples. In case of queries, where negative examples are rarely provided to seed the search, we create pseudo-negative samples. We then combine the ranked lists generated by evaluating the test database using both methods, to create a final ranked list of retrieved multimedia items. We evaluate this approach for rare concept and query topic modeling using the NIST TRECVID video corpus.In both tasks we find that applying the combination hypothesis across both modeling techniques and a variety of features results in enhanced performance over any of the baseline models, as well as in improved robustness with respect to training examples and visual features. In particular, we observe an improvement of 6% for rare concept detection and 17% for the search task. expand
|
|
|
An object-based video coding framework for video sequences obtained from static cameras |
| |
Asaad Hakeem,
Khurram Shafique,
Mubarak Shah
|
|
Pages: 608-617 |
|
doi>10.1145/1101149.1101289 |
|
Full text: PDF
|
|
This paper presents a novel object-based video coding framework for videos obtained from a static camera. As opposed to most existing methods, the proposed method does not require explicit 2D or 3D models of objects and hence is general enough to cater ...
This paper presents a novel object-based video coding framework for videos obtained from a static camera. As opposed to most existing methods, the proposed method does not require explicit 2D or 3D models of objects and hence is general enough to cater for varying types of objects in the scene. The proposed system detects and tracks objects in the scene and learns the appearance model of each object online using incremental principal component analysis (IPCA). Each object is then coded using the coefficients of the most significant principal components of its learned appearance space. Due to smooth transitions between limited number of poses of an object, usually a limited number of significant principal components contribute to most of the variance in the object's appearance space and therefore only a small number of coefficients are required to code the object. The rigid component of the object's motion is coded in terms of its affine parameters. The framework is applied to compressing videos in surveillance and video phone domains. The proposed method is evaluated on videos containing a variety of scenarios such as multiple objects undergoing occlusion, splitting, merging, entering and exiting, as well as a changing background. Results on standard MPEG-7 videos are also presented. For all the videos, the proposed method displays higher Peak Signal to Noise Ratio (PSNR) compared to MPEG-2 and MPEG-4 methods, and provides comparable or better compression. expand
|
|
|
SEVA: sensor-enhanced video annotation |
| |
Xiaotao Liu,
Mark Corner,
Prashant Shenoy
|
|
Pages: 618-627 |
|
doi>10.1145/1101149.1101290 |
|
Full text: PDF
|
|
In this paper, we study how a sensor-rich world can be exploited by digital recording devices such as cameras and camcorders to improve a user's ability to search through a large repository of image and video files. We design and implement a digital ...
In this paper, we study how a sensor-rich world can be exploited by digital recording devices such as cameras and camcorders to improve a user's ability to search through a large repository of image and video files. We design and implement a digital recording system that records identities and locations of objects (as advertised by their sensors) along with visual images (as recorded by a camera). The process, which we refer to as sensor-enhanced video annotation (SEVA), combines a series of correlation, interpolation, and extrapolation techniques. It produces a tagged stream that later can be used to efficiently search for videos or frames containing particular objects or people. We present detailed experiments with a prototype of our system using both stationary and mobile objects as well as GPS and ultrasound. Our experiments show that: (i) SEVA has zero error rates for static objects, except very close to the boundary of the viewable area; (ii) for moving objects or a moving camera, SEVA only misses objects leaving or entering the viewable area by 1-2 frames; (iii) SEVA can scale to 10 fast moving objects using current sensor technology; and (iv) SEVA runs online using relatively inexpensive hardware. expand
|
|
|
SESSION: Content 3: audio and security |
|
|
|
|
Unsupervised content discovery in composite audio |
| |
Rui Cai,
Lie Lu,
Alan Hanjalic
|
|
Pages: 628-637 |
|
doi>10.1145/1101149.1101292 |
|
Full text: PDF
|
|
Automatically extracting semantic content from audio streams can be helpful in many multimedia applications. Motivated by the known limitations of traditional supervised approaches to content extraction, which are hard to generalize and require suitable ...
Automatically extracting semantic content from audio streams can be helpful in many multimedia applications. Motivated by the known limitations of traditional supervised approaches to content extraction, which are hard to generalize and require suitable training data, we propose in this paper an unsupervised approach to discover and categorize semantic content in a composite audio stream. In our approach, we first employ spectral clustering to discover natural semantic sound clusters in the analyzed data stream (e.g. speech, music, noise, applause, speech mixed with music, etc.). These clusters are referred to as audio elements. Based on the obtained set of audio elements, the key audio elements, which are most prominent in characterizing the content of input audio data, are selected and used to detect potential boundaries of semantic audio segments denoted as auditory scenes. Finally, the auditory scenes are categorized in terms of the audio elements appearing therein. Categorization is inferred from the relations between audio elements and auditory scenes by using the information-theoretic co-clustering scheme. Evaluations of the proposed approach performed on 4 hours of diverse audio data indicate that promising results can be achieved, both regarding audio element discovery and auditory scene categorization. expand
|
|
|
Multimodal content-based structure analysis of karaoke music |
| |
Yongwei Zhu,
Kai Chen,
Qibin Sun
|
|
Pages: 638-647 |
|
doi>10.1145/1101149.1101293 |
|
Full text: PDF
|
|
This paper presents a novel approach for content-based analysis of karaoke music, which utilizes multimodal contents including synchronized lyrics text from the video channel and original singing audio as well as accompaniment audio in the two audio ...
This paper presents a novel approach for content-based analysis of karaoke music, which utilizes multimodal contents including synchronized lyrics text from the video channel and original singing audio as well as accompaniment audio in the two audio channels. We proposed a novel video text extraction technique to accurately segment the bitmaps of lyrics text from the video frames and track the time of its color changes that are synchronized to the music. A technique that characterizes the original singing voice by analyzing the volume balance between the two audio channels is also proposed. A novel music structure analysis method using lyrics text and audio content is then proposed to precisely identify the verses and choruses of a song, and segment the lyrics into singing phrases. Experimental results based on 20 karaoke music titles of difference languages have shown that our proposed video text extraction technique can detect and segment the lyrics texts with accuracy higher than 90%, and the proposed multimodal approach for music structure analysis method has better performance than the previous methods that are based only on audio content analysis. expand
|
|
|
A unified framework for resolving ambiguity in copy detection |
| |
Sujoy Roy,
Ee-Chien Chang,
K. Natarajan
|
|
Pages: 648-655 |
|
doi>10.1145/1101149.1101294 |
|
Full text: PDF
|
|
Copy detection is an important component of digital rights management and can be implemented using a retrieval-based approach. Under this approach, a query image, suspected to be a copy, is compared against all the images in the owner database. The comparison ...
Copy detection is an important component of digital rights management and can be implemented using a retrieval-based approach. Under this approach, a query image, suspected to be a copy, is compared against all the images in the owner database. The comparison is done based on a distance metric in feature space. The performance of such a system depends on the mutual separation of the feature representation of the images in the database. In this paper we propose a framework that increases this mutual separation by literally shifting them away from each other. The idea of modifying the features derives its inspiration from the field of watermarking. It is also important to make sure that the semantics of the images do not change after modification. Thus the focus of this paper is on how to modify the images in the database, so that the mutual separation between the images in feature space is above a certain threshold and the distortion induced is minimized. This problem can be formulated as a non-convex optimization problem which is difficult to solve. We propose a restriction of the problem and solve it using second-order cone programming. We present a practical implementation of our framework, named RAM, which uses AFMT as the feature representation. We conduct experiments to test the performance of RAM. expand
|
|
|
Accurate repeat finding and object skipping using fingerprints |
| |
Cormac Herley
|
|
Pages: 656-665 |
|
doi>10.1145/1101149.1101295 |
|
Full text: PDF
|
|
This paper introduces a novel and very accurate segmentation algorithm. It is very efficient and consumes less than 10% of CPU on a simple desktop PC to segment a stream in real-time. It operates on an audio stream, or on the audio portion of a audio-visual ...
This paper introduces a novel and very accurate segmentation algorithm. It is very efficient and consumes less than 10% of CPU on a simple desktop PC to segment a stream in real-time. It operates on an audio stream, or on the audio portion of a audio-visual stream. It is very accurate: it accurately detects the positions and durations of objects on an over-the-air broadcast television signal, and songs on both FM and internet radio stations (as checked against labeled ground truth streams). The algorithm does not require any prior information or training. We detail the system design and present results of segmenting broadcast streams. expand
|
|
|
PANEL SESSION: Panel |
|
|
|
|
What is the state of our community? |
| |
Yong Rui,
Ramesh Jain,
Nicolas D. Georganas,
HongJiang Zhang,
Klara Nahrstedt,
John Smith,
Mohan Kankanhalli
|
|
Pages: 666-668 |
|
doi>10.1145/1101149.1101297 |
|
Full text: PDF
|
|
|
|
|
SESSION: Brave new topics 2: affective multimodal human-computer interaction |
|
|
|
|
Affective multimodal human-computer interaction |
| |
Maja Pantic,
Nicu Sebe,
Jeffrey F. Cohn,
Thomas Huang
|
|
Pages: 669-676 |
|
doi>10.1145/1101149.1101299 |
|
Full text: PDF
|
|
Social and emotional intelligence are aspects of human intelligence that have been argued to be better predictors than IQ for measuring aspects of success in life, especially in social interactions, learning, and adapting to what is important. When it ...
Social and emotional intelligence are aspects of human intelligence that have been argued to be better predictors than IQ for measuring aspects of success in life, especially in social interactions, learning, and adapting to what is important. When it comes to machines, not all of them will need such skills. Yet to have machines like computers, broadcast systems, and cars, capable of adapting to their users and of anticipating their wishes, endowing them with the ability to recognize user's affective states is necessary. This article discusses the components of human affect, how they might be integrated into computers, and how far are we from realizing affective multimodal human-computer interaction. expand
|
|
|
Multimodal affect recognition in learning environments |
| |
Ashish Kapoor,
Rosalind W. Picard
|
|
Pages: 677-682 |
|
doi>10.1145/1101149.1101300 |
|
Full text: PDF
|
|
We propose a multi-sensor affect recognition system and evaluate it on the challenging task of classifying interest (or disinterest) in children trying to solve an educational puzzle on the computer. The multimodal sensory information from facial expressions ...
We propose a multi-sensor affect recognition system and evaluate it on the challenging task of classifying interest (or disinterest) in children trying to solve an educational puzzle on the computer. The multimodal sensory information from facial expressions and postural shifts of the learner is combined with information about the learner's activity on the computer. We propose a unified approach, based on a mixture of Gaussian Processes, for achieving sensor fusion under the problematic conditions of missing channels and noisy labels. This approach generates separate class labels corresponding to each individual modality. The final classification is based upon a hidden random variable, which probabilistically combines the sensors. The multimodal Gaussian Process approach achieves accuracy of over 86%, significantly outperforming classification using the individual modalities, and several other combination schemes. expand
|
|
|
Multimodal expressive embodied conversational agents |
| |
Catherine Pelachaud
|
|
Pages: 683-689 |
|
doi>10.1145/1101149.1101301 |
|
Full text: PDF
|
|
In this paper we present our work toward the creation of a multimodal expressive Embodied Conversational Agent (ECA). Our agent, called Greta, exhibits nonverbal behaviors synchronized with speech. We are using the taxonomy of communicative functions ...
In this paper we present our work toward the creation of a multimodal expressive Embodied Conversational Agent (ECA). Our agent, called Greta, exhibits nonverbal behaviors synchronized with speech. We are using the taxonomy of communicative functions developed by Isabella Poggi [22] to specify the behavior of the agent. Based on this taxonomy a representation language, Affective Presentation Markup Language, APML has been defined to drive the animation of the agent [4]. Lately, we have been working on creating no longer a generic agent but an agent with individual characteristics. We have been concentrated on the behavior specification for an individual agent. In particular we have defined a set of parameters to change the expressivity of the agent's behaviors. Six parameters have been defined and implemented to encode gesture and face expressivity. We have performed perceptual studies of our expressivity model. expand
|
|
|
Socially aware media |
| |
Alex (Sandy) Pentland
|
|
Pages: 690-695 |
|
doi>10.1145/1101149.1101302 |
|
Full text: PDF
|
|
Face-to-face communication conveys social context as well as words, and it is this social signaling that allows new information to be smoothly integrated into a shared, group-wide understanding. By building machines that understand social signaling and ...
Face-to-face communication conveys social context as well as words, and it is this social signaling that allows new information to be smoothly integrated into a shared, group-wide understanding. By building machines that understand social signaling and social context we can begin to make communication tools that keep remote users 'in the loop,' and can dramatically improve collective decision making. expand
|
|
|
SESSION: Content 4: image analysis and retrieval |
|
|
|
|
Coevolutionary feature synthesized EM algorithm for image retrieval |
| |
Rui Li,
Bir Bhanu,
Anlei Dong
|
|
Pages: 696-705 |
|
doi>10.1145/1101149.1101304 |
|
Full text: PDF
|
|
As a commonly used unsupervised learning algorithm in Content-Based Image Retrieval (CBIR), Expectation-Maximization (EM) algorithm has several limitations, especially in high dimensional feature spaces where the data are limited and the ...
As a commonly used unsupervised learning algorithm in Content-Based Image Retrieval (CBIR), Expectation-Maximization (EM) algorithm has several limitations, especially in high dimensional feature spaces where the data are limited and the computational cost varies exponentially with the number of feature dimensions. Moreover, the convergence is guaranteed only at a local maximum. In this paper, we propose a unified framework of a novel learning approach, namely Coevolutionary Feature Synthesized Expectation-Maximization (CFS-EM), to achieve satisfactory learning in spite of these difficulties. The CFS-EM is a hybrid of coevolutionary genetic programming (CGP) and EM algorithm. The advantages of CFS-EM are: 1) it synthesizes low-dimensional features based on CGP algorithm, which yields near optimal nonlinear transformation and classification precision comparable to kernel methods such as the support vector machine (SVM); 2) the explicitness of feature transformation is especially suitable for image retrieval because the images can be searched in the synthesized low-dimensional space, while kernel-based methods have to make classification computation in the original high-dimensional space; 3) the unlabeled data can be boosted with the help of the class distribution learning using CGP feature synthesis approach. Experimental results show that CFS-EM outperforms pure EM and CGP alone, and is comparable to SVM in the sense of classification. It is computationally more efficient than SVM in query phase. Moreover, it has a high likelihood that it will jump out of a local maximum to provide near optimal results and a better estimation of parameters. expand
|
|
|
Image annotations by combining multiple evidence & wordNet |
| |
Yohan Jin,
Latifur Khan,
Lei Wang,
Mamoun Awad
|
|
Pages: 706-715 |
|
doi>10.1145/1101149.1101305 |
|
Full text: PDF
|
|
The development of technology generates huge amounts of non-textual information, such as images. An efficient image annotation and retrieval system is highly desired. Clustering algorithms make it possible to represent visual features of images with ...
The development of technology generates huge amounts of non-textual information, such as images. An efficient image annotation and retrieval system is highly desired. Clustering algorithms make it possible to represent visual features of images with finite symbols. Based on this, many statistical models, which analyze correspondence between visual features and words and discover hidden semantics, have been published. These models improve the annotation and retrieval of large image databases. However, current state of the art including our previous work produces too many irrelevant keywords for images during annotation. In this paper, we propose a novel approach that augments the classical model with generic knowledge-based, WordNet. Our novel approach strives to prune irrelevant keywords by the usage of WordNet. To identify irrelevant keywords, we investigate various semantic similarity measures between keywords and finally fuse outcomes of all these measures together to make a final decision using Dempster-Shafer evidence combination. We have implemented various models to link visual tokens with keywords based on knowledge-based, WordNet and evaluated performance using precision, and recall using benchmark dataset. The results show that by augmenting knowledge-based with classical model we can improve annotation accuracy by removing irrelevant keywords. expand
|
|
|
Robust subspace analysis for detecting visual attention regions in images |
| |
Yiqun Hu,
Deepu Rajan,
Liang-Tien Chia
|
|
Pages: 716-724 |
|
doi>10.1145/1101149.1101306 |
|
Full text: PDF
|
|
Detecting visually attentive regions of an image is a challenging but useful issue in many multimedia applications. In this paper, we describe a method to extract visual attentive regions in images using subspace estimation and analysis techniques. The ...
Detecting visually attentive regions of an image is a challenging but useful issue in many multimedia applications. In this paper, we describe a method to extract visual attentive regions in images using subspace estimation and analysis techniques. The image is represented in a 2D space using polar transformation of its features so that each region in the image lies in a 1D linear subspace. A new subspace estimation algorithm based on Generalized Principal Component Analysis (GPCA) is proposed. The robustness of subspace estimation is improved by using weighted least square approximation where weights are calculated from the distribution of K nearest neighbors to reduce the sensitivity of outliers. Then a new region attention measure is defined to calculate the visual attention of each region by considering both feature contrast and geometric properties of the regions. The method has been shown to be effective through experiments to be able to overcome the scale dependency of other methods. Compared with existing visual attention detection methods, it directly measures the global visual contrast at the region level as opposed to pixel level contrast and can correctly extract the attentive region. expand
|
|
|
Formulating context-dependent similarity functions |
| |
Gang Wu,
Edward Y. Chang,
Navneet Panda
|
|
Pages: 725-734 |
|
doi>10.1145/1101149.1101307 |
|
Full text: PDF
|
|
Tasks of information retrieval depend on a good distance function for measuring similarity between data instances. The most effective distance function must be formulated in a context-dependent (also application-, data-, and user-dependent) way. In this ...
Tasks of information retrieval depend on a good distance function for measuring similarity between data instances. The most effective distance function must be formulated in a context-dependent (also application-, data-, and user-dependent) way. In this paper, we present a novel method, which learns a distance function by capturing the nonlinear relationships among contextual information provided by the application, data, or user. We show that through a process called the "kernel trick," such nonlinear relationships can be learned efficiently in a projected space. In addition to using the kernel trick, we propose two algorithms to further enhance efficiency and effectiveness of function learning. For efficiency, we propose a SMO-like solver to achieve O(N2) learning performance. For effectiveness, we propose using unsupervised learning in an innovative way to address the challenge of lack of labeled data (contextual information). Theoretically, we substantiate that our method is both sound and optimal. Empirically, we demonstrate that our method is effective and useful. expand
|
|
|
SESSION: Applications 2: automated multimedia authoring |
|
|
|
|
Automatic generation of personalized music sports video |
| |
Jinjun Wang,
Changsheng Xu,
Engsiong Chng,
Lingyu Duan,
Kongwah Wan,
Qi Tian
|
|
Pages: 735-744 |
|
doi>10.1145/1101149.1101309 |
|
Full text: PDF
|
|
In this paper, we propose a novel automatic approach for personalized music sports video generation. Two research challenges, semantic sports video content selection and automatic video composition, are addressed. For the first challenge, we propose ...
In this paper, we propose a novel automatic approach for personalized music sports video generation. Two research challenges, semantic sports video content selection and automatic video composition, are addressed. For the first challenge, we propose to use multi-modal (audio, video and text) feature analysis and alignment to detect the semantic of events in sports video. For the second challenge, we propose video-centric and music-centric music video composition schemes to automatically generate personalized music sports video based on user's preference. The experimental results and user evaluations are promising and show that our system's generated music sports video is comparable to manually generated ones. The proposed approach greatly facilitates the automatic music sports video generation for both professionals and amateurs. expand
|
|
|
Automated rich presentation of a semantic topic |
| |
Lie Lu,
Zhiwei Li
|
|
Pages: 745-753 |
|
doi>10.1145/1101149.1101310 |
|
Full text: PDF
|
|
To have a rich presentation of a topic, it is not only expected that many relevant multimodal information, including images, text, audio and video, could be extracted; it is also important to organize and summarize the related information, and provide ...
To have a rich presentation of a topic, it is not only expected that many relevant multimodal information, including images, text, audio and video, could be extracted; it is also important to organize and summarize the related information, and provide users a concise and informative storyboard about the target topic. It facilitates users to quickly grasp and better understand the content of a topic. In this paper, we present a novel approach to automatically generating a rich presentation of a given semantic topic. In our proposed approach, the related multimodal information of a given topic is first extracted from available multimedia databases or websites. Since each topic usually contains multiple events, a text-based event clustering algorithm is then performed with a generative model. Other media information, such as the representative images, possibly available video clips and flashes (interactive animates), are associated with each related event. A storyboard of the target topic is thus generated by integrating each event and its corresponding multimodal information. Finally, to make the storyboard more expressive and attractive, an incidental music is chosen as background and is aligned with the storyboard. A user study indicates that the presented system works quite well on our testing examples. expand
|
|
|
SESSION: Interactive arts 2: performance, play, and appreciation |
|
|
|
|
Situated event bootstrapping and capture guidance for automated home movie authoring |
| |
Brett Adams,
Svetha Venkatesh
|
|
Pages: 754-763 |
|
doi>10.1145/1101149.1101312 |
|
Full text: PDF
|
|
This paper describes a novel interactive media authoring framework, MediaTE, that enables amateurs to create videos of higher narrative or aesthetic quality with a completely mobile lifecycle. A novel event bootstrapping dialog is used to derive shot ...
This paper describes a novel interactive media authoring framework, MediaTE, that enables amateurs to create videos of higher narrative or aesthetic quality with a completely mobile lifecycle. A novel event bootstrapping dialog is used to derive shot suggestions that yield both targetted footage and annotation enabling an automatic Computational Media Aesthetics-aware editing phase, the manual performance of which is typically a barrier to the amateur. This facilitates a move away from requiring a prior-conception of the events or locale being filmed, in the form of a template, to at-capture bootstrapping of this information. Metadata gathered as part of the critical path of media creation also has implications for the longevity and reuse of captured media assets. Results of an evaluation performed on both the usability and delivered media aspects of the system are discussed, which highlight the tenability of the proposed framework and the quality of the produced media. expand
|
|
|
An ambient intelligence platform for physical play |
| |
Ron Wakkary,
Marek Hatala,
Robb Lovell,
Milena Droumeva
|
|
Pages: 764-773 |
|
doi>10.1145/1101149.1101313 |
|
Full text: PDF
|
|
This paper describes an ambient intelligent prototype known as socio-ec(h)o. socio-ec(h)o explores the design and implementation of a system for sensing and display, user modeling, and interaction models based on a game structure. The game structure ...
This paper describes an ambient intelligent prototype known as socio-ec(h)o. socio-ec(h)o explores the design and implementation of a system for sensing and display, user modeling, and interaction models based on a game structure. The game structure includes, word puzzles, levels, body states, goals and game skills. Body states are body movements and positions that players must discover in order to complete a level and in turn represent a learned game skill. The paper provides an overview of background concepts and related research. We describe the prototype and game structure, provide a technical description of the prototype and discuss technical issues related to sensing, reasoning and display. The paper contributes by providing a method for constructing group parameters from individual parameters with real-time motion capture data; and a model for mapping the trajectory of participant's actions in order to determine an intensity level used to manage the experience flow of the game and its representation in audio and visual display. We conclude with a discussion of known and outstanding technical issues, and future research. expand
|
|
|
Generating dance verbs and assisting computer choreography |
| |
Chi-Min Hsieh,
Annie Luciani
|
|
Pages: 774-782 |
|
doi>10.1145/1101149.1101314 |
|
Full text: PDF
|
|
As quoted in the philosophy of contemporary dance: <<Understanding the directions for a Free Dance performer stems mainly from the qualities and energy of the movement rather from spatial criteria>>, a lot of emphasis is put currently on ...
As quoted in the philosophy of contemporary dance: <<Understanding the directions for a Free Dance performer stems mainly from the qualities and energy of the movement rather from spatial criteria>>, a lot of emphasis is put currently on generating computing dance movement by dynamic and energy, which is totally different from producing movement by kinematical based gestures in sequence. We argue that it is an ideal interactive media to connect computer and choreographer.In this paper, we present a set of dynamic models according to dance verbs: "to rebound", "to jump", "to flip", "to wave", etc, served by physically based particle modeling based on Newton's law. Among them, user has a high-level motion control to modify the quality of such dynamically generated movement, for example, light/strong, free/bound, sudden/sustained, etc. These dynamic models are hence well suited to produce spontaneous motion that looks natural and plausible. To sum up, we propose a methodology, focusing on the birth, the growth and the death of cause, which include mystic anticipation, inner propagation and virtual momentum exchange. Our methodology exhibits energetic succession and connects well the dance, physics and computer. It is convincingly a well-suited direction for computer-aided choreography. expand
|
|
|
POEtic-cubes: acquisition of new qualia through apperception using a bio-inspired electronic tissue |
| |
Raquel Paricio,
J. Manuel Moreno
|
|
Pages: 783-789 |
|
doi>10.1145/1101149.1101315 |
|
Full text: PDF
|
|
In this paper we shall present the research process towards an artistic installation, called POEtic-Cubes, that is constituted by nine autonomous robots monitored by the POEtic electronic tissue (bio-inspired hardware with adaptive features). The main ...
In this paper we shall present the research process towards an artistic installation, called POEtic-Cubes, that is constituted by nine autonomous robots monitored by the POEtic electronic tissue (bio-inspired hardware with adaptive features). The main goal of our research is to approach the processes involved in the acquisition of new qualia, so that the installation is being conceived for this purpose. The installation mimics the cellular behavior during the mitosis and the genotype to phenotype mapping processes, being these in charge of managing the development and the adaptation of an individual to the environment. We relate this adaptation principle, a fundamental step in the evolutionary growth of the species, to the evolutionary process of human consciousness, and we concentrate in its apperception mechanism that is used to generate new qualia. expand
|
|
|
DEMONSTRATION SESSION: Technical demonstration 2: media authoring and processing |
|
|
|
|
MMM2: mobile media metadata for photo sharing |
| |
Shane Ahern,
Simon King,
Marc Davis
|
|
Pages: 790-791 |
|
doi>10.1145/1101149.1101317 |
|
Full text: PDF
|
|
Though cameraphones are rapidly becoming the dominant platform for consumer digital photography, users still face difficulties in transferring, managing, and sharing photos captured with cameraphones. The Mobile Media Metadata 2 (MMM2) system removes ...
Though cameraphones are rapidly becoming the dominant platform for consumer digital photography, users still face difficulties in transferring, managing, and sharing photos captured with cameraphones. The Mobile Media Metadata 2 (MMM2) system removes the difficulty in transferring photos from the device by providing an automatic upload capability and uses metadata about the context in which a photo was captured to simplify photo management and streamline the sharing process. In our MMM2 system, we have leveraged collaborative filtering techniques to infer the likely sharing recipients for photos based on contextual metadata, which allows the system to accurately guess likely share recipients for a photo and present them to the photographer at the time of capture. expand
|
|
|
LazyCut: content-aware template-based video authoring |
| |
Xian-Sheng Hua,
Zengzhi Wang,
Shipeng Li
|
|
Pages: 792-793 |
|
doi>10.1145/1101149.1101318 |
|
Full text: PDF
|
|
Though there are many commercial video authoring tools available today, video authoring remains as a tedious and extremely time consuming task that often requires trained professional skills. To tackle this issue, this demonstration presents a novel ...
Though there are many commercial video authoring tools available today, video authoring remains as a tedious and extremely time consuming task that often requires trained professional skills. To tackle this issue, this demonstration presents a novel end-to-end system, called LazyCut, which enables fast, flexible and personalized video authoring and sharing. LazyCut provides a semi-automatic video authoring and sharing system that significantly reduces users' efforts in video editing while preserving sufficient flexibility and personalization. expand
|
|
|
Simulated virtual market place by using voiscape communication medium |
| |
Yasusi Kanada
|
|
Pages: 794-795 |
|
doi>10.1145/1101149.1101319 |
|
Full text: PDF
|
|
We are developing a new voice communication medium called voiscape. Voiscape enables natural and seamless bi-directional voice communication by using sound to create a virtual sound room. In a sound room, people can feel others' direction ...
We are developing a new voice communication medium called voiscape. Voiscape enables natural and seamless bi-directional voice communication by using sound to create a virtual sound room. In a sound room, people can feel others' direction and distance expressed by spatial sounds with reverberations, and they can move freely by using a map of the room. Voiscape enables multi-voice-conversations. In a virtual market place that will be realized by voiscape, people can not only buy goods or information but also enjoy talking with merchants and people there. In this demo, a voiscape prototype called VPII is used for realizing such an environment. Unfortunately, because prerecorded voices are used in this demo, the participants cannot talk with merchants. However, the participants can talk each other with small end-to-end latency (less than 200 ms) and will feel the atmosphere of the virtual market place. Prerecorded people and merchants talk each other in English, Japanese and Chinese in parallel and with crossovers, and participants can virtually walk among them and can selectively listen one voice or hear multiple voices at once. expand
|
|
|
Perceptual media compression for multiple viewers with feedback delay |
| |
Oleg Komogortsev,
Javed Khan
|
|
Pages: 796-797 |
|
doi>10.1145/1101149.1101320 |
|
Full text: PDF
|
|
Human eyes have limited perception capabilities; for example, only 2 degrees of our 140 degree vision field provide the highest quality of perception. Due to this fact the idea of perceptual focus emerged to allow a visual content to be changed in a ...
Human eyes have limited perception capabilities; for example, only 2 degrees of our 140 degree vision field provide the highest quality of perception. Due to this fact the idea of perceptual focus emerged to allow a visual content to be changed in a way that only part of the visual field where a human gaze is directed is encoded with a high quality. The image quality in the periphery can be reduced without a viewer noticing it. This compression approach allows a significant decrease in the number of bits required for image encoding, and in the case of the 3D image rendering, it decreases the computational burden. A number of previous researchers have investigated the topic of perceptual focus but only for a single viewer. In our research we investigate a dynamically changing multi-viewer scenario. In this type of scenario a number of people are watching the same visual content at the same time. Each person has his/her own perceptual focus area which changes over time. The visual content is sent through a network with a fixed delay/lag which provides an additional challenge to the whole scheme. The goal of our work was to investigate and develop a method of multi-viewer perceptual focus zones adaptation for real-time media perceptual compression and transmission. In our research we also look into the impact that such a method can have on transmission bandwidth and computational burden reduction. expand
|
|
|
MobiCon: integrated capture, annotation, and sharing of video clips with mobile phones |
| |
Janne Lahti,
Utz Westermann,
Marko Palola,
Johannes Peltola,
Elena Vildjiounaite
|
|
Pages: 798-799 |
|
doi>10.1145/1101149.1101321 |
|
Full text: PDF
|
|
This paper presents MobiCon, a video production tool for mobile camera phones. MobiCon integrates video clip capture with context-aware, personalized clip annotation -- supporting automatic annotation suggestions based on context data and efficient manual ...
This paper presents MobiCon, a video production tool for mobile camera phones. MobiCon integrates video clip capture with context-aware, personalized clip annotation -- supporting automatic annotation suggestions based on context data and efficient manual annotation with user-specific ontologies and keywords -- and clip sharing secured by digital rights management techniques. Thus, MobiCon allows users to inexpensively create metadata-annotated video clips for a better management of their clip collections and keeps them in control of the clips they share. expand
|
|
|
Media processing workflow design and execution with ARIA |
| |
Lina Peng,
Gisik Kwon,
K. Selçuk Candan,
Kyung Ryu,
Karam Chatha,
Hari Sundaram,
Yinpeng Chen
|
|
Pages: 800-801 |
|
doi>10.1145/1101149.1101322 |
|
Full text: PDF
|
|
Recently, we introduced a novel ARchitecture for Interactive Arts (ARIA) middleware that processes, filters, and fuses sensory inputs and actuates responses in real-time while providing various Quality of Service (QoS) guarantees. The objective of ARIA ...
Recently, we introduced a novel ARchitecture for Interactive Arts (ARIA) middleware that processes, filters, and fuses sensory inputs and actuates responses in real-time while providing various Quality of Service (QoS) guarantees. The objective of ARIA is to incorporate realtime, sensed, and archived media and audience responses into live performances, on demand. An ARIA media workflow graph describes how the data sensed through media capture devices will be processed and what audio-visual responses will be actuated. Thus, each data object streamed between ARIA processing components is subject to transformations, as described by a media workflow graph. The media capture and processing components, such as media filters and fusion operators, are programmable and adaptable; i.e, the delay, size, frequency, and quality/precision characteristics of individual operators can be controlled via a number of parameters. In [1, 4, 5], we developed static and dynamic optimization algorithms which maximize the quality of the actuated responses, minimize the corresponding delay and the resource usage. In this demonstration, we present the ARIA GUI and the underlying kernel. More specifically, we describe how to design a media processing workflow, with adaptive operators, using the ARIA GUI and how to use the various optimization and adaptation alternatives provided by the ARIA kernel to execute media processing workflows. expand
|
|
|
Context-driven smart authoring of multimedia content with xSMART |
| |
Ansgar Scherp,
Susanne Boll
|
|
Pages: 802-803 |
|
doi>10.1145/1101149.1101323 |
|
Full text: PDF
|
|
In recent years, many highly sophisticated multimedia authoring tools have been developed. Up to today, these system's integration of the targeted user context, however, is limited. With our Context-aware Smart Multimedia Authoring Tool (xSMART) ...
In recent years, many highly sophisticated multimedia authoring tools have been developed. Up to today, these system's integration of the targeted user context, however, is limited. With our Context-aware Smart Multimedia Authoring Tool (xSMART) we developed a semi-automatic authoring tool that integrates the targeted user context into the different authoring steps and exploits this context to guide the author through the content authoring process. The design of xSMART allows that it can be extended and customized to the requirements of a specific domain by domain-specific wizards. These wizards realize the user interface that meets best the domain-specific requirements and effectively supports the domain experts in creating their content targeted at a specific user context. expand
|
|
|
Video inpainting and restoration techniques |
| |
Rong-Chi Chang,
Louis H. Lin,
Chia-Ton Tian,
Timothy K. Shih
|
|
Pages: 804-805 |
|
doi>10.1145/1101149.1101324 |
|
Full text: PDF
|
|
Aged films may contain defects such as spikes or dirt, as well as long vertical defect lines. These defects were produced in file development or due to improper maintenance of films. We present a series of algorithms, which can detect and restore defects. ...
Aged films may contain defects such as spikes or dirt, as well as long vertical defect lines. These defects were produced in file development or due to improper maintenance of films. We present a series of algorithms, which can detect and restore defects. In addition, the restoration technique is used with a motion tracking mechanism. Objects can be removed and holes can be inpainted. We aim to demonstrate three tools for 1) spike detection and restoration, 2) long vertical line detection and repairing, and 3) object removal and inpainting. Results of restored video clips show that our mechanisms are practical with good inpainted video quality. expand
|
|
|
Reading SCORM compliant multimedia courses using heterogeneous pervasive devices |
| |
Te-Hua Wang,
Hsuan-Pu Chang,
Yun-Long Sie,
Kun-Han Chan,
Mon-Tin Tzou,
Timothy K. Shih
|
|
Pages: 806-807 |
|
doi>10.1145/1101149.1101325 |
|
Full text: PDF
|
|
The Sharable Content Object Reference Model (SCORM) provides some important representation for distance learning content and the learning behavior. In general, SCORM-Compliant learning content can be viewed via the Web browsers. In this paper, we built ...
The Sharable Content Object Reference Model (SCORM) provides some important representation for distance learning content and the learning behavior. In general, SCORM-Compliant learning content can be viewed via the Web browsers. In this paper, we built an environment which allows people to read SCORM-Compliant course materials on hardcopy papers with a pen-like OCR device. A computer, a PDA, or a cellular phone can be used in conjunction with the pen device for multimedia presentations. Our project is called the Hard SCORM. Consequently, users can read textbooks in a traditional manner while behavior of reading is incorporated with the SCORM specification. expand
|
|
|
An automated end-to-end lecture capturing and broadcasting system |
| |
Cha Zhang,
Jim Crawford,
Yong Rui,
Li-wei He
|
|
Pages: 808-809 |
|
doi>10.1145/1101149.1101326 |
|
Full text: PDF
|
|
We present a complete end-to-end system that is fully automated and supports capturing, broadcasting, viewing, archiving and search. Specifically, we describe a system architecture that minimizes the pre- and post-production time, and a fully automated ...
We present a complete end-to-end system that is fully automated and supports capturing, broadcasting, viewing, archiving and search. Specifically, we describe a system architecture that minimizes the pre- and post-production time, and a fully automated lecture capturing system called iCam2, which synchronously captures all the contents of the lecture, including audio, video and visual aids. As no staff is needed during the capturing and broadcasting process, the operation cost of our system is negligible. The system has been used on a daily basis for more than 4 years, during which 467 lectures were captured with 17,000+ online viewers. expand
|
|
|
SESSION: Content 5: video abstraction |
|
|
|
|
Scenario based dynamic video abstractions using graph matching |
| |
JeongKyu Lee,
JungHwan Oh,
Sae Hwang
|
|
Pages: 810-819 |
|
doi>10.1145/1101149.1101328 |
|
Full text: PDF
|
|
In this paper, we present scenario based dynamic video abstractions using graph matching. Our approach has two main components: multi-level scenario generations and dynamic video abstractions. Multi-level scenarios are generated by a graph-based video ...
In this paper, we present scenario based dynamic video abstractions using graph matching. Our approach has two main components: multi-level scenario generations and dynamic video abstractions. Multi-level scenarios are generated by a graph-based video segmentation and a hierarchy of the segments. Dynamic video abstractions are accomplished by accessing the generated hierarchy level by level. The first step in the proposed approach is to segment a video into shots using Region Adjacency Graph (RAG). A RAG expresses spatial relationships among segmented regions of a frame. To measure the similarity between two consecutive RAGs, we propose a new similarity measure, called Graph Similarity Measure (GSM). Next, we construct a tree structure called scene tree based on the correlation between the detected shots. The correlation is computed by the GSM since it considers the relations between the detected shots properly. Multi-level scenarios which provide various levels of video abstractions are generated using the constructed scene tree. We provide two types of abstraction using multi-level scenarios: multi-level highlights and multi-length summarizations. Multi-level highlights are made by entire shots in each scenario level. To summarize a video in various lengths, we select key frames by considering temporal relationships among RAGs computed by the GSM. We have developed a system, called Automatic Video Analysis System (AVAS), by integrating the proposed techniques to show their effectiveness. The experimental results show that the proposed techniques are promising. expand
|
|
|
Evaluation of video summarization for a large number of cameras in ubiquitous home |
| |
Gamhewage C. de Silva,
Toshihiko Yamasaki,
Kiyoharu Aizawa
|
|
Pages: 820-828 |
|
doi>10.1145/1101149.1101329 |
|
Full text: PDF
|
|
A system for video summarization in a ubiquitous environment is presented. Data from pressure-based floor sensors are clustered to segment footsteps of different persons. Video handover has been implemented to retrieve a continuous video showing a person ...
A system for video summarization in a ubiquitous environment is presented. Data from pressure-based floor sensors are clustered to segment footsteps of different persons. Video handover has been implemented to retrieve a continuous video showing a person moving in the environment. Several methods for extracting key frames from the resulting video sequences have been implemented, and evaluated by experiments. It was found that most of the key frames the human subjects desire to see could be retrieved using an adaptive algorithm based on camera changes and the number of footsteps within the view of the same camera. The system consists of a graphical user interface that can be used to retrieve video summaries interactively using simple queries. expand
|
|
|
SESSION: Systems 2: mobility and video |
|
|
|
|
Can small be beautiful?: assessing image resolution requirements for mobile TV |
| |
Hendrik Knoche,
John D. McCarthy,
M. Angela Sasse
|
|
Pages: 829-838 |
|
doi>10.1145/1101149.1101331 |
|
Full text: PDF
|
|
Mobile TV services are now being offered in several countries, but for cost reasons, most of these services offer material directly recoded for mobile consumption (i.e. without additional editing). The experiment reported in this paper, aims to assess ...
Mobile TV services are now being offered in several countries, but for cost reasons, most of these services offer material directly recoded for mobile consumption (i.e. without additional editing). The experiment reported in this paper, aims to assess the image resolution and bitrate requirements for displaying this type of material on mobile devices. The study, with 128 participants, examined responses to four different image resolutions, seven video encoding bitrates, two audio bitrates and four content types. The results show that acceptability is significantly lower for images smaller than 168x126, regardless of content type. The effect is more pronounced when bandwidth is abundant, and is due to important detail being lost in the smaller screens. In contrast to previous studies, participants are more likely to rate image quality as unacceptable when the audio quality is high. expand
|
|
|
Chameleon: application level power management with performance isolation |
| |
Xiaotao Liu,
Prashant Shenoy,
Mark Corner
|
|
Pages: 839-848 |
|
doi>10.1145/1101149.1101332 |
|
Full text: PDF
|
|
In this paper, we present Chameleon---an application-level power management approach for reducing energy consumption in mobile processors. Our approach exports the entire responsibility of power management decisions to the application level. We propose ...
In this paper, we present Chameleon---an application-level power management approach for reducing energy consumption in mobile processors. Our approach exports the entire responsibility of power management decisions to the application level. We propose an operating system interface that can be used by applications to achieve energy savings. We consider three classes of applications---soft real-time, interactive and batch---and design user-level power management strategies for representative applications such as a movie player, a word processor, a web browser, and a batch compiler. We also design a user level power manager based on GraceOS using Chameleon. We implement our approach in the Linux kernel running on a Sony Transmeta laptop. Our experiments show that, compared to the traditional system-wide CPU voltage scaling approaches, Chameleon can achieve up to 32-50% energy savings while delivering comparable or better performance to applications. Further, Chameleon imposes small overheads and is very effective at scheduling concurrent applications with diverse energy needs. expand
|
|
|
SESSION: Open source software competition |
|
|
|
|
OpenVIDIA: parallel GPU computer vision |
| |
James Fung,
Steve Mann
|
|
Pages: 849-852 |
|
doi>10.1145/1101149.1101334 |
|
Full text: PDF
|
|
Graphics and vision are approximate inverses of each other: ordinarily Graphics Processing Units (GPUs) are used to convert "numbers into pictures" (i.e. computer graphics). In this paper, we propose using GPUs in approximately the reverse way: to assist ...
Graphics and vision are approximate inverses of each other: ordinarily Graphics Processing Units (GPUs) are used to convert "numbers into pictures" (i.e. computer graphics). In this paper, we propose using GPUs in approximately the reverse way: to assist in "converting pictures into numbers" (i.e. computer vision). The OpenVIDIA project uses single or multiple graphics cards to accelerate image analysis and computer vision. It is a library and API aimed at providing a graphics hardware accelerated processing framework for image processing and computer vision. OpenVIDIA explores the creation of a parallel computer architecture consisting of multiple Graphics Processing Units (GPUs) built entirely from commodity hardware. OpenVIDIA uses multiple Graphic.Processing Units in parallel to operate as a general-purpose parallel computer architecture. It provides a simple API which implements some common computer vision algorithms. Many components can be used immediately and because the project is Open Source, the code is intended to serve as templates and examples for how similar algorithms are mapped onto graphics hardware. Implemented are image processing techniques (Canny edge detection, filtering), image feature handling (identifying and matching features) and image registration, to name a few. expand
|
|
|
SESSION: Content 6: multimodal processing |
|
|
|
|
Generation of views of TV content using TV viewers' perspectives expressed in live chats on the web |
| |
Hisashi Miyamori,
Satoshi Nakamura,
Katsumi Tanaka
|
|
Pages: 853-861 |
|
doi>10.1145/1101149.1101336 |
|
Full text: PDF
|
|
We propose a method of generating views of TV programs based on viewer's perspectives expressed in live chats on the Web. Important scenes in a program and responses by particular viewers can be extracted efficiently by statistically computing and/or ...
We propose a method of generating views of TV programs based on viewer's perspectives expressed in live chats on the Web. Important scenes in a program and responses by particular viewers can be extracted efficiently by statistically computing and/or recognizing live chat data obtained in sync with the broadcast content. We show that by using the computed results, views can be generated that indicate the momentum of reactions by viewers and scenes of interest to particular viewers whose preferences are similar to those of the viewer, etc. This is a new way of viewing TV content from various perspectives. expand
|
|
|
Graph based multi-modality learning |
| |
Hanghang Tong,
Jingrui He,
Mingjing Li,
Changshui Zhang,
Wei-Ying Ma
|
|
Pages: 862-871 |
|
doi>10.1145/1101149.1101337 |
|
Full text: PDF
|
|
To better understand the content of multimedia, a lot of research efforts have been made on how to learn from multi-modal feature. In this paper, it is studied from a graph point of view: each kind of feature from one modality is represented as one independent ...
To better understand the content of multimedia, a lot of research efforts have been made on how to learn from multi-modal feature. In this paper, it is studied from a graph point of view: each kind of feature from one modality is represented as one independent graph; and the learning task is formulated as inferring from the constraints in every graph as well as supervision information (if available). For semi-supervised learning, two different fusion schemes, namely linear form and sequential form, are proposed. For each scheme, it is derived from optimization point of view; and further justified from two sides: similarity propagation and Bayesian interpretation. By doing so, we reveal the regular optimization nature, transductive learning nature as well as prior fusion nature of the proposed schemes, respectively. Moreover, the proposed method can be easily extended to unsupervised learning, including clustering and embedding. Systematic experimental results validate the effectiveness of the proposed method. expand
|
|
|
Multimodal metadata fusion using causal strength |
| |
Yi Wu,
Edward Y. Chang,
Belle L. Tseng
|
|
Pages: 872-881 |
|
doi>10.1145/1101149.1101338 |
|
Full text: PDF
|
|
We propose a probabilistic framework that uses influence diagrams to fuse metadata of multiple modalities for photo annotation. We fuse contextual information (location, time, and camera parameters), visual content (holistic and local perceptual features), ...
We propose a probabilistic framework that uses influence diagrams to fuse metadata of multiple modalities for photo annotation. We fuse contextual information (location, time, and camera parameters), visual content (holistic and local perceptual features), and semantic ontology in a synergistic way. We use causal strengths to encode causalities between variables, and between variables and semantic labels. Through analytical and empirical studies, we demonstrate that our fusion approach can achieve high-quality photo annotation and good interpretability, substantially better than traditional methods. expand
|
|
|
Automatic discovery of query-class-dependent models for multimodal search |
| |
Lyndon S. Kennedy,
Apostol (Paul) Natsev,
Shih-Fu Chang
|
|
Pages: 882-891 |
|
doi>10.1145/1101149.1101339 |
|
Full text: PDF
|
|
We develop a framework for the automatic discovery of query classes for query-class-dependent search models in multimodal retrieval. The framework automatically discovers useful query classes by clustering queries in a training set according to the performance ...
We develop a framework for the automatic discovery of query classes for query-class-dependent search models in multimodal retrieval. The framework automatically discovers useful query classes by clustering queries in a training set according to the performance of various unimodal search methods, yielding classes of queries which have similar fusion strategies for the combination of unimodal components for multimodal search. We further combine these performance features with the semantic features of the queries during clustering in order to make discovered classes meaningful. The inclusion of the semantic space also makes it possible to choose the correct class for new, unseen queries, which have unknown performance space features. We evaluate the system against the TRECVID 2004 automatic video search task and find that the automatically discovered query classes give an improvement of 18% in MAP over hand-defined query classes used in previous works. We also find that some hand-defined query classes, such as "Named Person" and "Sports" do, indeed, have similarities in search method performance and are useful for query-class-dependent multimodal search, while other hand-defined classes, such as "Named Object" and "General Object" do not have consistent search method performance and should be split apart or replaced with other classes. The proposed framework is general and can be applied to any new domain without expert domain knowledge. expand
|
|
|
SESSION: Applications 3: tools for multimedia analysis and retrieval |
|
|
|
|
A web-based system for collaborative annotation of large image and video collections: an evaluation and user study |
| |
Timo Volkmer,
John R. Smith,
Apostol (Paul) Natsev
|
|
Pages: 892-901 |
|
doi>10.1145/1101149.1101341 |
|
Full text: PDF
|
|
Annotated collections of images and videos are a necessary basis for the successful development of multimedia retrieval systems. The underlying models of such systems rely heavily on quality and availability of large training collections. The annotation ...
Annotated collections of images and videos are a necessary basis for the successful development of multimedia retrieval systems. The underlying models of such systems rely heavily on quality and availability of large training collections. The annotation of large collections, however, is a time-consuming and error prone task as it has to be performed by human annotators. In this paper we present the IBM Efficient Video Annotation (EVA) system, a server-based tool for semantic concept annotation of large video and image collections. It is optimised for collaborative annotation and includes features such as workload sharing and support in conducting inter-annotator analysis. We discuss initial results of an ongoing user-evaluation of this system. The results are based on data collected during the 2005 TRECVID Annotation Forum, where more than 100 annotators have been using the system. expand
|
|
|
Putting active learning into multimedia applications: dynamic definition and refinement of concept classifiers |
| |
Ming-yu Chen,
Michael Christel,
Alexander Hauptmann,
Howard Wactlar
|
|
Pages: 902-911 |
|
doi>10.1145/1101149.1101342 |
|
Full text: PDF
|
|
The authors developed an extensible system for video exploitation that puts the user in control to better accommodate novel situations and source material. Visually dense displays of thumbnail imagery in storyboard views are used for shot-based video ...
The authors developed an extensible system for video exploitation that puts the user in control to better accommodate novel situations and source material. Visually dense displays of thumbnail imagery in storyboard views are used for shot-based video exploration and retrieval. The user can identify a need for a class of audiovisual detection, adeptly and fluently supply training material for that class, and iteratively evaluate and improve the resulting automatic classification produced via multiple modality active learning and SVM. By iteratively reviewing the output of the classifier and updating the positive and negative training samples with less effort than typical for relevance feedback systems, the user can play an active role in directing the classification process while still needing to truth only a very small percentage of the multimedia data set. Examples are given illustrating the iterative creation of a classifier for a concept of interest to be included in subsequent investigations, and for a concept typically deemed irrelevant to be weeded out in follow-up queries. Filtering and browsing tools making use of existing and iteratively added concepts put the user further in control of the multimedia browsing and retrieval process. expand
|
|
|
Automatic measurement of quality metrics for colonoscopy videos |
| |
Sae Hwang,
JungHwan Oh,
JeongKyu Lee,
Yu Cao,
Wallapak Tavanapong,
Danyu Liu,
Johnny Wong,
Piet C. de Groen
|
|
Pages: 912-921 |
|
doi>10.1145/1101149.1101343 |
|
Full text: PDF
|
|
Colonoscopy is the accepted screening method for detection of colorectal cancer or its precursor lesions, colorectal polyps. Indeed, colonoscopy has contributed to a decline in the number of colorectal cancer related deaths. However, not all cancers ...
Colonoscopy is the accepted screening method for detection of colorectal cancer or its precursor lesions, colorectal polyps. Indeed, colonoscopy has contributed to a decline in the number of colorectal cancer related deaths. However, not all cancers or large polyps are detected at the time of colonoscopy, and methods to investigate why this occurs are needed. We present a new computer-based method that allows automated measurement of a number of metrics that likely reflect the quality of the colonoscopic procedure. The method is based on analysis of a digitized video file created during colonoscopy, and produces information regarding insertion time, withdrawal time, images at the time of maximal intubation, the time and ratio of clear versus blurred or non-informative images, and a first estimate of effort performed by the endoscopist. As these metrics can be obtained automatically, our method allows future quality control in the day-to-day medical practice setting on a large scale. In addition, our method can be adapted to other healthcare procedures. Last but not least, our method may be useful to assess progress during colonoscopy training, or as part of endoscopic skills assessment evaluations. expand
|
|
|
SESSION: Interactive arts 3: interaction in social and virtual environments |
|
|
|
|
Censor chair: exploring censorship and social presence through psychophysiological sensing |
| |
Eric Aley,
Trina Cooper,
Ross Graeber,
Andruid Kerne,
Kyle Overby,
Zachary O. Toups
|
|
Pages: 922-929 |
|
doi>10.1145/1101149.1101345 |
|
Full text: PDF
|
|
In this paper, we describe Censor Chair, an art installation that creates a shared experience addressing forms of censorship including self-censorship, censorship of a group upon an individual, visual and auditory censorship in digital media, ...
In this paper, we describe Censor Chair, an art installation that creates a shared experience addressing forms of censorship including self-censorship, censorship of a group upon an individual, visual and auditory censorship in digital media, and censorship in society. We are taking a playful position in considering relationships between censorship and sensors that monitor physiology. Censor Chair makes use of a galvanic skin response (GSR) sensor, live video feeds, and a barcode reader to drive the presentation of a digital media library. expand
|
|
|
Tensegric mobile controlled by pseudo forces |
| |
Kazuya G. Kobayashi,
Taro Ichizawa,
Koichi Nakano,
Katsutoshi Ootsubo
|
|
Pages: 930-936 |
|
doi>10.1145/1101149.1101346 |
|
Full text: PDF
|
|
A tensegric mobile in virtual 3D space is introduced. An input model is a triangular mesh B-rep designed by an artist, which is allowed to have an arbitrary topology. The tensegric structure is automatically generated from a mesh model as a deforming ...
A tensegric mobile in virtual 3D space is introduced. An input model is a triangular mesh B-rep designed by an artist, which is allowed to have an arbitrary topology. The tensegric structure is automatically generated from a mesh model as a deforming object. After the object is set up above a virtual floor, it repeats movements: rotating, falling, flattening, and rebounding. A user can control its behavior by "pseudo forces"; attraction and repulsion forces, stiffness property, and gravity. The stiffness and gravity can also be randomly changed. By using our system, an artist can present continuous movements of an arbitrary 3D shape as a "mobile" of abstractionism. expand
|
|
|
Echology: an interactive spatial sound and video artwork |
| |
Meghan Deutscher,
Reynald Hoskinson,
Sachiyo Takashashi,
Sidney Fels
|
|
Pages: 937-945 |
|
doi>10.1145/1101149.1101347 |
|
Full text: PDF
|
|
We present a novel way of manipulating a spatial soundscape, one that encourages collaboration and exploration. Through a table-top display surrounded by speakers and lights, participants are invited to engage in peaceful play with Beluga whales shown ...
We present a novel way of manipulating a spatial soundscape, one that encourages collaboration and exploration. Through a table-top display surrounded by speakers and lights, participants are invited to engage in peaceful play with Beluga whales shown through a live web camera feed from the Vancouver Aquarium in Canada. Eight softly glowing buttons and a simple interface encourage collaboration with others who are also enjoying the swirling Beluga sounds overhead. expand
|
|
|
SESSION: Systems 3: searching and streaming |
|
|
|
|
PRISM: indexing multi-dimensional data in P2P networks using reference vectors |
| |
O. D. Sahin,
A. Gulbeden,
F. Emekci,
D. Agrawal,
A. El Abbadi
|
|
Pages: 946-955 |
|
doi>10.1145/1101149.1101349 |
|
Full text: PDF
|
|
Peer-to-peer (P2P) systems research has gained considerable attention recently with the increasing popularity of file sharing applications. Since these applications are used for sharing huge amounts of data, it is very important to efficiently locate ...
Peer-to-peer (P2P) systems research has gained considerable attention recently with the increasing popularity of file sharing applications. Since these applications are used for sharing huge amounts of data, it is very important to efficiently locate the data of interest in such systems. However, these systems usually do not provide efficient search techniques. Existing systems offer only keyword search functionality through a centralized index or by query flooding. In this paper, we propose a scheme based on reference vectors for sharing multi-dimensional data in P2P systems. This scheme effectively supports a larger set of query operations (such as k-NN queries and content-based similarity search) than current systems, which generally support only exact key lookups and keyword searches.The basic idea is to store multiple replicas of an object's index at different peers based on the distances between the object's feature vector and the reference vectors. Later, when a query is posed, the system identifies the peers that are likely to store the index information about relevant objects using reference vectors. Thus the system is able to return accurate results by contacting a small fraction of the participating peers. expand
|
|
|
Supporting multimedia streaming between mobile peers with link availability prediction |
| |
Min Qin,
Roger Zimmermann,
Leslie S. Liu
|
|
Pages: 956-965 |
|
doi>10.1145/1101149.1101350 |
|
Full text: PDF
|
|
Numerous types of mobile devices are now popular with end users, who increasingly use them to carry multimedia content on the go. As wireless connectivity is integrated with more handhelds, streaming multimedia content among mobile peers is becoming ...
Numerous types of mobile devices are now popular with end users, who increasingly use them to carry multimedia content on the go. As wireless connectivity is integrated with more handhelds, streaming multimedia content among mobile peers is becoming a popular application. One of the main challenges in mobile streaming is the requirement that the link must be continuously available for a period of time to enable uninterrupted data transmission and a smooth media performance. Hence, an accurate prediction of future link availability is very desirable and allows, for example, the selection of the most stable link when a multimedia object is available from multiple peers. In this paper, we present a novel iterative algorithm for predicting continuous link availability between two mobile ad-hoc peers. Our method can function without the support of GPS equipment. By a rough estimation of the distance between two peers, our approach is able to accurately predict link availability over a short period of time. Experimental results show that our algorithm can accurately estimate the future link status with an error margin lower than 7%. To demonstrate the feasibility of our approach we have integrated our link prediction algorithm into MStream: a pioneering mobile audio streaming application. Simulation results show that our link availability model can reduce the number of link breaks and achieve smooth streaming experiences among mobile peers. expand
|
|
|
Scalable media streaming to interactive users |
| |
Marcus Rocha,
Marcelo Maia,
Ítalo Cunha,
Jussara Almeida,
Sérgio Campos
|
|
Pages: 966-975 |
|
doi>10.1145/1101149.1101351 |
|
Full text: PDF
|
|
Recently, a number of scalable stream sharing protocols have been proposed with the promise of great reductions in the server and network bandwidth required for delivering popular media content. Although the scalability of these protocols has been evaluated ...
Recently, a number of scalable stream sharing protocols have been proposed with the promise of great reductions in the server and network bandwidth required for delivering popular media content. Although the scalability of these protocols has been evaluated mostly for sequential user accesses, a high degree of interactivity has been observed in the accesses to several real media servers. Moreover, some studies have indicated that user interactivity can severely penalize the scalability of stream sharing protocols.This paper investigates alternative mechanisms for scalable streaming to interactive users. We first identify a set of workload aspects that are determinant to the scalability of classes of streaming protocols. Using real workloads and a new interactive media workload generator, we build a rich set of realistic synthetic workloads. We evaluate Bandwidth Skimming and Patching, two state-of-the-art streaming protocols, covering, with our workloads, a larger region of the design space than previous work. Finally, we propose and evaluate five optimizations to Bandwidth Skimming, the most scalable of the two protocols. Our best optimization reduces the average server bandwidth required for interactive workloads in up to 54%, for unlimited client buffers, and 29%, if buffers are constrained to 25% of media size. expand
|
|
|
SESSION: Applications 4: interactive multimedia systems |
|
|
|
|
Digital violin tutor: an integrated system for beginning violin learners |
| |
Jun Yin,
Ye Wang,
David Hsu
|
|
Pages: 976-985 |
|
doi>10.1145/1101149.1101353 |
|
Full text: PDF
|
|
Prompt feedback is essential for beginning violin learners; however, most amateur learners can only meet with teachers and receive feedback once or twice a week. To help such learners, we have attempted an initial design of Digital Violin Tutor (DVT), ...
Prompt feedback is essential for beginning violin learners; however, most amateur learners can only meet with teachers and receive feedback once or twice a week. To help such learners, we have attempted an initial design of Digital Violin Tutor (DVT), an integrated system that provides the much-needed feedback when human teachers are not available. DVT combines violin audio transcription with visualization. Our transcription method is fast, accurate, and robust again noise for violin audio recorded in home environments. The visualization is designed to be intuitive and easily understandable by people with little music knowledge. The different visualization modalities--video, 2D fingerboard animation, 3D avatar animation--help learners to practice and learn more effectively. The entire system has been implemented with off-the-shelf hardware and shown to be practical in home environments. In our user study, the system has received very positive evaluation. expand
|
|
|
Pervasive views: area exploration and guidance using extended image media |
| |
Jiang Yu Zheng,
Xiaolong Wang
|
|
Pages: 986-995 |
|
doi>10.1145/1101149.1101354 |
|
Full text: PDF
|
|
This work achieves full registration of scenes in a large area and creates visual indexes for visualization in a digital city. We explore effective mapping, indexing, and display of scenes so that an area becomes "visible". Users can virtual navigate ...
This work achieves full registration of scenes in a large area and creates visual indexes for visualization in a digital city. We explore effective mapping, indexing, and display of scenes so that an area becomes "visible". Users can virtual navigate city on the Internet and achieve real guidance with a PDA. Extended images such as route panoramas, scene tunnels, panoramic views and spherical views are acquired in an urban area and associated with geospatial locations. A 3D LIDAR elevation map is used to generate a scanning plan based on visibility, image properties, and importance of scenes. Scanning scenes along streets and at spots of interest allows for compact and complete visual data collection. To access city information, visual indexes from scenes to spaces are created pervasively for flexible space exploration and transition. To visualize a space seamlessly in a large view frame and synchronize scenes with the virtual movement in the map, we stream image data on the Internet. An engine is developed for continuous space traversing, accessing spatial information, and transiting between spaces through visual links. A real urban area has been modeled to verify the effectiveness of such a system. expand
|
|
|
A flexible system for creating music while interacting with the computer |
| |
Zeljko Obrenovic
|
|
Pages: 996-1004 |
|
doi>10.1145/1101149.1101355 |
|
Full text: PDF
|
|
Music is a very important part of our lives. People enjoy listening to the music, and many of us find a special pleasure in creating the music. Computers further extended many aspects of our musical experience. Listening to, recording, and creating music ...
Music is a very important part of our lives. People enjoy listening to the music, and many of us find a special pleasure in creating the music. Computers further extended many aspects of our musical experience. Listening to, recording, and creating music is now easier and more accessible to various users. On the other hand, various computing applications exploit the music in order to better support the interaction with users. However, listening to music is generally a passive experience. Although we may change many parameters, the music we listen to generally does not reflect our response, or does so very roughly.In this paper we present a flexible framework that enables active creation of instrumental music based of the implicit dynamics and content of human-computer interaction. Our approach is application independent, and it provides a mapping of musical features to the abstraction of user interaction. This mapping is based on analysis of the dynamic and content of the human-computer interaction. In contrast to the most existing interactive music composition tools, which require explicit interaction with the system, we have provided a more flexible solution that implicitly maps user interaction parameters to the various musical features. expand
|
|
|
SESSION: Brave new topics 3: advanced methods for medical image retrieval & applications |
|
|
|
|
Data grid for large-scale medical image archive and analysis |
| |
H. K. Huang,
Aifeng Zhang,
Brent Liu,
Zheng Zhou,
Jorge Documet,
Nelson King,
L. W. C. Chan
|
|
Pages: 1005-1013 |
|
doi>10.1145/1101149.1101357 |
|
Full text: PDF
|
|
Storage and retrieval technology for large-scale medical image systems has matured significantly during the past ten years but many implementations still lack cost-effective backup and recovery solutions. As an example, a PACS (Picture Archiving and ...
Storage and retrieval technology for large-scale medical image systems has matured significantly during the past ten years but many implementations still lack cost-effective backup and recovery solutions. As an example, a PACS (Picture Archiving and Communication system) in a general medical center requires about 40 Terabytes of storage capacity for seven years. Despite many healthcare centers are relying on PACS for 24/7 clinical operation, current PACS lacks affordable fault-tolerance storage strategies for archive, backup, and disaster recovery. Existing solutions are difficult to administer, and often time consuming for effective recovery after a disaster. For this reason, PACS still encounters unexpected downtime for hours or days, which could cripple daily clinical service and research operations. Grid Computing represents the latest and most exciting technology to evolve from the familiar realm of parallel, peer-to-peer, and client-server models that can address the problem of fault-tolerant storage for backup and recovery of medical images. We have researched and developed a novel Data Grid testbed involving several federated PAC systems based on grid computing architecture. By integrating grid architecture to the PACS DICOM (Digital Imaging and Communication in Medicine) environment, in addition to use its own storage device, a PACS also uses a federated Data Grid composing of several PAC systems for off-site backup archive. In case its own storage fails, the PACS can retrieve its image data from the Data Grid timely and seamlessly. The design reflects the Globus Toolkit 3.0 five-layer architecture of the grid computing: Fabric, Resource, Connectivity, Collective, and Application Layers. The testbed consists of three federated PAC systems, the Fault-Tolerant PACS archive server at the Image Processing and Informatics Laboratory, the clinical PACS at Saint John's Health Center, and the clinical PACS at the Healthcare Consultation Center II, USC Health Science Campus.In the testbed, we also implement computational services in the Data Grid for image analysis and data mining. The federated PAC systems can use this resource by sharing image data and computational services available in the Data Grid for image analysis and data mining application.In the paper, we first review PACS and its clinical operation, followed by the description of the Data Grid architecture in the testbed. Different scenarios of using the DICOM store and query/retrieve functions of the laboratory model to demonstrate the fault-tolerance features of the Data Grid are illustrated. The status of current clinical implementation of the Data Grid is reported. An example of using the digital hand atlas for bone age assessment of children is presented to describe the concept of computational services in the Data Grid. expand
|
|
|
Evaluation axes for medical image retrieval systems: the imageCLEF experience |
| |
Henning Müller,
Paul Clough,
William Hersh,
Thomas Deselaers,
Thomas Lehmann,
Antoine Geissbuhler
|
|
Pages: 1014-1022 |
|
doi>10.1145/1101149.1101358 |
|
Full text: PDF
|
|
Content--based image retrieval in the medical domain is an extremely hot topic in medical imaging as it promises to help better managing the large amount of medical images being produced. Applications are mainly expected in the field of medical teaching ...
Content--based image retrieval in the medical domain is an extremely hot topic in medical imaging as it promises to help better managing the large amount of medical images being produced. Applications are mainly expected in the field of medical teaching files and for research projects, where performance issues and speed are less critical than in the field of diagnostic aid. Final goal with most impact will be the use as a diagnostic aid in a real--world clinical setting.Other applications of image retrieval and image classification can be the automatic annotation of images with basic concepts or the control of DICOM header information.ImageCLEF is part of the Cross Language Evaluation Forum (CLEF). Since 2004, a medical image retrieval task has been added. Goal is to create databases of a realistic and useful size and also query topics that are based on real--world needs in the medical domain but still correspond to the limited capabilities of purely visual retrieval at the moment. Goal is to direct the research onto real applications and towards real clinical problems to give researchers who are not directly linked to medical facilities a possibility to work on the interesting problem of medical image retrieval based on real data sets and problems. The missing link between computer science research departments and clinical routine is one of the biggest problems that becomes evident when reading much of the current literature on medical image retrieval. Most databases are extremely small, the treated problems often far from clinical reality, and there is no integration of the prototypes into a hospital infrastructure. Only few retrieval articles specifically mention problems related to the DICOM format (Digital Imaging and Communications in Medicine) and the sheer amount of data that needs to be treated in an image archive ( > 30.000 images per day in the Geneva radiology).This article develops the various axes that can be taken into account for medical image retrieval system evaluation. First, the axes are developed based on current challenges and experiences from ImageCLEF. Then, the resources developed for ImageCLEF are listed and finally, the application of the axes is explained to show the bases of the ImageCLEFmed evaluation campaign. This article will only concentrate on the medical retrieval tasks, the non-medical tasks will only shortly be mentioned. expand
|
|
|
MultiPRE: a novel framework with multiple parallel retrieval engines for content-based image retrieval |
| |
Wei Xiong,
Bo Qiu,
Qi Tian,
Changsheng Xu,
S. H. Ong,
Kelvin Foong,
Jean-Pierre Chevallet
|
|
Pages: 1023-1032 |
|
doi>10.1145/1101149.1101359 |
|
Full text: PDF
|
|
We propose a novel framework for content-based image retrieval with multiple parallel retrieval engines (MultiPRE) to achieve higher retrieval performance. Visual features, including both low-level features, such as color, texture and region features, ...
We propose a novel framework for content-based image retrieval with multiple parallel retrieval engines (MultiPRE) to achieve higher retrieval performance. Visual features, including both low-level features, such as color, texture and region features, and middle-level structure features, such as blob representation of objects are used to capture geometrical and statistical characteristics of images. Both clustering analysis and discrimination analysis are used as similarity measures in multiple retrieval engines, which are based on~principal component analysis (PCA) and support vector machines (SVM), respectively. Finally outputs of these engines are fused to determine ranking lists of retrieved images for given retrieval topics. The proposed framework has been evaluated based on the 26 image query topics over the CasImage database~with over 9000 medical images~used in ImageCLEF 2004, an international research effort for content-based image retrieval performance benchmark. Experiments show that the proposed framework achieved significantly better performance in terms of both the mean and the variance of average precision than the best run reported in ImageCLEF2004. expand
|
|
|
SESSION: Doctoral symposium 1 |
|
|
|
|
Ontology-driven content search for personalized education |
| |
Apple W. P. Fok
|
|
Pages: 1033-1034 |
|
doi>10.1145/1101149.1101361 |
|
Full text: PDF
|
|
Striving towards our education vision, Personalized Education, a Personalized Education System (PES) framework has been proposed [3] to exploit the vast amount of multimedia learning content on the Web. PEOnto, a fundamental component of PE, composes ...
Striving towards our education vision, Personalized Education, a Personalized Education System (PES) framework has been proposed [3] to exploit the vast amount of multimedia learning content on the Web. PEOnto, a fundamental component of PE, composes of multiple education ontologies to support communications among personalized education agents that provide a variety of PE services. Our research in PEOnto focuses on investigating techniques and a computation framework for supporting ontology-driven search and retrieval of multimedia learning content to meet various learning objectives. expand
|
|
|
Content-based video indexing for sports applications using integrated multi-modal approach |
| |
Dian Tjondronegoro,
Yi-Ping Phoebe Chen,
Binh Pham
|
|
Pages: 1035-1036 |
|
doi>10.1145/1101149.1101362 |
|
Full text: PDF
|
|
To sustain an ongoing rapid growth of video information, there is an emerging demand for a sophisticated content-based video indexing system. However, current video indexing solutions are still immature and lack of any standard. This doctoral consists ...
To sustain an ongoing rapid growth of video information, there is an emerging demand for a sophisticated content-based video indexing system. However, current video indexing solutions are still immature and lack of any standard. This doctoral consists of a research work based on an integrated multi-modal approach for sports video indexing and retrieval. By combining specific features extractable from multiple audio-visual modalities, generic structure and specific events can be detected and classified. During browsing and retrieval, users will benefit from the integration of high-level semantic and some descriptive mid-level features such as whistle and close-up view of player(s). expand
|
|
|
Designing time-based interactions with multimedia |
| |
Eric Lee
|
|
Pages: 1037-1038 |
|
doi>10.1145/1101149.1101363 |
|
Full text: PDF
|
|
The current model of time in multimedia frameworks poses particular problems when designing multimedia systems with time-based interaction. We propose to expand and extend an existing distinction between semantic time and real time from music and film ...
The current model of time in multimedia frameworks poses particular problems when designing multimedia systems with time-based interaction. We propose to expand and extend an existing distinction between semantic time and real time from music and film theory to multimedia systems design. The semantic time concept also forms the foundation of a new software framework for multimedia systems that we are building, that, unlike most existing frameworks, includes mechanisms for time-based effects such as time-stretching. expand
|
|
|
Estimating illumination parameters in real space with application to image relighting |
| |
Feng Xie,
Linmi Tao
|
|
Pages: 1039-1040 |
|
doi>10.1145/1101149.1101364 |
|
Full text: PDF
|
|
|
|
|
Game state and event distribution using proxy technology and application layer multicast |
| |
Knut-Helge Vik
|
|
Pages: 1041-1042 |
|
doi>10.1145/1101149.1101365 |
|
Full text: PDF
|
|
|
|
|
SESSION: Doctoral symposium 2 |
|
|
|
|
Multimodal analysis of recorded video for e-learning |
| |
Thomas Martin,
Alain Boucher,
Jean-Marc Ogier
|
|
Pages: 1043-1044 |
|
doi>10.1145/1101149.1101367 |
|
Full text: PDF
|
|
In this paper, we present a model for multimodal content analysis. We distinguish between media and modality, which helps us to define and characterize three inter-modal relations. Then we apply this model for recorded course analysis for e-learning. ...
In this paper, we present a model for multimodal content analysis. We distinguish between media and modality, which helps us to define and characterize three inter-modal relations. Then we apply this model for recorded course analysis for e-learning. Different useful relations between modalities are explained and detailed for this application. expand
|
|
|
Enhancing quality of service by exploiting delay tolerance in multimedia applications |
| |
Saraswathi Krithivasan,
Sridhar Iyer
|
|
Pages: 1045-1046 |
|
doi>10.1145/1101149.1101368 |
|
Full text: PDF
|
|
|
|
|
Threading stories and generating topic structures in news videos across different sources |
| |
Xiao Wu
|
|
Pages: 1047-1048 |
|
doi>10.1145/1101149.1101369 |
|
Full text: PDF
|
|
News videos delivered from different sources constitute a huge volume of daily information. These videos, overall, form a huge collection of news stories that are intertwined with various novel and old topic themes. To date, it remains a challenging ...
News videos delivered from different sources constitute a huge volume of daily information. These videos, overall, form a huge collection of news stories that are intertwined with various novel and old topic themes. To date, it remains a challenging task on how to automatically extract a concise view of news stories according to topic themes. This doctoral thesis studies the issues in story dependency threading and topical auto-documentary in news stories. Initially, a co-clustering algorithm is proposed to perform the news story clustering by exploiting the duality between stories and multi-modal concepts. Then, the novelty and redundancy detection is performed to capture the relationship among stories of a topic. To facilitate the fast navigation of news topic, a novel topic structure is then proposed to chains the dependencies of stories. A main thread is extracted to highlight the important aspects of a theme. A news video editing optimization algorithm can be directly applied to automatically select suitable video and speech contents from the original video source to create an edited video documentary. expand
|
|
|
uPen: laser-based, personalized, multi-user interaction on large displays |
| |
Xiaojun Bi,
Yuanchun Shi,
Xiaojie Chen,
PeiFeng Xiang
|
|
Pages: 1049-1050 |
|
doi>10.1145/1101149.1101370 |
|
Full text: PDF
|
|
We present the uPen, a laser pointer combined with a contact-pushed switch, three press buttons and a wireless communication module. This novel interaction device allows users to interact on large displays at a distance or directly on the surface with ...
We present the uPen, a laser pointer combined with a contact-pushed switch, three press buttons and a wireless communication module. This novel interaction device allows users to interact on large displays at a distance or directly on the surface with full-function of mouse. Onboard software enable the uPen system to identify different users and provide personalized services to them, such as associating users with corresponding privileges, giving access to each participant's private content (e.g., home pages, personal calendars). Additionally, with our two-step association method, the uPen system has the ability to distinguish strokes of different uPens working simultaneously and support multi-user simultaneous interaction. A prototype system has been implemented in our Smart Classroom [1]. And user studies show the benefit of using it. expand
|
|
|
SESSION: ACM multimedia art exhibition |
|
|
|
|
ACM multimedia interactive art program: an introduction to the presence/absence exhibition |
| |
Alejandro Jaimes,
Andrew Senior,
Wolfgang Muench
|
|
Pages: 1051-1052 |
|
doi>10.1145/1101149.1101372 |
|
Full text: PDF
|
|
The second ACM Multimedia Art program followed the successful formula used in ACM MM 2005, consisting of a session of long papers, a selection of posters and an art exhibition of multimedia works displayed at a gallery for a period encompassing the conference ...
The second ACM Multimedia Art program followed the successful formula used in ACM MM 2005, consisting of a session of long papers, a selection of posters and an art exhibition of multimedia works displayed at a gallery for a period encompassing the conference duration. "Presence/Absence" was selected as the central theme for the exhibition. In this paper, we discuss our motivations in organizing an art program at ACM MM, the exhibition theme, the works selected, and their potential impact in the technical community. expand
|
|
|
The bomar gene: fictiobiography, digiart, hypertext |
| |
Jason Nelson
|
|
Pages: 1053-1054 |
|
doi>10.1145/1101149.1101373 |
|
Full text: PDF
|
|
The Bomar Gene [6] is a new media, digital fiction hybrid that explores the speculative concept that within us, the codes governing our bodies, is a single unique gene. This speculative gene gives each person an individualized ability, a singular talent. ...
The Bomar Gene [6] is a new media, digital fiction hybrid that explores the speculative concept that within us, the codes governing our bodies, is a single unique gene. This speculative gene gives each person an individualized ability, a singular talent. The work's nine sections chronicle through ficto-biographies how these abilities separate/isolate us from our cultural/physical landscape and yet reside us within those spaces the gene impacts. With each story and the accompanying interactive elements, the project explores how these genetically derived abilities consequently adjust our internal and external geographies. Through game, video, sonic and interactive interfaces, the ways these genes both locate and dislocate the characters are recreated/translated into aesthetic hypertexts. Our genetics build personal layers and our cultural response adds dimension. The Bomar Gene utilizes the layering of meanings, fiction over code over user invited exploration, to situate the user within the character's lives and genes. And the result is a space within spaces, the story of how our genetics effect the larger spaces around us. This project isn't as much about the science of genetics, as it is about human attributes and how those talents and internal deformities, reconfigure our relationship with "where we are". expand
|
|
|
Non_sensor |
| |
Raquel Rennó,
Rafael Marchetti,
Gonzague Defos du Rau
|
|
Pages: 1055-1056 |
|
doi>10.1145/1101149.1101374 |
|
Full text: PDF
|
|
The paper presents Non_sensor, a digital art project that makes use of a Polhemus motion tracking system to create a electromagnetic field which is disturbed by metallic objects that are manipulated by the visitors in the installation. This disturbance ...
The paper presents Non_sensor, a digital art project that makes use of a Polhemus motion tracking system to create a electromagnetic field which is disturbed by metallic objects that are manipulated by the visitors in the installation. This disturbance is interpreted by the system as movement, and presented in the interface as lines that create trajectories according to the distortions. Visitors can therefore design different graphic patterns in the interface by playing with objects. expand
|
|
|
Playas: homeland mirage |
| |
Jack Stenner,
Andruid Kerne,
Yauger Williams
|
|
Pages: 1057-1058 |
|
doi>10.1145/1101149.1101375 |
|
Full text: PDF
|
|
This paper describes an interactive installation that addresses issues of presence and absence by creating a virtualized representation of the abandoned town, Playas, New Mexico. This town is slated for conversion into an anti-terrorism training facility ...
This paper describes an interactive installation that addresses issues of presence and absence by creating a virtualized representation of the abandoned town, Playas, New Mexico. This town is slated for conversion into an anti-terrorism training facility by New Mexico Tech University in conjunction with the United States Department of Homeland Security. Using the metaphor of the mirage, the work functions as a critique of our understanding of "reality. expand
|
|
|
'Ere be dragons: an interactive artwork |
| |
Stephen Boyd Davis,
Magnus Moar,
John Cox,
Chris Riddoch,
Karl Cooke,
Rachel Jacobs,
Matt Watkins,
Richard Hull,
Tom Melamed
|
|
Pages: 1059-1060 |
|
doi>10.1145/1101149.1101376 |
|
Full text: PDF
|
|
The paper introduces a pervasive digital artwork which harnesses live heart-rate and GPS data to create a novel experience on a Pocket PC. The aims of the project, the technologies employed and the results of a preliminary trial are briefly described.
The paper introduces a pervasive digital artwork which harnesses live heart-rate and GPS data to create a novel experience on a Pocket PC. The aims of the project, the technologies employed and the results of a preliminary trial are briefly described. expand
|
|
|
Tastes like... |
| |
Miha Ciglar
|
|
Pages: 1061-1062 |
|
doi>10.1145/1101149.1101377 |
|
Full text: PDF
|
|
"Tastes Like..." (a composition for two monitors, mixing board and human body) is an interactive audiovisual work implemented without computers and common sound/picture - synthesis/processing techniques but exclusively with low-tech analogue equipment. ...
"Tastes Like..." (a composition for two monitors, mixing board and human body) is an interactive audiovisual work implemented without computers and common sound/picture - synthesis/processing techniques but exclusively with low-tech analogue equipment. Basically it can be perceived as a new instrument, representing a product of creative misuse and combination of objects mentioned in the subtitle. The monitors, mixing board and the player's body are connected in a specific constellation, which causes an audiovisual feedback, activated only through the presence of the artist and controlled by the location of his body inside the two dimensional, electrodynamic field, created by the two right angle positioned monitors. expand
|
|
|
Immersing ME: the disappearing digitized presence |
| |
Yu-Chuan Tseng,
Chia-Hsiang Lee
|
|
Pages: 1063-1064 |
|
doi>10.1145/1101149.1101378 |
|
Full text: PDF
|
|
Artists try to represent existence and indicate the meaning of presence by a variety of practices. However, when a figure is created by bits and composes hypereal information, the existence information has become fragments of fake existence. Presence ...
Artists try to represent existence and indicate the meaning of presence by a variety of practices. However, when a figure is created by bits and composes hypereal information, the existence information has become fragments of fake existence. Presence has become a form of absence. Absence has become a form of presence. The work 'Immersing ME' is an interactive art work inviting participants to perceive the process of dissolving from presence to absence. expand
|
|
|
Art exhibition: impossible geographies 01 |
| |
Petra Gemeinboeck,
Mary Agnes Krell
|
|
Pages: 1065-1066 |
|
doi>10.1145/1101149.1101379 |
|
Full text: PDF
|
|
Impossible Geographies 01: Memory is an interactive installation in which memory becomes the metaphor for the fluid boundaries between the physical and the virtual. It dynamically traces visitor's actions and mixes them in unexpected ways with ...
Impossible Geographies 01: Memory is an interactive installation in which memory becomes the metaphor for the fluid boundaries between the physical and the virtual. It dynamically traces visitor's actions and mixes them in unexpected ways with memories captured by the physical space. Throughout the exhibition, those memories of visitors and actions seep into the present environment, creating a virtually woven fabric of events that grows and evolves over time. This paper discusses the experience of (temporarily) inhabiting a space that senses, captures and remembers. expand
|
|
|
Vanishing point |
| |
Mauricio Arango
|
|
Pages: 1067-1068 |
|
doi>10.1145/1101149.1101380 |
|
Full text: PDF
|
|
Vanishing Point is a presentation of the world as it responds to international newspaper coverage - not a measure of what the world is, but of what is most newsworthy. Consequently, countries that receive less media coverage gradually disappear ...
Vanishing Point is a presentation of the world as it responds to international newspaper coverage - not a measure of what the world is, but of what is most newsworthy. Consequently, countries that receive less media coverage gradually disappear from view. It consists of an interactive world map connected to a database fed by international news sources, and exists both in the form of a website (http://low-fi.org.uk/vanishingpoint) and as a physical gallery installation.The goal of this piece is to decipher the world that news media reconfigures and to observe if media coverage, or lack thereof, is creating a new cartography. expand
|
|
|
Interactions: an interactive multimedia installation |
| |
David Birchfield
|
|
Pages: 1069-1070 |
|
doi>10.1145/1101149.1101381 |
|
Full text: PDF
|
|
Interactions is an interactive multimedia installation designed and realized by the author. The installation utilizes two neural network artist agents that act as virtual artists to manipulate a body of images, texts, and sounds collected from ...
Interactions is an interactive multimedia installation designed and realized by the author. The installation utilizes two neural network artist agents that act as virtual artists to manipulate a body of images, texts, and sounds collected from the internet as directed by audience participants. The piece addresses issues including competition in the arts, machine learning in media, the role of popular acceptance in art, and the relationship between raw materials and style in creating media. expand
|
|
|
Body degree zero |
| |
Alan Dunning,
Paul Woodrow,
Morley Hollenberg
|
|
Pages: 1071-1072 |
|
doi>10.1145/1101149.1101382 |
|
Full text: PDF
|
|
The Einstein's Brain Project is a collaborative group of artists and scientists who have been working together for the past 9 years. A central aim of the group is the visualization of the biological state of the body through the fabrication of environments, ...
The Einstein's Brain Project is a collaborative group of artists and scientists who have been working together for the past 9 years. A central aim of the group is the visualization of the biological state of the body through the fabrication of environments, simulations and installations. The Project has developed numerous installations using analog to digital interfaces to direct the output of the human body to virtual environments that are constantly being altered through feedback from a participant's biological body. The core of the Einstein's Brain Project is a discursive space that engages with ideas about the constructed body in the world and its digital cybernetic and post-human forms. This paper describes the form and context of the performance Body Degree Zero. expand
|
|
|
SmallConnection: designing of tangible communication media over networks |
| |
Hideaki Ogawa,
Noriaki Ando,
Satoshi Onodera
|
|
Pages: 1073-1074 |
|
doi>10.1145/1101149.1101383 |
|
Full text: PDF
|
|
The concept of "SmallConnection (abbr. SC)" is creating easy to operate tangible media for communication over networks. Focusing on the scenario where two intimate people live in distant places, we developed communication media that can be handled like ...
The concept of "SmallConnection (abbr. SC)" is creating easy to operate tangible media for communication over networks. Focusing on the scenario where two intimate people live in distant places, we developed communication media that can be handled like tools, and can convey faint information such as light, wind and touch through the use of a robot technology. The goal of this project is to propose and prototype new media design for communication between two people. Through working 3 products derived from SC, we hope to widely propose a?future communication design using multimedia. expand
|
|
|
The king has... |
| |
Krister Olsson,
Takashi Kawashima
|
|
Pages: 1075-1076 |
|
doi>10.1145/1101149.1101384 |
|
Full text: PDF
|
|
The installation The King Has... grew out of a desire to explore the ways in which people react to knowing the secrets of others, and if anonymity were guaranteed, the kinds of secrets people would choose to make public.Secrets were gathered from ...
The installation The King Has... grew out of a desire to explore the ways in which people react to knowing the secrets of others, and if anonymity were guaranteed, the kinds of secrets people would choose to make public.Secrets were gathered from individuals via SMS messaging, printed on wood panels, and mounted at two installation sites, creating ever-growing fields of text. expand
|
|
|
diorama table |
| |
Keiko Takahashi,
Shinji Sasada
|
|
Pages: 1077-1078 |
|
doi>10.1145/1101149.1101385 |
|
Full text: PDF
|
|
"diorama table " is an interactive table installation. People place physical objects on the table and projected elements such as trains, cars, houses, and trees appear and are interacted with physical objects.
"diorama table " is an interactive table installation. People place physical objects on the table and projected elements such as trains, cars, houses, and trees appear and are interacted with physical objects. expand
|
|
|
"KODAMA": mischievous echoes |
| |
Hisako Kroiden Yamakawa
|
|
Pages: 1079-1080 |
|
doi>10.1145/1101149.1101386 |
|
Full text: PDF
|
|
I created "KODAMA" to demonstrate my sensation of solidified human voices in conversation."KODAMA" is an interactive installation. The "KODAMA" are tree fairies that live in the forest who listen to human voices and mimic their sounds. They are visually ...
I created "KODAMA" to demonstrate my sensation of solidified human voices in conversation."KODAMA" is an interactive installation. The "KODAMA" are tree fairies that live in the forest who listen to human voices and mimic their sounds. They are visually depicted as bubbles or pockets of air that move around a projection of the forest. Their movement on screen is controlled by the movement of the audience detected by motion sensors. The audience's voices are captured and re-played by the "KODAMA". expand
|
|
|
Seven mile boots |
| |
Martin Pichlmair
|
|
Pages: 1081-1081 |
|
doi>10.1145/1101149.1101387 |
|
Full text: PDF
|
|
With seven-league boots through the Internet - when you take a stroll through the physical world in this wireless LAN footwear, you just might meet people who happen to be spending some time in a chat-room. Their virtual conversations are made audible ...
With seven-league boots through the Internet - when you take a stroll through the physical world in this wireless LAN footwear, you just might meet people who happen to be spending some time in a chat-room. Their virtual conversations are made audible as spoken text coming out of the boots. expand
|
|
|
Tangible weather channel |
| |
Yu-Cheng Hsu
|
|
Pages: 1082-1083 |
|
doi>10.1145/1101149.1101388 |
|
Full text: PDF
|
|
Tangible Weather Channel is an interactive sculptural apparatus that enables the participant to type in the remote location of a loved one and interprets its real-time weather information as a way of creating an emotional connection. Rather than ...
Tangible Weather Channel is an interactive sculptural apparatus that enables the participant to type in the remote location of a loved one and interprets its real-time weather information as a way of creating an emotional connection. Rather than employing traditional graphical representation, Tangible Weather Channel renders weather information into a multi-sensory experience by using natural elements such as water, air and sound. By materializing weather dynamics on intimate sites to mediate what occurs in another place, Tangible Weather Channel encourages the participant to establish links with his or her experiential memories of a specific place and to create a sense of closeness to a loved one via touch and contemplation. It investigates the experiential and performative aspects of information representation, and interrelationship among material, meaning, memory and perception. The system can be built upon to make more elaborate installations and this is the first iteration of a work which can be greatly expanded. expand
|