Ieee multi media april june 2016 by Vietnam School Tours

Contents | Zoom in | Zoom out

For navigation instructions please click here

Search Issue | Next Page

APRILâ&#x20AC;&#x201C;JUNE 2016

http://www.computer.org

Contents | Zoom in | Zoom out

For navigation instructions please click here

Search Issue | Next Page

M q M q

M q

MqM q THE WORLDâ&#x20AC;&#x2122;S NEWSSTANDÂ®

SEARCH, ANNOTATE, UNDERLINE, VIEW VIDEOS, CHANGE TEXT SIZE, DEFINE

READ YOUR FAVORITE PUBLICATIONS YOUR WAY Now, your IEEE Computer Society technical publications arenâ&#x20AC;&#x2122;t just the most informative and state-of-the-art PDJD]LQHV DQG MRXUQDOV LQ WKH Æ® HOG Å&#x153; WKH\Å&#x17E;UH DOVR WKH most exciting, interactive, and customizable to your reading preferences. The new myCS format for all IEEE Computer Society digital publications is:

<RXÅ&#x17E;YH *RW WR 6HH ,W 7R UHDOO\ DSSUHFLDWH WKH YDVW GLÆ¬ HUHQFH LQ UHDGLQJ HQMR\PHQW WKDW P\&6 UHSUHVHQWV \RX QHHG WR VHH D YLGHR GHPRQVWUDWLRQ DQG WKHQ WU\ RXW WKH LQWHUDFWLYLW\ IRU \RXUVHOI Just go to www.computer.org/mycs-info

â&#x20AC;¢ Mobile friendly. /RRNV JUHDW RQ DQ\ GHYLFH Å&#x153; PRELOH WDEOHW ODSWRS RU GHVNWRS â&#x20AC;¢ Customizable. :KDWHYHU \RXU H UHDGHU OHWV \RX GR \RX FDQ GR RQ P\&6 &KDQJH WKH SDJH FRORU WH[W VL]H RU OD\RXW HYHQ XVH DQQRWDWLRQV RU DQ LQWHJUDWHG GLFWLRQDU\ Å&#x153; LWÅ&#x17E;V XS WR \RX â&#x20AC;¢ Adaptive. 'HVLJQHG VSHFLÆ® FDOO\ IRU GLJLWDO GHOLYHU\ DQG UHDGDELOLW\ â&#x20AC;¢ Personal. 6DYH DOO \RXU LVVXHV DQG VHDUFK RU UHWULHYH WKHP TXLFNO\ RQ \RXU SHUVRQDO P\&6 VLWH

M q M q

M q

MqM q THE WORLDâ&#x20AC;&#x2122;S NEWSSTANDÂ®

Contents | Zoom in | Zoom out

For navigation instructions please click here

Search Issue | Next Page

http://www.computer.org APRILâ&#x20AC;&#x201C;JUNE 2016

_____________________

Contents | Zoom in | Zoom out

For navigation instructions please click here

Search Issue | Next Page

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

April–June 2016

Vol. 23, No. 2

Published by the IEEE Computer Society in cooperation with the IEEE Communications Society and IEEE Signal Processing Society

Ubiquitous Multimedia 12

Guest Editors’ Introduction Ubiquitous Multimedia: Emerging Research on Multimedia Computing Yonghong Tian, Min Chen, and Leonel Sousa

Nonlocal In-Loop Filter: The Way Toward Next-Generation Video Coding? Siwei Ma, Xinfeng Zhang, Jian Zhang, Chuanmin Jia, Shiqi Wang, and Wen Gao

A Novel Semi-Supervised Dimensionality Reduction Framework Xin Guo, Yun Tie, Lin Qi, and Ling Guan

Multimodal Ensemble Fusion for Disambiguation and Retrieval Yang Peng, Xiaofeng Zhou, Daisy Zhe Wang, Ishan Patwa, Dihong Gong, and Chunsheng Victor Fang

Planogram Compliance Checking Based on Detection of Recurring Patterns Song Liu, Wanqing Li, Stephen Davis, Christian Ritz, and Hongda Tian

www.computer.org/multimedia _____________________ Editorial: Unless otherwise stated, bylined articles, as well as product and service descriptions, reflect the author’s or firm’s opinion. Inclusion in IEEE MultiMedia does not necessarily constitute endorsement by the IEEE or the IEEE Computer Society. All submissions are subject to editing for style, clarity, and length. IEEE prohibits discrimination, harassment, and bullying. For more information, visit www.ieee.org/web/aboutus/whatis/policies/p9-26.html. Reuse Rights and Reprint Permissions: Educational or personal use of this material is permitted without fee, provided such use: 1) is not made for profit; 2) includes this notice and a full citation to the original work on the first page of the copy; and 3) does not imply IEEE endorsement of any third-party products or services. Authors and their companies are permitted to post the accepted version of IEEE-copyrighted material on their own Web servers without permission, provided that the IEEE copyright notice and a full citation to the original work appear on the first screen of the posted copy. An accepted manuscript is a version that has been revised by the author to incorporate review suggestions, but not the published version with copyediting, proofreading, and formatting added by IEEE. For more information, please go to: www.ieee.org/publications_standards/publications/rights/paperversionpolicy.html. Permission to reprint/republish this material for commercial, advertising, or promotional purposes or for creating new collective works for resale or redistribution must be obtained from IEEE by writing to the IEEE Intellectual Property Rights Office, 445 Hoes Lane, Piscataway, NJ 08854-4141 or ______________ pubs-permissions@ieee.org. Copyright c 2016 IEEE. All rights reserved.

Abstracting and Library Use: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy for private use of patrons, provided the per-copy fee indicated in the code at the bottom of the first page is paid through the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. Circulation: IEEE MultiMedia (ISSN 1070-986X) is published quarterly by the IEEE Computer Society. IEEE Headquarters: Three Park Ave., 17th Floor, New York, NY 10016-5997. IEEE Computer Society Publications Office: 10662 Los Vaqueros Circle, PO Box 3014, Los Alamitos, CA 90720-1264; +1 714 821 8380. IEEE Computer Society Headquarters: 2001 L St., Ste. 700, Washington, DC 20036. Subscribe to IEEE MultiMedia by visiting www.computer.org/multimedia. Postmaster: Send address changes and undelivered copies to IEEE MultiMedia, IEEE, Membership Processing Dept., 445 Hoes Lane, Piscataway, NJ 08855, USA. Periodicals Postage is paid at New York, NY, and at additional mailing sites. Canadian GST #125634188. Canada Post International Publications Mail Product (Canadian Distribution) Sales Agreement #0487848. Canada Post Publications Mail Agreement Number 40013885. Return undeliverable Canadian addresses to P.O. Box 122, Niagara Falls, ON L2E 6S8. Printed in USA.

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Feature Articles 64

An Image Encryption Algorithm Based on Autoblocking and Electrocardiography Guodong Ye and Xiaoling Huang

Extended Guided Filtering for Depth Map Upsampling Kai-Lung Hua, Kai-Han Lo, and Yu-Chiang Frank Wang

Managing Intellectual Property in a Music Fruition Environment

Adriano Barat e , Goffredo Haus, Luca A. Ludovico, and Paolo Perlasca

Departments 2

EIC’s Message Yong Rui Understanding Multimedia

Cover Image: Peter Nagy 4

Research Projects Tilman Dingler, Passant El Agroudy, Huy Viet Le, Albrecht Schmidt, Evangelos Niforatos, Agon Bexheti, and Marc Langheinrich Multimedia Memory Cues for Augmenting Human Memory

________

Scientific Conferences Qiong Liu The BAMMF Series in Silicon Valley Call for Papers: Advancing Multimedia Distribution, p. 27 Advertising Index, p. 41 IEEE CS Information, p. 52 Call for Papers: New Signals in Multimedia Systems, p. 53

ISSN 1070-986X Editor in Chief Yong Rui

Microsoft Research

Associate Editors in Chief Susanne Boll Alan Hanjalic Wenjun Zeng Editorial Board Ian Burnett Andrea Cavallaro Shu-Ching Chen Touradj Ebrahimi Farshad Fotouhi Gerald Friedland Winston Hsu Gang Hua Benoit Huet Hayley Hung Aisling Kelliher Shiwen Mao Cees G.M. Snoek Rong Yan

University of Oldenburg, Germany Delft University of Technology University of Missouri-Columbia RMIT University Queen Mary University of London Florida International University Swiss Federal Institute of Technology Wayne State University University of California, Berkeley National Taiwan University Stevens Institute of Technology Eurecom Technical University of Delft Virginia Tech Auburn University University of Amsterdam Snapchat

Advisory Board Forouzan Golshani William Grosky Ramesh Jain Sethuraman Panchanathan John R. Smith

Calif. State Univ., Long Beach University of Michigan University of California, Irvine Arizona State University IBM

Editorial Management Shani Murray Editorial Product Lead Cathy Martin Senior Manager, Editorial Services Robin Baldwin Manager, Editorial Services Brian Brannon Assoc. Mgr., Peer Review & Periodical Admin. Hilda Carman

Director, Products and Services Senior Business Development Manager Senior Advertising Coordinator

Evan Butterfield Sandra Brown Marian Anderson

Magazine Operations Committee Forrest Shull (chair), Brian Blake, Maria Ebling, Lieven Eeckhout, Miguel Encarnacao, Nathan Ensmenger, Sumi Helal, San Murugesan, Yong Rui, Ahmad-Reza Sadeghi, Diomidis Spinellis, George K. Thiruvathukal, Mazin Yousif, Daniel Zeng Publications Board David S. Ebert (VP for Publications), Alfredo Benso, Irena Bojanova, Greg Byrd, Min Chen, Robert Dupuis, Niklas Elmqvist, Davide Falessi, William Ribarsky, Forrest Shull, Melanie Tory Submissions: Send to https://mc.manuscriptcentral.com/ cs-ieee ____ (Manuscript Central). Please check to see if you have an account by using the Check for Existing Account button. If you don’t have an account, please sign up. Submit proposals for special issues to John R. Smith (jsmith@us.ibm.com). ____________ All submissions are subject to editing for style, clarity, and length.

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

EIC’s Message Understanding Multimedia

T Yong Rui Microsoft Research

he word artiﬁcial intelligence was coined in 1956 at Dartmouth College by a group of AI pioneers, so 2016 is its 60th anniversary. Since the very beginning of AI, researchers have dreamed of letting machines see, hear, feel, and understand the outside world. The sensory data of this outside world relates directly to the ﬁeld of multimedia—it includes images and videos for machines to see, audios for machines to hear, and haptics for machines to feel. In the past several months, I have been discussing some of these topics with colleagues at Microsoft Research, as well as with researchers from academia and industry, and we’re all amazed by how much AI and multimedia technologies have advanced in the past 60 years.

Visual Media Understanding Early on, visual media understanding mostly took place at the processing level, with image de-noising and line and corner extraction. This level of understanding corresponded with lowlevel human vision, but it was an important building block in moving toward high-level visual media understanding. Next, technology moved beyond understanding lines and corners and started analyzing color, texture, and shapes. The early literature on computer vision and multimedia contained rich research on developing various color spaces, such as RGB and YUV; testing different texture metrics, such as wavelet-based metrics; and inventing many shape descriptors and invariants, such as Fourier-based metrics. In the mid 1990s, color, texture, and shape features and similarity measures dominated the content-based image retrieval (CBIR) research area. Then, in the late 1990s, researchers started to better appreciate the semantic gap between the low-level features and the high-level semantics. Consequently, they put more effort into visual media classiﬁcation and tagging, and now an image of a red apple is more likely to be classiﬁed along with other apples than with some other, unrelated red round object.

1070-986X/16/$33.00 c 2016 IEEE

Thanks to these improvements in classiﬁcation and tagging, commercial search engines started returning meaningful results when users searched by example, but researchers still weren’t satisﬁed. They wanted to move to the next level.

The Video2Text Problem Given an image, can a computer generate a human understandable sentence? This is the socalled image2text problem. In 2014, systems from both academia and industry started to generate reasonable text results (see, for example, results from the 2015 Captioning Challenge at http://mscoco.org/dataset/#captions-leader ________________________________ -board). In fact, some of the generated sentences ____ were quite good. In parallel, other researchers were attacking the video2text problem, including researchers from Stanford, UT Austin, Berkeley, SUNY-Buffalo, and Microsoft Research. In general, there are two approaches to addressing the video2text problem. The ﬁrst approach uses a language template-based model, which predicts the best subject-verb-object and then generates a sentence using the template. The other approach, which currently provides better results, is to use recurrent neural network (RNN)-based models. The various RNN models differ in terms of the video features/representation used, the convolutional neural network structure chosen, and the objective function components (for example, relevance versus coherence) deﬁned.

hen it comes to visual media understanding, we’ve been continually progressing for the past 60 years, moving from line/corner extraction, to color/texture/shape analysis, to classiﬁcation and tagging, and now to image/video2text exploration. In the short term, I see computers generating paragraphs (versus sentences) from multiple images (versus a single image). In the long term, we’re marching toward understanding the much more

Published by the IEEE Computer Society

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Editorial Board Changes I want to take this opportunity to thank Anthony Vetro (editor of the Industry and Standards department) and ChiaWen Lin (our representative from the Signal Processing Society) for finishing their current terms and retiring from the editorial board. It would be impossible for us to do our job without their tremendous help! At the same time, I welcome Andrea Cavallaro as the new editor representing the Signal Processing Society and handling related papers and Touradj Ebrahimi as the new department editor for Industry and Standards. Andrea Cavallaro is a professor of multimedia signal processing and the director of the Centre for Intelligent Sensing at Queen Mary University of London, UK. His research interests include camera networks, target detection and tracking, behavior and identity recognition, privacy, and perceptual semantics in multimedia. Cavallaro received his

complicated direct and implied meanings of images and videos. Happy researching!

PhD in electrical engineering from the Swiss Federal Institute of Technology (EPFL), Lausanne. He is an elected member of the IEEE Signal Processing Society and is a member of the Image, Video, and Multidimensional Signal Processing Technical Committee and chair of its Awards committee. Contact him at ________________ a.cavallaro@qmul.ac.uk. Touradj Ebrahimi is a professor at the Swiss Federal Institute of Technology (EPFL), Lausanne, Switzerland, where he heads the Multimedia Signal Processing Group. His research interests include still, moving, and 3D image processing and coding; visual information security (rights protection, watermarking, authentication, data integrity, steganography); new media; and human computer interfaces (smart vision, brain computer interface). Ebrahimi received his PhD in electrical engineering from EPFL. Contact him at _________________ touradj.ebrahimi@epfl.ch.

and social and urban computing. Contact him at yongrui@microsoft.com. _______________

Yong Rui is the assistant managing director at Microsoft Research Asia, leading research efforts in the areas of multimedia search, knowledge mining,

_______________ _________

April–June 2016

_____________________________ ____________

____________________________

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Research Projects

Susanne Boll University of Oldenburg, Germany

Multimedia Memory Cues for Augmenting Human Memory Tilman Dingler, Passant El Agroudy, Huy Viet Le, and Albrecht Schmidt University of Stuttgart, Germany Evangelos Niforatos, Agon Bexheti, and Marc Langheinrich della Svizzera Universita Italiana (USI), Lugano, Switzerland

uman memory has long been used as an important tool in helping people effectively perform daily tasks. We write down information that we don’t want to forget, or tie a knot into a handkerchief to remember an important event. Today’s technology offers many replacements for these tried and tested tools, such as electronic phone books, diaries with automated alarms, and even location-based reminders. Lifelogging—“a phenomenon whereby people can digitally record their own daily lives in varying amounts of detail”1—offers a powerful new set of tools to augment our memories. In particular, the prospect of capturing a continuous stream of images or videos from both a ﬁrst-person perspective and various third-person perspectives promises an unprecedented level of rich multimedia content. Such content could disclose a signiﬁcant amount of detail, given the right set of analysis tools. Having comprehensive recordings of our lives would make it possible, at least in principle, to search such an electronic diary for any kind of information that might have been forgotten or simply overlooked: “What was the name of the new colleague that I met yesterday?” or “Where did I last see my keys?” In the context of the EU-funded Recall project (http://recall-fet.eu), we also look into the ____________ use of such multimedia data to augment human memory—but in a conceptually different fashion. Instead of seeking to offer users an index that can be searched at any time, thereby diminishing the importance of their own memory, we seek to create a system that will measurably improve each user’s own memory. Instead of asking yourself (that is, your electronic diary) for the name of the new colleague during your next encounter (which could be awkward as you wait for the diary to pull up the name), Recall users would have already trained their own memory to simply remember the colleague’s name.

1070-986X/16/$31.00 c 2016 IEEE

Here, we present the core research ideas of Recall, outlining the particular challenges of such an approach for multimedia research and summarizing the project’s initial results. Our overall approach is to collect multimedia lifelog data and contextual information through a range of capture devices, process the captured data to create appropriate memory cues for later playback, and apply theories from psychology to develop tools and applications for memory augmentation (see Table 1).

Memory Cues A system that aims to improve the user’s own memory must be able to properly select, process, and present “memory cues.” A memory cue is simply something that helps us remember— it is a snippet of information that helps us access a memory.2 Figure 1 gives an overview of contextual information sources that produce these cues. Almost anything can work as a memory cue: a piece of driftwood might remind us of family vacations at the beach, an old song might remind us of our ﬁrst high school dance, or the smell of beeswax might remind us of a childhood Christmas. Multimedia—audio, pictures, video, and so on—is thus of particular interest. It holds a signiﬁcant amount of information that can offer rich triggers for memory recollection. Furthermore, given today’s technology, multimedia memory cues are relatively easy to capture. Recall uses memory cues to stimulate pathways in a user’s memory that will reinforce the ability to retrieve certain information when needed in the future. To be useful, memory cues thus don’t have to actually contain all of the information needed. For example, a picture of a particular whiteboard drawing might not be detailed enough to show the individual labels, yet seeing a picture of the overall situation might be enough for a user to vividly remember not only

Published by the IEEE Computer Society

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Table 1. Summary of Recall studies to identify efficient cues for triggering memories. System layer

Research probe

Data-capture approach

Recall-supporting cues

Capture memory cues

On-body camera position: How does the position affect the quality

Automatic fixed-interval capture

Videos and photos from cameras (head-worn cameras offer better

and perception of captured photos?

autobiographical cues, while chestworn cameras are more stable) Faces (most relevant cues)

PulseCam app: How can we capture

Pulse-rate-triggered capture

only important photos using biophysical data? MGOK app: How can we enhance the quality of memory cues by captur-

Smaller number of captured photos for important activities

Limit number of pictures per day

Smaller number of captured photos for important activities

ing more significant moments? Extract

Summarization of lifelog image

Automatic fixed-interval

Summarized-video requirements: no

memory cues

collection: What are the guidelines

capture

more than three minutes; include

to produce video summaries from such collections?

people, objects, or actions; and present in the same chronological order

Summarization of desktop activity screenshots: How can we

Reading-triggered capture (using a commercial eye

reduce the volume of screenshots

tracker)

Smaller number of captured screenshots for important activities

without affecting recall quality? LISA prototype: How can we create

Auto-sync with third-party

Aggregated dashboard of projected

a holistic and interactive solution for reflecting on daily activities?

services

visualizations and speech (location, pictures, fitness data, and calendar

EmoSnaps app: How can we enhance emotional recall of past

Capture at predefined moments (such as when a

experiences using visual cues?

device is unlocked)

Re-Live the Moment app: How

Running-triggered capture

Time-lapse video: route-captured

can we use personalized multimedia

using a music playlist (continu-

photos and personal running music

cues to foster positive behavioral change in running?

ous capturing) and route photos (automatic, fixed-interval

playlist

ja vu concept: How can we De exploit priming to display ambient

Search for relevant information in third-party services

events) Present memory cues

Selfies (facial expressions)

capture) Visualizations of proactive information chunks about future situations

information about future situations to make them familiar?

Memory Capture Near-continuous collection of memory cues (lifelogging) has become possible through a number of available technologies. Lifelog cam-

eras, such as Microsoft’s SenseCam (http:// ____ research.microsoft.com/en-us/projects/sensecam) or the Narrative Clip (http://getnarrative.com), let users capture the day in images. Every 10 to 120 seconds, these devices take a picture, culminating in a time-lapse sequence of images that can span days, weeks, or months. Additional audio and video footage can be collected through other user-worn capture devices and through cameras and microphones placed in the environment. All this data makes up lifelogs, but the quality of the footage often is volatile.

April–June 2016

the diagram itself but also the discussion surrounding its creation. Similarly, an image with the face of a new colleague, together with the ﬁrst letter of her name, might be enough to trigger our own recollection of the full name, and thus priming us to retrieve the full name when we meet the new colleague again.

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Research Projects

Memory cues

Human memory

Actual experience Figure 1. Contextual information sources that produce cues for supporting one’s ability to recall a past experience of a future event.

IEEE MultiMedia

Here, device positioning matters. We compared the body position where such cameras are normally attached—head or chest—and its effect on image quality and user perception.3 We equipped 30 participants with cameras on their foreheads and chests and later asked them about their perception of the images collected. Additionally, we applied a set of standard image-processing algorithms to classify images, including sharpness ﬁlters and face and hand detection. We learned that the chest-worn devices produced more stable and less motion-blurred

(a)

images, through which feature detection by image processing algorithms worked better. Head-worn video cameras, on the other hand, captured more important autobiographical cues than chest-worn devices. Here, faces were shown to be most relevant for recall. Beyond visual data, there is a number of different data types capable of enhancing lifelogs. Combining different data streams can form a more comprehensive picture: accelerometers, blood pressure, or galvanic skin response sensors, for example, output physiological data that can be used to assess the signiﬁcance of images taken. Smartphones and watches often have some of these sensors already included. They further allow the collection of context information, such as time, location, or activities. We proposed using biophysical data to distinguish between highly important and rather irrelevant moments, subsequently driving image capture. As such, we developed the PulseCam (see Figure 2a), an Android Wear and mobile app that takes the user’s pulse rate to capture images of greater importance.4 Eventually, merged with third-party data sources— such as calendar entries, email communication, or social network activities—lifelogs can be enriched with a holistic picture of a person’s on- and ofﬂine activities throughout the day. Most of this data can be collected implicitly—that is, without the user having to manually trigger the recording (by taking a picture, for example). Explicit recording, on the other hand, would imply the conscious act of recording a memory, as would a manual posting on

(b)

Figure 2. Screenshots of Recall probes for capturing memory cues. (a) The PulseCam prototype hardware. On the left arm, the user has an LG G smart watch for continuous heart rate capture. On the right arm, a Nexus S smartphone is attached to capture the pictures. (b) A screenshot of the My Good Old Kodak

¼ch.usi.inf.recall.myoldkodak&hl¼en). The application (https://play.google.com/store/apps/details?id ____________________________________________________ number of remaining photos in the day is displayed in the lower left corner.

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Twitter, Facebook, or Instagram. Explicit capture tends to indicate the significance of a particular moment for a person, whereas implicit capture helps us ensure we don’t miss key events. A combination of both capture modes leads to a richer lifelog, from which more significant memory cues might be drawn. To produce a holistic record, data coming from all these different sources must be properly time-synchronized, unless additional algorithms can infer temporal co-location from overlapping information (for example, a smartphone camera and chestmounted camera showing two different viewpoints of the same scene) and thus post-hoc synchronize two or more streams. To contrast how implicit and explicit captures influence our original (uncued) recollection of an event, and how well they can serve as memory cues, we developed the My Good Old Kodak (MGOK) mobile app. MGOK is a mobile camera application that artificially limits the amount of pictures that can be taken, resembling the classic film cameras (see Figure 2b).5 We are currently analyzing data from a large trial that we ran with almost 100 students, snapping away for a day with a chest-worn lifelogging camera (implicit, unbounded), a “normal” smartphone camera app (explicit, unbounded), our MGOK app (explicit, bounded), or no camera at all. We hypothesize that the imposed capture limitation will result in moments of higher significance being captured, potentially leading to pictures that better serve as memory cues.

Memory Cue Extraction

Summarizing Large Datasets Images, videos, and speech streams are a rich pool of information, because they capture experiences along with contextual cues such as locations or emotions in great detail. However, they require signiﬁcant time to be moderated and viewed, creating the need for efﬁcient automatic summarization techniques. For example, a single Narrative Clip that takes a picture every 30 seconds will produce approximately 1,500 pictures per day. On the other hand, the variety of digital services that we use on a daily basis produces a heterogeneous pool of data. This makes it hard to gain deeper insights or derive more general patterns. By merging sensor data and extracting meaning, we can derive holistic and meaningful insights. Summarizing a large image collection. To inform and automatically generate lifelog summaries, we conducted a set of user studies to elicit design guidelines for video summaries.6 We instructed 16 participants to create video summaries from their own lifelogging images and compared the results to nonsummary review techniques, such as using time lapses and reviews through an image browser. The three techniques were equally effective, but participants preferred the experience of their own video summaries. However, such manual processing isn’t always possible, especially when considering the large amounts of data collected in just one day. Insights from the preceding study lead us to the following set of guidelines, which we used to build a system for automating the creation of video summaries. First, video summaries should not exceed three minutes, because most users don’t want to spend an exhaustive amount of time reviewing lifelogging activities. Second, images featuring combinations of people, places, objects, or actions are reportedly the most effective memory cues. These can be further enhanced by adding metadata, which improves the user’s understanding of the image’s context. Finally, presenting images in a chronological order provides additional support (chronological, contextual, and inferential) for the reconstruction of memories. In particular, this affects

April–June 2016

Memory cues are stimuli hints that trigger the recall of a past experience or future event.2 Cues can be presented, for example, on peripheral displays throughout the user’s home or on a personal device, such as a smartphone. They are meant to trigger episodic (remembering past events) and prospective (remembering planned future events) memory recall. By frequently encountering certain cues, they can improve people’s ability to recall a relevant memory and its details over time. Such an effect could potentially persist without the need for further technological support. Thus, we currently focus on two main directions: summarizing large datasets of images, and merging and summarizing heterogeneous data resources. One of the main objectives is to ﬁnd high-quality cues for efﬁciently triggering memories and recall. Compared to traditional summaries, an effective memory cue is minimalistic by

itself but allows a wide range of associations to be made. It acts as a trigger to your own memory with all the richness that the memory entails.

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Research Projects

(a)

(b)

Figure 3. Pictures of the LISA prototype used to extract memory cues: (a) The prototype hardware, composed of a projector and a set of speakers. (b) The summary of memory cues presented as a dashboard projection and audio summary upon waking up.

activities with greater movement (such as sports activities, walking, or social events) because such activities require multiple images to cover additional details not well represented by a single image. Summarizing desktop screenshots. To create a holistic lifelog of people’s daily activities, we also need to look at people’s technology usage across the day. Therefore, we investigated how to capture people’s PC usage, as represented by their activities on their computer desktop. We focused on automatic screenshots, which— when triggered in a regular time interval—produce a large amount of images. This led to an investigation of how to minimize the sheer volume of snapshots taken by an automatic desktop logger in a work environment.7 We compared three triggers for such snapshots: a ﬁxed-time interval (two minutes) and two techniques informed by eye-tracking data—whenever the user’s eye gaze focused on an application window, or whenever a reading activity was registered. Reading detection turned out to signiﬁcantly reduce the amount of images taken while still capturing relevant activities.

IEEE MultiMedia

Merging and Summarizing Heterogeneous Data Sources There is a wide range of personal data streams accessible not only through capture devices but also through Web APIs and interaction logs. We created a projection system called Life Intelligence Software Assistant (LISA) in the form of a bedside device that provides a morning briefing, combining data from the past day with

upcoming events (see Figure 3). Using visual projection and speech, it presents information from different data sources: locations visited, ﬁtness stats, images taken, and calendar events. In a pilot study, we found a mixture of speech and projection to be preferable to either of them alone. In a series of domestic deployments, we are currently investigating the effectiveness of different cues, display locations, and use cases.

Presentation: Apps and Concepts The capture of effective memory cues is essential to enable recall. A single effective cue can produce a great amount of details in memory, such that a comprehensive media capture of that same memory becomes redundant. To investigate the efficiency of certain memory cues, we created and deployed a series of presentation prototypes that allowed us to test how replaying captured cues would actually help participants remember prior experiences. EmoSnaps Initially, we investigated how visual cues can enhance emotional recall in the form of selfies. As such, we developed a mobile app called EmoSnaps (see Figure 4a). It unobtrusively captures pictures of the user’s facial expression at predefined moments (such as when the user unlocks his or her smartphone).8 Participants correctly identified the emotion captured on their selfies during a past moment solely by revisiting the selfie taken. Surprisingly, participants managed to identify older pictures better than newer ones. We

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

(a)

(b)

(c)

Figure 4. Screen shots of Recall research problems for presenting memory cues. (a) EmoSnaps—the user is asked to recall his or her emotion using a past selfie (https://play.google.com/store/apps/details?id ¼com.nifo.emosnaps&hl¼en). (b) Re-Live the Moment—the ______________________________________________ user wears a chest-mounted smarphone. He starts running with his favorite music and tracking app. After the run, the user can check his “personal music clip.” (c) De ja vu—a conceptual prototype of context-aware peripheral displays showing information about relevant people and locations.

attribute this phenomenon to the probable conflict between recognizing emotion through facial expressions with recalling emotion from contextual information derived from the background of the picture. We believe this conflict becomes less prevalent as more time elapses since capture. We further used the selfie cue in additional studies as a successful metric to evaluate user satisfaction with mobile phone applications,8 and we even proposed unobtrusively measuring drivers’ user experience during their commutes.9

ja Vu De Human memory isn’t restricted to simply recalling the past (episodic memory); it also relates to remembering events that are scheduled to occur in the future. In fact, human memory relies on prospective memory for remembering upcoming events. To tap into the potential of prospective memory, we envisioned exploiting the concept vu (see Figure 4c) and displaying inforof d eja mation about upcoming events and situations with the goal of making future situations appear somewhat familiar.11 New situations naturally create a sense of excitement or anxiety. However, using peripheral displays in people’s homes to present small information chunks that possibly have future relevance for a person can help lower potential anxiety caused by the uncertainty of the unknown. We investigated whether people can learn incidentally and without conscious effort about new environments and other people. By providing visual information, such systems create a sense of d eja vu at the point when people will be facing a new situation.

April–June 2016

Re-Live the Moment We investigated whether using visual as well as audible memory cues (pictures and music) captured during a beneﬁciary activity (such as running) could be used to facilitate the formation of positive habits (such as exercising often). In fact, contemporary psychology has shown that people are more likely to form a habit if they are reminded of previously positive experiences during habit formation. Based on this theory, we developed Re-Live the Moment (see Figure 2b), a mobile application that captures pictures and records the music that a user was listening to while on a run to create a personal exercise “music clip.”10 These clips act as video memory cues that can later be watched to remind runners of the positive feelings exhibited during their run, thus encouraging them to continue exercising. We tested our prototype in a pilot study with ﬁve participants who reported that they enjoyed reviewing the multimedia presentation (the personal music clip) after the run, and some even shared the resulting clip with friends. We assume

that positive feelings exhibited during the review might lead to more exercising, but a larger deployment, and for a longer duration, is required to ascertain the presence and signiﬁcance of such an effect.

Assessing the Tools In the ﬁnal stages of the Recall project, we’re undertaking a series of trials to assess the effectiveness of our memory augmentation tools.

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Research Projects

We have identified three domains: domestic, workplace, and campus. In a domestic setting, we use devices and displays to present memory cues in accordance to people’s regular routines. We attach displays in the periphery of their homes to create stimulating environments and display personal content with the goal of supporting people’s cognition and memory. In a campus scenario, we focus on the scheduling and presentation of personalized media on public displays and other ambient displays across a university campus. By deriving memory cues from lecture material, we target content at specific individuals and groups. Finally, the work scenario involves a series of augmented meeting rooms that give people access to captured moments, both from other meeting participants and from the installed infrastructure (such as cameras). Captured moments are augmented with topics inferred from an automated topic analysis system and played back to attendees on peripheral displays (such as on laptop and smartphone lock screens, or on tablets installed in offices that serve as picture frames). The goal is to help users better remember the meeting progression and outcome to prepare for the next meeting.

Challenges and Lessons Learned There are still signiﬁcant challenges that must be addressed before we can fully realize the potential of multimedia to augment human memory recall. Modern lifelogging devices, for example, can certainly capture a tremendous amount of information, yet this information is often irrelevant to the actual experience we strive to capture. Although a variety of approaches can be taken to separate important moments from mundane ones during capture, there is clearly much to be done.

IEEE MultiMedia

Capturing Meaningful Data Memory cues can be highly effective when evoking ﬁne-grained details about an experience, or completely obsolete when they can’t be placed into a context. Obtaining meaningful cues from a multimedia capture relies heavily on both the raw data quality and the legitimacy of the conclusions we draw from them. This is especially true for implicitly collected data, where inferential conclusions might be ambiguous. The ﬁnal quality of a multimedia capture thus can’t be judged simply by the data itself— for example, the picture quality or its contents

(though lower boundaries, such as dark or blurry images, do exist). Memories and their corresponding cues are highly personal. A supporting system therefore must learn the user’s preferred types of cues and subjective relevance of a capture. For some, a blurry image of scribbled meeting notes might be enough to recall the meeting’s content, while others might need to see the faces of those present to evoke a meaningful memory of the event. The use of physiological sensing might offer some insight into which cues hold the most potential for a user. Dealing with Technology Constraints There are constraints on what can be tracked, especially when it comes to physiological and also psychological or emotional aspects. Despite the progress made regarding tracking physical data, tracking mental activities is inherently difficult to do without additional hardware, such as eye trackers or Electroencephalography (EEG) devices—both of which are (still) highly obtrusive to use. Furthermore, certain mental states are difficult to infer reliably, such as attention levels, emotions, or stress. A much more straightforward technical barrier is today’s often low capture quality. Although storage will continue to expand, the sheer volume of what we can capture might nevertheless tax effective local processing, requiring extensive offline processing that might eventually become cost effective with technological advances. Addressing Privacy Implications Although the continuous capture of (potential) multimedia memory cues might be a boon to human memory, it might also represent the bane of an Orwellian nightmare come true. The strong social backlash that many wearers of Google Glass experienced12 is a potent reminder of the potentially underlying incompatibilities between those who capture and those who are captured. In previous work,13 we enumerated the key privacy issues of memory augmentation technology—issues that span a wide range of areas, from data security (secure memory storage, ensuring the integrity of captured memories) to data management (sharing memories with others) to bystander privacy (controlling and communicating capturing in public). We recently started work on creating an architecture that both enables the seamless sharing of

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

captured multimedia data within colocated groups (for example, an impromptu work meeting or a chat over coffee) and features tangible objects to easily communicate and control what gets recorded and who can access the data.

ltimately, Recall aims to lay the scientiﬁc foundations for a new technology ecosystem that can transform how humans remember to measurably and signiﬁcantly improve functional capabilities while maintaining individual control. Our work in Recall has only begun to scratch the surface of this exciting new application area. MM

Conf. Automotive User Interfaces and Interactive Vehicular Applications, 2015, pp. 118–123. 10. A. Bexheti et al., “Re-Live the Moment: Visualizing Run Experiences to Motivate Future Exercises,” Proc. 17th Int’l Conf. Human-Computer Interaction with Mobile Devices and Services Adjunct, 2015, pp. 986–993. j 11. A. Schmidt et al., “De a Vu—Technologies that Make New Situations Look Familiar: Position Paper,” Proc. 2014 ACM Int’l Joint Conf. Pervasive and Ubiquitous Computing: Adjunct Publication, 2014, pp. 1389–1396. 12. B. Bergstein, “The Meaning of the Google Glass Backlash,” MIT Technology Rev., 14 Mar. 2013; www.technologyreview.com/s/512541/the -meaning-of-the-google-glass-backlash. _______________________

Acknowledgment Project Recall is funded by the European Union in FP7 under grant number 612933.

13. N. Davies et al., “Security and Privacy Implications of Pervasive Memory Augmentation,” IEEE Pervasive Computing, vol. 14, no. 1, 2015, pp. 44–53. Tilman Dingler is a researcher at the University of

References

Stuttgart, Germany. Contact him at tilman.dingler@ __________

1. C. Gurrin, A.F. Smeaton, and A.R. Doherty, “Lifelogging: Personal Big Data,” Foundations and Trends in Information Retrieval, vol. 8, no. 1, 2014, pp. 1–125. 2. A. Baddeley et al., Memory, Psychology Press,

vis.uni-stuttgart.de. ____________ Passant El Agroudy is a researcher at the University of Stuttgart, Germany. Contact her at ________ passant.el.agroudy@vis.uni-stuttgart.de. _________________

2009. 3. K. Wolf et al., “Effects of Camera Position and Media Type on Lifelogging Images,” Proc. 14th Int’l Conf. Mobile and Ubiquitous Multimedia, 2015, pp. 234–244. 4. E. Niforatos et al., “PulseCam: Biophysically Driven Life Logging,” Proc. 17th Int’l Conf. Human-Computer Interaction with Mobile Devices and Services

Huy Viet Le is a researcher at the University of Stuttgart, Germany. Contact him at _________ huy.le@vis.unistuttgart.de. ________ Albrecht Schmidt is a professor of human-computer interaction at the University of Stuttgart. Contact him at _______________________ albrecht.schmidt@vis.uni-stuttgart.de.

Adjunct, 2015, pp. 1002–1009. 5. E. Niforatos, M. Langheinrich, and A. Bexheti, “My Good Old Kodak: Understanding the Impact of Having Only 24 Pictures to Take,” Proc. 2014 ACM Int’l Joint Conf. Pervasive and Ubiquitous Computing: Adjunct Publication, 2014, pp. 1355–1360. 6. H.V. Le et al., “Impact of Video Summary Viewing

Evangelos Niforatos is a researcher at Universita della Svizzera Italiana (USI), Lugano, Switzerland. Contact him at _________________ evangelos.niforatos@usi.ch. della SvizAgon Bexheti is a researcher at Universita zera Italiana (USI), Lugano, Switzerland. Contact

on Episodic Memory Recall: Design Guidelines for Video Summarizations,” Proc. 34rd Ann. ACM Conf.

him at _____________ agon.bexheti@usi.ch.

Human Factors in Computing Systems, 2016.

Marc Langheinrich is an associate professor at the della Svizzera Italiana (USI), where he Universita

7. T. Dingler et al., “Reading-Based Screenshot Summaries for Supporting Awareness of Desktop Activities,” Proc. 7th Augmented Human Int’l Conf.,

works on privacy and usability in pervasive computing systems. Contact him at langheinrich@ieee.org. ______________

April–June 2016

2016, pp. 27:1–27:5. 8. E. Niforatos and E. Karapanos, “EmoSnaps: A Mobile Application for Emotion Recall from Facial Expressions,” Personal and Ubiquitous Computing, vol. 19, no. 2, 2015, pp. 425–444. 9. E. Niforatos et al., “eMotion: Retrospective In-Car User Experience Evaluation,” Adjunct Proc. 7th Int’l

_______________ _________

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Guest Editors’ Introduction

Ubiquitous Multimedia: Emerging Research on Multimedia Computing Yonghong Tian Peking University Min Chen University of Washington Bothell Leonel Sousa Universidade de Lisboa, Portugal

ultimedia is ubiquitous in our daily lives. In recent years, many multimedia applications and services have been developed and deployed, including mobile audio/ video streaming, mobile shopping, and remote video surveillance, letting people access rich multimedia content anytime, anywhere, and using different access networks and computing devices. It is anticipated that ubiquitous multimedia applications and services will change the way we operate and interact with the world. In this sense, digital multimedia can be viewed as representing a fundamental shift in how we store, transmit, and consume information. An immediate consequence of ubiquitous multimedia applications and services has been

1070-986X/16/$33.00 c 2016 IEEE

the explosive growth of multimedia data. Today, photos and videos can be easily captured by any handheld device—such as a mobile phone, an iPad, or an iWatch—and then automatically pushed to various online sharing services (such as Flickr and YouTube) and social networks (such as Facebook and WeChat). On average, Facebook receives more than 350 million new photos each day, while 300 hours of video are uploaded to YouTube every minute.1 Furthermore, Cisco predicts that video will account for 80 percent of all consumer Internet traffic in 2019, up from 64 percent in 2014.2 Such explosive growth of multimedia data will definitely lead to the emergence of the so-called “big data deluge.”3 The wide-ranging applications and big data of ubiquitous multimedia present both unprecedented challenges and unique opportunities for multimedia computing research, which were the focus of the 2015 IEEE International Symposium on Multimedia (ISM 2015) held in Miami from 14–16 December 2015. Over the past decade, ISM has established itself as a renowned international forum for researchers and practitioners to exchange ideas, connect with colleagues, and advance the state of the art and practice of multimedia computing, as well as to identify emerging research topics and define the future of this cross-disciplinary field. The ISM 2015 call for papers redefined “multimedia computing” as “one of the computing fields that is generally concerned with presentation, integration, and computation of one or more ubiquitous media, such as text, image, graphics, audio, video, social data, and data collected from various sensors, etc., using computing techniques.” Approximately 45 high-quality papers were accepted for ISM 2015, providing novel ideas, new results, and state-of-the-art techniques in the field of ubiquitous multimedia computing. Following this successful event, we aim to provide with this special issue another forum for the researchers of the top symposium papers to further present their research results, potentially increasing the papers’ impact on the community.

In this Issue This special issue is the second successful collaboration between IEEE MultiMedia and ISM, which facilitates the publication of the extended versions of the top symposium papers through a fast-track review and publication process. From a total of seven invited submissions, we selected

Published by the IEEE Computer Society

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

four representative articles that investigate the emerging multimedia technology to address the challenges in ubiquitous multimedia data and applications. Next-Generation Video Coding Technology Although the video compression ratio has doubled in each of the last three decades, it is far behind the growth speed of multimedia data. In addition, this gap is expected to grow even bigger over the next several years, presenting an unprecedented challenge for high-efficiency video coding technology.4 In the article, “Nonlocal In-Loop Filter: The Way Toward Next-Generation Video Coding?”, Siwei Ma, Xinfeng Zhang, Jian Zhang, Chuanmin Jia, Shiqi Wang, and Wen Gao journey through the design philosophy of in-loop filtering, an essential coding tool in H.264/AVC and High Efficiency Video Coding (HEVC), and then present their vision of next-generation (higher-efficiency) video coding technology. Toward this end, they explore the performance of in-loop filters for HEVC with image local and nonlocal correlations. In their method, a nonlocal similarity-based loop filter (NLSLF) is incorporated into the HEVC standard by simultaneously enforcing the intrinsic local sparsity and the nonlocal self-similarity of each frame in the video sequence. Then, a reconstructed video frame from the previous stage is first divided into overlapped image patches, which are subsequently classified into different groups based on their similarities. Since these image patches in the same group have similar structures, they can be represented sparsely in the unit of a group instead of a block. The compression artifacts can be reduced by hardand soft-thresholding the singular values of image patches group-by-group, based on the sparse property of similar image patches. Experimental results show that such an in-loop filter design can significantly improve the compression performance of HEVC, providing a new possible direction for improving compression efficiency.

is how to effectively and efficiently process, analyze, and mine the numerous amounts of multimedia big data.

significantly increases the complexity of learning algorithms. One approach for simplification is to assume that the data of interest is on an embedded nonlinear manifold within the higher-dimensional space. In practice, when a data set contains multiple classes, and the structures of the classes are different, a single manifold assumption can hardly guarantee the best performance. To address this problem, Xin Guo, Yun Tie, Lin Qi, and Ling Guan propose a framework of semisupervised dimensionality reduction for multimanifold learning in their article, “A Novel Semi-Supervised Dimensionality Reduction Framework.” Technologically, the framework consists of three components: sparse manifold clustering to group unlabeled samples, cluster label predication to calculate the manifold-to-manifold distance, and graph construction to discover both the geometrical and discriminant structure of the data manifold. Experimental results verify the effectiveness of this multimanifold learning framework. Multimodal Machine Learning In “Multimodal Ensemble Fusion for Disambiguation and Retrieval,” Yang Peng, Xiaofeng Zhou, Daisy Zhe Wang, Ishan Patwa, Dihong Gong, and Chunsheng Victor Fang address the machine-learning problem on multimedia data from another perspective—namely, from that of multimodal machine learning. Toward this end, they first explain why multimodal fusion works by analyzing the correlative and complementary relations among different modalities. By making use of these two properties, multimodal machine learning could achieve higher quality than single-modality approaches. Following this idea, the authors design a multimodal ensemble fusion model with different

April–June 2016

Learning over Multimedia Big Data For ubiquitous multimedia data, another important challenge is how to effectively and efﬁciently process, analyze, and mine the numerous amounts of multimedia big data. Machine learning is widely recognized as an effective tool to cope with this challenge. However, the very high dimensionality of features for multimedia data

An important challenge

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

ensemble approaches for word sense disambiguation and information retrieval, which combines the results of text-only processing and image-only processing to achieve better quality. Experimental results on the University of Illinois at Urbana–Champaign Image Sense Discrimination (UIUC-ISD) dataset and the Google-MM dataset demonstrated the effectiveness of the proposed model. Unsupervised Recurring-Pattern Detection Ubiquitous multimedia computing can also be explored in many new applications and services. In the article, “Planogram Compliance Checking Based on Detection of Recurring Patterns,” Song Liu, Wanqing Li, Stephen Davis, Christian Ritz, and Hongda Tian present a recurring-pattern-detection method for automatic planogram compliance checking, which is referred to as the verification process used by the company headquarters to verify whether each chain store follows the planograms created by headquarters to regulate how products should be placed on shelves. In their method, product layout is extracted from an input image by means of unsupervised recurring-pattern detection and matched via graph matching with the expected product layout specified by a planogram to measure the level of compliance. A divide-andconquer strategy is employed to improve the speed. Specifically, the input image is divided into several regions based on the planogram. Recurring patterns are detected in each region and are merged together to estimate the product layout. Experimental results on real data verify the efficacy of the proposed method.

Future Directions Nevertheless, many technical challenges are yet to be addressed in the ﬁeld of ubiquitous multimedia computing. We thus envision several future research directions in this ﬁeld that are worthy of attention from the multimedia community.

IEEE MultiMedia

Ultra-High Efficiency Compression Considering that the volume of multimedia big data approximately doubles every two years,5 multimedia compression and coding technologies are far from where they need to be. To achieve much higher and even ultra-high coding efﬁciency, one potential solution is to introduce vision-based mechanisms and models into the coding framework and develop vision-based coding theories and technologies to substitute

the traditional signal-processing-based coding framework. Moreover, it is also highly desirable to develop joint compression and coding technology for multiple media data, such as video, audio, and virtual reality data, which is essential to some attractive ubiquitous multimedia applications in unmanned aerial vehicles, self-driving cars, and augmented-reality products. Brain-Like Multimedia Intelligence For machine learning research, ubiquitous multimedia indicates both a major challenge and an important opportunity. On the one hand, larger-scale multimedia data available nowadays can lead toward advanced machine learning techniques, such as deep learning.1 On the other hand, new machine learning models and algorithms are still in urgent need of ways of efficiently processing, analyzing, and mining ubiquitous multimedia data. For example, in recent years, deep learning has shown overwhelmingly superior performance compared to traditional machine learning methods based on hand-craft features, such as scale-invariant feature transform (SIFT) in image classification. However, there is still much room for improvement when applied to video content analysis—for example, to recognize actions or detect abnormalities. One promising direction is to develop new intelligence algorithms by structurally and functionally simulating a human brain, leading to brain-like computation.6 This new intelligence paradigm will likely open another door for ubiquitous multimedia computing and applications. Benchmark Data for Ubiquitous Multimedia Computing The dataset is a core component of research and development in all scientific fields. More recently, the Yahoo Flickr Creative Commons 100 Million (YFCC100M) dataset has become the largest public multimedia collection ever released, with 100 million media objects.1 This dataset enables large-scale unsupervised learning, semisupervised learning, and learning with noisy data to address questions across many fields—from computer vision to social computing. However, YFCC100M consists of photos and videos only from Flickr, making it remarkably different from many ubiquitous multimedia data, such as surveillance video. Even so, the availability of such large-scale benchmark data might shift the way in which we cope with the long-standing challenges in

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

ubiquitous multimedia computing, leading to important breakthroughs.

Yonghong Tian is a full professor with the Coopera-

Attractive Applications and Services The challenges and opportunities highlighted in this field will further foster some interesting developments in ubiquitous multimedia applications and services. Some examples include unmanned aerial vehicles and augmented-reality applications. Looking into the future, ubiquitous multimedia applications will bring new opportunities and driving forces to the research in the related fields. From the history of information technology, we can see that typically a major breakthrough will happen every 10 to 15 years, which will in turn foster new applications, markets, business models and industrial fields. Clearly, ubiquitous multimedia technology should take the responsibility of leading such a change in the next few years. MM

University, Beijing, China. His research interests

tive Medianet Innovation Center, School of Electronics Engineering and Computer Science, Peking include machine learning, computer vision, and multimedia big data. Tian received his PhD in computer science from the Institute of Computing Technology, Chinese Academy of Sciences. He is currently an Associate Editor of IEEE Transactions on Multimedia and the International Journal of Multimedia Data Engineering and Management (IJMDEM), and a Young Associate Editor of Frontiers of Computer Science. He is a senior member of IEEE and a member of ACM. Contact him at yhtian@pku.edu.cn. ____________ Min Chen is an assistant professor in the Computing and Software Systems Division, School of STEM, at the University of Washington Bothell. Her research interests include distributed multimedia database systems, data mining and their applications on reallife problems, and interdisciplinary projects. Chen received her PhD in computing and information sci-

Acknowledgments We thank all the authors and reviewers for their efforts on this special issue under a very tight schedule. In addition, we thank EIC Yong Rui and AEIC Wenjun Zeng for giving us the opportunity to organize this special issue. Yonghong Tian is partially supported by grants from the National Natural Science Foundation of China under contract No. 61390515 and No. 61425025.

ences from Florida International University, Miami. She is the treasurer for the IEEE Computer Society Technical Committee on Multimedia Computing, and she’s on the IEEE ICME Steering Committee. She was the lead program chair for 2015 IEEE International Symposium on Multimedia. Contact her at minchen2@uw.edu. ____________ Leonel Sousa is a full professor in the Electrical and Computer Engineering Department at Instituto Superior Tecnico (IST), Universidade de Lisboa (UL), Lisbon, Portugal. He is also a senior researcher with

References

the R&D Instituto de Engenharia de Sistemas e Com-

1. B. Thomee et al., “YFCC100M: The New Data in Multimedia Research,” Comm. ACM, vol. 59, no. 2, pp. 64–73. 2. Cisco Visual Networking Index: Forecast and Methodology, white paper, Cisco, May 2015; www.cisco. com/c/en/us/solutions/collateral/service-provider/ ip-ngn-ip-next-generation-network/white_________________________ paper c11-481360.html. ______________ 3. “Dealing with Data: Challenge and Opportunities,” Science, vol. 331, no. 6018, 2011, pp. 692–693. 4. W. Gao et al., “IEEE 1857 Standard Empowering Smart Video Surveillance Systems,” IEEE Intelligent Systems, vol. 29, no. 5, 2014, pp. 30–39.

computing, and signal processing systems. Sousa received his PhD in electrical and computer engineering from IST, UL, Portugal. He is currently an Associate Editor of IEEE Transactions on Multimedia, IEEE Transactions on Circuits and Systems for Video Technology, IEEE Access, Springer Journal of Real-Time Image Processing (JRTIP), and IET Electronics Letters, and Editor in Chief of the Eurasip Journal on Embedded Systems (JES). He is Fellow of the Institution of Engineering and Technology (IET), a Distinguished Scientist of ACM, and a senior member of IEEE. Contact him at ________________ leonel.seabra@gmail.com.

April–June 2016

5. J. Gantz and D. Reinsel, The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest

putadores (INESC-ID). His research interests include VLSI and computer architectures, high performance

Growth in the Far East, tech. report, Int’l Data Corp. (IDC), 2012; www.emc.com/collateral/analyst -reports/idc-digital-universe-united-states.pdf. ___________________________ 6. J. Hawkins and S. Blakeslee, On Intelligence, Henry Holt and Company, 2004.

_______________ _________

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Ubiquitous Multimedia

Nonlocal In-Loop Filter: The Way Toward NextGeneration Video Coding? Siwei Ma Peking University Xinfeng Zhang Nanyang Technological University Jian Zhang, Chuanmin Jia, Shiqi Wang, and Wen Gao Peking University

deblocking filter,2 Sample Adaptive Offset (SAO),3 and Adaptive Loop Filter (ALF)4—and eventually adopted the first two. However, these in-loop filters only take advantage of the image’s local correlations, which limits their performance. Here, we explore the performance of in-loop filters for HEVC by taking advantage of both local and nonlocal correlations in images. We incorporate a nonlocal similarity-based loop filter (NLSLF) into the HEVC standard by simultaneously enforcing the intrinsic local sparsity and nonlocal self-similarity of each frame in the video sequence. For a reconstructed video frame from a previous stage, we first divide it into overlapped image patches and subsequently classify them into different groups based on their similarities. Because these image patches in the same group have similar structures, they can be represented sparsely in a group unit rather than a block unit.5 We can then reduce the compression artifacts by thresholding the singular values of image patches group by group, based on the sparse property of similar image patches. We also explore two kinds of thresholding methods— hard and soft thresholding—and their related adaptive threshold determination methods. Our extensive experiments on HEVC common test sequences demonstrate that the nonlocal similarity-based in-loop filter significantly improves the compression performance of HEVC, achieving up to an 8.1 percent bitrate savings.

In-Loop Filtering

Existing in-loop filters rely only on an image’s local correlations, largely ignoring nonlocal similarities. The proposed approach uses group-based sparse representation to jointly exploit local and nonlocal selfsimilarities, laying a novel and meaningful groundwork for in-loop filter design.

igh Efficiency Video Coding (HEVC)1 is the latest video coding standard jointly developed by the International Telecommunication Union–Telecommunication (ITU-T) Video Coding Experts Group (VCEG) and Moving Picture Experts Group (MPEG). Compared to H.264/AVC, HEVC claims to potentially achieve a more than 50 percent coding gain. The in-loop filtering is an important video coding module for improving compression performance by reducing compression artifacts and providing a high-quality reference for subsequent video frames. During the development of HEVC, researchers intensively investigated the performance of three kinds of in-loop filters—the

1070-986X/16/$33.00 c 2016 IEEE

The deblocking filter was the first adopted inloop filter in H.264/AVC to reduce the blocking artifacts caused by coarse quantization and motion compensated prediction.6 Figure 1 shows a typical example of the block boundary with the blocking artifact. H.264/AVC defines a set of low pass filters with different filtering strengths that are applied to 4 4 block boundaries. H.264/AVC has five levels of filtering strength, and the filter strength for each block boundary is jointly determined by the quantization parameters, correlations of samples on both side of block boundaries, and the prediction modes (intra- and interprediction). The deblocking filter in HEVC is similar to that in H.264/AVC. However, in HEVC, it’s applied only to 8 8 block boundaries, which are the boundaries of coding units (CU), prediction units (PU), or transform units (TU). Due to

Published by the IEEE Computer Society

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Figure 1. A one-dimensional example of the block boundary with the blocking artifact. Here, fpi g and fqi g are pixels in neighboring blocks.

a a

a c

c b

Figure 2. Four 1D directional patterns for edge offset sample classification. The samples in the positions, a, b and c, are used for comparison.

C0 C1

C1 C0 Figure 3. Adaptive loop filter (ALF) shape in HM7.0 (each square corresponds to a sample). The notations, c1 ; c2 ; …; c9 are the filter coefficients.

regions must be merged at the encoder side based on rate-distortion optimization (RDO), which makes neighboring regions share the same filters to achieve a good tradeoff between the filter performance and overheads. One of us (Zhang) and colleagues proposed reusing the filter coefficients and regions division in the previous encoded frame to reduce overheads.7 Stephan Wenger and his colleagues proposed placing the filter coefficient parameters in a picture-level header called the Adaptation Parameter Set (APS), which

April–June 2016

HEVC’s improved prediction accuracy, only three filtering strengths are used, thus reducing complexity compared to H.264/AVC. SAO is a completely new in-loop filter adopted in HEVC. In contrast to the deblocking filter, which reconstructs only the samples on block boundaries, SAO processes all samples. Because the sizes of coding, prediction, and transform units have been largely extended compared with previous coding standards— that is, the coding unit has been extended from 8 8 to 64 64, the prediction unit from 4 4 to 64 64, and the transform unit from 4 4 to 32 32—the compression artifacts inside the coding blocks can no longer be compensated by the deblocking filter. Therefore, SAO is applied to all samples reconstructed from the deblocking filter by adding an offset to each sample to reduce the distortion. SAO has proven to be a powerful tool to reduce ringing and contouring artifacts. To adapt the image content, SAO first divides a reconstructed picture into different regions and then derives an optimal offset for each region by minimizing the distortion between the original and reconstructed samples. SAO can use different offsets sample by sample in a region, depending on the sample classification strategy. In HEVC, two SAO types were adopted: edge offset and band offset. For the edge offset, the sample classification is based on comparing the current and the neighboring samples according to four one-dimensional neighboring patterns (see Figure 2). For the band offset, the sample classification is based on sample values, and the sample value range is equally divided into 32 bands. These offset values and region indices are signaled in the bitstream, which can impose a relatively large overhead. ALF is a Wiener-based adaptive filter; its coefficients are derived by minimizing the mean square errors between original and reconstructed samples. Numerous recent efforts have been dedicated to developing high-efficiency and lowcomplexity ALF approaches. In HEVC reference software HM7.0, the filter shape of ALF is a combination of a 9 7-tap cross shape and a 3 3tap rectangular shape, as Figure 3 illustrates. Therefore, only correlations within a local patch are used to reduce the compression artifacts. To adapt the properties of an input frame, up to 16 filters are derived for different regions of the luminance component. Such high adaptability also creates a large overhead, which should be signaled in the bitstream. Therefore, these

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Related Work in Nonlocal Image Filters In existing video coding standards, in-loop filters focus only on the local correlation within image patches without fully considering nonlocal similarities. However, in image restoration and denoising fields, researchers have proposed many methods based on image nonlocal similarities.1–5 Antoni Buades and his colleagues proposed the famous nonlocal means filter (NLM) to remove different kinds of noise by predicting each pixel with a weighted average of nonlocal pixels, where the weights are determined by the similarity of image patches located at the source and target coordinates.1 The well-known block-matching and 3D filtering (BM3D) denoising filter stacks nonlocal similar image patches into 3D matrices and removes noise by shrinking coefficients of 3D transform of similar image patches based on the image-sparse prior model.2 Other research used the nonlocal similar image patches to suppress compression artifacts, which is achieved by adaptively combining the pixels restored by the NLM filter and reconstructed pixels according to the reliability of NLM prediction and quantization noise in the transform domain.3–5 In other work, the authors use a group of nonlocal similar image patches to construct image-sparse representation, which can be further applied to image deblurring, denoising, and inpainting.6–8 Although these nonlocal methods significantly improve the quality of restored images, all of them are treated as post-processing filters and thus don’t fully exploit the compression information. Masaaki Matsumura and his colleagues first introduced the NLM filter to compensate for the shortcomings of HEVC with only image-local prior models; to improve the coding performance, they used delicately designed patch shapes, search window shapes, and optimizing filter on/off control modules.9,10 Finally, Qinglong Han and his colleagues also employed nonlocal similar image patches in a quadtree-based Kuan’s filter to suppress compression artifacts; the pixels restored by the NLM filter and the reconstructed pixels are adaptively combined according to the variance of image signals and quantization noise.11 However, the weights in these filters are difficult to determine, leading to limited coding performance improvement.

References 1. A. Buades, B. Coll, and J. M. Morel, “A Non-Local Algorithm for Image Denoising,” Proc. IEEE CS Conf. Computer Vision and Pattern Recognition (CVPR), vol. 2, 2005, pp. 60–65. 2. K. Dabov et al., “Image De-Noising by Sparse 3D TransformDomain Collaborative Filtering,” IEEE Trans. Image Processing, vol. 16, no. 8, 2007, pp. 2080–2095. 3. X. Zhang et al., “Reducing Blocking Artifacts in Compressed Images via Transform-Domain Non-local Coefficients Estimation,” Proc. IEEE Int’l Conf. Multimedia and Expo (ICME), 2012, pp. 836–841. 4. X. Zhang et al., “Compression Artifact Reduction by OverlappedBlock Transform Coefficient Estimation with Block Similarity,” IEEE Trans. Image Processing, vol. 22, no. 12, 2013, pp. 4613–4626. 5. X. Zhang et al., “Artifact Reduction of Compressed Video via ThreeDimensional Adaptive Estimation of Transform Coefficients,” Proc. IEEE Int’l Conf. Image Processing (ICIP), 2014, pp. 4567–4571. 6. J. Zhang et al., “Image Restoration Using Joint Statistical Modeling in a Space-Transform Domain,” IEEE Trans. Circuits and Systems for Video Technology, vol. 24, no. 6, 2014, pp. 915–928. 7. X. Zhang et al., “Compression Noise Estimation and Reduction via Patch Clustering,” Proc. Asia-Pacific Signal and Information Processing Assoc. Ann. Summit and Conf., vol. 16, no. 19, 2015, pp. 715–718. 8. J. Zhang, D. Zhao, and W. Gao, “Group-Based Sparse Representation for Image Restoration,” IEEE Trans. Image Processing, vol. 23, no. 8, 2014, pp. 3336–3351. 9. M. Matsumura et al., “In-Loop Filter Based on Non-local Means Filter,” Joint Collaborative Team on Video Coding (JCTVC) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, JCTVCE206, 2011; ____________________________ http://phenix.int-evry.fr/jct/doc end user/documents/5 Geneva/wg11. ______________ 10. M. Matsumura, S. Takamura, and A. Shimizu, “Largest Coding Unit Based Framework for Non-Local Means Filter,” Proc. AsiaPacific Signal Information Processing Assoc. Ann. Summit and Conference (APSIPA ASC), 2012 Asia-Pacific, Dec. 2012, pp. 1–4. 11. Q. Han et al., “Quadtree-Based Non-Local Kuans Filtering in Video Compression,” J. Visual Communication and Image Representation, vol. 25, no. 5, 2014, pp. 1044–1055.

makes in-loop ﬁlter parameters reuse more ﬂexible with APS indices.8

IEEE MultiMedia

The Nonlocal Similarity-Based In-Loop Filter In addition to image local-correlation-based ﬁlters, many nonlocal-correlation-based ﬁlters have been proposed in the literature (see the “Related Work in Nonlocal Image Filters” sidebar). In our previous work,5 we formulated a new sparse representation model in terms of a

group of similar image patches. Our group-based sparse representation (GSR) model can exploit the local sparsity and the nonlocal self-similarity of natural images simultaneously in a uniﬁed framework. Here, we describe how the NLSLF is designed in stages based on the GSR model. Patch Grouping The basic idea of GSR is to adaptively sparsify the natural image in the domain of a group. Thus, we ﬁrst show how to construct a group.

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

xi ∈

Extracting

XGi ∈

Matching

Bs × K

Stacking

Group ﬁltering and reconstruction K Figure 4. Framework of the nonlocal similarity-based loop filter (NLSLF). The high-quality image is reconstructed via patch grouping, group filtering, and reconstruction.

In fact, each group is represented by a matrix, which is composed of nonlocal patches with similar structures. For a video frame, I, we first divide it into S overlapped image patches with pffiffiffiffiffi pffiffiffiffiffi the size of Bs Bs . Each patch is reorganized into a vector, xk , k ¼ 1; 2; :::; S, as illustrated in Figure 4. For every image patch, we find K nearest neighbors according to the Euclidean distance between different image patches, dðxi ; xj Þ ¼ kxi xj k22 :

(1)

These K similar image patches are stacked into a matrix of size Bs K; XGi ¼ xGi ;1 ; xGi ;2 ; :::; xGi ;K : (2) Here, XGi contains all the image patches with similar structures, which we call a group.

k¼1

where !Gi ¼ ½!Gi ;1 ; !Gi ;2 ; :::; !Gi ;M is a column vector, RGi ¼ diagð!Gi Þ is a diagonal matrix with the elements of c Gi as its main diagonal, and uGi ;k vTGi ;k are the columns of UGi and VGi ,

Y ¼ X þ N;

(4)

where N is the compression noise and X and Y (without any subscript) represent the original and reconstructed frames, respectively. To derive the sparse representation parameters, we apply thresholding, which is a widely used operation for coefﬁcients with sparse property in image denoising problems. We apply two kinds of the thresholding methods—hard and soft thresholding— to the singular values in !Gi , which is composed of singular values of matrix Y, ðhÞ

aGi ¼ hardð!Gi ;s Þ and

(5)

ðsÞ aGi

(6)

¼ softð!Gi ;s Þ;

where the hard and soft thresholding are deﬁned as hard ðx; sÞ ¼ signðxÞ absðxÞ s1 (7) soft ðx; sÞ ¼ signðxÞ max absðxÞ s1; 0 : (8)

April–June 2016

Group Filtering and Reconstruction Because the image patches in the same group are very similar, they can be represented sparsely. For each group, we apply singular value decomposition (SVD) and get image sparse representation, M X XGi ¼ UGi RGi VTGi ¼ !Gi ;k uGi ;k vTGi ;k ; (3)

respectively. M is the maximum dimension of matrix XGi . The matrix composed of the corresponding compressed video frame is formulated as

Here, stands for the element-wise product of two vectors, sign( ) is the function extracting the sign of every element of a vector, 1 is an allones vector, and s denotes the threshold. After achieving the shrunken singular values, the ^ is given by restored group of image patches x

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Figure 5. The relationship between

ðQP ¼ 27; 32; 38; 45Þ, which are further converted to the quantization step sizes (Qsteps), as Figure 5 shows. We can infer that different sequences with the same quantization parameter or Qstep have similar optimal values of rn , implying that rn is closely related with the quantization parameter or Qstep. Inspired by this, we estimate the optimal value of rn directly from the Qstep by curve ﬁtting using the following empirical formulation,

16 14

Qstep and the

The linear function

12 10 σ

standard deviation of compression noise. can fits well with their relationship.

8 6 Best (BasketballDrive) Best (FourPeople) Estimated

4 2

^¼ x

M X

60 Qstep

100

aGi ; k ðuGi ;k v TGi ;k Þ:

r ¼ a Qstep þ b;

(10)

where the Qstep can be easily derived from the quantization parameter based on the following relationship in HEVC:

120

Qstep ¼ 2

(9)

k¼1

ðQP 4Þ : 6

(11)

Table 1 shows the parameters ða; bÞ for different coding configurations. Based on the filtering performance, we further use the size and number of similar image patches in one group as a scale factor pffiffiffiffi s ¼ rn ðBs þ KÞ; (12)

Because these image patches are overlap extracted, we simply take the average of the overlapped samples as the final filtered values. Threshold Estimation Based on the above discussion, we determine the filtering strength by the thresholding-level parameter s in Equations 5 and 6. However, given that various video content is compressed with different quantization parameters, this is a nontrivial problem that has not been well resolved. In essence, the optimal threshold is closely related with the standard deviation of noise denoted as rn , and larger thresholds correspond to higher rn values. In video coding, the compression noise is mainly caused by quantizing the transform coefficients. Therefore, we can use quantization steps to determine the standard deviation of the compression noise and a scale factor to adapt different prediction modes, including intra- and interprediction. For hard thresholding, the optimal values of rn are derived experimentally based on the sequences BasketballDrive and FourPeople, compressed with different quantization parameters

where rn is the standard deviation of compression noise for the whole image, which is estimated based on Equation 10. For soft thresholding, based on the ﬁltering performance, we take the optimal threshold formulation for generalized Gaussian signals, s¼

cr2n ; rx

(13)

where rx is the standard deviation of original signals that can be estimated by r2x ¼ r2y r2n :

(14)

Because the variance of compression noise, rn , is derived at the encoder side, we quantize it into the nearest integer range,9 which is signaled with 4 bits and transmitted in the bitstream. Therefore, 12 bits are encoded in total for one frame with three color components— for example, YUV. The two thresholds for both

IEEE MultiMedia

Table 1. The coefficient for estimating r for all configurations. All intra coding

Low delay B coding

Random access coding

0.13000

0.7100

0.10450

0.4870

0.10450

0.4870

0.06623

0.8617

0.03771

0.8833

0.03771

0.8833

0.06623

0.8617

0.03771

0.8833

0.03771

0.8833

Color component

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

hard and soft thresholding operations increase with the standard deviation of compression noise, which implies that the frames with more noise should be filtered with higher strength. Furthermore, the thresholds decrease with the standard deviation of signals, which can avoid over-smoothing for smooth areas. Filtering On/Off Control To ensure that the NLSLF consistently leads to distortion reduction, we introduce on/off control flags for frame and largest coding unit (LCU) levels, which should be signaled in the bitstream. Specifically, regarding the framelevel on/off control, three flags—Filtered Y, Filtered U, and Filtered V—are designed for the corresponding color components Y, U, and V, respectively. When the distortions of the filtered image decrease, the corresponding flag signals as true, indicating that the image color component is finally filtered. For the on/off control at the LCU level, each LCU needs only one flag Filterd LCU[i] to indicate the on/off filtering for the luminance component of the corresponding LCU. In the picture header syntax structure, three bits are encoded to signal frame-level control flags for each color component, respectively. We place the syntax elements of the LCU-level control flags in coding tree unit parts, using only one bit for each LCU.

Experimental Results and Analysis

Class A: 2560 1600,

Class B: 1920 1080,

Class C: 832 480,

thresholding operations increase with the standard deviation of compression noise, which implies that the frames with more noise should be filtered with higher strength. Class D: 416 240, and Class E is 1280 720. Class F contains screen videos with three different resolutions: 1280 720, 1024 768, and 832 480. We tested four typical quantization parameters—22, 27, 32, and 37—and three common coding configurations: all intra coding (AI), low delay B (LDB) coding, and random access (RA) coding. Along with the increase of K and Bs , the computational complexity increases rapidly, while the filtering performance might decrease for some sequences because dissimilar structures are more likely to be included. Therefore, in our experiments, the size of image patches is set to Bs ¼ 6, and the number of nearest neighbors for each image patch is set to K ¼ 30 for all the sequences. For each frame, we extract image patches every five pixels according to the raster scanning order, which makes the image patches overlap. First, we treat the HM12.0 with and without ALF as anchors. The overall coding performances of NLSLF-S and NLSLF-H with only frame-level control are shown in Tables 2–5. Both of the two thresholding filters with nonlocal image patches achieve significant bitrate savings compared to HM12.0 without ALF. NLSLF-S achieves 3.2 percent, 3.1 percent, and 4.0 percent bitrate savings on average for the AI, LDB, and RA configurations, respectively. Moreover, NLSLF-H achieves 4.1 percent, 3.3 percent, and 4.4 percent bitrate savings on

April–June 2016

In our experiments, we implement the nonlocal similarity-based in-loop filter in the HEVC reference software, HM12.0. We denote the hard-threshold filtering (with the threshold in Equation 12) as NLSLF-H, and the soft-threshold filtering (with the threshold in Equation 13) as NLSLF-S. To better analyze the performance of the nonlocal similarity-based in-loop filter, we further integrate the ALF from HM3.0 into HM12.0 (in which the ALF tool has been removed) and compare the nonlocal similaritybased in-loop filter with ALF. The test video sequences in our experiments are widely used in HEVC common test conditions. There are 20 test sequences that are classified into six categories (Classes A–F). The resolutions for the first five categories are as follows:

Hard and soft

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Table 2. Performance of the nonlocal similarity-based loop filter with soft thresholding (NLSLF-S) on HM12.0 with adaptive loop filtering (ALF) turned off. All intra coding (%) Sequences Class A

Low delay B (LDB) coding (%)

Random access coding (%)

4.3

4.0

3.9

3.5

3.3

2.3

4.8

6.1

5.7

Class B

2.9

3.3

4.0

3.0

4.2

4.3

5.5

4.7

Class C

2.8

4.6

6.2

1.6

3.4

5.4

2.1

5.1

6.5

Class D

2.0

4.5

5.5

1.3

2.4

2.5

1.6

3.5

4.4

Class E

5.8

5.3

4.4

7.9

10.0

9.5

9.8

9.4

8.6

Class F

2.5

3.1

3.4

1.7

2.8

3.3

2.2

4.4

4.7

Overall

3.4

4.1

4.6

3.2

4.4

4.5

4.1

5.6

5.8

Table 3. Performance of the nonlocal similarity-based loop filter with soft thresholding (NLSLF-S) on HM12.0 with adaptive loop filtering (ALF) turned on. All intra coding (%)

Low delay B (LDB) coding (%)

Random access coding (%)

Class A

1.8

2.3

2.4

1.0

3.9

2.5

2.2

5.2

5.0

Class B

1.8

2.1

3.0

1.8

3.9

4.7

2.6

5.0

5.2

Class C

2.7

3.5

4.5

1.7

4.4

5.9

2.2

5.6

6.4

Class D

1.9

2.8

3.7

1.7

2.2

3.2

1.8

3.7

4.6

Class E

3.9

2.8

2.1

6.1

7.5

6.0

7.4

7.3

6.2

Class F

2.4

2.9

3.2

1.9

3.6

3.9

2.0

4.2

4.5

Overall

2.4

2.7

3.2

2.4

4.2

4.4

3.0

5.1

5.3

Sequences

Table 4. Performance of the nonlocal similarity-based loop filter with hard thresholding (NLSLF-H) on HM12.0 with adaptive loop filtering (ALF) turned off. All intra coding (%)

Low delay B (LDB) coding (%)

Random access coding (%)

Class A

4.9

3.0

3.5

3.1

1.2

1.4

4.2

3.1

2.8

Class B

3.2

2.2

3.9

3.2

3.5

3.7

4.3

3.9

3.8

Class C

3.6

4.9

6.9

1.9

3.4

4.8

2.5

4.2

5.9

Class D

3.1

4.4

5.9

1.5

2.5

2.8

2.1

3.4

Class E

7.1

8.5

8.9

7.4

9.5

10.5

10.0

11.4

12.1

Class F

3.5

4.4

5.0

2.4

2.8

3.6

3.0

5.0

5.4

Overall

4.2

4.6

5.7

3.3

3.8

4.5

4.3

5.2

5.6

Sequences

IEEE MultiMedia

average for the all intra, LDB, and random access conﬁgurations, respectively, compared to HM12.0 without ALF. When the nonlocal similarity-based in-loop ﬁlters are combined with ALF, NLSLF-S achieves approximately 2.6 percent, 2.6 percent, and 3.2 percent bitrate savings for all intra, LDB, and random access coding, respectively, and NLSLF-H achieves

approximately 3.1 percent, 2.8 percent, and 3.4 percent bitrate savings for all intro, LDB, and random access coding, respectively, compared with HM12.0 with ALF. Although the NLSLF improvements are not as signiﬁcant as those achieved without ALF, they can still further improve the performance of HEVC with ALF. This veriﬁes that nonlocal

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Table 5. Performance of the nonlocal similarity-based loop filter with hard thresholding (NLSLF-H) on HM12.0 with adaptive loop filtering (ALF) turned on. All intra coding (%)

Low delay B (LDB) coding (%)

Random access coding (%)

Class A

2.1

1.4

1.8

1.0

1.6

1.3

1.7

2.3

2.1

Class B

1.9

1.0

2.5

2.1

2.9

3.3

2.6

3.0

3.8

Class C

3.1

2.6

5.0

2.0

4.0

5.1

2.2

4.3

5.9

Class D

2.6

1.6

3.1

1.6

2.5

3.0

1.9

3.6

3.8

Class E

4.9

4.5

3.9

5.5

5.6

7.5

6.8

Class F

3.1

4.3

5.0

2.8

3.5

3.6

2.9

4.7

5.3

Overall

2.9

2.6

3.5

2.5

3.3

3.7

3.1

4.2

4.6

Sequences

Table 6. Performance of the nonlocal similarity-based loop filter with soft thresholding (NLSLF-S) with largest coding unit (LCU) level control for each sequence. All intra coding (%) Sequences Class A Class B

Class C

Class D

Class E

Class F

Overall

Low delay B (LDB) coding (%)

Random access coding (%)

Traffic

2.0

2.4

2.3

1.9

1.5

2.9

3.9

3.2

PeopleOnStreet

2.4

2.7

2.4

2.8

5.2

3.4

2.5

5.8

6.1

Kimono

1.9

1.0

1.8

3.0

4.3

4.4

1.5

2.8

4.1

ParkScene

0.6

0.5

0.9

1.4

0.5

1.3

0.4

0.1

Cactus

2.4

1.5

4.5

4.1

2.3

4.9

4.3

6.8

7.3

BasketballDrive

1.9

4.7

5.2

2.5

9.1

8.5

2.3

8.0

6.9

BQTerrace

2.8

2.5

2.7

4.6

2.5

4.9

7.2

4.4

5.6

BasketballDrill

4.3

7.0

8.6

3.1

10.2

11.9

3.3

11.8

13.0

BQMall

4.2

3.8

4.0

4.7

4.3

4.5

4.4

5.4

5.0

PartyScene

0.9

1.3

1.8

1.4

0.9

1.5

1.8

0.1

0.2

RaceHorsesC

1.3

1.8

3.6

2.7

3.1

7.6

2.6

3.6

7.3

BasketballPass

3.4

4.5

4.7

2.4

4.0

3.6

2.0

5.2

4.6

BQSquare

1.7

0.9

2.6

1.5

1.0

0.4

2.4

0.8

1.9

BlowingBubbles

1.1

2.9

3.6

1.9

2.7

0.5

2.2

3.7

4.1

RaceHorses

2.1

3.3

4.4

3.3

1.0

5.6

2.7

4.6

7.2

FourPeople

3.2

2.5

1.7

4.8

5.6

4.5

5.6

5.2

4.7 5.8

Johnny

4.9

3.0

1.7

6.7

7.7

5.3

8.1

6.8

KristenAndSara

3.6

2.6

2.7

5.2

5.0

4.4

6.0

7.4

5.2

BasketballDrillText

4.4

6.7

7.8

3.3

8.2

8.5

3.7

10.3

10.8

ChinaSpeed

1.7

2.5

2.9

2.1

3.1

2.3

4.6

4.4

SlideEditing

1.9

0.5

0.8

2.1

0.2

0.4

2.1

0.5

0.8

SlideShow

1.4

1.5

1.4

0.8

3.2

1.4

0.0

0.7

0.9

2.5

2.7

3.1

3.7

3.9

3.3

4.8

5.0

NLSLF-S in our experiments, soft thresholding outperforms hard thresholding for some sequences, such as for Class E in the LDB coding conﬁguration and Class A in the LDB and random access coding conﬁgurations. Table 6 shows the detailed results of NLSLF-S with LCU-level control for each sequence.

April–June 2016

similarity offers more beneﬁts for compression artifact reduction than local similarity alone. Because hard- and soft-thresholding operations are suitable for signals with different distributions, they show different coding gains on different sequences. Although NLSLF-H achieves better performance for most sequences than

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Johnny (ALF off)

KristenAndSara (ALF off)

PSNR

41 40

HEVC Anchor NLSLF-S

38 500

(a)

1,500 1,000 Bitrate (Kbit/s)

37 HEVC Anchor NLSLF-S

2,000

36 0

39 38

37 0

FourPeople (ALF off)

PSNR

Peak signal to noise ratio (PSNR)

500

(b)

1,000 1,500 Bitrate (Kbit/s)

HEVC Anchor NLSLF-S

2,000

35 0

500

(c)

1,000 1500 2,000 2,500 Bitrate (Kbit/s)

Figure 6. The rate-distortion performance of the nonlocal similarity-based loop filter with soft thresholding (NLSLF-S) compared with HEVC with the adaptive in-loop filter turned off. The test involved three sequences: (a) Johnny, (b) KristenAndSara, and (c) FourPeople. All three sequences are compressed by HEVC RA coding.

Johnny (ALF off)

KristenAndSara (ALF off)

PSNR

41 40

HEVC Anchor NLSLF-H

(a)

500

1,000 1,500 Bitrate (Kbit/s)

2,000

37 HEVC Anchor NLSLF-H

37 36 0

(b)

39 38

37 0

FourPeople (ALF off)

PSNR

500

1,000 1,500 Bitrate (Kbit/s)

2,000

HEVC Anchor NLSLF-H

36 35 0

(c)

500

1,000 1,500 2,000 2,500 Bitrate (Kbit/s)

Figure 7. The rate-distortion performance of the nonlocal similarity-based loop filter with hard thresholding (NLSLF-H) compared with HEVC with the adaptive in- loop filter turned off. The test involved three sequences: (a) Johnny, (b) KristenAndSara, and (c) FourPeople. All three sequences are compressed by HEVC RA coding.

IEEE MultiMedia

Although LCU-level control increases overheads, it can also improve coding efficiency by avoiding the over-smoothing case. Further, this shows that room still exists for improving the filtering efficiency by designing more reasonable thresholds for group-based sparse coefficients. Figures 6 and 7 illustrate the rate-distortion curves of NLSF and HEVC without ALF for the sequences Johnny, KristenAndSara, and FourPeople, which are compressed at different quantization parameters under the random access configuration. As the figures show, the coding performance is significantly improved in a wide bit range with the nonlocal similarity-based inloop filters. We further compare the visual quality of the decoded video frames with different in-loop filters in Figure 8. The deblocking filter removes only the blocking artifacts, and it is difficult to

reduce other artifacts, such as the ringing artifacts around the coat’s stripes in the Johnny image. Although SAO can process all the reconstructed samples, its performance is constrained by the large overheads, such that blurring edges still exist. The nonlocal similarity-based filters can efficiently remove different kinds of compression artifacts, as well as recover destroyed structures by utilizing nonlocal similar image patches, such as recovering most of the lines in Johnny’s coat. Although NLSLF achieves significant improvement for video coding, it also introduces many computational burdens, especially due to SVD. Compared with HM12.0 encoding, NLSLFH’s encoding time increase is 133 percent, 30 percent, and 33 percent for all intra, LDB, and random access coding, respectively. This also proposes new challenges for loop filter research

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Figure 8. Visual quality comparison for the Johnny sequence when the adaptive inloop filter (ALF) is off: (a) images reconstructed with HEVC Anchor; (b) Images reconstructed with the nonlocal similaritybased loop filter with soft thresholding (NLSLF-S), and (c) images reconstructed with NLSLF with hard thresholding (NLSLF-H).

(a)

(b)

on image nonlocal correlations, which we plan to explore in our future work.

which obtains predictions from a relatively large range compared to intraprediction, leading to signiﬁcant performance improvement. However, to the maximum extent, only a unique pair of patches can be employed, such as one image patch in unidirectional predictions and two image patches in bidirectional predictions. This signiﬁcantly limits the prediction technique’s potential, as the number of similar image patches can be further extended to fully exploit the spatial and temporal redundancies. With the new technological advances in hardware and software, we could have foreseen the arrival and maturity of these nonlocalbased coding techniques. We also believe that the nonlocal-based video coding technology described in this article—or similar technologies developed based on it—could play an important role in the future of video standardization. MM

April–June 2016

he novelty in our approach lies in adopting the nonlocal model in the in-loop filtering process, which leads to reconstructed frames with higher fidelity. To estimate the noise level, we examined different kinds of thresholding operations, confirming that the nonlocal strategy can significantly improve the coding efficiency. This offers new opportunities for in-loop filter research with nonlocal prior models. It also opens up new space for future exploration in nonlocal-inspired high-efficiency video compression. Apart from in-loop filtering, the nonlocal information can motivate the design of other key modules in video compression as well. Traditional video coding technologies focus mainly on reducing the local redundancies by intraprediction with limited neighboring samples. This interprediction can be regarded as a simplified version of nonlocal prediction,

(c)

Acknowledgments This special issue is a collaboration between the 2015 IEEE International Symposium on

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Multimedia (ISM 2015) and IEEE MultiMedia. This article is an extended version of “Non-Local Structure-Based Filter for Video Coding,” presented at ISM 2015. This work was supported in part by the National Natural Science Foundation of China (grants 61322106, 61572047, and 61571017) and the National Basic Research Program of China 973 Program (grant 2015CB351800).

doctorate work at the University of Southern California. Contact him at ____________ fswma@pku.edu.cn. Xinfeng Zhang, the corresponding author for this article, is a research fellow at Nanyang Technological University, Singapore. His research interests include image and video processing, and image and video compression. Zhang received a PhD in computer science from the Institute of Computing Technology,

References

Chinese Academy of Sciences, Beijing. Contact him at _____________ xfzhang@ntu.edu.sg.

1. G. Sullivan et al., “Overview of the High Efficiency Video Coding (HEVC) Standard,” IEEE Trans. Circuits and Systems for Video Technology, vol. 22, no. 12, 2012, pp. 1649–1668. 2. A. Norkin et al., “HEVC De-blocking Filter,” IEEE Trans. Circuits and Systems for Video Technology, vol. 22, no. 12, 2012, pp. 1746–1754. 3. C.-M. Fu et al., “Sample Adaptive Offset in the HEVC Standard,” IEEE Trans. Circuits and Systems for Video Technology, vol. 22, no. 12, 2012, pp. 1755–1764. 4. C.-Y. Tsai et al., “Adaptive Loop Filtering for Video Coding,” IEEE J. Selected Topics in Signal Processing, vol. 7, no. 6, 2013, pp. 934–945. 5. J. Zhang, D. Zhao, and W. Gao, “Group-Based Sparse Representation for Image Restoration,” IEEE Trans. Image Processing, vol. 23, no. 8, 2014, pp. 3336–3351. 6. P. List et al., “Adaptive Deblocking Filter,” IEEE Trans. Circuits and Systems for Video Technology, vol. 13, no. 7, 2003, pp. 614–619. 7. X. Zhang et al., “Adaptive Loop Filter with Temporal Prediction,” Proc. Picture Coding Symposium (PCS), 2012, pp. 437–440. 8. S. Wenger et al., “Adaptation Parameter Set (APS),” Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC JTC1/

Jian Zhang is a postdoctoral fellow at National Engineering Laboratory for Video Technology (NELVT), Peking University, Beijing. His research interests include image/video coding and processing, compressive sensing, sparse representation, and dictionary learning. Zhang received a PhD in computer science from the Harbin Institute of Technology, China. He was the recipient of the Best Paper Award at the 2011 IEEE Visual Communication and Image Processing. Contact him at _______________ jian.zhang@pku.edu.cn. Chuanmin Jia is a doctoral student at the Institute of Digital Media, EECS, Peking University. His research interests include image processing and video compression. Jia received his BS in computer science from Beijing University of Posts and Telecommunications. Contact him at __________ cmjia@pku.edu. cn. __ Shiqi Wang is a postdoc fellow in the Department of Electrical and Computer Engineering, University of Waterloo, Canada. His research interests include video compression and image video quality assessment. Wang received a PhD in computer application technology from Peking University. Contact him at sqwang@pku.edu.cn. _____________

SC29/WG11, JCTVC-F747, 2011; __________ http://phenix.intend user/documents/6 Torino/ evry.fr/jct/doc ___________________________ wg11. ___

Wen Gao is a professor of computer science at the Institute of Digital Media, EECS, Peking University.

9. K. Dabov et al., “Image De-noising by Sparse 3D

His research interests include image processing,

Transform-Domain Collaborative Filtering,” IEEE Trans. Image Processing, vol. 16, no. 8, 2007, pp.

video coding and communication, pattern recognition, multimedia information retrieval, multimodal

2080–2095.

interfaces, and bioinformatics. Gao received a PhD in electronics engineering from the University of Tokyo. Contact him at ____________ wgaog@pku.edu.cn.

Siwei Ma is an associate professor at the Institute of

IEEE MultiMedia

Digital Media, School of Electronic Engineering and Computer Science (EECS), Peking University, Beijing. His research interests include image and video coding, video processing, video streaming, and transmission. Ma received a PhD in computer science from Institute of Computing Technology, Chinese Academy of Sciences, Beijing, and did post-

_______________ _________

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

________________________ ___________

____ _______________________

_____________

__________ _________

_______________

____________________

_________________________________________

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Ubiquitous Multimedia

A Novel SemiSupervised Dimensionality Reduction Framework Xin Guo, Yun Tie, and Lin Qi Zhengzhou University Ling Guan Ryerson University

A novel framework of semisupervised dimensionality reduction for multimanifold learning aims to address the issue of label insufficiency under the multimanifold assumption. Experimental results verify the advantages and effectiveness of this new framework.

anifold learning has been an active research topic in computer vision and pattern recognition for many years. Classical methods—such as locally linear embedding (LLE),1 Isomap,2 and Laplacian Eigenmap (LE)3—address the issue of data representation using the model of a single manifold in an unsupervised manner (see the “Manifold Learning Methods” sidebar for more detailed information and other related work in this area). However, many manifold learning methods1–6 assume that all the samples in a dataset reside on a single manifold. For multiclass recognition tasks, this assumption brings about two problems: ﬁrst, because true distribution of the data isn’t considered, the data can’t be properly modeled by a single manifold, leading to suboptimal classiﬁcation accuracy. Second, the reduced dimensions must be the same for all classes, because there is only one projection matrix under this assumption. To address these problems, Rui Xiao and his colleagues proposed using the assumption that

1070-986X/16/$33.00 c 2016 IEEE

different classes reside on different manifolds,7 which leads to very good performance compared to methods using the single manifold assumption. However, Xiao’s work did not take label information into account, so this can be viewed as an extension of locality preserving projections (LPP; see the “Manifold Learning Methods” sidebar). Recently, Jiwen Lu and his colleagues8,9 introduced discriminative multimanifold analysis (DMMA) to address the issue of single sample face recognition. DMMA was further extended in other work,10 adding sparse optimization to the framework. Although supervised learning algorithms generally outperform unsupervised learning algorithms, the collection of labeled training data in supervised learning requires expensive human labor and can be very time consuming. In addition, it is much easier to obtain unlabeled data. To jointly use the information embedded in the huge amount of unlabeled data and the relatively limited amount of labeled data for better classiﬁcation, several semisupervised methods have been proposed in recent years (see the “Semisupervised Methods” sidebar for more information).11–14 Despite the success of the existing semisupervised methods, to the best of our knowledge, they are all under the single manifold assumption. Motivated by the merits of multimanifold assumption, we extend semisupervised manifold learning to the framework under the multimanifold assumption. We then solve this problem by clustering the unlabeled samples using sparse manifold clustering, predicting the cluster label by calculating the manifold-tomanifold distance, and constructing three kinds of graphs to discover both the geometrical and discriminant structure of the data manifold.

The Proposed Framework In the semisupervised problem, we have only partially labeled samples in the dataset S ¼ fX;Yg, where X and Y denote data and label matrices, respectively. We assume there are l labeled points and u unlabeled points. Then, the observed input points can be written as X ¼ ðXL ; XU Þ, where XL ¼ ðX1 ; X2 ; …; Xl Þ and XU ¼ ðxlþ1 ; xlþ2 ; …; xlþu Þ. Each point x 2 <d is a d-dimensional vector. The label matrix is Y ¼ YL ¼ ðy1 ; y2 ; …; yl Þ. Let N denote the number of total classes, with each class containing l=N labeled samples. The goal of semisupervised learning dimensionality reduction is to use the information of both labeled and unlabeled samples to map

Published by the IEEE Computer Society

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Manifold Learning Methods 1

While locally linear embedding (LLE) and Laplacian Eigenmap (LE)2 focus on preserving local structures, Isomap3 attempts to preserve the geodesic distance between samples. However, due to the implicitness of the nonlinear maps, they are not directly applicable to new test samples, suffering from the so-called “out-of-sample problem.” This limits the application of these methods to classification tasks, leading to the development of more advanced algorithms.4 In particular, the method of locality preserving projections (LPP)4 provides mapping for the whole data space. Because the mapping is linear, it avoids the out-ofsample problem and can directly apply to recognition tasks. To fully use class label information, which is important for recognition tasks, Wei Zhang and his colleagues5 proposed discriminant neighborhood embedding (DNE), which can embed classification by not only forming a compact submanifold for the same class but also by widening gaps among submanifolds corresponding to different classes. However, because DNE simply assigns þ1 and 1 to intraclass and interclass neighbors, respectively, it could not well preserve the data’s local and geometrical structure information. Consequently, Jianping Gou and Zhang Yi6 introduce locality-based DNE (LDNE), which considers both the “locality” in LPP and the “discrimination” in DNE in an integrated modeling environment. Shuicheng Yan and his colleagues7 demonstrate that several dimension reduction algorithms (such as principle component analysis (PCA),8 linear discriminant analysis (LDA),9 Isomap, LLE, and LE) can be unified within a proposed graph-embedding framework, in which the desired statistical or geometric data properties are encoded as graph relationships. Tianhao Zhang and his colleagues further reformulated several dimension reduction algorithms into a unified patch alignment framework.10 Recently, maximal linear embedding (MLE)11 was also proposed to align local models into a global coordinate space.

1. S.T. Roweis and L.K. Saul, “Nonlinear Dimensionality Reduction by Locally Linear Embedding,” J. Science, vol. 290, no. 5500, 2000, pp. 2323–2326. 2. M. Belkin and P. Niyogi, “Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering,” Proc. Advances in Neural Information Processing Systems 14 (NIPS), 2001, pp. 585–591. 3. J.B. Tenenbaum, V. De Silva, and J.C. Langford, “A Global Geometric Framework for Nonlinear Dimensionality Reduction,” J. Science, vol. 290, no. 5500, 2000, pp. 2319–2323. 4. X. He et al., “Face Recognition Using Laplacianfaces,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 3, 2005, pp. 328–340. 5. W. Zhang et al., “Discriminant Neighborhood Embedding for Classification,” Pattern Recognition, vol. 39, no. 11, 2006, pp. 2240–2243. 6. J. Gou and Z. Yi, “Locality-Based Discriminant Neighborhood Embedding,” The Computer J., vol. 56, no. 9, 2013, pp. 1063–1082. 7. S. Yan et al., “Graph Embedding and Extensions: A General Framework for Dimensionality Reduction,” IEEE Trans. Pattern Analysis and Machine Intelligence, 2007, vol. 29, no. 1, pp. 40–51. 8. M. Turk and A.P. Pentland, “Face Recognition Using Eigenfaces,” Proc. Computer Vision and Pattern Recognition (CVPR), 1991, pp. 586–591. 9. P.N. Belhumeur, J.P. Hespanha, and D.J. Kriegman, “Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 7, 1997, pp. 711–720. 10. T. Zhang et al., “Patch Alignment for Dimensionality Reduction,” IEEE Trans. Knowledge and Data Engineering, vol. 21, no. 9, 2009, pp. 1299–1313. 11. R. Wang et al., “Maximal Linear Embedding for Dimensionality Reduction,” IEEE Trans. Pattern Analysis and Machine Intelligence, 2011, vol. 33, no. 9, pp. 1776–1792.

which are labeled 1 to manifold 1, 2 to manifold 2, and so on. The second step is to construct a neighborhood graph with detailed information, while the third step is to obtain the projection matrix of each manifold through optimization. Distributing Samples to Different Manifolds As mentioned before, labeled samples can be easily allocated to different manifolds, because the samples belonging to the same class are more likely to reside on the same manifold. We can thus assign labeled samples based on the label information.

April–June 2016

ðx 2 <d Þ7!ðz 2 <d0 Þ, where d0 < d. In addition to dimensionality reduction, we aim to minimize the intramanifold variance and maximize the intermanifold separability in the embedded space, so that the manifold margins are maximized and can be separated by a simple classiﬁer. The proposed semisupervised framework is a three-step process, illustrated in Figure 1. Assuming the set of instances contains both labeled and unlabeled data, the ﬁrst step is to allocate the instances to the corresponding manifolds. The labeled instances are distributed into different manifolds based on the label information—in other words, we allocate the instances

References

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Semisupervised Methods Deng Cai and his colleagues extend linear discriminant analysis (LDA) to semisupervised discriminant analysis (SDA).1 Fadi Dornaika and Youssof El Traboulsi combine flexible graph-based manifold embedding with semisupervised discriminant embedding, extended to kernel version, for semisupervised dimension reduction.2 Yangqiu Song and his colleagues propose a unified framework for semisupervised dimensionality reduction,3 under which several classical methods, such as principle component analysis (PCA), LDA, locality preserving projections (LPP), and their corresponding kernel versions can be treated as special cases. Recently, Quanxue Gao and his colleagues presented a novel semisupervised method, named stable semisupervised discriminant learning (SSDL).4 By building a proper objective function, SSDL learns the intrinsic structure that

characterizes both the similarity and diversity of data and then incorporates this structure representation into LDA.

References 1. D. Cai, X. He, and J. Han, “Semi-Supervised Discriminant Analysis,” IEEE 11th Int’l Conf. Computer Vision (ICCV), 2007, pp. 1–7. 2. F. Dornaika and Y. El Traboulsi, “Learning Flexible GraphBased Semi-Supervised Embedding,” IEEE Trans. Cybernetics, vol. 46, no. 1, 2016, pp. 206–218. 3. Y. Song et al., “A Unified Framework for Semi-Supervised Dimensionality Reduction,” Pattern Recognition, vol. 41, no. 9, 2008, pp. 2789–2799. 4. Q. Gao et al., “A Novel Semi-Supervised Learning for Face Recognition,” Neurocomputing, Mar. 2015, pp. 69–76.

N Manifolds . . . Labeled

Manifold i Label i = 1, . . . , N information . .

Data set . . . Unlabeled

Sparse manifold clustering

Minimum manifold distance

N New manifolds .. . Manifold i Cluster i i = 1, . . . , N i = 1, . . . , N .. .

Graph construct

Wi , i = 1, . . . , N

Cluster i i = 1, . . . , N . . . N Clusters

Figure 1. The proposed semisupervised framework. It comprises a three-step process: allocating instances to the corresponding manifolds, constructing a neighborhood graph with detailed information, and obtaining the projection matrix of each manifold through optimization.

IEEE MultiMedia

For the unlabeled samples, we carry out the distribution task in two steps: adopt sparse manifold clustering to generate unlabeled sample clusters and merge them with the corresponding manifold with labeled instances. We propose a sparse manifold clustering method to assign the unlabeled samples to different clusters. The majority of traditional clustering methods first find the neighborhood based on distance and then assign the weights between pairwise points, but with our method, the introduced clustering method finds both the neighbors and weights automatically. This is done by solving a sparse optimization problem, which encourages selecting nearby points that lie in the same manifold.

For each data point xi , consider the smallest ball that contains the k3 nearest neighbors of xi from the manifold M, and let the neighborhood Ni be the set of all the data points in the smallest ball excluding xi . In general, Ni contains points from M as well as other manifolds. We assume that there exists e 0, such that the nonzero coefﬁcients xij of the sparsest solution of X e; x ðx x Þ ij j i 2 j2Ni (1) X xij ¼ 1 subject to j2Ni

correspond to k3 neighbors of xi from M. In other words, the optimization problem has multiple solutions. We solve this problem

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

increases. To solve this problem, we relax it to the following weighted ‘1 -optimization program,

X2 X4

X8 X9

X10 X3

Figure 2. For x1 , the smallest neighborhood contains points from M1 , M2 , and M3 (red, blue, and black, respectively). However, only the samples from M1 span a one-dimensional subspace around x1 .

kXi xi k2 e; 1T xi ¼ 1;

(2)

where Xi ¼ ½x1 xi =kx1 xi k2 ; …; xN xi = 2 kxN xi k 2 <d Ni 1 , which can be deduced by normalizing the vector fxj xi gj6¼i. The optimization problem (Equation 2) could be solved efficiently using convex programming tools and is known to prefer a sparse solution.16 Ideally, the solution to Equation 2 corresponds to subspace-sparse representations of the data points, which we can use next to infer the clustering of the data. Then we can build a similarity graph by connecting data points with a sparse coefficient and apply a spectral clustering method to achieve data clustering. Through the sparse manifold clustering, we can distribute the unlabeled instances to different manifolds to effectively handle multiple manifolds; even when two manifolds are close to each other, we set up experiments on synthetic data to demonstrate the effectiveness. Constructing a Graph of Merged Samples Although the unlabeled samples are divided into different clusters, and the distribution of each cluster is likely to be restricted on a special manifold, merging the unlabeled cluster with the labeled cluster is still a difficult task. The problem can be tackled by calculating the manifold–manifold distance. We use an average of the reconstruction error7–9 as the manifold-tomanifold distance (excluding errors from certain works17,18) to find the nearest subspace–subspace distance to approximate the manifold– manifold distance. As a result, the minimum distance was selected, because the two clusters are more likely to come from the same class. To minimize the intramanifold variance and maximize the intermanifold separability in the embedded space, we built a weighted undirected graph G ¼ ðV; EÞ, where V denotes the set of vertexes that correspond to all the labeled samples, and E denotes the edges that connect pairwise samples with weights. To minimize the intramanifold variance, the samples of the same manifold should be compacted. Unlike traditional supervised methods, there are two kinds of similarity graphs considered—a must-link graph and a high-probability must-link graph. “Must-link” refers to the pairwise samples, both of which have label information and belong to the same class, while “high-probability

April–June 2016

under the assumption that the sparsest solutions are more likely to come from the same manifold (see Figure 2). There are three manifolds in Figure 2, M1 , M2 , and M3 , represented by red, blue, and black circles, respectively. x1 , x2 , and x3 come from manifold M1 ; x4 , x5 , and x6 come from the manifold M2 ; and x7 , x8 , x9 , and x10 come from manifold M3 . We also assume that the distance between x1 and x2 , and between x1 and x3 is farther than between x1 and the other points. One solution to Equation 1 is x2 , x3 , while x4 , x5 , x6 corresponds to another sparse solution. In addition, x7 , x8 , x9 , and x10 is also one of the optimal sparse solutions. Among all the possible solutions with two data points, x2 and x3 is the closest. Although there are other solutions with three points or more that are closer, such as x4 , x5 , and x6 , they require a linear combination of more than two data points, indicating they are not the sparsest solution for ‘0 norm.15 We adopt the assumption16 that the sparsest solution will come from the same manifold. Under this assumption, the pair of x2 and x3 is the optimal solution to this problem. Then, we can regard x2 and x3 as x1 ’s neighbors and connect them with edges to realize the goal of connecting each point to other points from the same manifold. If the neighborhood Ni is known and is relatively small, one can search for the minimum number of points that satisfy Equation 1. However, Ni is usually not known before, and searching for a few data points in Ni that satisfy Equation 1 becomes computationally challenging, because the size of the neighborhood

min kxi k1 ; subject to

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q

THE WORLD’S NEWSSTAND®

Xi6

Xi2

Xi1 Xi

Xi8

Xi3

Xi6 Bij

Xi4

Xi5

(a)

Xi7

Xi2

Aij

Xi1

Xi Xi8

Xi6

(b)

Xi9

Xi5

Xi3

Manifold margin

Cir

Xi4

Xi9

Xi3 Xi1 Xi2 Xi X i7 Xi8 Xi9

Xi7

(c)

Xi4

Xi5

(d)

(e)

Figure 3. This example shows how intramanifold and intermanifold neighbors work. (a) There are three labeled intramanifold neighbors ðxi1 ; xi2 ; xi3 Þ, three labeled intermanifold neighbors ðxi4 ; xi5 ; xi6 Þ, and three unlabeled neighbors ðxi7 ; xi8 ; xi9 Þ. The points with the same color and shape are from the same class. The blue triangle points denote labeled points from different classes and yellow rectangle points denote unlabeled points. (b) The intramanifold graph connects the intramanifold neighbors. (c) The intermanifold graph connects intermanifold neighbors. (d) This graph connects unlabeled neighbors. (e) Our goal—the manifold margin is maximized.

must-link” refers to the pairwise samples, at least one of which lacks the label information, although both came from the merged manifold, which has the same label as the other sample. To maximize the intermanifold distance, we simultaneously maximize the margins among the pairwise samples, which are must-not-link samples that label information but belong to different classes. Figure 3 presents an example to show how these intramanifold and intermanifold neighbors work. Obtaining the Projection Matrix Let W ¼ ½W1 ; W2 ; …; WN denote the projection matrix. For each sample xi in XL , we linearly parameterize zi ¼ Wc xi , where Wc 2 <d d0 is the projection matrix for the cth manifold and c ¼ f1; 2; ::: Ng. Thus, zi is a d0 -dimensional vector in the embedded space. To achieve the goal illustrated in Figure 3e, the points in Figure 3b and Figure 3d should be compacted as close as possible, while the points in Figure 3c should be separated as far as possible. Such an objective can be cast in the following optimization problem: W ¼ arg max J1 ðWxÞ lJ2 ðWxÞ gJ3 ðWxÞ;

J3 ðW1 ; W2 ; …; WN Þ ¼

IEEE MultiMedia

c¼1

i¼1 p¼1

kWTc xci

J2 ðW1 ; W2 ; …; WN Þ ¼

N X

l=N X k2 X

c¼1

i¼1 q¼1

i¼1 r¼1

( Acip

WTc xcr k2 Ccir

; ð6Þ

k1 c exp ð kxci xcp k2 =r2 Þ; if xcp 2 Nint er ðxi Þ

otherwise (7)

( Bciq

k2 ðxci Þ exp ð kxci xcq k2 =r2 Þ; if xcq 2 Nintra

0; (

(3) Cir ¼

J1 ðW1 ; W2 ; …; WN Þ ¼

c¼1

kWTc xci

otherwise (8)

where

l=N X k1 X

l=N X k3 X

with xcp representing the pth k1 -nearest labeled intermanifold neighbors of xci for the cth class; xcq representing the qth k2 -nearest labeled intermanifold neighbors of xci for the cth class; xcr representing the rth k3 -nearest unlabeled neighbors for the predict image set; and Acip , Bciq , and Ccir being affinity matrices to characterize the similarity between xi and xip , xi and xiq , and xi and xir , respectively. As we know, the affinity matrices Acip , Bciq , and c Cir are costs penalizing an embedded distance between two neighboring points, and they play an important role in keeping the intrinsic structure of the original data set in the embedded space. Explicitly, Acip ,Bciq , and Ccir are defined as

W1 ; W2 ; …; WN s:t: WTc Wc ¼ I

N X

WTc xcp k2 Acip

k3 exp ð kxci xcr k2 =r2 Þ; if xcr 2 Nintra ðxci Þ

otherwise (9)

ð4Þ

kWTc xci WTc xcq k2 Bciq ð5Þ

The main difference between Equations 8 and 9 is that the neighborhood samples are chosen from different clusters. Equation 8 selects from the must-link set, which refers to the pairwise samples that both have labels and belong to the same class, while Equation 9 selects from the

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

high-probability must-link set, which refers to the pairwise samples with at most one having label information. In particular, the unlabeled cluster merges with the labeled sample subset when the two datasets are a minimum distance apart and thus are likely to come from the same class. The objective function for each manifold denoted as J1 ðWc Þ, J2 ðWc Þ, J3 ðWc Þ, is as follows, respectively, where c ¼ 1; 2; ::: N: J1 ðWc Þ

X kWTc xci WTc xcip k2 Acip

ðHc1 lHc2 gHc3 ÞWc ¼ kWc :

(17)

Let fwc1 ; wc2 ; …; wcd g be the eigenvectors corresponding to the d0 largest eigenvalues kcj , ðj ¼ 1; …; d0 Þ, ordered in such a way that kc1 kc2 … kcd0 . The optimal feature dimension d0 is determined the same as in Jiwen Lu, Yap-Peng Tan, and Gang Wang’s work.8 Then, Wc ¼ ½wc1 ; wc2 ; …; wcd0 is the projection matrix. Figure 4 summarizes the main procedure of the algorithm.

l=N X k1 X ¼ trace WTc ðxci xcip Þ i¼1 p¼1

ðxci xcip ÞT Acip Wc

J2 ðWc Þ

¼ traceðWTc Hc1 Wc Þ ð10Þ X X ¼ kWTc xci WTc xciq k2 Bciq iq

l=N X k2 X ¼ trace WTc ðxci xciq Þ i¼1 q¼1

ðxci

xciq ÞT Bciq Wc

¼ traceðWTc Hc2 Wc Þ X kWTc xci WTc xcir k2 Ccir J3 ðWc Þ ¼

ð11Þ

¼ trace

WTc

l=N X k3 X

ðxci

xcir Þðxci

xcir ÞT Ccir Wc

i¼1 r¼1 ¼ traceðWTc Hc3 Wc Þ

where

ð12Þ

l=N X k X 1

Hc1

ðxci xcip Þðxci xcip ÞT Acip

i¼1

(13)

p¼1

l=N X k X 2

Hc2

ðxci xciq Þðxci xciq ÞT Bciq

i¼1 l=N

Hc3 ¼

(14)

q¼1 k3

ðxci xcir Þðxci xcir ÞT Ccir :

i¼1

(15)

r¼1

Then, the objective function (Equation 3) can be rewritten as N X Wc ¼ arg max J1 ðWc Þ lJ2 ðWc Þ gJ3 ðWc Þ W1 ; W2 ; …; WN c¼1 s:t: WTc Wc ¼ I

N X trace WTc ðHc1 lHc2 gHc3 ÞWc : c¼1

(16) Having derived Hc1 ,Hc2 , and Hc3 , the bases of Wc are obtained by solving the following eigenvalue equation:

Discussion Here, we show that the proposed framework can generalize various existing supervised, unsupervised, and semisupervised manifold learners. In the case of the supervised manifold learning method, all the data points are labeled, which means the objective function based on unlabeled examples is nonexistent. Consequently, the objective function of Equation 3 can be rewritten as

April–June 2016

¼ arg max

Classification We classify new samples after learning the intrinsic feature of each class’s manifold. Given that we have obtained multiple manifolds fX1 ; X2 ; …; Xc g and their associated mapping functions fW1 ; W2 ; …; Wc g, an intuitive way to classify a new testing sample is to measure its similarities (or distances). However, the conventional Euclidean distance is not directly available, because the testing data points reside on different manifolds of different dimensionalities. Therefore, we propose using a reconstruction-error-based criterion for classification on these multiple manifolds. By using such a measurement, after calculating the reconstruction error of the sample on each of the manifold, we assign the new sample to the manifold on which the minimal reconstruction error is reached. Take the k-th manifold as an example. The input data point x0 is first projected onto the manifold as y 0 ¼ WTk x0 . The set of its neighbors on the manifold, denoted by Nk, is then found using either E-neighborhood or k-nearest neighbors. Its reconstruction error on the kth expression manifold is finally calculated by Equation 18, presented in Figure 4. Once the reconstruction errors of the new sample on all the manifolds are obtained, we can determine the sample’s class label. Figure 5 illustrates the entire procedure.

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Input x1, x2, …, xN, and value of the parameters k1, k2, k3 and μ, η, σ, iteration number T, and convergence error ε. Output projection matrix W1, W2, …, WN Step 1 (Initialization): 0 Wi ∈ R d × d 0 , i = 1, 2…N.

Step 2 (Similarity calculation): c

For each sample xi , construct three afﬁnity matrices Aip, Biq, Cir as shown in Equations 4, 5, and 6, respectively. Step 3 (Local optimization): For r = 1, 2… T, repeat c

3.1. Calculate H1, H2, H3, which are demonstrated in Equations 13, 14, and 15, respectively. c

3.2. Solve the eigenvalues [W1, W2, . . . , Wd0] and eigenvalues (λ1, λ2, . . . , λd0) of H1 – μH2 – ηH3 by generalized Eigen-decomposition. r

3.3. Update

xi = Wi (Wi )T xi

. c c c 3.4. For each update sample xi , recalculate H1, H2, H3, respectively. r

3.5. If r > 2 and Wi – Wir–1 < ε, go to step 4. End Step 4 (Output projection matrices): r

Output projection matrices Wi = Wi , i = 1, 2 . . . N. 2

errk (x0) =

min y0k – ∑ aikyik a ∈N k i

(18)

yi k ∈ Nk

s.t. ∑ aik = 1. i

Figure 4. The proposed algorithm. The algorithm is part of the proposed framework for obtaining multiple projection matrices for multimanifold learning.

Error 1 Manifold 1 Minimum error

Error 2

Manifold 2

Error N

IEEE MultiMedia

Testing set

Manifold 2 . . .

Manifold N Figure 5. For any sample in the testing set, we first find its k-nearest neighbors in Manifold 1, Manifold 2, …, Manifold N, then we calculate the reconstruction errors using Equation 4 for each manifold, separately. Assuming we obtain the minimum reconstruction error for Manifold 2, then the testing sample belongs to the same class as Manifold 2.

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Trefoil-knots embedded in R100

1 0 –1 2

6 1

4 2

0 –1 –2

0 –2

(a)

3 2.5 2 1.5 1 0.5 0 –0.5 –1 –1.5 –2 –6 –5 –4 –3 –2 –1 0 (b)

3 2.5 2 1.5 1 0.5 0 –0.5 –1 –1.5 –2 –6 –5 –4 –3 –2 –1 0 (c)

Figure 6. We compared the proposed framework with the ‘1 -graph on two synthetic trefoil-knots: (a) the original manifolds in 3D, (b) clustering by the introduced method, and (c) clustering by the ‘1 -graph.

Wc ¼

N X

J1 ðWc Þ lJ2 ðWc Þ :

arg max W1 ; W2 ; …; WN s:t: WTc Wc ¼ I

c¼1

(19) Obviously, the objective function is the same with DMMA if the coefﬁcient l is equal to 1. It is thus straightforward to show that DMMA is a special case of the proposed framework. Furthermore, under the single manifold learning assumption, Equation 19 is further reduced to W ¼ arg max J1 ðWÞ J2 ðWÞ WT W

J1 ðWÞ ¼

kxi xj k Aij J2 ðWÞ

kxi xj k Bij ;

(20)

which is the objective function of locality-based discriminant neighborhood embedding.13 In the case of unsupervised manifold learning, all the samples are unlabeled, so the original objective function is reduced to Wi ¼ arg min kxi xj k2 Sij ;

(21)

W1 ;W2 :::Wc WT i Wi

Ratthachat Chatpatanasiri and Boonserm Kijsirikul present a comprehensive description of Equation 22 and conclude that many existing semisupervised methods can be included in the framework.19

Experiments We set up experiments on two groups of datasets, synthetic and real. The former attempts to verify the advantage of a graph construction based on sparse manifold clustering, while the latter demonstrates the superior performance of the proposed framework compared with state-of-the-art algorithms. Synthetic Data To evaluate the performance of a graph construction based on sparse manifold clustering, we considered a challenging case in which two manifolds were close to each other. As this method was used only in J3 , all the data points were unlabeled for the synthetic data. It has been demonstrated elsewhere20 that, compared with the graph constructed by k-nearest neighbors or e-ball methods, the ‘1 -graph has better performance. We compared the proposed framework with the ‘1 -graph on two synthetic trefoil-knots, shown in Figure 6a—the left knot is represented by dark blue, blue, and cyan, while the right knot is represented by green, yellow, and red, embedded in <100 and corrupted with small

April–June 2016

where Sij is deﬁned the same as Aij , when xi is one of the k-nearest neighbors of xj , and characterizes the similarity of two nearest points. Otherwise, Sij is equal to zero. Apparently, Equation 21 is the objective function of Xiao’s work,7 an extension of LPP. Consequently, LPP can be regarded a special case of the proposed framework under the single manifold learning assumption in the unsupervised framework. In the case of the semisupervised framework, Equation 3 is reduced to

W ¼ arg max J1 ðWÞ lJ2 ðWÞ gJ3 ðWÞ W ¼ arg min lJ2 ðWÞ J1 ðWÞ þ gJ3 ðWÞ W ð22Þ ¼ arg min J l ðWÞ þ J u ðWÞ :

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Gaussian white noise. After constructing the similarity graph based on sparse manifold and ‘1 -graph, the clustering results are shown in Figure 6b and Figure 6c, respectively. The aim of J3 is to find the unlabeled neighbor points from the same manifold. In other words, each cluster represents the corresponding manifold and has no points from other manifolds. It is observed that, in Figure 6b, all the data points are correctly classified, although the two manifolds are close to each other, demonstrating that the introduced sparse manifold clustering effectively groups together the points from the same manifold. However, in Figure 6c, only the points far away from each other can be correctly separated, resulting in low quality in finding the neighborhood points in the same manifold. The experimental results show that the introduced similarity graph construction performs well, especially when two classes correspond to two manifolds close to each other.

IEEE MultiMedia

Real Data for Face Recognition To verify the effectiveness of the proposed framework on real data, we studied two cases— face and handwritten digital recognition—in which semisupervised learning was natural and necessary. For face recognition, we chose two public and popular databases, the CMU Pose, Illumination, and Expression (PIE)21 database and the AR database (created by Aleix Martinez and Robert Benavente).22 The CMU PIE database contains 41,368 images of 68 individuals. Images of each person were taken under 43 different illumination conditions, and with 13 different poses and four different expressions. In the experiments, we selected images from the “pose 27” gallery, which includes 49 samples per person except for the 38th subject, which had 36 samples. For the training set, we randomly select 30 images per person and used the remaining images for testing. Moreover, each image was manually cropped to 64 64 pixels. The AR database displays more than 4,000 color face images from 126 people, including frontal views of faces with different facial expressions, lighting conditions, and occlusions. The original resolution of these images is 165 120. We used a subset of AR, where each person has 14 different images without occlusion. Ten images per person were randomly selected as the training set, and the other four images were used as the testing set. For computational convenience, all the images were resized to 32 32 pixels.

In each experiment, the first p images of each person in the training set were selected as the labeled images, and the remaining samples were regarded as the unlabeled images. In our semisupervised setting, the available training set contained both labeled and unlabeled examples, and the testing set was not available during the training phase. For fair comparison, all the samples in the training set, labeled and unlabeled, were included for the term of unsupervised setting, whereas only labeled images were used for supervised methods to find the projection vectors. We compared the proposed framework with the common unsupervised dimensional reduction method, including principal component analysis (PCA)23 and LPP; the general supervised dimensional reduction methods, including linear discriminant analysis (LDA),24 regularized LDA (RLDA), multiple maximum scatter difference (MMSD), and DMMA; and several state-of-the-art semisupervised methods, including semisupervised LDA (SSLDA),13 semisupervised maximum margin criterion (SSMMC),13 semisupervised discriminant analysis (SDA),11 and semisupervised discriminant learning (SSDL).14 l; g are set to 1 and 0.1; and k1 ; k2 ; k3 are set to 3, 5, 15, respectively, in the tests. To compare the performance under the same condition, k1 and k2 are also set to 3 and 5 in DMMA. We adopt k-NN classifier except for DMMA and the proposed framework. Because the margin between two manifolds can’t be calculated by Euclidean distance, we employ the reconstruction-based method described earlier as the classifier. We repeated all experiments 10 times. We first fixed the labeled number in each class p equal to 6 for the CMU PIE database and to 4 for the AR database, and then we tested the reduced features with different numbers of dimensionality. Figure 7 illustrates the results, showing that the proposed method outperforms other methods, not only because it exploits information of both labeled and unlabeled samples, but also because it uses multiple manifold assumption to distribute different classes to different manifolds, which is more reasonable and flexible than choosing different dimensions for different classes. Then, we fixed the reduced dimensionality to the optimal value and varied the labeled numbers in each class. We present the results in Tables 1 and 2, which clearly show the following:

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q

0.9

0.8

0.8 Accuracy

Accuracy

THE WORLD’S NEWSSTAND®

0.7

0.7 0.6

0.6 0.5 0.5 10 (a)

30 40 50 Dimension of features PCA LPP

10 20 30 40 50 60 70 80 90 100 110 Dimension of features

60 (b) MMSD DMMA

LDA RLDA

SDA SSLDA

SSMMC SSDL

Proposed

Figure 7. Experimental results: (a) Recognition accuracy of different methods versus dimension of features on the CMU-PIE database when p ¼ 6. (b) Recognition accuracy of different methods versus dimension of features on the AR database when p ¼ 4.

Table 1. The relationship between the number of labeled samples in each class and the recognition accuracy for the CMU PIE database. (Bold font indicates the best performance.) Number of labeled samples Method

PCA

0.5966

LPP

0.6470

LDA

0.6929

0.7033

0.7544

0.7882

0.7796

0.7929

RLDA

0.8316

0.8553

0.8662

0.8588

0.8669

0.8700

MMSD

0.8776

0.8824

0.8912

0.8993

0.9012

0.9113

DMMA

0.9112

0.9223

0.9332

0.9446

0.9538

SDA

0.6630

0.6799

0.7011

0.7122

0.7244

0.7396

SSLDA

0.6534

0.6877

0.7122

0.7233

0.7385

0.7399

SSMMC

0.6221

0.6339

0.6557

0.6776

0.6880

0.6889

SSDL

0.6831

0.6977

0.7233

0.7447

0.7496

0.7500

Proposed framework

0.9344

0.9441

0.9502

0.9537

0.9513

0.9542

In general, the proposed framework remarkably outperforms the other methods when the number of labeled samples is low. But the advantage is reduced with the increase of labeled samples, because the supervised method can extract more discriminant information compared to semisupervised methods.

There are no changes for unsupervised methods, because they adopt all the training data points and have no relationship with labeled samples.

Note that DMMA performs slightly better than the proposed framework when the number of labeled samples is larger than seven in each class for the AR database (Table 2). The reason is that there are only 14 images in each class in this database. When the number of labeled samples gets closer to the number of total samples, more label information is presented. As a result, supervised methods start gaining the upper hand compared with semisupervised methods. However, when there is a lack of labeled samples (< 8), which is the focus of this study, the proposed method outperformed all the supervised methods.

April–June 2016

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Table 2. The relationship between number of labeled samples in each class and recognition accuracy for the AR database. (Bold font indicates the best performance.) Number of labeled samples 4

PCA

0.6750

LPP

0.7234

LDA

0.8014

0.8217

0.8237

0.8344

0.8566

0.8587

0.8611

RLDA

0.9396

0.9412

0.9423

0.9436

0.9422

0.9513

0.9566

MMSD

0.9679

0.9598

0.9700

0.9732

0.9749

0.9800

0.9786

DMMA

0.9712

0.9734

0.9769

0.9788

0.9832

0.9844

0.9837

SDA

0.7336

0.7654

0.7732

0.7816

0.7899

0.7884

0.7913

SSLDA

0.7408

0.7758

0.7839

0.7912

0.8033

0.8122

0.8117

SSMMC

0.7123

0.7552

0.7544

0.7623

0.7749

0.7812

0.7998

SSDL

0.7413

0.7842

0.7936

0.8055

0.8122

0.8337

0.8345

Proposed framework

0.9799

0.9812

0.9813

0.9817

0.9822

0.9816

0.9823

0.9

0.8

0.9

0.7

0.8

0.6

0.7

Accuracy

Method

0.5 0.4

0.6 0.5

0.3

0.4

0.2

0.3

0.1 (a)

3 4 Dimension of features PCA LPP

LDA RLDA

0.2

1 1.5

(b) MMSD DMMA

SDA SSLDA

2.5 3 3.5 4 4.5 Dimension of features

SSMMC SSDL

5.5

Proposed

Figure 8. Results for real data for handwritten digit recognition: (a) Recognition accuracy of different methods versus dimension of features on the USPS database when p ¼ 50. (b) Recognition accuracy of different methods versus dimension of features on the AR database when p ¼ 100.

IEEE MultiMedia

For multimanifold learning, the number of projection matrices is the same as the number of classes. For the AR database, there are 120 classes and thus 120 projection matrices. The original dimension is 1,024, and the reduced dimensionalities for the 120 classes range from 20 to 560. So, the proposed framework provides high recognition accuracy, as shown in Table 2 and Figure 7, with signiﬁcant dimensionality reduction. Similar results are observed for CMPPIE (Table 1), where there are 68 classes and 68 projection matrices.

Real Data for Handwritten Digit Recognition We performed additional experiments on handwritten digit recognition on two well-known public databases: the US Postal Service (USPS) and MNIST (from the National Institute of Standards and Technology) databases. The USPS database contains grayscale handwritten digit images scanned from envelopes. All the samples are of size 16 16. The original training set contains 7,291 images and the testing set contains 2,007 images. We selected a subset of six digits—0, 1, 2, 3, 4, and 5—in our experiments,

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Table 3. The relationship between number of labeled samples in each class and recognition accuracy for the USPS database. (Bold font indicates the best performance.) Number of labeled samples Method

100

150

200

250

300

350

400

450

500

PCA

0.5717

LPP

0.6800

LDA

0.2377

0.4011

0.4533

0.4772

0.4788

0.4812

0.4844

0.4815

0.4830

0.4833

RLDA

0.4122

0.4587

0.4899

0.5883

0.5991

0.6077

0.6122

0.6178

0.6211

0.6230

MMSD

0.4876

0.5327

0.6019

0.6328

0.6430

0.6527

0.6632

0.6722

0.6876

0.6933

DMMA

0.5322

0.6011

0.6578

0.7233

0.7566

0.7788

0.7963

0.8122

0.8244

0.8369

SDA

0.6900

0.6955

0.6970

0.7000

0.7083

0.7111

0.7156

0.7200

0.7302

0.7399

SSLDA

0.7112

0.7223

0.7334

0.7355

0.7337

0.7446

0.7559

0.7568

0.7669

0.7770

SSMMC

0.6755

0.6804

0.6999

0.7011

0.7000

0.7066

0.7112

0.7223

0.7227

0.7266

SSDL

0.7299

0.7387

0.7447

0.7556

0.7669

0.7680

0.7790

0.7799

0.7833

0.7899

Proposed framework

0.8134

0.8299

0.8330

0.8447

0.8512

0.8539

0.8610

0.8677

0.8764

0.8890

Table 4. The relationship between number of labeled samples in each class and recognition accuracy for the MNIST database. (Bold font indicates the best performance.) Number of labeled samples Method

100

200

300

400

500

600

700

800

900

1,000

0.7508

LPP

0.8196

LDA

0.7446

0.7662

0.7833

0.7729

0.7812

0.7944

0.8254

0.8366

0.8427

0.8527

RLDA

0.7842

0.8055

0.8233

0.8441

0.8557

0.8660

0.8711

0.8729

0.8788

0.8813

MMSD

0.8221

0.8344

0.8579

0.8667

0.8834

0.8992

0.9013

0.9066

0.9112

0.9200

DMMA

0.8556

0.8611

0.8677

0.8764

0.8912

0.9013

0.9122

0.9234

0.9335

0.9542

SDA

0.8334

0.8441

0.8339

0.8550

0.8566

0.8611

0.8678

0.8721

0.8832

0.8799

PCA

SSLDA

0.8442

0.8533

0.8547

0.8612

0.8633

0.8678

0.8754

0.8862

0.8913

0.8996

SSMMC

0.8016

0.8117

0.8224

0.8227

0.8368

0.8413

0.8556

0.8677

0.8754

0.8713

SSDL

0.8532

0.8598

0.8671

0.8732

0.8864

0.8859

0.8912

0.8937

0.8998

0.9012

Proposed framework

0.9122

0.9233

0.9344

0.9357

0.9368

0.9412

0.9422

0.9456

0.9503

0.9522

The parameters setting described earlier was again used here, and the experiments were divided into two parts. First, the labeled numbers were ﬁxed to 50 for USPS and 100 for MNIST. We attempted to use the experiments to explore the relationship between dimensionality and recognition accuracy. Figure 8 shows the results. First, note that the proposed framework again performed better on both databases. Furthermore, note that, compared with face recognition, LDA had no advantage over PCA, because the number of labeled samples in each class is relatively small—only 5 percent of the training set. For the handwritten database, it is

April–June 2016

and each class contained 1,100 images. We then randomly selected 1,000 samples from each class as a subset for the training set with p samples from each class as the labeled ones. The remaining images were used as the testing set. The MNIST has a training set of 60,000 example digits and a test set of 10,000 examples. All the digits in the dataset have been sizenormalized and centered to 28 28 gray-level images. We also selected six classes—0, 1, 2, 3, 4, and 5—and each class had 2,000 samples in the experiments. For each class, 1,600 samples were randomly selected with p images of each class as the labeled ones; the other 400 samples formed the testing set.

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

difﬁcult for LDA to exploit the discriminative information. At the same time, the semisupervised methods showed apparent advantages, demonstrating that such methods provide a good solution when the label information is severely lacking. Tables 3 and 4 show the results of gradually increasing the number of labeled samples to exploit the relationship between the number of the labeled samples and recognition accuracy. Similar to the results in Tables 1 and 2, Tables 3 and 4 further demonstrates the effectiveness of the proposed framework, especially when the label information is in critical shortage. For the supervised methods, when only 50 labeled samples are available, it is far from sufﬁcient for extracting enough discriminant information for supervised methods. In the proposed framework, except for 50 labeled samples, 950 unlabeled samples also play an important role, as their labels are being predicted in the process. Therefore, these unlabeled samples can be regarded as labeled to some degree, owing to the relatively accuracy of the prediction. With the increase of labeled training samples, the performance of supervised methods is remarkably improved.

he proposed framework can ﬂexibly choose the manifold and perform optimal dimensionality reduction, because data in one class lies on a single corresponding manifold. In addition, the introduced graph construction method performs well when dealing with multiple manifolds, even if two manifolds are close to each other. However, the framework currently can’t represent methods with cost functions that are nonlinear with respect to distances among the samples. In the future, we plan to study extension of these algorithms to handle nonlinear semisupervised learning problems. MM

Acknowledgments

IEEE MultiMedia

This special issue is a collaboration between the 2015 IEEE International Symposium on Multimedia (ISM 2015) and IEEE MultiMedia. This article is an extended version of “A Novel Semi-Supervised Dimensionality Reduction Framework for Multimanifold Learning,” presented at ISM 2015. This work is partially supported by the National Natural Science Foundation of China (NSFC, no. 61071211), the State Key Program of NSFC (no. 61331201), the Key International Collaboration Program

of NSFC (no. 61210005), and the Canada Research Chair Program.

References 1. S.T. Roweis and L.K. Saul, “Nonlinear Dimensionality Reduction by Locally Linear Embedding,” J. Science, vol. 290, no. 5500, 2000, pp. 2323–2326. 2. J.B. Tenenbaum, V. De Silva, and J.C. Langford, “A Global Geometric Framework for Nonlinear Dimensionality Reduction,” J. Science, vol. 290, no. 5500, 2000, pp. 2319–2323. 3. M. Belkin and P. Niyogi, “Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering,” Proc. Advances in Neural Information Processing Systems 14 (NIPS), 2001, pp. 585–591. 4. X. He et al., “Face Recognition Using Laplacianfaces,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 3, 2005, pp. 328–340. 5. W. Zhang et al., “Discriminant Neighborhood Embedding for Classification,” Pattern Recognition, vol. 39, no. 11, 2006, pp. 2240–2243. 6. J. Gou and Z. Yi, “Locality-Based Discriminant Neighborhood Embedding,” The Computer J., vol. 56, no. 9, 2013, pp. 1063–1082. 7. R. Xiao et al., “Facial Expression Recognition on Multiple Manifolds,” Pattern Recognition, vol. 44, no. 1, 2011, pp. 107–116. 8. J. Lu, Y.-P. Tan, and G. Wang, “Discriminative Multimanifold Analysis for Face Recognition from a Single Training Sample Per Person,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 35, no. 1. 2013, pp. 39–51. 9. H. Yan et al., “Multi-Feature Multi-Manifold Learning for Single-Sample Face Recognition,” Neurocomputing, Nov. 2014, pp. 134–143. 10. P. Zhang et al., “Sparse Discriminative Multi-Manifold Embedding for One-Sample Face Identification,” Pattern Recognition, Apr. 2016, pp. 249–259. 11. D. Cai, X. He; and J. Han, “Semi-Supervised Discriminant Analysis,” IEEE 11th Int’l Conf. Computer Vision (ICCV), 2007, pp. 1–7. 12. F. Dornaika and Y. El Traboulsi, “Learning Flexible Graph-Based Semi-Supervised Embedding,” IEEE Trans. Cybernetics, vol. 46, no. 1, 2016, pp. 208–218. 13. Y. Song et al., “A Unified Framework for SemiSupervised Dimensionality Reduction,” Pattern Recognition, vol. 41, no. 9, 2008, pp. 2789–2799. 14. Q. Gao et al., “A Novel Semi-Supervised Learning for Face Recognition,” Neurocomputing, Mar. 2015, pp. 69–76. 15. E. Amaldi and V. Kann. “On the Approximability of Minimizing Nonzero Variables or Unsatisfied

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Relations in Linear Systems,” Theoretical Computer Science, vol. 209, no. 1, 1998, pp. 237–260. 16. E. Elhamifar and R. Vidal, “Sparse Manifold Clustering and Embedding,” Proc. Advances in Neural Information Processing Systems, 2011, pp. 55–63. 17. R. Wang and X. Chen, “Manifold Discriminant Analysis,” IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2009, pp. 429–436. 18. R. Wang et al., “Manifold-Manifold Distance with Application to Face Recognition Based on Image Set,” IEEE Conf. Computer Vision and Pattern Recognition (CFPR), 2008, pp. 1–8. 19. R. Chatpatanasiri and B. Kijsirikul, “A Unified Semi-

Beijing Institute of Technology. Contact him at ielqi@zzu.edu.cn. __________ Ling Guan is a professor and a Tier I Canada Research Chair in the Department of Electrical and Computer Engineering at Ryerson University, Toronto, Canada. His research interests include multimedia signal processing, human-centered computing, and machine leaning. Guan received a PhD in electrical engineering from the University of British Columbia, Vancouver, Canada. He is a Fellow of the Engineering Institute of Canada and also a Fellow of IEEE. Contact him at _____________ lguan@ee.ryerson.ca.

Supervised Dimensionality Reduction Framework for Manifold Learning,” Neurocomputing, vol. 73, no. 10, 2010, pp. 1631–1640. 20. S. Yan and H. Wang, “Semi-Supervised Learning by Sparse Representation,” Proc. SIAM Int’l Conf. Data Mining (SDM), 2009, pp. 792–801.

_______________ ________

21. T. Sim, S. Baker, and M. Bsat, “The CMU Pose, Illumination, and Expression (PIE) Database,” Proc. Fifth IEEE Int’l Conf. Automatic Face and Gesture Recognition, 2002, pp. 46–51. 22. A.M. Martinez, The AR Face Database, tech. report #24, Computer Version Center, 1998. 23. M. Turk and A.P. Pentland, “Face Recognition Using Eigenfaces,” Proc. Computer Vision and Pattern Recognition (CVPR), 1991, pp. 586–591. 24. P.N. Belhumeur, J.P. Hespanha, and D.J. Kriegman, “Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 7, 1997, pp. 711–720.

________________ ________________

Xin Guo is currently working toward her PhD at Zhengzhou University, China. Her research interests include multimedia signal processing, pattern recognition, and image processing. Guo received an MS in communication and information system from

________________ ______________

Zhengzhou University. Contact her at ________ guoxin1988 0806@163.com. __________ Yun Tie is a professor in the School of Information ________________

and Engineering at Zhengzhou University. His research interests include multimedia systems design, digital image processing, and pattern recog-

______________

nition. Tie received his PhD in electrical engineering from Ryerson University, Toronto, Canada. Contact him at ___________ ytie@ee.ryerson.ca. Lin Qi is a professor and the vice dean of the School of Information Engineering at Zhengzhou Univer-

________________ _______________ ________________

sity. His research interests include digital image processing and pattern recognition. Qi received a PhD in communication and information systems from the

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Ubiquitous Multimedia

Multimodal Ensemble Fusion for Disambiguation and Retrieval Yang Peng, Xiaofeng Zhou, Daisy Zhe Wang, Ishan Patwa, and Dihong Gong University of Florida Chunsheng Victor Fang Pivotal Inc.

fusion is more flexible in terms of feature representations and learning algorithms for different modalities. Ensemble fusion is also more scalable in terms of modalities.1 Here, we study multimodal fusion from a deeper perspective and present our ensemble fusion model, designed for disambiguation and retrieval. In particular, we demonstrate the effectiveness of our model through word sense disambiguation (WSD) and information retrieval (IR) tasks. We employ several existing algorithms and models for WSD and IR in our ensemble fusion model, including the unsupervised Yarowsky algorithm2 for text disambiguation and the inverted indexing algorithm for indexing and searching (see http://lucene.apache.org/ solr). Current multimodal fusion approaches for ___ disambiguation and retrieval mostly focus on early fusion, developing a unified representation model for multiple modalities and then employing existing learning methods on the unified representation.3–7 Yet most related work doesn’t explain why the multimodal fusion approaches work, and few projects have leveraged ensemble fusion for disambiguation and IR tasks. We propose a multimodal ensemble fusion model to combine the results of text- and imageonly processing (disambiguation or retrieval) to achieve better quality. Our model is designed to capture the complementary relation and correlative relation between images and text. Different ensemble approaches, including linear rule fusion, maximum rule fusion, and logistic regression, are used to combine the results from methods using single-modality data.

Related Work The proposed multimodal ensemble fusion model captures complementary and correlative relations between two modalities. Experimental results show that it outperforms approaches using only a single modality for word sense disambiguation and information retrieval.

iven the abundance of multimedia data on the Internet, researchers in the multimedia analysis community1 are starting to develop multimodal machine learning models to integrate data of multiple modalities— including text, images, audio, and videos—for multimedia analysis tasks, such as event detection. There are two major fusion schemes:1 early fusion and late fusion. The former, which fuses information at the feature level, is the most widely used strategy. The latter, also known as ensemble fusion, fuses multiple modalities in the semantic space at the decision level. Early fusion can use the correlation between multiple features from different modalities at an early stage, while ensemble

1070-986X/16/$33.00 c 2016 IEEE

Here, we offer a brief overview of WSD, IR, and multimodal fusion, and discuss existing efforts in each area. Word Sense Disambiguation Words in natural languages tend to have multiple meanings or senses—for example, the word crane might refer to a type of bird or a type of machine. WSD addresses the problem of determining which word sense is used in a sentence. It was ﬁrst formulated as a distinct computational task during the early days of machine translation in the 1940s, making it one of the oldest problems in computational linguistics. Different kinds of methods have been introduced to solve WSD,1,8 including supervised approaches, unsupervised approaches, and knowledge-based approaches. Most of the existing approaches exploit only

Published by the IEEE Computer Society

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

textual information; limited research efforts have been conducted on multimodal data for WSD.6,7 For supervised approaches, many supervised statistical algorithms have been employed for WSD,8,9 including decision list, decision tree, naive Bayesian, neural networks, and support vector machine (SVM) algorithms. However, it is unrealistic to manually label a very large collection of textual data, which is the major limitation of the supervised approaches. Unsupervised approaches,8,9 on the other hand, do not require a large labeled dataset, which enables them to overcome the knowledge acquisition bottleneck—that is, the lack of large data collections with manual annotations. However, unsupervised approaches have a major disadvantage in that they don’t exploit any knowledge inventory or dictionary of realworld senses. Knowledge-based methods, which use knowledge resources (such as dictionaries and ontologies), provide a better tradeoff between disambiguation accuracy and computational costs than supervised and unsupervised methods.

Multimodal Fusion As already noted, the two major fusion schemes are early fusion, which fuses information at the feature level, and ensemble fusion, which fuses multiple modalities in the semantic space at the decision level.1 In the machine learning community, with deep learning gaining popularity in recent years, there have been efforts to exploit deep learning for multimodal learning.4,5 Jiquan Ngiam and colleagues proposed the bimodal deep Boltzmann machine and the bimodal deep auto-encoder to fuse features of multiple modalities for multimodal fusion and cross-modal learning.4 Nitish Srivastava and Ruslan Salakhutdinov also employed the deep Boltzmann machine to fuse features of images and text.5 For WSD, there have been several research projects that have used images and text to improve disambiguation accuracy.6,7 Wesley May and colleagues combined the image space and text space directly and applied a modiﬁed version of Yarowsky algorithm2 to the combined space to solve WSD.6 But this naive combination of two spaces might not capture the deep or complex correlations between the image space and text space, which might lead to poor accuracy. Kate Saenko and colleagues assumed the features of one modality are independent of sense, given the other modality. They then used Latent Dirichlet allocation to model the probability distributions of senses, given images and text separately, combining these two distributions using a sum rule.7 Although the linear rule in our model and the sum rule in Saenko’s work might look similar, the ideologies and motivations behind them are quite different. The goal of the sum rule is to model the joint probability distribution of senses given both images and text under the

April–June 2016

Information Retrieval IR obtains information relevant to a query from a collection of documents (usually textual documents), and it involves many research topics, including document representation models, similarity metrics, indexing, relevance feedback, and reranking. The bag-of-words model is commonly used to represent textual documents in IR and natural language processing. In this model, a textual document or sentence is represented as a bag or a set of its words in an orderless and grammar-free manner. The frequency vectors or occurrence vectors of words are treated as features in this model. Image retrieval is the search for desired images from an image dataset according to queries from users.10 Content-based image retrieval (CBIR), which emerged in 1990s, is a special case in which the queries are images, and the search process is based on the visual content of images rather than on textual captions or image labels. Image retrieval borrows many existing algorithms and technologies from IR. For CBIR, the most popular approach uses the bag-of-visual-words model11 with local features, such as scale-invariant feature transform (SIFT)12 features for representing images. Similar to the bag-of-words model, the bag-of-visual-words model is designed to represent images as a frequency or as occurrence vec-

tors of “visual words.” The extracted local features from images are quantized into histograms of visual words, which can be used to further represent each image. Visual words are generated ofﬂine by clustering from local features of images.11 Thus, IR techniques can be easily borrowed and applied to the CBIR task, and the model has been proven effective and efﬁcient.13,14 In our model, we use one of the most important indexing algorithms, the inverted indexing algorithm, to index and search images and text. For each word, the inverted index stores the list of documents in which the word appears. Inverted indexing can provide a fast full-text document search, which is why it has been widely applied in the document IR community.

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

(a)

(b) Figure 1. Examples selected from the University of Illinois at Urbana–Champaign Image Sense Discrimination (UIUC-ISD) dataset (http://vision. cs.uiuc.edu/isd) for the keyword “bass” __________

with two

different meanings: (a) “fish of Florida: Rock Sea Bass” and (b) “L.A. Kidwell musical instruments—product (bass 006).”

IEEE MultiMedia

independence assumption, while our goal of the linear rule approach is to capture the complementary relationship between images and text in the ensemble fusion framework, where text processing and image processing are conducted ﬁrst and then the linear rule is used to combine the results to achieve higher quality. For IR, Nikhil Rasiwasia and colleagues proposed several state-of-the-art approaches to achieve cross-modal IR.3 The ﬁrst approach was correlation matching, which aimed to map the different feature spaces for images and text to the same feature space based on correlation analysis of these two spaces. The second approach was semantic matching, which represented images and text with the same semantic concepts using multiclass logistic regression. Yi Wu and colleagues proposed the super kernel fusion method to combine multimodal features optimally for image categorization.15 Qiang Zhu and colleagues preprocessed embedded text in images to get weighted distance and combined the distance with visual cues to further classify images.16 Eric Bruno and colleagues proposed the preference-

based representation to completely abstract multimedia content for efficient processing.17 Shikui Wei and colleagues proposed a cross-referencebased fusion strategy for video search, which used an ensemble fusion technique that hierarchically combined ranked results from different modalities.18 Their strategy can be viewed as a special discrete case of the linear rule in our model. Fusion techniques have been used in other research areas. For example, Longzhi Yang and colleagues have proposed risk analysis approaches for chemical toxicity assessment on multiple limited and uncertain data sources.19,20 However, their approaches are not directly applicable to our applications, because our work focuses on fusing information from deterministic multimodal data. Furthermore, our ensemble fusion model can capture the complementary relation between text and images, which has been ignored in most previous work related to multimodal disambiguation and retrieval. Such previous research, mostly focused on using an early fusion scheme to develop unified representation models from text and images, uses classification techniques on top of the unified representation models to solve different tasks.

Why Multimodal Fusion Works Here, we explain in detail the correlative and complementary relations among multiple modalities. To simplify the scenario, we discuss only two modalities—images and text. We also explain how our ensemble fusion model can capture the complementary relation between images and text to achieve higher quality than single-modality approaches. We will mostly use examples from WSD to explain the concepts, though the correlative and complementary relations extend to many other applications. Correlative Relation The correlative relation between text and images means images and textual sentences of the same documents tend to contain semantic information describing the same objects or concepts. For example, the image and textual sentence in Figure 1a both refer to the “bass” as a ﬁsh, while the image and sentence in Figure 1b both describe “bass” as an instrument. Because images and text for the same documents have this semantic correlative relation, they tend to be correlated in the feature space as well. Thus it is possible to conduct correlation analysis on textual and visual features to

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

construct a unified feature space to represent multimodal documents. Previous research exploits the correlative relation to develop a unified representation model for multimodal documents,3–5 although most of them did not identify this relation explicitly. Because a “semantic gap” exists between semantic concepts and image features,10 the performance of these approaches using the correlative relation is highly dependent on the image features, textual features, and the correlative analysis methods, as well as on the nature of the data (for example, whether the correlative relation exists in the majority of the documents inside the dataset). In the ensemble fusion scheme, images and text also display certain correlations at the decision level. For example, some images and textual sentences are classified as the same senses correctly in the experiments for WSD. But the ensemble fusion scheme obviously can’t exploit the correlation of images and text at the feature level.

(b) Figure 2. Examples selected from UIUC-ISD dataset (http://vision.cs.uiuc.edu/isd) for the sense “bass (fish)”: (a) “portfolio 2”and (b) “lake fork fishing guide, bass guide—guarantee bass catch.”

documents. On the other hand, the image disambiguation using SVM classiﬁcation has lower precision but higher recall, because it can disambiguate all the unseen documents but with lower conﬁdence. Text retrieval and image retrieval share a similar complementary relationship. After using inverted indexing to index textual data, the text retrieval has high precision but low recall due to the sparse representation of short textual sentences. However, image retrieval has high recall but low precision, due to its dense and noisy representation of images. This observation motivated us to propose our ensemble fusion model to combine the results of text and image processing. Complementary and correlative relations can both be leveraged in multimodal processing tasks, such as WSD, to achieve high accuracy. They usually co-exist inside the same datasets, even though they are often presented in different documents. These two relations reveal the potential of using multimodal fusion to achieve higher quality than single-modality approaches, because multimodal data can either provide additional information or emphasize the same semantic information.

April–June 2016

Complementary Relation Images and text are complementary to each other, because they contain different semantic information. For example, in the WSD case, textual sentences contain more useful and relevant information for disambiguation in some documents, while images contain more useful information in other documents. For example, in Figure 2a, the sentence “portfolio 2” contains little information to disambiguate senses for “bass,” while the image depicts the “bass fish” object. In Figure 2b, the image is rather complex and shows a lot of foreground objects, including a person, a boat, a fish, the lake, and trees, while the textual sentence contains cues which can be directly used to disambiguate, such as “fishing,” “lake” and “catch.” Image processing (disambiguation or retrieval) and text processing (disambiguation or retrieval) are also complementary. For some documents, text processing generates correct results, while for others, image processing generates correct results. The reasons are twofold: first, the semantic information in images and text are complementary to each other; second, text processing usually has high precision but low recall, while image processing has low precision but high recall. In WSD, the Yarowsky algorithm we use to disambiguate textual sentences offers high confidence in its disambiguation results, but it frequently fails to disambiguate many unseen

(a)

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

score f

Probabilistic ensemble fusion (linear rule, max rule, logistic regression) score t

Text processing (Yarowsky, Bag-of-Words)

score i

Image processing (SVM, Bag-of-Visual-Words)

Largemouth bass ﬁshing tips

Figure 3. The ensemble fusion model. In our model, text and image processing are conducted on text and images separately, and a fusion algorithm then combines the results.

Algorithm In our ensemble fusion model, text and image processing are conducted on text and images separately, and a fusion algorithm then combines the results. For disambiguation, the results from text disambiguation and image disambiguation are senses with conﬁdence scores. For retrieval, the results from text retrieval and image retrieval are similarity scores between documents and queries.

IEEE MultiMedia

Ensemble Fusion Model In the ensemble fusion model, images and text are first processed separately to provide decision-level results. Then, the results are combined using different approaches—including the linear rule, the maximum rule, and logistic regression classification—to generate the final results. Let’s use score to denote the results from text and image processing. For disambiguation, score refers to the confidence scores ðc1 ; c2 ; :::; cn ÞT of senses ðs1 ; s2 ; :::; sn ÞT . For retrieval, score refers to the similarity score of a document to the query document. The process of our ensemble fusion model is shown in Figure 3. Let’s simplify the scenario for WSD: say, for one keyword w with two senses s1 and s2 , and a document d with one image i and a textual sentence t, the image classifier generates ðs1 ; ci1 Þ and ðs2 ; ci2 Þ, and the text classifier generates ðs1 ; ct1 Þ and ðs2 ; ct2 Þ, where ci1 ; ci2 ; ct1 and ct2 denote the confidence scores of senses s1 and s2 , generated by image disambiguation and text disambiguation, respectively. Confidence scores are normalized into a [0, 1] interval. The sense with

the higher confidence score between s1 and s2 is used as the disambiguated sense annotation for the word w. Let’s also formulate the retrieval problem: say for a document d with one image i and a textual sentence t in the data collection, the image retrieval generates similarity score scorei and the text retrieval returns similarity score scoret . Our ensemble model is simple but powerful. The experimental results, presented later, demonstrate its effectiveness. In addition, the model can be viewed as a general framework for multimodal fusion that lets users come up with new fusion approaches to combine the results from text and image processing, or create new text- and image-processing methods. It also can be expanded to more modalities, such as audio and video. Ensemble Approaches We proposed rule- and classification-based approaches to combine the results from image and text processing. There are two rule-based approaches: linear rule fusion and maximum rule fusion. Logistic regression is employed as a classification-based fusion approach in our model. Linear rule Linear rule fusion uses a weight k to combine the scores from image and text processing. For disambiguation, the fused confidence scores for s1 and s2 are cf 1 ¼ k ci1 þ ð1 kÞ ct1 ;

(1)

cf 2 ¼ k ci2 þ ð1 kÞ ct2 ; and

(2)

k ¼ Accuracyi = ðAccuracyi þ Accuracyt Þ:

(3)

k is calculated by dividing the accuracy of image disambiguation by the sum of accuracy of text and image disambiguation on the validation datasets. For retrieval, the fused similarity score for d is scoref ¼ k scorei þ ð1 kÞ scoret k ¼ APi =ðAPi þ APt Þ:

(4) (5)

k is calculated by dividing the AP (average precision) of image retrieval by the sum of the AP of text and image retrieval on the training queries. Maximum rule The maximum rule selects the highest conﬁdence score or similarity score from text and image processing. For disambiguation, the maximum rule chooses the sense s with the

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

highest confidence score c from ðs1 ; ci1 Þ; ðs2 ; ci2 Þ; ðs1 ; ct1 Þ; and ðs2 ; ct2 Þ. For example, with ðs1 ; 0:45Þ and ðs2 ; 0:55Þ from image classification and ðs1 ; 0:91Þ and ðs2 ; 0:09Þ from text classification, we choose s1 as the output sense for the document d according to the maximum rule, because the text classification outputs the highest confidence score 0.91 for sense s1 . For retrieval, the maximum rule simply chooses the larger score from scorei and scoret as the final score scoref . Logistic regression For logistic regression, confidence scores and similarity scores are used as features to train the logistic regression classifier. For disambiguation, confidence scores from two modalities, ci1 ; ci2 ; ct1 ; and ct2 , are used to train the logistic regression classifier on the validation datasets. For retrieval, similarity scores, returned by training queries, are used to train the logistic regression classifier to determine if a document is relevant or similar to the query. Then, the logistic regression classifier is used to classify the documents to get the final results. The confidence scores of the logistic regression are used as the final confidence scores for WSD or the final similarity scores for IR. Logistic regression is chosen for its nonlinear transformation of the confidence scores or similarity scores compared to rule-based approaches. Applications Here, we present individual approaches and implementations of the ensemble fusion model for disambiguation and retrieval.

Retrieval In our implementation, Solr (http:// ____ lucene.apache.org/solr), a Web server based on ________________ Lucene, is deployed to handle indexing and searching for textual data and image data with inverted indexing and term frequency-inverse document frequency (tf-idf) weighting (a numerical weight often used in IR and text mining to evaluate how important a word is to a document in a corpus). The textual sentences are represented using the bag-of-words model and the images are represented using the bagof-visual-words model.11 Both text and images are represented as vectors, which makes it straightforward for Solr to index and search them. The cosine similarity with tf-idf weighting on the word vectors or visual word vectors is used as the similarity metric for documents (images and textual sentences). The cosine similarity scores between documents and a query document are used to rank the documents by Solr. Given a query document, the sentence and image are transformed into their proper representation vectors and then searched by Solr separately. Solr returns the ranked lists of documents with similarity scores for both text and image retrieval. For text retrieval, the bagof-words model is used to represent textual sentences. For image retrieval, LIRE, a Java library for image processing, is used to extract SIFT features. The bag-of-visual-words model is

April–June 2016

Disambiguation For text disambiguation, the iterative Yarowsky algorithm2 starts with a small set of seed rules to disambiguate senses and a large untagged corpus. In each iteration, the algorithm first applies known rules to untagged samples and learns a set of new rules from newly tagged samples. This process is repeated until all training samples are tagged, and the learned rules are arranged in descending order of confidence scores, which are determined by the numbers of samples supporting the rules. When given an unseen testing sample, the algorithm returns the first rule matching the testing sample in the ordered list and the confidence score of the matched rule. For image disambiguation, we use SIFT12 to extract local features and the bag-of-visualwords model11 to represent images. Then, an

SVM classifier is trained on the bag-of-visualwords vectors to classify images. The SVM model is a supervised classification model, the goal of which is to construct a set of hyperplanes in the high-dimensional feature space with the intention of maximizing the margins between different hyperplanes.21 Both the image and text disambiguation generate sense annotations, along with confidence scores for testing samples. For text disambiguation, we wrote the Yarowsky algorithm2 implementation in Cþþ and implemented the pseudo probability distribution over the Yarowsky classifier using Python. For image disambiguation, we used OpenCV to extract SIFT features from images, the K-Means from Python scikit-learn to generate visual words, and the multiclass SVM implementation from Python scikit-learn to disambiguate images. The ensemble fusion model uses the logistic regression implementation with L2 regularization from Python scikit-learn.

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Table 1. The accuracy of image-only disambiguation, text-only disambiguation, linear-rule fusion, maximum-rule fusion, and logistic regression fusion on the UIUC-ISD dataset for WSD. (Bold font indicates the best results.) Image

Text

Linear rule

Maximum rule

Bass

0.565

0.365

0.871

Crane

0.642

0.333

0.800

0.808

0.775

Squash

0.754

0.188

0.768

0.754

implemented in Java using distributed k-means algorithms. Solr provides indexing and searching for both images and text. We use the logistic regression implementation with ridge regularization from Weka for fusion.

Experiments We ran experiments on the University of Illinois at Urbana–Champaign Image Sense Discrimination (UIUC-ISD) dataset (http://vision. cs.uiuc.edu/isd) __________ and the Google-MM dataset to test the performance of the three fusion approaches used in our ensemble fusion model.

IEEE MultiMedia

Datasets We used the multimodal UIUC-ISD dataset to test the accuracy of the text-only disambiguation (Yarowsky algorithm), the image-only disambiguation (the SVM classifier), and the three fusion approaches in our ensemble fusion model for WSD. There are three keywords (“bass,” “crane,” and “squash”) in the dataset. For each keyword, we selected two core senses. There are 1,691 documents for “bass,” 1,194 documents for “crane,” and 673 documents for “squash.” We constructed training, validation, and testing datasets for each keyword. We used the training dataset to train the image and text classifiers. We used the validation dataset to train the logistic regression classifier and selected the linear weight k based on the accuracy of the image disambiguation and text disambiguation on the validation dataset. We used the testing dataset to evaluate the fusion algorithms and to demonstrate that, by using multimodal fusion, we could get higher disambiguation accuracy compared to the accuracy achieved by methods using a single modality. We used the Google-MM dataset to evaluate the retrieval quality of image-only retrieval, text-only retrieval, and the three fusion approaches in our ensemble fusion model for IR. We crawled 2,209 multimodal documents

Logistic regression

using Google Images with 20 object categories (including airplane, cat, and dog) and 14 landmarks (including Big Ben, Eiffel Tower, and the Taj Mahal). Each document comprised one title and one image. For each category or landmark, we prepared one query for training and one query for testing, with each query containing a few keywords and one image. For each training or testing query, the ground truth results were provided for retrieval quality evaluation. Results The results for WSD and IR demonstrated that these three fusion approaches achieve higher quality than the text-only and image-only methods. Word sense disambiguation Table 1 presents the experimental results for WSD on the UIUCISD dataset. From the table, the accuracy of the three fusion methods is much higher than the image-only and text-only methods on “bass” and “crane.” For “bass,” the ensemble approaches improved the accuracy from 0.565 to 0.871. For “crane,” the maximum rule approach improved the accuracy from 0.642 to 0.808. For “squash,” because the accuracy of text-only disambiguation is low (0.188), we could not get much additional information from the text-only disambiguation. Therefore, the accuracy of the three fusion approaches for “squash” is quite similar to the image-only classiﬁcation. Information retrieval Table 2 presents the experimental results for IR on the Google-MM dataset. The retrieval quality is measured by the mean average precision (MAP) of all 34 testing queries. As Table 2 shows, all the three fusion approaches achieve higher MAP than imageonly and text-only retrieval. While image-only retrieval achieved 0.125 MAP and text-only retrieval 0.761 MAP, linear-rule fusion achieved 0.802 MAP, maximum-rule fusion achieved

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Table 2. Retrieval quality (mean average precision) of image-only retrieval, text-only retrieval, early fusion, linear-rule fusion, maximum-rule fusion, and logistic regression on the Google-MM dataset for IR. (Bold font indicates the best result.) Image

Text

Early fusion

Linear rule

Maximum rule

Logistic regression

0.125

0.761

0.187

0.802

0.788

0.798

0.788 MAP, and logistic regression reached 0.798 MAP. For naive early fusion—where we combine text words and visual words directly, as introduced elsewhere6—MAP was 0.187, slightly higher than image-only MAP. The reasons for the signiﬁcantly lower image-only MAP are as follows:

1.2 1 Image Text

0.6

Linear-Rule

Most of the searched images have noisy backgrounds or incomplete coverage of the object.

0.4

Max-Rule

We use only the bag-of-visual-words model and cosine distance to calculate the similarity score, without complex techniques, because our focus is on the ensemble fusion part.

MAP

0.8

Log-Reg

0.2 Early-Fusion

Naive early fusion had low MAP for two reasons:

The images and textual sentences usually were not quite correlated.

The image feature space had more dimensions than the text feature space, so the text features didn’t have a strong impact on the retrieval results.

7 9 11 13 15 17 19 21 23 25 27 29 31 33 Query

Figure 4. Information retrieval: per-query detailed result. The three fusion models had similar performance results, while naive early fusion of text and visual words had low mean average precision (MAP) for all 34 queries.

Analysis Here, we discuss how our ensemble fusion model captures the correlative and complementary relations between images and text to achieve higher quality compared to singlemodality approaches. We also compare the differences between early fusion and ensemble fusion models. Correlation As noted earlier, images and text display a certain level of correlation at the decision level. For WSD, if image processing and text processing generate the same sense annotation for one document, linear-rule, maximumrule, and logistic-regression fusion will usually generate the same sense annotation as image processing and text processing for this document, according to our experimental results. For IR, if image retrieval and text retrieval generate high similarity scores for one document, then linear-rule, maximum-rule, and logisticregression fusion would generate high similarity scores for this document as well, according to our experimental results. Thus, although our ensemble fusion model can’t capture the correlation between images and text at the feature

April–June 2016

Figure 4 shows the detailed per-query image result for IR. We can see that the three fusion models had similar performance results, while naive early fusion of text and visual words had low MAP for all 34 queries. By combining the results of image-only and text-only processing under an ensemble fusion framework, we can achieve higher performance results compared to methods using only a single modality. In cases where image processing and text processing are reliable to some extent, such as “bass” and “crane” in Table 2, the fusion model can improve the performance to a great extent. Even in cases where one of the single-modality methods has very poor performance, the fusion model can still generate results as good as or even slightly better than the best results from any single-modality processing methods, such as “squash” in Table 1.

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Table 3. The coverage, average precision, and average recall of different approaches on WSD for the keyword “bass.” Coverage refers to the percentage of the documents that each approach can effectively disambiguate. (Bold font indicates the best results.) Image

Text

Linear rule

Maximum rule

Logistic regression

Coverage

1.000

0.376

1.000

Average precision

0.522

0.857

0.862

0.859

Average recall

0.636

0.297

0.884

0.893

level, it can capture the correlation at the decision level.

IEEE MultiMedia

Complementation Although our ensemble fusion model can capture the correlation between images and text at the decision level, this is not the main reason that we can improve quality, because, in this case, our model just generates consistent results as image-only and text-only processing. However, the ability to capture the complementary relation between image-only and text-only processing helps our model generate better results than image-only and text-only processing. The average precision and average recall of image-only processing, text-only processing, and the three ensemble fusion approaches on WSD for keyword “bass” are shown in Table 3, illustrating the complementary relation between image and text processing. As Table 3 shows, text processing usually has high precision but low recall. For example, the Yarowsky classifier works well when the testing sentences contain patterns that have been discovered in training datasets. It can generate high confidence scores for the correct senses in most cases—for example, ðs1 ; 1:0Þ and ðs2 ; 0:0Þ or ðs1 ; 0:95Þ and ðs2 ; 0:05Þ, with s1 usually being the correct sense. However, for the sentences that do not contain known patterns, the Yarowsky classifier would fail to disambiguate between two senses and output ðs1 ; 0:0Þ and ðs2 ; 0:0Þ. Similar to text disambiguation, text retrieval also has high precision and low recall, because inverted indexing works well for textual sentences that contain query keywords. For those sentences that do not contain query keywords, text retrieval can’t return them as relevant results, thus causing recall to drop. On the other hand, image processing has high recall but low precision (see Table 3). For disambiguation, the image SVM classification can disambiguate all images, but it is less accurate due to the noisy image data and image rep-

resentation. Image disambiguation thus generates less conﬁdent results—for example, ðs1 ; 0:55Þ and ðs2 ; 0:45Þ or ðs1 ; 0:60Þ and ðs2 ; 0:40Þ, with s1 possibly being a wrong label. Image retrieval also generates lower similarity scores for documents than text retrieval because of the noisy representation of images. Also, because each image might contain hundreds or thousands of local features, image representation is more dense, so image retrieval has better recall than text retrieval. Consequently, for documents in which the text processing works, the results of the three fusion approaches in our ensemble fusion model would be consistent with text processing, because the text processing outputs results in very high conﬁdence scores or similarity scores. For other documents, in which the text processing fails, the results of the three approaches in the ensemble fusion model would be consistent with image processing, because text processing returns no useful information for these documents. Therefore, our ensemble fusion model can increase both precision and recall by taking advantage of both text processing and image processing while avoiding their drawbacks. Early vs. ensemble fusion. Early fusion can capture the correlative relation between images and text at the feature level, while ensemble fusion can capture the complementary relation at the decision level. Whether we should use early fusion or ensemble fusion depends on the nature of the multimodal datasets. In our multimodal datasets, the images and textual sentences are mostly complementary, which corroborates the fact that our ensemble fusion model can achieve better quality than image-only and text-only approaches. On the other hand, the correlative relation between images and text is not commonly found in the documents, which explains why the naive early fusion fails to improve retrieval quality. Because the early fusion approaches use correlation

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

analysis methods to fuse features from different modalities—which aim to maximize the correlation effect between images and text in the combined feature space—they are not expected to achieve very good results on the datasets we used.

8. R. Navigli, “Word Sense Disambiguation: A Survey,” ACM Computing Surveys, vol. 41, no. 2, 2009, article no. 10. 9. E. Agirre and P. Edmonds, Word Sense Disambiguation, Algorithms and Applications, Springer 2007. 10. R. Datta et al., “Image Retrieval: Ideas, Influences, and Trends of the New Age,” ACM Computing Sur-

he next steps for us are to employ sophisticated algorithms to capture both the correlation and complementation among multiple modalities, prepare large-scale multimodal datasets, and improve the performance of fusion models on various tasks, including WSD and IR. MM

Acknowledgments This special issue is a collaboration between the 2015 IEEE International Symposium on Multimedia (ISM 2015) and IEEE MultiMedia. This article is an extended version of “Probabilistic Ensemble Fusion for Multimodal Word Sense Disambiguation,” presented at ISM 2015. This work was partially supported by DARPA under FA8750-12-2-0348 and a generous gift from Pivotal. We also thank Yang Chen for his suggestions and discussions.

References

veys (CSUR), vol. 40, no. 2, 2008, article no. 5. 11. J. Sivic and A. Zisserman, “Video Google: A Text Retrieval Approach to Object Matching in Videos,” Proc. Ninth IEEE Int’l Conf. Computer Vision (ICCV), 2003, pp. 1470–1477. 12. D.G. Lowe and G. David, “Object Recognition from Local Scale Invariant Features,” Proc. Seventh IEEE Int’l Conf. Computer Vision (ICCV), 1999, pp. 1150–1157. 13. J. Yang et al., “Evaluating Bag-of-Visual-Words Representations in Scene Classification,” Proc. Int’l Workshop on Multimedia Information Retrieval (MIR), 2007, pp. 197–206. 14. J. Philbin et al., “Object Retrieval with Large Vocabularies and Fast Spatial Matching,” IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2007; doi: 10.1109/CVPR.2007.383172. 15. Y. Wu et al., “Optimal Multimodal Fusion for Multimedia Data Analysis,” Proc. 12th Ann. ACM Int’l Conf. Multimedia, 2004, pp. 572–579. 16. Q. Zhu, M.C. Yeh, and K.T. Cheng, “Multimodal Fusion Using Learned Text Concepts for Image Categorization,” Proc. ACM Int’l Conf. Multimedia, 2006, pp. 211–220.

1. P.K. Atrey et al., “Multimodal Fusion for Multimedia Analysis: A Survey,” Multimedia Systems, vol. 16, no. 6, 2010, pp. 345–379. 2. D. Yarowsky, “Unsupervised Word Sense Disambiguation Rivaling Supervised Methods,” Proc. 33rd Ann. Meeting Assoc. for Computational Linguistics (ACL), 1995, pp. 189–196. 3. N. Rasiwasia et al., “A New Approach to CrossModal Multimedia Retrieval,” Proc. 18th ACM Int’l Conf. Multimedia, 2010, pp. 251–260. 4. J. Ngiam et al., “Multimodal Deep Learning,” Proc. Int’l Conf. Machine Learning (ICML), 2011, pp. 689–696. 5. N. Srivastava and R. Salakhutdinov, “Multimodal Learning with Deep Boltzmann Machines,” J. Machine Learning Research, vol. 15, 2014, pp. 2949–2980.

dia Information Retrieval,” Proc. Int’l Workshop on Multimedia Information Retrieval (MIR), 2007, pp. 71–78. 18. S. Wei et al., “Multimodal Fusion for Video Search Reranking,” IEEE Trans. Knowledge and Data Engineering, vol. 22, no. 8, 2010, pp. 1191–1199. 19. L. Yang and D. Neagu, “Toxicity Risk Assessment from Heterogeneous Uncertain Data with Possibility-Probability Distribution,” IEEE Int’l Conf. Fuzzy Systems (FUZZ), 2013; doi: 10.1109/FUZZIEEE.2013.6622304. 20. L. Yang et al., “Towards a Fuzzy Expert System on Toxicological Data Quality Assessment,” Molecular Informatics, vol. 32, no. 1, 2013, pp. 65–78. 21. C. Cortes and V. Vapnik, “Support-Vector Networks,” Machine Learning, vol. 20, no. 3, 1995,

April–June 2016

6. W. May et al., “Unsupervised Disambiguation of Image Captions,” Proc. Int’l Workshop on Semantic

17. E. Bruno, J. Kludas, and S. Marchand-Maillet, “Combining Multimodal Preferences for Multime-

pp. 273–297.

Evaluation (SemEval), 2012, pp. 85–89. 7. K. Saenko and T. Darrell, “Filtering Abstract Senses from Image Search Results,” Proc. Advances in Neu-

Yang Peng is a PhD student in the Department of

ral Information Processing Systems 22 (NIPS), 2009,

Computer and Information Science and Engineering (CISE) at the University of Florida. His research inter-

pp. 1589–1597.

ests include data science, big data, and multimodal

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

machine learning. Peng received his BS in computer science from Nanjing University. Contact him at

Patwa received his MS in computer engineering from the University of Florida. Contact him at ipawa@ _____

ypeng@cise.ufl.edu. ____________

cise.ufl.edu. _______

Xiaofeng Zhou is a PhD student in the Department

Dihong Gong is a PhD student in the Department of

of Computer and Information Science and Engineer-

Computer and Information Science and Engineering

ing at the University of Florida. His research interests include data science, big data, and probabilistic

at the University of Florida. His research interests include computer vision and machine learning.

knowledge bases. Zhou received his BS in computer

Gong received his BE in electrical engineering from

science from the University of Science and Technology of China. Contact him at ______________ xiaofeng@cise.ufl.edu.

the University of Science and Technology of China. Contact him at dhong@cise.ufl.edu. ____________

Daisy Zhe Wang is an assistant professor in the

Chunsheng Victor Fang is the lead data scientist at

Department of Computer and Information Science and Engineering at the University of Florida, where

Awake Networks. This work was done when he was affiliated with Pivotal Inc. His research interests

she is the director of the Data Science Research Lab.

include innovating artificial intelligence in data sci-

Her research interests include probabilistic knowledge bases, large-scale inference engines, query-

ence and helping large enterprises revolutionize IT data analytics solutions for cybersecurity. Fang

driven interactive machine learning, and crowd-

received his PhD in computer science from the Uni-

assisted machine learning. Wang received her PhD from the EECS Department at the University of Cali-

versity of Cincinnati. Contact him at vicfcs@ _____ gmail.com.

fornia, Berkeley. Contact her at _____________ daisyw@cise.ufl.edu. Ishan Patwa is a software engineer at Microsoft. This work was done when he was a master’s student in the Department of Computer and Information Science and Engineering at the University of Florida.

_______________ _________

__________ _____________ _________

_________

IEEE MultiMedia

___________

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

_________________ _______________

_______ ____________________

___________________________

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Ubiquitous Multimedia

Planogram Compliance Checking Based on Detection of Recurring Patterns Song Liu, Wanqing Li, Stephen Davis, Christian Ritz, and Hongda Tian University of Wollongong

Rather than relying on product template images for training, this novel method for automatic planogram compliance checking in retail chains extracts the product layout from an input image using unsupervised recurring pattern detection.

o launch promotions and facilitate the best customer experience in retail chains, planograms regulate how products should be placed on shelves. These planograms are usually created by a company’s headquarters and distributed to its chain stores so that store managers can place products on shelves accordingly. The company’s headquarters often likes to verify whether each chain store has properly followed the planograms, and this veriﬁcation process is referred to as planogram compliance checking. Figure 1 shows an example of a compliant product layout. Conventional planogram compliance checking is conducted visually and manually, which is laborious and prone to human error. As a result, many retail chains have begun seeking ways to automate this process.

1070-986X/16/$33.00 c 2016 IEEE

Technologies based on computer vision have been explored for automatic planogram compliance checking (see the sidebar). In particular, the problem has been considered as a typical object detection problem in which products are detected and localized by matching input images of shelves against given template images of the products. Compliance checking is then performed by comparing the detected product layout with the pre-speciﬁed planogram. However, this approach requires up-to-date product template images, which are often unavailable. In addition, it’s subject to the quality of images, lighting conditions, viewpoints of images, and image pattern variations due to seasonal promotions that vendors regularly carry out. For promotional purposes, multiple instances of a product are usually displayed consistently on a shelf. These multiple instances within the input image form similar yet nonidentical visual objects and can be referred to as a recurring visual pattern or recurring pattern.1 By detecting the recurring pattern, the product instances that form the pattern can be localized. The layout of the shelf can then be estimated by detecting all the recurring patterns and locating all the instances in each pattern. Next, the estimated layout can be compared with the expected product layout speciﬁed by a planogram to measure the level of compliance. Because detection of the recurring patterns doesn’t require template images for training, the compliance checking doesn’t require any template images of the products on the shelf. This novel method for automatic planogram compliance checking, which doesn’t require product template images, mainly consists of

estimating the product layout from the results of recurring pattern detection;

comparing the estimated product layout with a planogram using spectral graph matching (for compliance checking);2 and

using a divide-and-conquer strategy to improve the speed.

This article extends earlier work, presented at the International Symposium on Multimedia, to include additional justiﬁcation of the proposed algorithm, a comparison with a template-based method, and a reﬁnement of the compliance checking using the product images automatically extracted from input images through recurring pattern detection.

Published by the IEEE Computer Society

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Related Work in Compliance Checking and Pattern Detection Our method realizes automatic compliance checking with the help of recurring patterns detection. Therefore, we briefly review some representative methods for both of them.

Automatic Planogram Compliance Checking Conventional automatic methods for planogram compliance checking involve extracting product layout information based on well-established object detection and recognition algorithms, which usually require template images as training samples. Adrien Auclair and his colleagues presented a product detection system from input images by matching with existing templates using scaleinvariant feature transform (SIFT) vectors.1 Other researchers proposed a real-time online product detection tool using speeded up robust features (SURF) and optical flow; it also depends on high-quality training data.2 Another study focused on product logo detection using spatial pyramid mining.3 A recent method carried out by Gul Varol and Ridvan S. Kuzu4 used a cascade object detection framework and support vector machine (SVM) to detect and recognize cigarette packages on shelves; this, too, requires template images for training. Despite such progress in estimating product layout information using object detection and recognition, most methods require either strong or weak supervision for object modeling. Although some unsupervised approaches based on latent topic models have been proposed,5,6 they still require images for learning.

Recurring Pattern Detection Multiple instances or objects of the same product on a shelf share a similar visual appearance. In particular, objects that share similar groups of visual words can be formulated as

Proposed Method

Planogram XML Parser and Region Partition A planogram is created by company headquarters to indicate how and where speciﬁc products should be placed on shelves. Therefore, the

pairwise visual word matching, which matches pairs of visual words across all objects;12 pairwise visual object matching, which matches feature point correspondences between a pair of objects;8,9 and pairwise visual word-object matching, which matches visual words and objects simultaneously.13

Some research has explored unsupervised detection/ segmentation of two objects in two images.10,14 Junsong Yuan and Ying Wu7 detected object pairs from a single image or an image pair using spatial random partitioning. Minsu Cho and his colleagues achieved the same goal by solving a correspondence association problem via Markov chain Monte Carlo (MCMC) exploration.11 As for pairwise object matching-based methods for detecting multiple recurring patterns, Hairong Liu and Shuicheng Yan8 employed graph matching to detect recurring patterns between two images. Agglomerative clustering15 and MCMC association9 were adopted by Cho and his colleagues to deal with multiple object matching. Jizhou Gao and his colleagues used a pairwise visual word matching approach to detect recurring patterns.12 Jingchen Liu and Yanxi Liu13 discovered recurring patterns from one image by optimizing a pairwise visual word-object joint

layout information stored in a planogram can be regarded as the expected layout of the corresponding input image. Moreover, such layout information can be used to divide the input image into regions corresponding to different types of products. In the proposed method, an input planogram is stored in XML format and must be parsed to retrieve related formation for region partition and compliance checking. A parsed planogram for a particular shelf contains the following information:

April–June 2016

In our proposed method, an input image is ﬁrst partitioned into regions based on the information parsed from a planogram. Repeated products are detected in each region and then merged together to estimate product layout. Finally, the estimated product layout is compared against the expected product layout speciﬁed in the planogram for compliance checking. Figure 2 shows the block diagram of the proposed method, which we now describe in more detail.

recurring patterns. By detecting those recurring patterns using an unsupervised object-level matching method, product layout could be extracted without requiring template images of the products on the shelf. In the literature, the process of detecting recurring patterns is referred to variously as common visual pattern discovery,7,8 co-recognition/segmentation of objects,9–11 and high-order structural semantics learning.12 There are three typical approaches for recurring pattern detection:

number of rows in the shelf, number of products in each row, and each product type.

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

assignment problem using greedy randomized adaptive search procedure (GRASP).16 Both visual words and objects are considered in pairwise visual word-object joint assignment, which could yield results with higher detection accuracy than methods in which only visual word matching or object matching is performed. As a result, we adopted pairwise visual word-object matching13 for recurring pattern detection in the proposed method. However, solving such a joint assignment problem is computational expensive, especially in detecting recurring patterns with many visual objects. To improve the speed, a divide-and-conquer strategy is proposed to partition the image into regions to control the number of visual objects in each region.

7. J. Yuan and Y. Wu, “Spatial Random Partition for Common Visual Pattern Discovery,” Proc. IEEE Int’l Conf. Computer Vision (ICCV), 2007, pp. 1–8. 8. H. Liu and S. Yan, “Common Visual Pattern Discovery via Spatially Coherent Correspondences,” Proc. Int’l Conf. Computer Vision and Pattern Recognition (CVPR), 2010, pp. 1609–1616. 9. M. Cho, Y.M. Shin, and K.M. Lee, “Unsupervised Detection and Segmentation of Identical Objects,” Proc. Int’l Conf. Computer Vision and Pattern Recognition (CVPR), 2010, pp. 1617–1624. 10. C. Rother et al., “Cosegmentation of Image Pairs by Histogram Matching—Incorporating a Global Constraint into MRFs,” Proc. Int’l Conf. Computer Vision and Pattern Recognition (CVPR), 2006, pp. 993–1000.

References

11. M. Cho, Y.M. Shin, and K.M. Lee, “Co-recognition of Image

1. A. Auclair, L.D. Cohen, and N. Vincent, “How to Use SIFT Vectors to Analyze an Image with Database Templates,” Adaptive Multimedia Retrieval: Retrieval, User, and Semantics, LNCS 4918, 2008, pp. 224–236. 2. T. Winlock, E. Christiansen, and S. Belongie, “Toward RealTime Grocery Detection for the Visually Impaired,” Computer Vision and Pattern Recognition Workshops (CVPRW), 2010, pp. 49–56. 3. J. Kleban, X. Xie, and W.-Y. Ma, “Spatial Pyramid Mining for Logo Detection in Natural Scenes,” Proc. Int’l Conf. Multimedia and Expo, 2008, pp. 1077–1080. 4. G. Varol and R.S. Kuzu, “Toward Retail Product Recognition on Grocery Shelves,” Proc. Int’l Conf. Graphics and Image Processing (SPIE), 2015; doi:10.1117/12.2179127. 5. J. Sivic et al., “Discovering Object Categories in Image Collections,” Proc. IEEE Int’l Conf. Computer Vision (ICCV), 2005, pp. 370–377. 6. L. Karlinsky et al., “Unsupervised Classification and Part Localization by Consistency Amplification,” Proc. 10th European Conf. Computer Vision, LNCS 5305, 2008, pp. 321–335.

Pairs by Data-Driven Monte Carlo Image Exploration,” Proc. 10th European Conference on Computer Vision, LNCS 5305, 2008, pp. 144–157. 12. J. Gao et al., “Unsupervised Learning of High-Order Structural Semantics from Images,” Proc. Int’l Conf. Computer Vision, 2009, pp. 2122–2129. 13. J. Liu and Y. Liu, “Grasp Recurring Patterns from a Single View,” Proc. Int’l Conf. Computer Vision and Pattern Recognition (CVPR), 2010, pp. 2003–2010. 14. A. Toshev, J. Shi, and K. Daniilidis, “Image Matching via Saliency Region Correspondences,” Proc. Int’l Conf. Computer Vision and Pattern Recognition (CVPR), 2007, pp. 1–8. 15. M. Cho, J. Lee, and J. Lee, “Feature Correspondence and Deformable Object Matching via Agglomerative Correspondence Clustering,” Proc. IEEE Int’l Conf. Computer Vision (ICCV), 2009, pp. 1280–1287. 16. T.A. Feo and M.G. Resende, “Greedy Randomized Adaptive Search Procedures,” J. Global Optimization, vol. 6, no. 2, 1995, pp. 109–133.

IEEE MultiMedia

Figure 2c shows a planogram in which the shelf, every row in the shelf, and every product could be represented as 2D boxes. Due to the shape of a regular retail shelf, a rectangle (the gray box in Figure 2c) is used to represent the whole shelf. Then, the shelf is vertically divided into several identical rows (yellow boxes in Figure 2c) according to the number of rows in the shelf. Finally, each row is horizontally divided into boxes according to the number of products that are placed in that row. As a result, products can be represented by boxes in each row. Given the estimated boxes for each product, product position and layout are described using a set of 2D points. A 2D coordinate system is created by considering the top left corner of the

shelf box as the origin (0, 0) and bottom-right corner as (1, 1). The expected product layout then can be represented by all the center points of product boxes in this coordinate system, which are denoted as PointSet planogram (Figure 2d): PointSet planogram ¼ fP1 ; …; PM g and Pi ¼fpi1 ; pi2 ; …; pim g; where Pi is the set of points that correspond to the ith type of product speciﬁed in the planogram (that is, the points in the same color in Figure 2d). pii0 is a point with the 2D coordinate, ðxpii0 ; ypii0 Þ, where xpii0 ; ypii0 2 ½0; 1 . To estimate regions, all product boxes are projected onto the input image. For each type

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Be Natural nut bars

Be Natural four

Be Natural Be Natural Be Natural brail bars berry brail bars honey brail bars fruit

Be Natural cashew, almond, hazelnut and coconut

Be Natural pink lady apple and ﬂame raisin

Be Natural manuka honey, spice clusters and ﬂakes

Be Natural 5 whole grain ﬂakes

Figure 1. An example of planogram compliance. (a) A planogram specified by the company headquarters. (b) A store shelf with a product layout that complies with the planogram.

Region partition

Recurring pattern detection

Recurring pattern merging

Compliance checking (0, 0)

(0, 0)

(1, 0)

Image

(f)

(g) (j) (0, 1)

(e)

(a)

(h)

(1, 1)

Compliance result

(i)

Planogram XML parser (0, 0)

Planogram <planogram> <shelf> ......

(1, 1)

</shelf> </planogram>

(b)

(c)

(d)

(k)

Figure 2. The block diagram of the proposed method for planogram compliance checking. (a) The input image. (b) The corresponding planogram in XML. (c) The parsed planogram shown in 2D boxes. (d) 2D points representing the expected product layout ðPointSet planogram Þ. (e) The product boxes and an estimated region projected on the input image. (f) A region for one type of product. (g) A detected recurring pattern shown in visual words and objects. (h) A detected recurring pattern shown in circular regions. (i) A bounding box with merged recurring patterns. (j) 2D points representing the detected product layout ðPointSet detected Þ. (k) Searching for the optimal matches using a graph matching and greedy algorithm.

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

of product, the regions covered by the product boxes of that type are grouped into a rectangular search region. The search region is then extended by a margin to allow for differences in the product locations on the shelf compared to the planogram speciﬁcation. An example of this area is the red box in Figure 2e. Considering the speed of recurring pattern detection and the known minimum size of products listed in the planogram that are to be displayed, every region will be limited to include no more than 25 product instances. If there are more than 25 instances of the same type of product within a region, the region for this product type will be further divided to meet the criteria above. The choice of the maximum number of product instances in a search region is to balance the time spent on the recurring pattern detection and merging of the detected patterns. Here, we empirically chose that number to be 25.

IEEE MultiMedia

Recurring Pattern Detection and Merging The recurring pattern detection aims to estimate a product layout by finding and locating near-identical visual objects from a single input image. After region partition, the original input image is divided into several regions—at least one for each type of product. Therefore, recurring patterns are detected in each region and later merged together for layout estimation. We adopted an existing method that realizes recurring pattern discovery by solving a simultaneous visual word-object assignment problem.1 In the method, a recurring pattern is defined as a 2D feature-assignment matrix in which each row corresponds to a visual word and each column corresponds to a visual object. Detected feature points are populated in the matrix to ensure feature points in each row/visual word (that is, points of the same color in Figure 2g) share strong appearance similarity, while layouts formed by points in every column/visual object (that is, the connected black lines in Figure 2g) share strong geometric similarity. An energy function is defined to achieve both appearance and geometric consistencies. This joint assignment problem is NP-hard and is thus optimized by a greedy randomized adaptive search procedure (GRASP) with matrix operations, called local moves, which are specifically designed for convergence purposes. In these local moves, operations are performed to expand and maintain the 2D feature-assignment matrix, and these operations include add-

ing or deleting a row or column and modifying entries. In each iteration of GRASP, local moves are applied stochastically to explore a variety of local optima. Each optimized assignment matrix can be regarded as a detected recurring pattern. As a result, for each input region, a set of multiple candidate recurring patterns could be detected. Circular regions are calculated to represent a recurring pattern in which each detected visual object is covered by a circle (Figure 2h). Therefore, we can write detected recurring patterns as follows: CandidatePatterns ¼ fPattern1 ; …; Patternn g and Patterni ¼ fðxi1 ; yi1 ; ri1 Þ; …; ðxin ; yin ; rin Þg:

In a recurring pattern Patterni , each circular region is represented by a center with a 2D coordinate and a radius. The center is calculated as the mean position of the visual object’s feature points. The radius is the mean value of width and height of the bounding box covering all feature points assigned to the visual object. Assuming Patterns and Patternt are detected from different regions that actually belong to the same product type, whether Patterns and Patternt will be merged depends on the coverage of their circular regions. If a circular region from Patterns overlaps with another circular region from Patternt , these two regions will be combined into one. Recent improvements have made the existing method capable of extracting image patches for all detected visual objects from the input image.1 These patches can serve as product images to reﬁne compliance checking results (we present more details on this later). Compliance Checking To match with the expected product layout PointSet planogram from the planogram, another group of point sets representing the detected product layout PointSet detected (Figure 2j) must be constructed by processing the detected recurring patterns. First, all the visual object centers of each pattern are regarded as 2D points. Then, a minimum bounding box is calculated to cover all points. A 2D coordinate system is created by considering the top left corner of this bounding box as the origin (0, 0) and the bottom right corner as the coordinate (1, 1). 2D coordinates are calculated for all the detected visual object centers in such a coordinate system (Figure 2i). The object center points from the jth recurring

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

pattern in this 2D coordinate system are denoted as point set Rj . PointSet detected can be constructed as follows: PointSet detected ¼ fR1 ; …; RN g and Rj ¼ frj1 ; rj2 ; …; rjn g; where rjj is a point with the 2D coordinate ðxrjj0 ; yrjj0 Þ; xrjj0 ; yrjj0 2 ½0; 1 ;and jPointSet detected j ¼ N is the number of detected products. The height and width of the bounding box are denoted as Heightbox and Widthbox , respectively. Match graphs. Given the point set P from PointSet planogram , containing m points, and R from PointSet detected , containing n points, a matching between the points in P and those in R is performed by solving a spectral graph matching problem.2 The afﬁnity matrix U for graph matching is created as follows by considering geometric layout relations between any pair of assignments ða; bÞ, where a ¼ ðpi ; ri ; Þ and b ¼ ðpj ; rj ; Þ; pi ; pj ; 僆 p and ri0 ; rj00 ; 2 R. ! D2dh D2dv Uða; bÞ ¼ exp ; ddh ddv Ddh ¼ max ðdis hðpi ; ri0 ; Þ; dis hðpj ; rj0 ; ÞÞ; and Ddv ¼ max ðdis vðpi ; ri0 ; Þ; dis vðpj ; rj0 ; ÞÞ; whereddh and ddv are weight parameters. In the experiments,ddh ¼ðWidthbox =ðHeightbox þWidthbox ÞÞ2 and ddv ¼ðHeightbox =ðHeightbox þ Widthbox ÞÞ2 , and dis hðÞ and dis vðÞ return the horizontal and vertical distance, respectively. U 2 Rk k is a sparse symmetric and positive matrix, where k ¼ m n. The matching problem is to ﬁnd a cluster C of assignments ðpi ; ri0 ; Þ that could maximize P the score S ¼ a;b2C Uða; bÞ regarding the undirected weighted graph represented by U. This cluster C can be described by an indicator vector x, such that xðaÞ ¼ 1 if a 2 C, and xðaÞ ¼ 0 otherwise. The total intercluster score can be rewritten as X S¼ Uða; bÞ ¼ xT Ux: a;b2C The optimal solution x is the binary vector that could maximize the score x ¼ argmaxðxT UxÞ:

Experimental Results To perform an experimental validation of the proposed method, we collected a dataset from a supermarket chain; all images were captured using a first generation iPad Mini (5 megapixel camera) because images from popular mobile devices are easy to obtain. Additionally, we implemented the proposed method in Cþþ on a PC with a 3.4 GHz Intel Core i7 CPU. The test cases in the real dataset are product size, product quantity, and feature quality—the latter of which can directly affect the accuracy of compliance checking. Products with poor feature quality will lead to insufficient visual words for finding recurring patterns, which in turn results in false detection. Product quantity refers to the number of instances that belong to the same type of product. This number can greatly affect the speed of compliance checking, as the processing time of recurring pattern detection increases dramatically as product quantity increases. Based on these characteristics of the test cases, we conducted experiments to evaluate

April–June 2016

x , which will maximize the score xT Ux, is the principal eigenvector of U. A greedy algorithm is then applied to the principal eigenvector of U making it the binary indicator vector x. P The marching score S ¼ a;b2C Uða; bÞ ¼ xT Ux is calculated and stored for compliance checking.

Check compliance. M N possible matching cases can be generated by spectral graph matching. The matching scores of these cases form an M N matrix that indicates matching confidences between any expected product layout Pi from PointSet planogram and the detected product layout Rj from PointSet detected . Therefore, a greedy algorithm is performed again on this matrix to identify the optimal matches from PointSet planogram to PointSet detected (Figure 2k). The greedy algorithm first accepts the match with the maximum matching score and then rejects all other matches that are conflicting with the accepted one. The process repeats until all scores in the matching matrix are either accepted or rejected. After the greedy matching, every type of product specified in the planogram will finally be matched with one unique and optimal recurring pattern. Each product from these optimal recurring patterns will be marked as compliant (true positive) if it can be matched with one product from the planogram. A detected product that can’t be matched with any products from the planogram will be marked as not compliant (false positive). Moreover, if a product from the planogram can’t be matched with any products from recurring patterns, the position of this product will be marked as empty, which is another instance of being not compliant.

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Table 1. Characteristics of selected samples. Selected sample

Product size

Product quantity

Feature quality

Toilet Paper

Big

Small

Rich

Heater

Big/medium

Medium/small

Medium

Coke

Big

Medium/small

Rich

Cereal

Medium

Small

Rich

Shampoo

Small

Medium/small

Poor

Tissue

Small

Large

Medium

Chocolate

Small

Large

Poor

Table 2. Accuracies for different product sizes. Compliance accuracy (%) Product size

Template-based method

Proposed method

Big

94.60

95.90

Medium

89.82

90.61

Small

40.34

84.44

Table 3. Accuracies for different product quantities. Compliance accuracy (%) Product quantity

Template-based method

Proposed method

Large

51.69

87.57

Medium

79.43

87.85

Small

91.25

92.94

Table 4. Accuracies for different feature qualities. Compliance accuracy (%) Feature quality

Template-based method

Proposed method

Rich

95.71

96.03

Medium

82.49

91.77

Poor

55.95

81.24

both the effectiveness and speed of the proposed method. Due to space limitations, we present here results on selected samples with different characteristics (see Table 1).

IEEE MultiMedia

Effectiveness For each test case, an image and its corresponding planogram XML ﬁle served as the inputs. For each product type, we compared a number of matched products Nmatched from graph matching and a number of expected products from the planogram Nexp ected . The compliance accuracy for one product type is calculated as

1 jNmatched Nexp ected j=Nexp ected . The compliance accuracy of a test case is calculated by averaging compliance accuracies over all product types in the case. For testing purposes, planogram XML ﬁles are created manually to match the product layout in the input images. Therefore, the compliance accuracy for each test case could also indicate the accuracy of the algorithm. For comparison purposes, we implemented a template-based method for planogram compliance checking as a baseline. The baseline follows a conventional idea of using product template images for training and detection. Speciﬁcally, for each input image, the algorithm must be trained using corresponding product template images, which consists of detecting scale-invariant feature transform (SIFT) keypoints and extracting SIFT descriptors. Then, the input image is divided into several nonoverlapping segments and a brute-force matching is carried out on each segment until all product templates are exhausted. A match is found if the number of matched descriptors between the input and template images exceeds a threshold. Product template images in our experiments are cropped from the images taken by a high-resolution camera. The overall accuracy achieved by the proposed method was 90.53 percent, while the template-based method’s accuracy was 71.84 percent. Tables 2–4 show the accuracies of both methods with respect to product size, quantity, and feature quality. The experimental results show that the proposed method is effective for planogram compliance checking. Compared with the template-based method, the proposed method achieves higher compliance checking accuracies, especially when dealing with products that are of small size, large quantity, or poor feature quality. Although the accuracy of the template-based method could be improved for some cases with small products of large quantities using the planogram as a prior knowledge of product location, it couldn’t achieve the same accuracy as our method. As expected, accuracies dropped for test cases with small product size, large product quantity, or poor feature quality. Obviously, lower feature quality will lead to worse compliance-checking accuracy. As for small product size, the main reason for lower accuracy is that smaller products captured in images tend to possess limited texture features. Moreover, smaller products tend to be packed in

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

(a)

(b) Figure 3. Detected products on selected samples. (a) Results on front-view samples for the template-based method (left) and the proposed method (right). Samples include Coke, cereal, tissue, and chocolate. (b) The proposed method’s results on non-front-view samples, including toilet paper, heaters, and shampoo.

percent for our proposed method over the entire set of test images. The proposed method is also effective in dealing with non-front-view test cases, whereas the template-based method isn’t suitable, as most product templates are front-view images only. Thus, variations arising from different viewpoints will further degrade that method’s accuracy. Figure 3a shows results on selected frontview samples, in which the detected products

April–June 2016

large quantities on the shelf, which contributes to the decreased accuracy. In general, the false positive rates were very low for both methods, and false positives tend to happen for the same category of products when their packages look similar, but they are classiﬁed as different products—such as tissues and shampoos from the same vendor. On average, the false positives were less than 1 percent for the template-based method and less than 3

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Table 5. Average processing time for different product quantities. Average processing time (seconds) Product quantity

Template-based method

Proposed method

Large

274

478

Medium

166

359

Small

147

127

Table 6. Processing time of the proposed method using different numbers of regions. (The sample was 39 products of the same type.)

Number of regions

Expected number of products in each region

Processing time (seconds)

489

195

136

(a)

(b) Figure 4. Results on product image extraction and compliance checking refinement. (a) Generating product image from recurring patterns. (b) Using the extracted product image to redetect products (the red box indicates the region of recurring pattern detection).

are marked in the input images. For the template-based method, detected products are highlighted using green bounding boxes, with the labels indicating the different product types detected. For the proposed method, detected products of the same type are labeled using circles of the same color. Figure 3b shows our method’s results on selected non-front-view samples. Speed We assessed the proposed method’s speed using the average time required to process one test case. Table 5 shows results on the speed for both methods. In general, the proposed method requires more time than the templatebased method, especially in the cases with a large product quantity. For the proposed method, most computation is actually spent on detecting recurring patterns. The computational complexity of the recurring pattern detection algorithm is Oðn3 Þ, where n is the number of visual objects (in our case, the product quantity). Therefore, the adopted divide-and-conquer strategy that partitions the image into regions to control the product quantity in each region can effectively reduce the overall CPU time, especially in cases involving many products. To validate the improvement on speed brought by the divide-and-conquer method, we carried out further experiments on the cases with large product quantities. In these experiments, the number of regions was adjusted; Table 6 shows the processing time compared with the number of regions. As these results show, region partitioning can improve the average speed by more than 70 percent without compromising the compliance checking accuracy. Product Image Extraction Our method is also capable of extracting product images. Based on the graph matching results, each type of product can be linked to a unique recurring pattern. In this recurring pattern, each visual object can be regarded as a product instance and represented by a set of feature points (see Figure 2g). A bounding box that covers all these feature points could well represent this product (see Figure 4a). To ﬁnd out the most suitable rectangular region that could represent a particular product type, we consider the product instance in a detected recurring pattern that possesses the most

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

feature points to be most representative. The rectangular region of this product instance is then selected as the product image of its product type. Figure 4a shows example results on selecting product images from recurring patterns. The extracted product images could be useful in many ways. For instance, an image could be used to detect products that were missed during recurring pattern detection, which would further improve the compliance checking accuracy. Some products might not be detected due to an unsuitable region partition. However, missing products might still be found by matching extracted product images within the overall input image. Figure 4b shows an example, in which one heater sitting on top of two other heaters isn’t detected, as it falls outside the partitioned region. The missing heater is picked up using the extracted product image and a template-matching algorithm.

Constraints,” Proc. IEEE Int’l Conf. Computer Vision (ICCV), 2005, pp. 1482–1489. Song Liu is a research associate at the Advanced Multimedia Research Lab and the School of Computing and Information Technology, University of Wollongong, Australia. His research interests include computer vision, pattern recognition, and 3D reconstruction. Liu received an MSc in computer science from the University of Wollongong. He is a member of IEEE. Contact him at ___________ sl796@uow.edu.au. Wanqing Li is an associate professor and co-director of the Advanced Multimedia Research Lab at the University of Wollongong, Australia. His research interests include computer vision, multimedia signal processing, and medical image analysis. Li received a PhD in electronic engineering from The University of Western Australia. He is a senior member of IEEE. Contact him at wanqing@uow.edu.au. ______________ Stephen Davis is a researcher at the University of Wollongong, funded by the Smart Services Collabo-

o the best of our knowledge, there’s no existing automatic method for planogram compliance checking without using template images. Our method detects products effectively and efﬁciently by merging detected recurring patterns from divided regions of the input image. Compared with a template-based method, our proposed method is much more effective, especially when processing low-quality images. Our method is challenged, however, by deformable packages such as chips. This will be our focus for future research. MM

Acknowledgment This special issue is a collaboration between the 2015 IEEE International Symposium on Multimedia (ISM 2015) and IEEE MultiMedia. This article is an extended version of “Planogram Compliance Checking Using Recurring Patterns,” presented at ISM 2015. Also, this work was partially supported by Smart Services Collaborative Research Centre (CRC) Australia.

rative Research Centre (CRC). His research interests include multimedia delivery, multimedia semantics and quality of experience, and social networking and collaboration. Davis has a PhD in computer engineering from the University of Wollongong, Australia. Contact him at stdavis@uow.edu.au. ____________ Christian Ritz is an associate professor in the School of Electrical, Computer and Telecommunications Engineering at the University of Wollongong. His research interests include spatial audio signal processing, multichannel speech signal processing, and multimedia quality of experience. Ritz received a PhD in electronic engineering from the University of Wollongong. He is a senior member of IEEE. Contact him at ___________ critz@uow.edu.au. Hongda Tian is an associate research fellow in the School of Computing and Information Technology at the University of Wollongong. His research interests include image and video processing, computer vision, pattern recognition, and machine learning. Tian received a PhD in computer science from the University of Wollongong. He is a student member of IEEE. Contact him at ht615@uow.edu.au. ____________

April–June 2016

References 1. J. Liu and Y. Liu, “Grasp Recurring Patterns from a Single View,” Proc. Int’l Conf. Computer Vision and Pattern Recognition (CVPR), 2010, pp. 2003–2010. 2. M. Leordeanu and M. Hebert, “A Spectral Technique for Correspondence Problems Using Pairwise

_______________ _________

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Feature: Image Security

An Image Encryption Algorithm Based on Autoblocking and Electrocardiography Guodong Ye and Xiaoling Huang Guangdong Ocean University

A novel image encryption algorithm uses electrocardiography to generate initial keys and an autoblocking method to remove the need for manual assignment. It has proven complex, strong, and flexible enough for practical applications.

n daily life, people communicate through media in the form of images, audio, and video. Digital image information, in particular, can be widely accessed through the Internet and via wireless networks. Images differ from text due to their bulky data capacity, redundancy, and strong correlation between adjacent pixels. Consequently, protection for images is inherently different from that for text, implying that traditional text cryptosystems might not be applicable to images.1 With the rapid advancement of network technologies, we need highly secure algorithms to safeguard digital images and protect private and unauthorized images from being illegally visited, copied, or modiﬁed. Unlike traditional text cryptosystems, chaotic systems have attracted increasing attention due

1070-986X/16/$33.00 c 2016 IEEE

to their many useful characteristics,2 such as sensitivity to initial conditions, ergodicity, and inherent control parameters. To date, a variety of such algorithms have been proposed. Noticeably, most of them adopted the classical permutationplus-diffusion architecture. For example, Anil Kumar and M.K. Ghose suggested an image encryption algorithm using the standard map with a linear feedback shift register, which by nature is an extended permutation-plus-diffusion method.3 Hongjun Liu and Xingyuan Wang used the Piecewise Linear Chaotic Map system to permute a plain-image at the bit level and then used the Chen system to confuse and diffuse the RGB components of the permuted image.4 Yang Tang and his colleagues employed the tent map to permute pixel positions in a plain-image and used a delayed coupled map lattice to diffuse the permuted image.5 As a result, they obtain cipherimages under traditional encryption structures.6,7 Under the traditional permutation-plus-diffusion structure,8,9 a common way to encrypt an image is to perform the permutation operation first and then the diffusion operation. As Wei Zhang and his colleagues point out,1 if an algorithm separates the two stages, the security of the algorithm will depend only on diffusion. In other words, permutation is not needed in the first place, so this step is a waste of work. Notice that insecure encryption schemes are also proposed. For example, Chengqing Li and his colleagues reevaluated the security of the image encryption scheme presented by Congxu Zhu10 and found that the scheme could be broken by a known-plaintext attack.11 Another example is that by analyzing the algorithm proposed by Vinod Patidar, N.K. Pareek, and K.K. Sud,12 researchers13 found that using only one pair of plaintext and ciphertext is sufficient to break the cryptosystem, because the generated keystream is independent of the plain-image. Motivated by these observations, we propose a novel chaos-based image encryption algorithm in which we use an electrocardiography (ECG) signal to generate the initial keys. The encryption algorithm can implement autoblocking for the image matrix, which is dependent on some designed control parameters. In this scheme, the keystream generated is related to the plain-image, so it can effectively resist all kinds of differential attacks.

The Proposed Method Here, we describe in detail the main structure of the proposed image encryption algorithm.

Published by the IEEE Computer Society

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

1 0.9 Long-time values of x

Lyapunov exponent

0.5 0 −0.5 −1 −1.5 −2 −2.5 2.5

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

2.7

2.9

3.1

(a)

3.3 u

3.5

3.7

0 2.8

3.9 4

3.2

(b)

3.4 u

3.6

3.8

0.9

0.8

0.7 0.6 0.5 0.4 0.3

0.2

0.1

0 (a)

Iterated value y (k )

Iterated value x (k )

Figure 1. A logistic map: (a) the Lyapunov exponent and (b) bifurcation.

100

150 k

200

250

300

(b)

100

150 k

200

250

300

Figure 2. Behavior of the generalized Arnold map: plotting the iterated values of (a) the x-coordinates and (b) the y-coordinates.

Logistic Map The logistic map is a simple, widely used function deﬁned by xn ¼ uxn 1 ð1 xn 1 Þ;

(1)

where u 2 ðo; 4 is a control parameter. It is chaotic1 when u 2 ½3:57; 4 , as shown in Figure 1. When we let u 2 ½3:9; 4 , we can obtain better chaotic properties.8 Before applying the chaotic sequence generated by iterating the logistic map onto a plain-image, we perform a conversion using xn ¼ floorðxn 1014 Þ mod 256. Here, the function floorðxÞ rounds the number x to the nearest integer toward negative inﬁnity, and mod is the modulus function.

This map is chaotic for all a > 0 and b > 0, because the largest Lyapunov exponent

pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi a2 b2 þ 4abÞ=2 > 1.9 Figure 2

shows the chaotic behavior of the generalized Arnold map. The same conversion we described for the logistic map should also be performed on the iterated values to obtain integer values between 0 and 255. ECG Signals ECG is typically deﬁned as a simple, noninvasive procedure that cyclically reports the successive atrial depolarization/repolarization and ventricular depolarization/repolarization that occur in each heartbeat (see Figure 3a).14 No precise mathematical model exists for cardiac electrical activity due to the complexity of the human body’s biological system. Moreover, ECG signals are signiﬁcantly varied for different persons and even for the same person at different times. Figures 3b and 3c show the ECG signals for the same person at two different times. So, ECG signals cannot be copied or simulated precisely.

April–June 2016

Generalized Arnold Map The generalized Arnold map, with two integer control parameters a and b, is deﬁned by xnþ1 1 a xn ¼ mod1: (2) b 1 þ ab ynþ1 yn

k ¼ 1 þ ðab þ

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q

1.5

0.5

Voltage

THE WORLD’S NEWSSTAND®

0.5

−0.5

−1

(a)

50 100 150 200 250 300 350 400 Sampling point (b)

−1

50 100 150 200 250 300 350 400 (c) Sampling point

−1

50 100 150 200 250 300 350 400 Sampling point

Figure 3. ECG signals for (a) person A, (b) person B, and (c) person B at a different time.

value for the logistic map (see Equation 1), and jkj represents the absolute value of k.

Table 1. Lyapunov exponent values (k) for different ECGs. Person

Person C before the shift (Figure 4a)

0.05281413777783

Person C after the shift (Figure 4b)

0.01751099064177

Person D before the shift (Figure 4c)

0.02968008559044

Person D after the shift (Figure 4d)

þ0.00461212897750

Wolf Algorithm for Initial Conditions Ching-Kun Chen and his colleagues have proposed a method for generating a secret key from an ECG signal.15 They calculate the largest Lyapunov exponent k to extract ECG features using the Wolf algorithm. The algorithm ﬁrst calculates the distance of two nearby points in the phase space of the signal, denoting it as s0 . Then it computes again the new distance s1 when these two points are moved a short distance in the phase space. If s1 is too large, one of these two points will be kept and the other will be replaced by a new one in the same orbit. Finally, the largest Lyapunov exponent k is obtained using the following equation after q iterations: 1 k¼ tq t0

q X

k¼1

s1 ðtk Þ ; tk ¼ k D ; (3) s0 ðtk 1 Þ

IEEE MultiMedia

where D represents the sampling period. Table 1 shows the largest Lyapunov exponents for two individuals before and after a one-sampling shift in ECG (see Figure 4). These data points indicate the high sensitivity of the ECG signal. The mathematical model described by Equation 4 generates three initial conditions for the generalized Arnold map and the logistic map: 8 < x0 ¼ jkj y ¼ jkj 105 floorðjkj 105 Þ; : 0 x0 ¼ jkj 108 floorðjkj 108 Þ ð4Þ where x0 and y0 are initial values for the generalized Arnold map (see Equation 2), x0 is the initial

Autoblocking Method Suppose that a plain-image has size M N, and the block number’s size is p1 p2 . Here, p1 and p2 should not be too big or too small, since too big a size will result in a small block and too small will increase computation. Table 2 shows nine cases of block sizes for a 256 256 plain-image. Here, we consider only images of size 256 256 without loss of generality; other cases can be discussed similarly. Using Equation 5, we can get p1 and p2 for the block numbers in Table 2 with indexes q1 and q2 , produced by the logistic map ðq1 ; q2 2 f1; 2; 3gÞ: 8 < q1 ¼ mod floorðx x1 1014 Þ; 3 ; ð5Þ : q2 ¼ mod floorðx x2 1014 Þ; 3 where x1 and x2 are control parameters. As a result, autoblocking can be implemented. For example, if q1 ¼ 2 and q2 ¼ 1, then the corresponding block number will be ð32; 16Þ ¼ ðp1 ; p2 Þ, as shown in Table 2, with block size 8 16. Clearly, the autoblocking method depends on the outputs of the logistic map and the initial conditions from the ECG signals. Encryption and Decryption Processes With initial conditions x0 and y0 given by Equation 4, we can get iterated values fx0 ; y0 ; x1 ; y1 ; x2 ; y2 ; g from the system in Equation 2. Assuming that the size of a sub-block is p1 p2 ðM ¼ r1 p1 ; N ¼ r2 p2 Þ and using a starting control parameter r, we can collect a set of values of length MN: fxðrþ1Þ ; yðrþ1Þ ; xðrþ2Þ ; yðrþ2Þ ; g, which is then converted into a pseudorandom matrix D of size M N. Like the division of the plain-image, we divide matrix D into r1 r2 blocks, with each block size being p1 p2 .

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

u1; v ¼ x u2: r¼x

(7)

Additionally, the constant block C0 with size ;x ; g p1 p2 can be obtained from fx X0 1 Ar1 r2 mod with control parameter u3 ¼ 256 þ 1. The keystream thus generated is indeed dependent on the plain-image. In the decryption process, we can recover the plain-image from the cipher-image by performing the inverse operation of Equation 6, as follows: 8 Ar r ¼ Cr1 r2 Cr1 r2 1 > > > 12 > < Ar1 r2 1 ¼ Cr1 r2 1 vDr1 r2 1 Cr1 r2 2 ⯗ ⯗ ⯗ ⯗ : (8) > > ¼ C2 vD2 C1 A2 > > : A1 ¼ C1 vD1 C0

Simulation Examples

1.5

−1

Voltage

1 0.5

0.5 0

−0.5 0

200

400 600 800 1000 1200 (b) Sampling point

(a)

−1

200

400 600 800 1000 1200 Sampling point

−0.8 400 600 800 1000 1200 0 Sampling point (d)

200

400 600 800 1000 1200 Sampling point

0.6

0.4

0.2

0 –0.2

–0.4

−0.6

−0.8

200

(c)

Figure 4. ECG signals for (a) person C, (b) person C with a one-sampling shift, (c) person D, and (d) person D with a one-sampling shift.

Table 2. Blocking for size 256 3 256. 0

(8, 8)

(8, 16)

(8, 32)

(16, 8)

(16, 16)

(16, 32)

(32, 8)

(32, 16)

(32, 32)

tic map), we obtained the corresponding cipherimage using the proposed encryption algorithm with three rounds of iteration (see Figure 5c). Figure 5d shows another cipher-image with a onesampling shift in the ECG of Figure 5b. Given an ECG signal, executing the whole algorithm took 0.0468 seconds; this demonstrates that implementing encryption using the proposed algorithm is fast.

Key Space and Sensitivity We generated the secret keys using 1,000 samples from an ECG signal and without control parameters such as u1 or u2. It is well known that copying or simulating an ECG signal is difﬁcult due to the complexity of the biological system. Of course, the number of sampling points can be adjusted according to the security requirement. Also, an ECG acquisition system15 can be built to collect actual ECG signals if one does not want to use the online database (the

April–June 2016

We performed simulations with Matlab 7.0 on a Notebook. We randomly chose an image of size 256 256, which we call Lena, for testing (see Figure 5a). As Figure 5b shows, we used an ECG signal from the Physionet16 online database as our initial condition. After setting the control parameters a ¼ 1; b ¼ 1; x1 ¼ 50; x2 ¼ 50; u ¼ 3:999; u1 ¼ 20; and u2 ¼ 15, together with two chaotic maps (the generalized Arnold map and the logis-

Voltage

Here, Ci and Ci 1 denote the current and former cipher-image blocks, respectively; C0 is a constant block; v is a new control parameter; and Ai and Di are the ith blocks of the plain-image and the pseudorandom matrix, respectively. Notice that, before performing diffusion, the classical encryption architecture should include a permutation operation in the ﬁrst stage. On the other hand, the two processes of permutation and diffusion will become independent when the plain-image is a homogeneous one with identical pixels.1 As a result, the security of the whole algorithm will rely only on diffusion. Based on these analyses, the proposed method considers only the diffusion process. 0, by iterating the logistic With initial key x 0; x 1 ; g. map, we obtain a set of values fx The parameters r and v can be designed with control parameters u1 and u2 as follows:

−0.5

Voltage

We consider and implement only the diffusion operation, as follows: 8 C1 ¼ A1 þ vD1 þ C0 > > > > ¼ A2 þ vD2 þ C1 < C2 ⯗ ⯗ ⯗ ⯗ (6) > > > Cr1 r2 1 ¼ Ar1 r2 1 þ vDr1 r2 1 þ Cr1 r2 2 > : Cr1 r2 ¼ Ar1 r2 þ Cr1 r2 1

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

2 1.5 1 0.5 0 −0.5 −1 0 (a)

200

400

600

800

1000

1200

(b)

(c)

(d)

(g)

(h)

2.5 2 1.5 1 0.5 0 −0.5

(e)

200

400

600

800

1000

1200 (f)

Figure 5. The Lena test image: (a) the plain-image; (b) the ECG of person A; (c) the cipher-image using (b); (d) the cipher-image with a one-sampling shift in (b); (e) the ECG of person B; (f) the cipher-image using (e); (g) decryption with a one-sampling shift in (b); and (h) decryption with a one-sampling shift in (e).

IEEE MultiMedia

ECG is like a one-time keypad—different people will have different ECGs, so the keys will not be used twice). Thus, the key space is sufﬁciently large to resist brute-force attack. Besides having a large key space, an ideal encryption algorithm should also be sensitive to every key. Figure 5f is the cipher-image using the ECG signal in Figure 5e. However, with only a one-sampling shift in the ECG signal, we would obtain incorrect decrypted images from the cipher-images in Figures 5c and 5f, as shown in ﬁgures 5g and 5h, respectively. Therefore, one cannot correctly recover the plainimage if there is any tiny change in the keys. However, the original plain-image can be recovered by using the corresponding ECG signal, as shown in Figure 5a. Histograms The distribution of gray values for an image can be displayed in a histogram, by which image information can be detected and analyzed.

Therefore, the histogram of the cipher-image should be changed so that it differs from that of the plain-image. By doing so, a statistical attack will be infeasible. Figure 6 shows the histograms corresponding to a Cameraman image and the cipherimage of the Cameraman. Here, we chose the Cameraman image for testing, taken from Matlab 7.0 (www.mathworks.com/help/images/ examples/deblurring-images-using-the-blind________________________________ deconvolution-algorithm.html). Clearly, the _____________________ histogram of the cipher-image is fairly uniform and differs significantly from that of the plainimage. Thus, by using the proposed cryptosystem, the cipher-image will not supply any useful information related to the plain-image. Differential Attack To measure the influence of a one-bit change from the plain-image to the cipher-image, two performance measures17 are usually adopted: the unified average changing intensity (UACI)

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

and the number-of-pixels change rate (NPCR), both as percentages. If the values of UACI and NPCR can reach 33.3 and 99.5 percent, respectively, the designed encryption method is considered able to resist differential attacks. NPCR and UACI are deﬁned as follows:

NPCR ¼

Dði; jÞ

M N

100%

(9)

(a)

2X 3 1 jC1 ði; jÞ C2 ði; jÞj UACI ¼ 4 5 100%; M N 255 i;j

(10) where C1 and C2 are two cipher-images with only a one-bit change in the same plain-image; Dði; jÞ ¼ 0 if C1 ði; jÞ ¼ C2 ði; jÞ, otherwise Dði; jÞ ¼ 1. We again chose the Lena image for testing, with results listed in Table 3. They indicate that a one-bit change in the plain-image results in a totally different cipher-image. Correlation Coefficients Strong correlation between two adjacent pixels usually exists in a meaningful plain-image. A good encryption algorithm should reduce the correlation coefficients to near zero.18 By randomly selecting 2,500 pairs of adjacent pixels in horizontal, vertical, and diagonal directions from the plain-image and the corresponding cipher-image, we tested the proposed algorithm according to Equation 11, as follows: covðx; yÞ rxy ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; DðxÞDðyÞ

1 N

600 500 400 300 200 100 0 0

100

150

200

250

(b)

100

150

200

250

Figure 6. The histogram test for the plain-image Cameraman and its cipherimage using our method: (a) the histogram of the Cameraman plain-image, and (b) the histogram of the cipher-image. (See the Cameraman image at www.mathworks.com/help/images/examples/deblurring-images-using-the_______________________________________________ blind-deconvolution-algorithm.html.) _______________________

Table 3. Unified average changing intensity (UACI) and the number-of-pixels change rate (NPCR) values (%) for the Lena image. 2 rounds

3 rounds

4 rounds

5 rounds

UACI

32.255

33.456

33.507

33.546

NPCR

99.413

99.576

99.634

99.561

Table 4. Correlation coefficients. Direction

Horizontally

Vertically

Diagonally

Plain-image

0.94875

0.96394

0.94085

Cipher-image

0.02507

0.02071

0.00796

(11)

where covðx; yÞ ¼

1000 900 800 700 600 500 400 300 200 100 0

N X

xi EðxÞ yi EðyÞ

i¼1

8 N > X > 2 > 1 > > > DðxÞ ¼ xi EðxÞ > > N < i¼1 and N > X > > > 1 > > EðxÞ ¼ xi > > N : i¼1

April–June 2016

and where xi and yi represent the gray values of two adjacent pixels in the image. The results, shown in Table 4, show that the cipher-image’s correlation coefﬁcients are around zero. As a helpful visualization, Figure 7 plots the correlation of the Lena image along the diagonal direction.

Known-Plaintext and Chosen-Plaintext Attacks To frustrate known-plaintext and chosen-plaintext attacks, a feedback mechanism from the plaintext should be established to change the keystream with respect to different plainimages.6 In the proposed method, the control paramP eter u3 ¼ ð Ar1 r2 Þmod256 þ 1 is set. As a result, the corresponding keystreams will not stay the same when encrypting different plain-images. Therefore, known-plaintext and chosen-plaintext attacks are infeasible for the proposed encryption algorithm. Randomness We employed the NIST 800-22 statistical test suite to test the randomness of the cipherimage. We chose a meteorogram image of size 256 256 at random for testing, as shown in

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

300 Gray value of image pixels at location (x+1, y+1)

Gray value of image pixels at location (x+1, y+1)

250 200 150 100 50 0

(a)

250 200 150 100 50

50 100 150 200 250 Gray value of image pixels at location (x, y) (b)

50 100 150 200 250 Gray value of image pixels at location (x, y)

Figure 7. Correlation in the diagonal direction of (a) the plain-image of Lena, and (b) the cipher-image of Lena.

Table 5. Randomness test for the cipher-image in Figure 8b. Test items

P-value

Results

Frequency

0.759756

Pass

Block frequency

0.319084

Pass

Cumulative sums

0.213309

Pass

Runs

0.595549

Pass

Longest run

0.867792

Pass

Rank

0.739918

Pass

Fast Fourier transform

0.224821

Pass

Nonoverlapping template

0.001628

Pass

Overlapping template

0.554420

Pass

Universal

0.678686

Pass

Approximate entropy

0.319084

Pass

Random excursions

0.055361

Pass

Random excursions variant

0.002559

Pass

Serial

0.534146

Pass

Linear complexity

0.191687

Pass

(a)

(b)

Figure 8. Meteorogram of (a) a plain-image, and (b) a cipher-image. (Source: NASA; used with permission.)

Figure 8a. Table 5 shows the results of this randomness test for the corresponding cipherimage in Figure 8b. The data show that the proposed algorithm satisﬁes the security requirement and passes the randomness test. Therefore, the proposed method has a strong capability to resist statistical analysis. We did not consider in this work how the ECG signals can be acquired directly from a live human body. A new simple handheld device developed by Ching-Kun Chen and his colleagues15 will be able to provide real-time live ECG signals and, hence, the one-time keys. In future work, we will continue to study a simpler ECG signal acquisition system to build an ECG biophysical one-time pad. MM

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Acknowledgments This work was fully supported by the National Natural Science Foundations of China (No. 11301091, No. 11526057), the Natural Science Foundation of Guangdong Province of China (No. 2015A030313614), the Project of Enhancing School with Innovation of Guangdong Ocean University of China (No. Q14217), the Science & Technology Planning Projects of Zhanjiang City of China (No. 2015B01051, No. 2015B01098), and the Program for Scientific Research Start-up Funds of Guangdong Ocean University.

References

Mechanism,” Optics Comm., vol. 284, no. 22, 2011, pp. 5290–5298. 10. C. Zhu, “A Novel Image Encryption Scheme Based on Improved Hyperchaotic Sequences,” Optics Comm., vol. 285, no. 1, 2012, pp. 29–37. 11. C.Q. Li et al., “Breaking a Novel Image Encryption Scheme Based on Improved Hyperchaotic Sequences,” Nonlinear Dynamics, vol. 73, no. 3, 2013, pp. 2083–2089. 12. V. Patidar, N.K. Pareek, and K.K. Sud, “A New Substitution Diffusion Based Image Cipher Using Chaotic Standard and Logistic Maps,” Comm. Nonlinear Science and Numerical Simulation, vol. 14, no. 7, 2009, pp. 3056–3075. 13. R. Rhouma, E. Solak, and S. Belghith, “Cryptanalysis of a New Substitution-Diffusion

1. W. Zhang et al., “An Image Encryption Scheme Using Reverse 2-Dimensional Chaotic Map and Dependent Diffusion,” Comm. Nonlinear Science and Numerical Simulation, vol. 18, 2013, pp. 2066–2080. 2. J. Fridrich, “Symmetric Ciphers Based on TwoDimensional Chaotic Maps,” Int’l J. Bifurcation and Chaos, vol. 8, no. 6, 1998, pp. 1259–1284. 3. A. Kumar and M.K. Ghose, “Extended Substitution-Diffusion Based Image Cipher Using Chaotic Standard Map,” Comm. Nonlinear Science and Numerical Simulation, vol. 16, no. 1, 2011, pp. 372–382. 4. H.J. Liu and X.Y. Wang, “Color Image Encryption Using Spatial Bit-Level Permutation and HighDimension Chaotic System,” Optics Comm., vol. 284, 2011, pp. 3895–3903. 5. Y. Tang, Z.D. Wang, and J.A. Fang, “Image Encryption Using Chaotic Coupled Map Lattices with Time-Varying Delays,” Comm. Nonlinear Science and Numerical Simulation, vol. 15, no. 9, 2010, pp. 2456–2468. 6. Y.S. Zhang and D. Xiao, “An Image Encryption Scheme Based on Rotation Matrix Bit-Level Permu-

Based Image Cipher,” Comm. Nonlinear Science and Numerical Simulation, vol. 15, no. 7, 2010, pp. 1887–1892. 14. M. Mangia et al., “Rakeness-Based Approach to Compressed Sensing of ECGs,” 2011 IEEE Biomedical Circuits and Systems Conf. (BioCAS), 2011, pp. 424–427. 15. C.K. Chen et al., “Personalized Information Encryption Using ECG Signals with Chaotic Functions,” Information Sciences, vol. 193, June 2012, pp. 125–140. 16. A.L. Goldberger et al., “Physiobank, Physiotoolkit, and Physionet: Components of a New Research Resource for Complex Physiologic Signals,” Circulation, vol. 101, vol. 23, 2000, pp. e215–e220. 17. B. Norouzi and S. Mirzakuchaki, “A Fast Color Image Encryption Algorithm Based on Hyper-Chaotic Systems,” Nonlinear Dynamics, vol. 78, no. 2, 2014, pp. 995–1015. 18. Z. Eslami and A. Bakhshandeh, “An Improvement over an Image Encryption Method Based on Total Shuffling,” Optics Comm., vol. 286, 2013, pp. 51–55. Guodong Ye, the corresponding author, is an associ-

tation and Block Diffusion,” Comm. Nonlinear Sci-

ate professor of information security in the College of Science at Guangdong Ocean University, China. His

ence and Numerical Simulation, vol. 19, no. 1, 2014, pp. 74–82.

research interests include image encryption, image

7. Y.S. Zhang et al., “A Novel Image Encryption

quality assessment, and numerical simulation. Ye received his PhD in electronic engineering from City University of Hong Kong. Contact him at guodongye _______

tem of Partial Differential Equations,” Signal Processing: Image Communication, vol. 28, no. 3, 2013,

@hotmail.com or _______________ guodongye@gmail.com. _________

pp. 292–300.

Xiaoling Huang is an associate professor of cryptog-

8. Y. Wang et al., “A Chaos-Based Image Encryption

raphy in the College of Science at Guangdong Ocean

Algorithm with Variable Control Parameters,” Chaos, Solitons and Fractals, vol. 41, no. 4, 2009,

University, China. Her research interests include cryptography, information security, and mathemati-

pp. 1773–1783.

cal models. Huang received her MS in mathematics

9. R.S. Ye, “A Novel Chaos-Based Image Encryption Scheme with an Efficient Permutation-Diffusion

April–June 2016

Scheme Based on a Linear Hyperbolic Chaotic Sys-

from Shantou University, China. Contact her at xyxhuang@hotmail.com. _______________

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Feature: Depth Sensing

Extended Guided Filtering for Depth Map Upsampling Kai-Lung Hua and Kai-Han Lo National Taiwan University of Science and Technology Yu-Chiang Frank Wang Academia Sinica

This extended guided filtering approach for depth map upsampling outperforms other state-of-the-art approaches by using a high-resolution color image as a guide and applying an onion-peeling filtering procedure that exploits local gradient information of depth images.

he emerging topic of 3D scene analysis has attracted the attention of computer vision and image processing researchers. Depth sensing is among the important tasks researchers must address to better understand the structure of a scene. In recent years, passive stereo matching has been used to compute depth information through multiple cameras, but it is unreliable for textureless or occluded image regions. Active laser range scanners can provide dense and precise depth data, but their use is typically limited to static scenes because they only measure a single point at a time. These limitations have been overcome by time-of-ﬂight (ToF) cameras, which measure the round-trip travel time of infrared signals between a ToF camera and the observed object or surface of interest. However, due to the intrinsic physical constraints of ToF cameras, using such sensors suffers from low image resolution. Nevertheless, accurate high-resolution depth mapping is in demand for many applications. For example, ﬁne-grained gesture recognition and semantic scene analysis would improve per-

1070-986X/16/$33.00 c 2016 IEEE

formance when leveraging high-quality RGB-D (red, green, blue plus depth) data. Moreover, researchers have reported that for 3D TV, the higher the spatial resolution of the depth map is, the higher the video quality and depth perception are.1 Based on the examples given, it is obvious that researchers desire techniques for excellent depth map upsampling. RGB-D cameras such as Microsoft Kinect offer synchronized information of color and depth images. Because the built-in RGB and range sensors are placed side by side, registration of color-depth image pairs can be achieved via homographic warping or multiview camera calibration techniques (see Figure 1).2,3 Therefore, several depth map upsampling algorithms have been proposed that use registered highresolution color images for reference. In general, these approaches fall into two classes: learning-based and filtering-based. For learning-based approaches, James Diebel and Sebastian Thrun approached depth map upsampling by solving a multilabeling optimization problem via Markov random fields (MRFs).4 In other recent work,5–10 improved depth map upsampling estimates were achieved by solving MRFs with additional constraints on depth discontinuities. Learning-based approaches in general require higher computational loads, so their use in practical scenarios is limited. On the other hand, filtering-based approaches, such as bilateral filtering11 and its extensions, have been employed to solve this task, with the goal of enhancing depth image resolution while preserving edge information.2,3,12–14 Recently, Kaiming He and his colleagues15 proposed guided filtering (GF), which is more efficient and effective near edges when compared to bilateral filtering. Although promising edge-preserving results have been reported,15 color texture-copying artifacts cannot easily be addressed in depth map upsampling due to possible inconsistency between color and depth variations. Here, we propose an extended GF algorithm for depth map upsampling.

A Brief Review of Guided Filtering GF15 is an edge-preserving smoothing filter. While it functions like bilateral filtering,11 GF shows improved performance near image edges. The filtering output depends on the content of the guidance image, which can be either the input image itself or another relevant image. The formulation of GF is derived from the local î ¼ ak Ii þ bk ; 8i 2 xk , linear transformation p

Published by the IEEE Computer Society

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

CCD camera

Depth camera

Upsampled depth map

LR depth map

HR color image

Figure 1. Using a registered high-resolution color image for depth map upsampling. Registration of a colordepth image pair could be achieved via homographic warping or camera calibration.

^ is the filter output, i is the pixel index, I where p is the guidance image, and ðak ; bk Þ are constant linear coefficients in a local reference window xk centered at pixel k. The solution of these linear coefficients is defined to minimize the squared ^ and the input p: difference X between p 2 2 Eðak ; bk Þ ¼ ½ða I i2xk k i þ bk pi Þ þ eak , where 僆 is a regularization parameter. It has been proven elsewhere15 that the filter output can be approximated Xby the weighted î average function p j2uðiÞ Wp;q ði; jÞ qj , where qj is a neighboring pixel indexed by j adjacent to pi . For simplicity, we will use ðp; qÞ as notations for ðpi ; qj Þ in the rest of this article. Thus, the weight can be expressed as Wp;q ¼

ðIp lI ÞðIq lI Þ 1 þ r2I þ e jxj2 k:ðp;qÞ2x 1

(1) where lI and r2I are the mean and variance of I in xk , respectively. Equation 1 is the kernel of GF for weighting assignments, normalized by jxj (that is, the number of pixels in xk ). Computed from the ﬁltering kernel, the weight is much smaller than 1 when p and q are located at different sides of an edge in the guidance image. In contrast, when we focus on homogeneous regions, both p and q are close to the patch mean; hence, the corresponding weight is approximately equal to 1 as the behavior of lowpass ﬁltering. Although the edge-preserving

smoothing property can be achieved, texturecopying artifacts might occur in the output when certain inconsistency exists between the guidance image and the target image. To address this issue for depth map upsampling, we propose an extended GF algorithm.

Our Proposed Method Figure 2 shows an overview of our proposed extended GF framework. We first obtain the initial upsampled depth map through bicubic interpolation. For this initially estimated highresolution depth map, we derive a mask that identifies the unreliable regions for later filtering processes. As detailed later, the goal of extended GF is to exploit both the structures in the guidance high-resolution color image and local gradient information of the depth map to predict the final depth output. Onion-Peel Filtering for Depth Edges Compared to color images, depth maps possess the property that the pixel values of the same object surface in 3D are homogeneous. Based on this observation, we first upsample the input low-resolution depth map using bicubic interpolation to obtain an initial high-resolution estimation. Although the smoothness of the estimated high-resolution depth pixel values of the same surface can be preserved, it is inevitable that details such as depth edges will be blurred. Therefore, it is necessary to identify

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q

THE WORLD’S NEWSSTAND®

Input image pair HR color image

HR intensity image

LR depth map

Interpolated HR depth map

Extended guided ﬁltering

Gradient image (horizontal)

Upsampled depth map

(vertical)

Figure 2. Flowchart of our proposed extended guided filtering for depth map upsampling. Our goal is to exploit both the structures in the guidance high-resolution color image and local gradient information in the depth map to predict the final depth output.

the depth discontinuity for such detailed regions, so that the predicted depth values in these unreliable regions can be reﬁned accordingly. For each pixel p of the interpolated depth map, we compute its depth range information as the difference between the maximum and minimum depth values within a reference win-

Initialization before depth edge reﬁnement

dow, containing a set of neighboring pixels centered at p. To identify if this window corresponds to the depth discontinuity/unreliable region, we use Pedge ¼ 1 to denote pixels whose depth range is above a predefined threshold s (or Pedge ¼ 0 otherwise). Next, we use the term Punreliable as a binary flag guiding the following filtering process.

Filtering in onion-peel skin order

Output depth map

Figure 3. Example of the onion-peeling filtering process for a patch containing a diagonal edge in the depth map. The numbers for each pixel denote the filtering orders from reliable toward unreliable regions.

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

GV,1

GV,2

(a)

(b)

(c)

Figure 4. Example of the second-order gradient of a depth patch containing a horizontal edge. The yellow line indicates the expected boundary of the depth edge. (a) Adjacent pixels p and q are located across a blurred edge in the interpolated depth map; (b) a firstorder vertical gradient map; and (c) a second-order vertical gradient map.

First, the values of Punreliable are initialized as the values of Pedge on all pixel coordinates. Then filtering proceeds for every unreliable pixel by progressive scanning from left to right and top to bottom, supported by reliable pixels under the filtering mask. During the scanning procedure, pixels with at least one reliable eight-connected neighbor are filtered first, then the corresponding coordinates are marked. Once all

the unreliable pixels are scanned, the values of Punreliable of the marked coordinates are updated to 0, changing the roles of particular pixels into reliable ones. Hence, the corresponding ﬁltered outputs are viewed as observed values to support the remaining unreliable pixels in the next scan. As illustrated in Figure 3, the entire ﬂow is an onion-peel process in which we predict depth values for pixels in unreliable regions from

Higher weight p

(a)

(d)

(b)

(c)

(e)

Lower weight

(f)

Figure 5. Illustration of our extended filter kernel in a local reference window centered at p: (a) an interpolated depth map, (b) a second-order vertical gradient in depth, (c) weights obtained from the depth gradient, (d) the corresponding intensity image, (e) weights obtained from the intensity of the guidance color image, and (f) weights integrating (c) and (e). Note that the pink line depicts the horizontal edge in depth.

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Table 1. Performance comparisons of Middlebury images with upsampling factors 8 and 4 in terms of bad pixel rate. Note that for images Venus and Teddy, the masks of non-occlusion regions (nonocc.) and discontinuity regions (disc.) were provided by the Middlebury website (http://vision.middlebury.edu/stereo), ______________________ and we compute the BP% in these regions. (Bold indicates best performance.) Venus

Teddy

Bad pixel rate (BP%) of 8X

Nonocc.*

All†

Disc.‡

Nonocc.

All

Disc.

2.10

2.44

7.52

15.95

15.99

31.59

Guided Filtering15

1.13

1.44

15.79

11.75

12.52

42.05

Joint Bilateral Upsampling13

1.10

1.61

12.44

11.75

12.68

36.40

Diebel and colleagues

0.98

1.46

12.51

11.45

12.37

36.04

§Lu and colleagues7

NAFDU

0.98

1.36

7.29

13.73

14.93

31.58

§Kim and colleagues5

0.60

0.74

5.83

9.80

10.47

25.63

Jung12

0.99

1.46

11.16

9.62

10.65

29.88

§Lo and colleagues6

0.39

0.49

4.74

9.49

10.77

26.39

§Xie and colleagues9

0.76

0.65

5.61

6.78

7.86

17.91

Extended GF

0.20

0.32

2.79

7.92

8.22

25.66

§Diebel and colleagues4

0.85

1.16

3.93

7.46

8.17

18.02

Guided Filtering15

0.47

0.63

6.58

6.66

7.19

24.56

Joint Bilateral Upsampling13

0.40

0.72

5.62

5.75

6.61

20.08

0.41

0.69

5.74

5.64

6.48

19.75

0.24

0.31

3.27

5.14

5.60

14.47

0.17

0.30

2.31

5.33

6.20

16.86

Jung12

0.27

0.59

3.76

4.77

5.65

15.99

0.12

0.16

1.67

3.35

3.69

9.59

BP% of 4X

NAFDU3 Lu and colleagues7 Kim and colleagues5 Lo and colleagues6

Xie and colleagues

Extended GF

0.23

0.45

3.22

1.68

2.45

5.84

0.09

0.15

1.19

3.81

4.26

12.50

Nonocc.: pre-defined non-occlusion regions All: all regions, entire image ‡ Disc. : pre-defined discontinuity regions § A learning-based approach k NAFDU: Noise-Aware Filter for Real-Time Depth Upsampling †

outside inward in a concentric-layer order; we terminate the ﬁltering when all pixels are marked as reliable. Extended Guided Filtering For each pixel in the ﬁltering process, the goal of GF is to assign a lower weight to its neighboring pixels if they are at a different side of an image edge. When applying this property to the task of depth map upsampling, the registered high-resolution color image is the guidance for the weighting assignment, explicitly revealing the assumption that depth discontinuities often co-occur with color edges. However, when inconsistency exists between the color image and its depth map, this assump-

tion would fail and lead to texture-copying artifacts. In particular, it occurs when two adjacent pixels in the same depth plane have different colors, or two adjacent pixels with similar color have distinct depth variations. The former is not an issue for us, because it corresponds to the smooth depth regions in the initial upsampled depth map. As for the latter case, we propose an extended kernel for extended GF as follows:

ðIp lI ÞðIq lI Þ 1 X Wp;q ¼ 1þ N k:ðp;qÞ2x r2I þ e k

1 þ DðGp ; Gq Þ MðqÞ; ð2Þ where N is the normalization factor. The term in the ﬁrst parenthesis is the kernel of

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Baby1

Dolls

Midd2

Moebius

Reindeer

Living1

Store1

Kitchen1

8.01

13.83

5.06

17.42

4.39

8.00

9.33

6.94

6.26

4.05

4.42

10.05

5.94

4.99

4.33

5.47

3.03

12.51

4.38

13.16

7.08

6.84

5.83

3.42

4.89

9.71

3.76

11.16

6.70

6.90

5.77

3.75

3.65

17.44

3.74

17.26

8.21

6.17

5.25

3.15

5.21

7.63

3.59

9.15

4.58

6.04

5.97

3.56

4.55

11.67

3.91

12.83

6.12

5.43

4.70

2.78

3.18

12.25

3.01

10.09

5.39

5.76

4.26

3.10

2.20

6.29

2.16

5.12

3.85

4.43

4.30

2.58

1.86

5.16

2.40

7.20

3.79

5.50

4.21

2.54

5.08

6.64

2.96

9.58

3.95

4.81

3.09

3.45

2.25

4.48

2.57

5.67

3.24

2.22

2.73

1.34

2.36

5.55

1.95

6.48

3.36

3.37

3.60

2.30

2.26

4.96

1.87

5.53

3.30

2.74

2.84

1.96

2.17

6.69

3.00

6.26

3.48

2.34

2.56

2.32

3.48

4.37

2.92

5.46

2.14

2.37

2.46

2.24

1.84

5.04

2.10

6.03

2.91

2.87

2.86

2.00

1.50

3.45

1.80

4.33

1.97

1.94

2.39

1.45

All

1.10

2.56

0.97

2.80

2.21

1.85

2.31

1.33

0.85

2.08

0.80

3.02

1.80

2.07

2.18

1.27

the original GF, as introduced in Equation 1. The second term, which we will explain in more detail in the following paragraph, is the proposed extended kernel that we employ to alleviate texture-copying artifacts in depth discontinuity regions. The last term, MðqÞ, is the binary mask that indicates whether neighboring pixel q is reliable. In other words, MðqÞ ¼ 1 Punreliable ðqÞ. We now explain how the proposed kernel for extended GF takes depth information for robust weight estimation. To exploit this information, we ﬁrst calculate gradients for each pixel at coordinate ðm; nÞ in the bicubic interpolated depth map S along the vertical axis:

GV;1 ðm; nÞ ¼ 1=2jSðm þ 1; nÞ Sðm 1; nÞj: (3) These operations reveal the magnitudes of local depth variations in the vertical direction. As illustrated in Figure 4a, if two adjacent pixels were located across a blurred depth edge, it can be seen in Figure 4b that they would hold similar ﬁrst-order gradient values even if they belonged to different depth planes. Thus, we further compute the second-order gradient as GV;2 ðm; nÞ ¼ 1=2 fGV;1 ðm þ 1; nÞ GV;1 ðm 1; nÞg:

(4)

Similarly, we also compute GH;1 and GH;2 for the horizontal aspect. In Equation 4, the signs

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Table 2. Performance comparisons of Middlebury images with upsampling factors 8 and 4 in terms of mean squared error. (Bold indicates best performance.) Venus

Teddy

Mean squared error of 8X

Nonocc.

All

Disc.

Nonocc.

*Diebel and colleagues4

17.71

21.69

87.55

143.66

Guided Filtering15

4.87

6.05

63.99

56.14

61.99

205.36

Joint Bilateral Upsampling13

7.63

11.19

85.97

94.78

108.38

327.87

†

6.93

10.35

86.54

94.22

107.78

327.18

*Lu and colleagues7

18.79

20.38

95.94

31.82

38.51

76.40

*Kim and colleagues5

8.34

8.90

46.92

44.50

53.29

142.27

Jung12

8.32

12.39

93.10

72.85

87.04

251.77

*Lo and colleagues6

13.97

15.04

85.76

48.29

55.70

143.12

*Xie and colleagues9

8.00

10.51

76.80

14.11

21.46

50.83

Extended GF

2.55

3.33

29.42

21.04

26.53

70.93

*Diebel and colleagues4

9.55

12.69

49.98

40.63

51.95

135.42

Guided Filtering15

2.12

2.78

28.22

38.06

44.75

140.64

Joint Bilateral Upsampling13

2.90

5.04

36.27

61.03

72.47

224.36

†

2.77

4.79

36.41

60.78

72.17

223.87

*Lu and colleagues7

6.34

5.43

39.54

16.97

20.98

44.56

*Kim and colleagues5

2.83

3.59

19.24

40.22

51.04

141.75

Jung12

2.42

4.72

30.08

49.74

60.84

179.52

*Lo and colleagues6

2.16

2.54

18.67

26.07

32.40

80.02

2.29

4.61

30.33

6.89

12.00

24.60

1.26

1.71

15.33

15.82

20.39

55.76

NAFDU3

All 52.49

Disc. 121.47

Mean squared error of 4X

NAFDU3

*Xie and colleagues Extended GF GF *A

learning-based approach Noise-Aware Filter for Real-Time Depth Upsampling

†NAFDU:

of the second-order gradient between adjacent pixels across a depth edge would be inverse, which can be viewed as an indicator for depth discontinuities, as illustrated in Figure 4c. Based on this observation, we derive the term DðGp ; Gq Þ in Equation 2 as follows: DðGp ; Gq Þ ¼ ( min

V;2 ðGpV;2 lG ÞðGV;2 lV;2 q G Þ 2 ðrV;2 G Þ þe H;2 ðGpH;2 lG ÞðGqH;2 lH;2 G Þ 2 ðrH;2 G Þ þe

; )

Experiments ; ð5Þ

where the gradient difference between p and q V;2 is computed with respect to lG and lH;2 G , the means of the second-order gradients within the 2 H;2 2 reference window xk. ðrV;2 G Þ and ðrG Þ are the corresponding variances. We perform the minimum operation in Equation 5 to examine if pixel q locates at the same depth plane as p. If so, a positive value

IEEE MultiMedia

will be derived for DðGp ; Gq Þ, and thus a higher weight will be determined for ﬁltering. On the other hand, as illustrated in Figure 5, if q locates at a different depth plane than p, we would obtain a negative value for DðGp ; Gq Þ, which would result in a lower ﬁlter weight. Therefore, our proposed kernel for extended GF lets us better preserve sharp edges while suppressing texture-copying artifacts in the output depth map.

To evaluate the performance of our proposed method, we consider images from both the Middlebury stereo dataset16 and the NYU depth dataset v2.17 While the Middlebury stereo dataset provides standard test images, the NYU depth dataset17 offers real-world color-depth image pairs that are captured by Kinect sensors. The resolutions of test images in the Middlebury and NYU datasets are approximately 300 400 and 480 640 pixels, respectively. We downsample the depth map into lower-resolution ones, and

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Baby1

Dolls

Midd2

Moebius

Reindeer

Living1

Store1

Kitchen1

97.25

23.16

66.01

38.51

28.91

19.65

70.83

38.47

46.27

47.75

27.99

29.76

68.34

19.92

19.05

42.47

16.88

27.26

46.74

35.64

110.20

22.94

17.77

12.83

40.10

23.64

43.11

32.17

107.08

23.07

17.70

12.13

53.05

30.71

54.26

48.40

110.31

30.09

22.25

15.77

44.74

14.59

49.74

20.79

43.41

32.91

23.31

23.66

42.83

26.32

56.71

33.71

110.74

20.32

14.67

12.03

52.52

27.38

53.62

35.34

98.76

28.05

18.63

25.01

36.41

57.87

53.62

30.97

93.32

16.16

15.74

12.97

23.40

14.61

26.98

24.11

65.88

22.50

17.06

11.41

48.66

14.70

42.19

27.72

40.95

38.14

28.46

23.37

17.08

9.33

18.28

14.86

37.07

15.40

13.36

7.22

20.99

17.18

25.31

23.99

51.36

13.28

10.39

7.89

20.34

16.35

24.32

22.58

50.62

12.05

9.37

7.35

33.34

14.02

35.73

21.88

54.25

26.37

15.59

13.76

23.91

13.27

31.08

14.84

29.94

24.09

15.06

11.51

21.23

14.76

27.62

21.92

51.18

11.73

8.03

7.53

27.45

11.52

24.49

17.46

39.58

17.34

11.69

8.67

All

17.06

16.86

21.92

18.81

56.03

6.72

9.02

5.64

13.84

6.39

13.32

9.13

45.83

12.53

8.75

5.45

use different upsampling factors for performing depth map upsampling. We empirically use the same parameter settings, x ¼ 3 3 and 2 ¼ 0:3, for all images in the ﬁlter kernel. For identiﬁcation of depth edges, we set s ¼ 10 and the reference window size as ð2½log2 K þ 1Þ ð2½log2 K þ 1Þ, where K is the upsampling factor. We apply the bad pixel rate (BP%) along with the mean squared error (MSE) as the evaluation metrics for assessing performance. We calculate the BP% by scaling the output into a particular range in terms of depths and determining the percentage of pixels that differ from the ground-truth ones by an error threshold of 1.16 We consider nine state-of-the-art approaches for comprehensive comparison and report the BP% and MSE of different images (Venus, Teddy, Baby1, Dolls, Midd2, Moebius, Reindeer, Living1, Store1, and Kitchen1) in Tables 1 and 2. We use “*” to denote a learning-based method, which generally requires a much higher compu-

tational cost (see Table 3). Tables 1 and 2 show that our proposed method, extended GE, outperforms the other nine state-of-the-art methods in terms of both BP% and MSE. Note that we use bold font to indicate the algorithm achieving the best performance. For visual performance, we show example images with regions of interest in Figures 6 and 7. Figures 6c–j and 7c–j show that texture-copying artifacts were produced in the depth maps that comprise inconsistent color and depth variation. Our result in Figures 6k and 7k was robust to such effects and able to preserve sharp edges, so it was closest to the ground truth. To evaluate the performance on noisy images, we added Gaussian noise (zero mean and variance of 10 and 20) to the depth images of Venus, Baby1, and Kitchen1. The upper part of Tables 4 and 5 show that most approaches, except for two high-computational-cost learning-based ones,6,7 are not robust to noise and suffer degradation in performance. However, if we employ the BM3D

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q

THE WORLD’S NEWSSTAND®

Table 3. Time consumption (seconds) for depth map upsampling with a factor of 8, running on Matlab (CPU: Intel Core i5-2400 3.1 Ghz). Algorithm

Venus

Teddy

Kitchen1

*Diebel and colleagues

189.7

339.0

608.4

Guided Filtering15

4.2

4.5

Joint Bilateral Upsampling13

9.3

9.2

10.4

†

12.6

12.7

13.5

*Lu and colleagues7

112.2

100.9

285.6

*Kim and colleagues5

266.6

524.5

979.9

Jung12

644.5

743.4

1942.5

*Lo and colleagues6

172.5

194.0

639.9

NAFDU3

*Xie and colleagues

88.3

221.3

582.1

Extended GF

14.8

46.7

79.4

A learning-based approach NAFDU: Noise-Aware Filter for Real-Time Depth Upsampling

†

250 200

Venus

150 100 50

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

0 250 200 150

Teddy

100 50

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

0 250 200 150

Baby1

100 50

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

0 250 200 150

Dolls

100 50

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

Figure 6. Example depth map upsampling results of Venus, Teddy, Baby1, and Dolls (from top to bottom) with two regions of interest: (a) a ground-truth depth map and (b) selected color image regions, with the associated depth upsampling results of (c) James Diebel and colleagues,4 (d) Guided Filtering,15 (e) Joint Bilateral Upsampling,13 (f) Jiangbo Lu and colleagues,7 (g) Dae-Young Kim and colleagues,5 (h) Seung-Won Jung,12 (i) Kai-Han Lo and colleagues,6 (j) Jun Xie and colleagues,9 (k) our method, extended GF, and (l) the ground-truth depth map region.

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

250 200

Moebius

150 100 50

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

0 250 200 150

Reindeer

100 50

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

250 200 150

Store1

100 50

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

250 200 150

Kitchen1

100 50

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

Figure 7. Example depth map upsampling results of Moebius, Reindeer, Store1, and Kitchen1 (from top to bottom) with two regions of interest: (a) a ground-truth depth map and (b) selected color image regions, with the associated depth upsampling results of (c) James Diebel and colleagues,4 (d) Guided Filtering,15 (e) Joint Bilateral Upsampling,13 (f) Jiangbo Lu and colleagues,7 (g) Dae-Young Kim and colleagues,5 (h) Seung-Won Jung,12 (i) Kai-Han Lo and colleagues,6 (j) Jun Xie and colleagues,9 (k) our method, extended GF, and (l) the ground-truth depth map region.

denoising algorithm18 on the noisy images as the preprocessing step, as shown in the lower part of Tables 4 and 5, the performance of the proposed method outperforms all of the other methods we’ve presented, including those of K.-H. Lo and his colleagues6 and J. Lu and his colleagues.7

References 1. G. Nur et al., “Impact of Depth Map Spatial Resolution on 3D Video Quality and Depth Perception,” Proc. IEEE 3DTV Conf.: The True Vision—Capture, Transmission and Display of 3D Video, 2010, pp. 1–4. 2. Q. Yang et al., “Spatial-Depth Super Resolution for

April–June 2016

e presented a novel filtering-based approach for depth map upsampling. Our extended GF successfully alleviates texture copying artifacts. In the future, we plan to design a systematic method to automatically determine the values of the parameters (such as threshold and sliding window size) based on the input color depth image pair. Besides, for depth regions with a very fine structure, the current onion-peel filtering would fail due to depth propagation from inaccurate depth regions (see

Figure 6k), so in the future, we aim to further develop a structure-aware onion-peel ﬁlter that can exploit color edges as guidance to intelligently assign better ﬁltering order. MM

Range Images,” Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition (CVPR), 2007, pp. 1–8. 3. D. Chan et al., “A Noise-Aware Filter for Real-Time Depth Upsampling,” Proc. ECCV Workshop Multicamera and Multi-modal Sensor Fusion Algorithms and Applications, 2008, pp. 1–12.

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Table 4. Performance comparisons with Gaussian noise (zero mean and variance of 10) added. The upper and lower parts of the table report the bad pixel rates (%) without and with BM3D denoising preprocessing, respectively. Venus

Baby1

Dolls

Without BM3D preprocessing

Nonocc.

All

Disc.

All

*Diebel and colleagues4

18.42

19.11

24.54

23.60

51.07

Guided Filtering15

25.16

25.73

33.97

37.70

59.27

Joint Bilateral Upsampling13

51.62

51.97

52.43

52.65

66.33

†

51.75

52.24

56.69

54.07

65.92

*Lu and colleagues7

NAFDU

4.24

5.20

14.26

11.25

40.85

*Kim and colleagues5

43.57

43.47

42.76

14.72

57.51

Jung12

46.87

47.30

53.81

49.41

64.49

*Lo and colleagues6

7.22

7.87

19.68

12.06

49.79

*Xie and colleagues9

18.21

20.14

23.31

15.23

47.40

Extended GF

37.40

47.11

40.62

54.35

*Diebel and colleagues4

0.90

1.21

5.05

3.78

13.22

Guided Filtering15

0.67

0.88

9.37

3.00

9.52

Joint Bilateral Upsampling13

0.53

0.92

7.43

3.00

11.51

†

0.62

0.96

8.72

2.86

9.80

*Lu and colleagues7

0.29

0.36

4.00

2.80

13.58

*Kim and colleagues5

0.31

0.44

4.09

2.88

10.20

Jung12

0.37

0.73

5.10

2.85

10.25

*Lo and colleagues6

0.24

0.30

3.28

1.81

9.92

With BM3D preprocessing

NAFDU3

*Xie and colleagues

0.36

0.60

4.63

1.94

9.46

Extended GF

0.20

0.32

2.75

1.74

8.87

A learning-based approach NAFDU: Noise-Aware Filter for Real-Time Depth Upsampling

†

4. J. Diebel and S. Thrun, “An Application of Markov Random Fields to Range Sensing,” Proc. Conf. Neural Information Processing Systems (NIPS), 2005, pp. 291–298. 5. D. Kim and K. Yoon, “High Quality Depth Map UpSampling Robust to Edge Noise of Range Sensors,” Proc. IEEE Int’l Conf. Image Processing (ICIP), 2012, pp. 553–556. 6. K.-H. Lo, K.-L. Hua, and Y.-C.F. Wang, “Depth Map Super-Resolution via Markov Random Fields without Texture-Copying Artifacts,” Proc. IEEE Int’l Conf. Acoustics, Speech, and Signal Processing (ICASSP),

IEEE MultiMedia

2013, pp. 1414–1418. 7. J. Lu et al., “A Revisit to MRF-Based Depth Map

Conf. Image Processing (ICIP), 2014, pp. 3773–3777. 10. J. Li et al., “Similarity-Aware Patchwork Assembly for Depth Image Super-Resolution,” Proc. 2014 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2014, pp. 3374–3381. 11. C. Tomasi and R. Manduchi, “Bilateral Filtering for Gray and Color Images,” Proc. IEEE Int’l Conf. Computer Vision (ICCV), 1998, pp. 839–846. 12. S.-W. Jung, “Enhancement of Image and Depth Map Using Adaptive Joint Trilateral Filter,” IEEE Trans. Circuits and Systems for Video Technology, vol. 23, no. 2, 2013, pp. 258–269.

Super-Resolution and Enhancement,” Proc. IEEE

13. J. Kopf et al., “Joint Bilateral Upsampling,” ACM

Int’l Conf. Acoustics, Speech, and Signal Processing (ICASSP), 2011, pp. 985–988.

Trans. Graphics, 2007, vol. 26, no. 3, pp. 96:1–96:5. 14. M. Liu, O. Tuzel, and Y. Taguchi, “Joint Geodesic

8. J. Park et al., “High-Quality Depth Map Upsampling and CompletionforRGB-D Cameras,” IEEE Trans. Image Processing, vol. 23, no. 12, 2014, pp. 5559–5572. 9. J. Xie, R.S. Feris, and M.-T. Sun, “Edge Guided Single Depth Image Super Resolution,” Proc. IEEE Int’l

Upsampling of Depth Images,” Proc. 2013 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2013, pp. 169–176. 15. K. He, J. Sun, and X. Tang, “Guided Image Filtering,” IEEE Trans. Pattern Analysis and

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Table 5. Performance comparisons with Gaussian noise (zero mean and variance of 20) added. The upper and lower parts of the table report the bad pixel rate without and with BM3D denoising preprocessing, respectively. Venus

Baby1

Dolls

Without BM3D preprocessing

Nonocc.

All

Disc.

All

*Diebel and colleagues4

24.00

24.57

32.60

32.97

58.77

Guided Filtering15

39.37

39.68

43.98

41.61

64.13

Joint Bilateral Upsampling13

54.81

54.94

54.87

55.59

68.17

†

54.28

54.55

57.35

56.32

67.62

*Lu and colleagues7

NAFDU

46.19

46.60

53.69

44.33

49.64

*Kim and colleagues5

49.75

49.74

49.54

20.86

63.52

Jung12

54.34

54.43

55.93

56.38

67.61

*Lo and colleagues6

56.84

57.13

64.81

53.73

55.33

*Xie and colleagues9

44.31

47.10

51.42

48.25

52.65

Extended GF

57.43

57.38

63.47

50.79

59.62

*Diebel and colleagues4

0.99

1.29

6.24

8.55

18.55

Guided Filtering15

0.96

1.19

13.35

4.10

14.07

Joint Bilateral Upsampling13

0.77

1.22

10.67

3.74

14.93

**NAFDU3

0.83

1.21

11.57

3.54

14.86

*Lu and colleagues7

0.32

0.40

4.39

3.13

13.97

*Kim and colleagues5

0.46

0.60

5.72

3.67

12.25

Jung12

0.58

0.94

7.97

3.20

15.02

*Lo and colleagues6

0.26

0.33

3.35

2.29

15.43

With BM3D preprocessing

*Xie and colleagues9

0.36

0.63

4.82

2.33

12.56

Extended GF

0.28

0.44

3.95

2.02

11.91

A learning-based approach NAFDU: Noise-Aware Filter for Real-Time Depth Upsampling

†

Machine Intelligence, vol. 35, no. 6, 2013, pp.

Computer Engineering at Purdue University. Con-

1397–1409.

tact him at ______________ hua@mail.ntust.edu.tw.

16. D. Scharstein and R. Szeliski, “A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms,” Int’l J. Computer Vision, vol. 47, no. 1-3, 2001, pp. 7–42; “Middlebury Stereo Vision Page,” http://vision.middlebury.edu/stereo/. 17. N. Silberman et al., “Indoor Segmentation and Support Inference from RGBD Images,” Proc. 12th European Conf. Computer Vision (ECCV), 2012, pp. 746–760. 18. K. Dabov et al., “Image Denoising by Sparse 3D Transform-Domain Collaborative Filtering,” IEEE Trans. Image Processing, vol. 16, no. 8, 2007, pp. 2080–2095.

Kai-Han Lo is a research assistant in the Research Center for Information Technology Innovation’s Multimedia and Machine Learning Lab at Academia Sinica, Taipei. His current research interests include digital image processing, computer vision, and multimedia. Lo received an MS in computer science and information engineering from National Taiwan University of Science and Technology. Contact him at d10215001@mail.ntust.edu.tw. ___________________ Yu-Chiang Frank Wang is an associate research fel-

Engineering at the National Taiwan University of Science and Technology. His current research interests include digital image and video processing,

low and deputy director of the Research Center for

April–June 2016

Kai-Lung Hua is an associate professor in the Department of Computer Science and Information

Information Technology Innovation at Academia Sinica, Taipei. His research interests include computer vision, machine learning, and image processing. Wang

computer vision, and multimedia networking. Hua

received his PhD in electrical and computer engineering at Carnegie Mellon University. Contact him at

received his PhD from the School of Electrical and

ycwang@citi.sinica.edu.tw. _______________

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Feature: Digital Rights Management

Managing Intellectual Property in a Music Fruition Environment Adriano Barate`, Goffredo Haus, Luca A. Ludovico, and Paolo Perlasca University of Milan

An innovative architecture lets users select multiple media streams on the fly in a fully synchronized environment. This work proposes an approach to encode contents and to build advanced multimodal interfaces where intellectual property is protected.

n the music production and distribution field, the availability of music as data files, rather than as physical objects, has created a real revolution. It has significantly reduced distribution costs and increased music availability; it has also improved audio quality and fostered the creation of innovative portable devices. Consequently, even the paradigm of music listening has completely changed. Unfortunately, modern online distribution has introduced some problems as well, which are sometimes exacerbated by copyright owners. For instance, for a decade after the birth of the MPEG Audio Layer-3 (MP3) format, music copyright owners didn’t license their music for electronic distribution, and digital audio collections materialized in the form of illegal peer-topeer services. According to a survey conducted by Ipsos-Reid, during that decade, the availability of “free” online music resulted in more than 60 million Americans over the age of 12 engaging in copyright infringement to get digital

1070-986X/16/$33.00 c 2016 IEEE

files.1 It wasn’t until 2001 that the music industry finally began licensing online content. In 2003, Apple’s iTunes service was released, thus providing a comprehensive source for music download. A full history of unauthorized music distribution is available elsewhere.2 Needless to say, the matter of music copyright in the digital era is complex. Music copyright involves multiple works, multiple rights, and multiple intermediaries.3 For example, a recorded performance of a song embodies two separately copyrighted works—namely, musical composition and sound recording—each enjoying a different and complex set of rights and administered by separate groups of corporate intermediaries. If we also consider other music-related materials (such as lyrics, scores, and photos), other rights come into play, making intellectual property (IP) even harder to manage. The involved actors include music publishers, performers, and major recording companies. Dealing with complex objects composed of different media types—each potentially presenting specific user-tailored grants—can be challenging. Here, we propose an ad-hoc architecture to manage IP in the multimedia field. As a case study for our proposed architecture, we use IEEE 1599, an international XML-based standard that aims to comprehensively describe music content. IEEE 1599 supports the representation of heterogeneous music aspects within a single document, with some descriptions hard coded in XML, while other are descriptions are referenced through XML encoding—that is, the data is external (the XML contains links to identify and retrieve the data from media files). This enables a rich set of available scenarios but is a challenge from the IP viewpoint, as many rights holders and rights types might be involved in a single complex entity. IEEE 1599 also introduces a new type of right to be protected: synchronization rights.

Music File Formats The music field uses many open and proprietary file formats. For example, MakeMusic Finale and Avid Sibelius are two of the market leaders for digital score editing. Both applications support the specification of copyright, but this information is simply plain text superimposed onto printed scores rather than a software license. Open file formats include the Music Encoding Initiative (MEI)4 and the Open Score Format

Published by the IEEE Computer Society

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

(OSF).5 MEI strives to create a semantically rich model for music notation. It is based on open standards and is platform independent, using XML technologies to develop comprehensive and permanent international archives of notated music as a basis for editions, analysis, performances, and other forms of research. Even if external digital objects (such as inline graphics, illustrations, and ﬁgures) are supported, rights management is presented in the plain-text content of ad hoc XML elements. OSF is a distribution, interchange, and archive ﬁle format for digital scores, with its core represented by MusicXML 2.0. Such an initiative provides a package for combining digital scores with other media assets—such as HTML, video, audio, and MIDI—into a single distribution. A structured metadata format based on the Dublin Core (dublincore.org) __________ is used for describing the content of packages and their relationships with other content, and packages and their contents can be digitally signed. However, as with MEI, OSF allows users to encode copyright information but not to embed a license.

IEEE 1599

April–June 2016

IEEE 1599 is an internationally recognized IEEE standard sponsored by the IEEE Computer Society Standards Activity Board and designed by the Technical Committee on Computer Generated Music (IEEE CS TC on CGM). IEEE 1599 adopts XML to describe a music piece in all its aspects.6 The format’s innovative contribution is providing a comprehensive description of music and music-related materials within a single framework. The symbolic score—represented here as a sequence of music symbols—is only one of the many descriptions that can be provided for a piece. For instance, all the graphical and audio instances (scores and performances) available for a given music composition are further descriptions, as are its text elements (catalogue metadata, lyrics, and so on), still images (such as photos and playbills), and moving images (such as video clips and movies with a soundtrack). Comprehensiveness in music description is realized in IEEE 1599 through a multilayer environment. The XML format provides a set of rules to create strongly structured documents. As we now describe, IEEE 1599 implements this characteristic by arranging music and musicrelated contents within layers.

IEEE 1599 Layers The standard’s data structure consists of the general, logic, structural, notational, performance, and audio layers. The general layer primarily contains catalog metadata about the piece; examples are the work title, its author(s), release date, and music genre. In general terms, this kind of metadata is publicly available and doesn’t need to be protected. Another goal of the general layer is to link those external digital objects that refer to the particular piece but aren’t directly related to its music contents. Examples here include stage pictures, playbills, album covers, and so on. In this case, copyright issues could emerge, because such objects are typically products of the human intellect to be credited and protected. The logic layer represents the original score in terms of symbols. Usually, the score is described by translating typical Common Western Notation information into XML code. After the expiration of authors’ rights, this process can be performed without copyright infringement. Another key role played by the logic layer is marking music events through a unique ID, so that all layers can refer to them without ambiguity. The data structure containing unique music-event IDs is called the spine (see Figure 1). The structural layer makes it possible to identify music objects and make their mutual relationships explicit. A typical application encodes the results from musicological analysis at different degrees of abstraction. For instance, in fugues or other contrapuntal compositions, it’s possible to mark the subject and its occurrences within the work; for pop/rock songs, the speciﬁc verse-chorus structure can emerge; and so on. As with any other intellectual achievement, music analyses about harmony, piece structure, and so on might be subject to IP. The notational layer contains [0…l] representations of the score in a graphical form—namely, digital images. This kind of representation, represented in external digital objects, substantially differs from the symbolic description provided by the logic layer and encoded in XML. In this case, we’re dealing with autograph and printed versions of the score with potential copyright restrictions. As we describe later, with respect to external digital objects, layers contain basic information to locate them and map between spine IDs and the occurrence of the corresponding music event in the media. The performance layer contains [0…m] computer-driven performances of the current music

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

IEEE 1599-2008 Document Structural

General Logic Logically organized symbols (LOS)

Spine

Notational

Performance

Audio

Notational ﬁles

Audio ﬁles

Performance ﬁles

Figure 1. The multilayer structure of an IEEE 1599 document. In the IEEE 1599 format, each layer describes one of the aspects a music piece is made of, and all layers are kept together by a common data structure known as the spine.

IEEE MultiMedia

piece, encoded as external files and potentially linked event-by-event to the spine. Among the supported formats, the most popular are MIDI and MPEG. Finally, the audio layer contains [0…n] audio/ video tracks. Similar to the notational and performance layers, digital contents are external to the XML encoding and might be subject to copyright. Linking external objects offers several advantages. The (typically verbose) XML file doesn’t contain huge multimedia descriptions; in any case, commonly accepted formats can be used to encode multimedia, with no need to translate into XML or convert to another file format. In addition, digital contents can be geographically distributed and physically reside on different servers around the Web. Synchronizing Music Events An added value in IEEE 1599 is its ability to include heterogeneous materials that refer to a given music piece inside a single XML docu-

ment, along with its ability to let spatiotemporal relationships emerge among such materials. Thanks to the spine and its unique marking, music events can be described in different layers (such as a chord’s graphical aspect and audio performance), as well as multiple times within a single layer (such as many different music performances of the same event). Consequently, the IEEE 1599 multilayer environment presents two complementary synchronization modes. Interlayer synchronization takes place among contents described in different layers. By definition, different layers store heterogeneous information to allow the enjoyment of heterogeneous music contents simultaneously in a synchronized way. Applications involving multimedia and multimodal fruition, such as score following, karaoke, didactic products, and multimedia presentations, can be realized through this mode of synchronization. Intralayer synchronization occurs among the contents of a single layer, which are, by definition, homogeneous. This mode lets users jump from one instance to another (of the same type) in real time, without losing synchronization. Coupling these two categories of synchronization makes it possible to design and implement frameworks that allow new kinds of interaction with media contents and novel music-experience models. For further details about the format, please refer to the official IEEE documentation and a recent book covering specific heterogeneous aspects of the standard.7 Preparing suitable materials—in particular, recognizing music events and their synchronization—can be viewed as an additional intellectual achievement requiring adequate protection. In this context, we can define a new professional role, the synchronization producer, who should have a deep knowledge of both the media contents to be synchronized and the format.

Intellectual Property in IEEE 1599 As we’ve noted, an IEEE 1599 ﬁle is an XML document that carries logic and multimedia descriptions referable to a single music piece. Contents can be disposed according to a sixlayer structure, in which—from a theoretical viewpoint—all kinds of information are equally important in the creation of a comprehensive description of the piece. The inter- and intralayer relationship concepts allow the simultaneous presence of

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

heterogeneous and homogeneous descriptions respectively, within a fully synchronized environment. However, if we analyze the standard documentation in detail, asymmetric behaviors emerge. First, some layers contain text-based logic descriptions of the piece encoded in XML format, whereas other layers refer to external binary ﬁles. Examples of the former are the general, logic, and structural layers. On the contrary, notational, performance, and audio layers contain links to external ﬁles, each one encoded in the most suitable format. For them, the XML description of each music event consists of

a reference to the spine that links the current description to a unique event ID; and

a way to locate the occurrence of the event within the current media using the most suitable measurement unit (such as seconds or frames for audio, and pixels or centimeters for images).

Consequently, the way synchronization is represented in IEEE 1599 differs from media type to media type, and it might even change from one media instance to another inside the same layer. Finally, let’s consider some typical scenarios for the user’s fruition models. Because an IEEE 1599 environment is potentially rich in heterogeneous media and descriptions, from the viewpoint of content access, different axes can be involved: the type and number of digital objects;

subparts of digital objects, defined either in terms of absolute measurement units (such as seconds for audio/video) or music entities (such as the first n measures or only the lead vocals);

encoding quality, which depends on file formats, compression algorithms and settings, sampling, and so on;

availability over time, which can be either unlimited or specified in terms of a fixed number of views or expiration date; and

the type of rights involved, such as play/display and synchronization rights.

At this point, an example is called for. Let’s consider an IEEE 1599 document containing n score versions, each composed of mi pages with

User A can play and display all digital materials in high quality and with no time limits, but without synchronization rights—that is, the user can scroll all scores pages and listen to all audio tracks without the score-following function.

User B has both play/display and synchronization rights, but with a limited access to high-quality digital objects, such as only one of the n score versions and three of the k audio/video tracks. Regarding the other materials, access is granted only to lowquality previews (such as a thumbnail of the first score page and the first 30 seconds of a 96 Kbps MP3).

User C is granted a free trial of the premium service for a limited period of time and only on the first 20 measures of the piece. The listed fruition models can be implemented through an ad-hoc IEEE 1599 client– server framework, such as the one we describe later. However, we must ﬁrst ﬁnd a suitable language to express licenses in XML and thereby comply with IEEE 1599.

Rights Expression Languages IP protects the creations of human intellect. The main purpose of IP is to preserve the interests of intellectual creators, giving them a relevant set of rights to protect the exploitation of their creations. “Intellectual property” is a broad locution that embraces both copyright and industrial property. Copyright includes literary and artistic works, such as movies, poems, and music works, whereas industrial property takes into account inventions (patents), trademarks, industrial designs, and geographic indications of source. The copyright was born as a “right to copy” and has its origins in the possibility of reproduction and the need to regulate the exercise of that right. Commercial aspects, as conﬁrmed by historical evidences, become the key element of all subsequent legislation. Johann Gutenberg’s invention of mechanical movable type printing in 1440 can be considered a milestone for the copyright origin. This invention produced several important consequences; among them, the reduction of

April–June 2016

i 僆 [1…n] and k audio/video objects. Through a license framework, different grants can be assigned to users. For instance:

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

book production costs and the increase in the number and dissemination of printed books are the most important. From that moment on, content became accessible to a greater number of people, and therefore it became extremely important to regulate the rights related to the creation, distribution, and use of such content. Similarly, the evolution of information and communication technology, particularly in recent years, has changed the classical way of understanding the activities of content creation, distribution, and consumption. Contents become digital and immaterial and can be stored, transmitted, reproduced, and transferred with virtually no effort and relatively low costs. As a result of this cultural revolution, we now think about (and interact with) virtual objects in a way that substantially differs from material objects. For example, digital music can be provided to consumers via downloading or streaming services, and in the latter case, the concept of possession is replaced with one of service subscription. A further consequence is that unauthorized content copies—virtually identical to the originals—can be easily created and distributed, preventing right holders from collecting their proﬁts. To prevent copyright infringements, digital rights management (DRM) systems were developed. According to the deﬁnition of the National Institute of Science and Technology, a DRM system is “a system of information technology components and services along with corresponding law, policies and business models which strive to distribute and control intellectual property and its rights.” DRM systems are responsible for— and must ensure—two main functions:

management of the rights, assets, parties, and licenses; and

enforcement of the terms and conditions determined by right holders and expressed within licenses.

IEEE MultiMedia

To achieve these goals, different techniques belonging to multiple scientific disciplines are needed. Cryptography, watermarking, and fingerprinting techniques are usually adopted to ensure either secure content identification, packaging, and distribution or copyright-infringement tracking and monitoring. For instance, ondemand music services like Spotify or Google Play Music encrypt their streams and downloaded music to be enjoyed once disconnected from the service.

A key component in a generic DRM system architecture is represented by the Rights Expression Language (REL),8,9 which is used to express the terms and conditions—that is, the licenses— that govern how content is distributed and then enjoyed by users. The basic building blocks of RELs are rights (to copy, move, and so on); conditions (sometimes specialized in permissions, constraints, and requirements); resources (such as contents); and parties (right holders, users, and so on). Through RELs, it’s possible to define complex usage rules. The expressiveness of an REL is directly proportional to both the level of granularity that you can obtain in defining the rules and the number of business models that can be exploited. This aspect is particularly important for handling heterogeneous digital objects that must be managed through articulated usage rules. (We offer use-case scenarios for this later.) Several XML-based RELs have been proposed in the literature, including the creative commons REL (ccREL),10 Open Digital Rights Language (ODRL),11 Open Mobile Alliance (OMA) DRM,12 eXtensible Rights Markup Language (XRML),13 and MPEG-21 REL.14 These proposals differ mainly in the way digital licenses, terms, and conditions are expressed and in the scope and granularity of each aspect of the license specification and management process. The semantics of RELs must be adequately formalized to express unambiguously the meaning of their terms and expressions. Rights data dictionaries are normally used by XML-based RELs for this purpose.15 A correct semantic representation and analysis is a fundamental aspect that can’t be addressed systematically and automatically, because license terms and conditions can have ambiguous meanings. Web ontologies are a useful tool for expressing concepts and concept relationships, capturing the different nuances typical of natural languages that must be formalized to facilitate automation.16 Interoperability and extensibility are relevant DRM aspects. XML-based RELs are

flexible enough to express terms, conditions, and expressions with the required level of granularity;

extensible, thus guaranteeing the possibility of representing new terms, conditions, and expression for future uses; and

interoperable, supporting both the translation of terms, conditions, and expressions

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

among different languages and the achievement of an effective content portability among different devices. We now introduce the key characteristics of two important RELs—MPEG-21 REL and ODRL—both of which are suitable for modeling licenses in the IEEE 1599 framework. MPEG-21 REL The MPEG-21 REL is based on XrML.13 Both languages permit the specification of a license that lets right holders allow users to exercise specific rights on the digital resources to be protected. Part 5 of the MPEG-21 standard specifies the REL’s syntax and semantics, whereas Part 6—the MPEG-21 Rights Data Dictionary (RDD)—defines a set of terms to support the MPEG-21 REL. According to the MPEG REL data model, a license consists of a sequence of grants and an issuer that identifies the party who has issued the license itself. A grant is the basic building block of an MPEG-21 license and structurally consists of

a principal, representing the subject to whom the grant is issued;

a right, denoting which action the principal is authorized to exercise;

a resource to which the right is applied; and

a condition, representing the terms, conditions, and obligations governing the exercise of the right.

To exercise a given action on a given resource, a user must hold a license containing a grant that speciﬁes the right to exercise the required action on the required resource. Extensibility is ensured by extensions that provide a simple mechanism to add new elements to address the requirements of a new application domain. The MPEG-21 REL standard speciﬁcation consists of a set of three XML schemas: The core schema defines the basic elements of the language.

The standard extension adds terms and conditions to restrict the use of the content.

The multimedia extension lets users express specific rights, conditions, and metadata for digital works.

ODRL The ODRL provides the semantics to deﬁne DRM terms, conditions, and expressions to express access control policies for DRM. ODRL is based on an extensible model that includes a number of core entities and their relationships. The three basic core entities are the

assets, representing any content being licensed;

rights, expressing the potential actions that can be performed or prohibited; and

parties, representing both users and right holders. Rights include permissions, which can contain constraints, requirements, and conditions. Permissions represent the actual usages or activities allowed over the assets, whose exercise is limited by constraints and requirements representing conditions to be verified. Finally, conditions are exceptions used to make permissions expire and to require a renegotiation. The representation of offers and agreements is a core aspect of ODRL. Offers can be created according to different business models for assets; they represent the proposals to exercise specific actions over the assets for which the proponent possesses the required IP rights. These proposals must be accepted to become valid. Agreements represent the transformation of an offer into a proper license and implicitly express the acceptance of the license terms and conditions by users. ODRL 1.1 has been published by the W3C and is supported by the OMA. The latest ODRL proposal (v. 2.0) makes the concept of access control policy explicit by specifying a model in which the central entity is the policy. An ODRL policy specifies which actions are admitted on a specific asset. An action, to be executed, must be explicitly permitted through the specification of permissions, which can be constrained and might require prerequisites for the proper exercise of the related actions. Constraint and

April–June 2016

A solution compliant with the MPEG-21 ISO/IEC standard is represented by the Axmedis project,17 which was launched by a consortium of content producers, aggregators, distributors, and information technology companies and research groups with the goal of developing a framework and tools to control and manage complex digital content.

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Prohibition

Type

Obligation Number of

End-user Publisher

Permission Spatial Constrain

Rights holder

Category Fee

Party Temporal

Distributor

Possesso property

Use/create Limit Possess/receive

Content

REL license

Certiﬁcation

Prerequisito right

Over Right Issue

View

Distribution

Render Edit

Print Play

Transformation Loan

Transport Extract

Embed Move

Copy

Figure 2. The Rights Expression Language (REL) license mind map.

duty elements are used for this purpose. Finally, the prohibition elements limit the scope of the actions granted by permissions. Figures 2 and 3 show two mind maps representing the basic concepts of the REL license and content.

A New Architecture for the IEEE 1599 Framework

IEEE MultiMedia

Because IEEE 1599 is an international standard, several Web players have been developed in the context of international collaborations and scientiﬁc projects. For example, Pearson Italy is embedding a player inside its online music education tools to make them interactive through the integration of IEEE 1599 support. Similarly, Bach-Archiv Leipzig published webpages with IEEE 1599 encodings of relevant compositions by J.S. Bach. All of these customized versions have come from a general-purpose music player, developed in the framework of the Enhanced Music Interactive Platform for Internet User project (http://emipiu.di.unimi.it). _________________ Features and technological details about this release are discussed elsewhere.18 As we noted earlier, given the kind of materials involved and the authentication policies

adopted by institutions, the issue of DRM has yet to be addressed. As a result of our recent research, the IEEE 1599 Web framework must be rethought to support DRM and user profiles. Here, we present two different architecture proposals (see Figures 4 and 5) that enable a set of services and technologies to both govern the authorized use of digital resources and manage the synchronization among them according to the IEEE 1599 format. We first introduce the set of services and the right sequence of service requests required to enable a complete IEEE 1599 synchronization experience. The basic idea that both proposals are based on is delivering, in an encrypted format, only those data that the final user is authorized to access. In keeping with our earlier discussion, we first tested a condition in which it’s possible for users to exercise the play/display directly on digital resources. This aspect isn’t trivial, because users could be fully authorized, unauthorized, or partially authorized (to access part of a digital resource, such as the first 30 seconds of audio, or only a low-quality version of it). The mere play/display right doesn’t guarantee the enjoyment of a complete IEEE 1599

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Unprotected content

Fingerprint

Watermark Secure container

Protected content

Repository

Protection

Constraints

Creation

Metadata Management

Admissible actions

Rights

Validation

Economic transactions Content Trading Distribution

Creation

Use

Exercise wrt granted permissions

Tracking

Aggregation

Metadata

Workﬂow

Identiﬁcation

Figure 3. Content mind map. This figure shows, through a mind map, the basic concepts related to content in the context of digital rights management.

The License Server is in charge of managing the licenses, and it responds to the licenseaccess requests posed by the IEEE 1599 server to satisfy the client’s service requests.

The Content Repository Server is in charge of managing the resources, and answers to the access/composition requests posed by the IEEE 1599 server to satisfy the client’s service requests.

The IEEE 1599 Repository Server is in charge of managing the IEEE 1599 data files, and it responds to the access/composition requests posed by the IEEE 1599 server to generate the final IEEE 1599 document according to the client’s service request and the user’s licenses.

The IEEE 1599 Server is in charge of the authentication and service-request evaluation processes; to perform its tasks, it must obtain the necessary resources from the other servers, evaluate them, and generate the required resources to be sent to the Client. The actor who selects the type of materials to send to the Client is the IEEE 1599 Server. The Client must query the IEEE 1599 Server, which retrieves the user licenses from the License Server and the suitable IEEE 1599 data from the IEEE 1599 Repository Server. In our proposed architectures, the IEEE 1599 Repository Server returns a complete IEEE 1599 document in Figure 4; in Figure 5, it returns an IEEE 1599 skeleton to be completed later. After their evaluation,

April–June 2016

synchronization experience. In fact, a key aspect to consider is the ability to exercise a new right—called the synchronization right— that makes synchronization possible among the various digital resources embedded in an IEEE 1599 document. As we mentioned earlier, to enjoy a full IEEE 1599 experience, both play/ display and synchronization rights are required. The two proposals we mentioned differ primarily in the way IEEE 1599 documents are stored on the server and processed for distribution to the client. In the ﬁrst case, a complete IEEE 1599 document is parsed and those subparts without user-access authorization are removed. In the second case, the generation of the IEEE 1599 ﬁnal representation is built incrementally by adding to an IEEE 1599 skeleton only the subparts the user is authorized to access. For the sake of clarity, we won’t present the external tracking and payment services, thus focusing only on generation services compatible with user’s requests and licenses. The architecture proposals in Figures 4 and 5 contain the following components:

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Client

Content Repository Server

License Server

IEEE 1599 Server

IEEE 1599 Repository Server

by adding to the IEEE 1599 skeleton the subparts that govern the synchronization of the data the user is authorized to access.

Finally, an encrypted package is generated and sent to the Client. Note that image content, synchronization data, and metadata can be transferred “as a whole” from the IEEE 1599 Server to the Client, whereas for audio and video contents, encrypted streaming seems to be the best solution.

response service request ID user license

license evaluation

Possible Scenarios IEEE 1599 request IEEE 1599

IEEE 1599 evaluation content request wrt IEEE 1599 and license info

Finally, we introduce scenarios that consider complex and heterogeneous combinations of materials, licenses, and user proﬁles. From the viewpoint of digital content, for a given music piece, an IEEE 1599 document can embed and/ or link to the following:

one logic description of the score (chords, rests, articulation signs, and so on) that no longer imply IP, because all authors (such as composers and librettists) died at least 70 years ago;

catalog entries (such as the title of the piece, the title of the opera, the name of the authors, and so on) that are publicly available and don’t require protection;

the autograph manuscript, owned by an historical archive, is accessible for free at a low quality (because it belongs to national cultural heritage), but it requires a specific license to be reproduced in a professional context;

an early version of the printed score, which is no longer covered by copyright;

a recent critical and philological review of the manuscript, for which the musicologist and publisher’s rights must be protected;

commercially available tracks that can be previewed for 30 seconds at a low quality and purchased in their full-duration, highquality version from a music store; and

old recordings of the piece originally contained within analog media (vinyl records, magnetic audio tapes, and so on), and recently digitized and restored by users “of good will” who released them under the Creative Commons license.

content

new IEEE 1599 with license info

encrypted package encrypted package

Figure 4. An example of request-response exchanges in the proposed IEEE 1599 client-server architecture.

IEEE MultiMedia

in both cases, the IEEE 1599 Server retrieves the content from the Content Repository Server according to the information speciﬁed with respect to the IEEE 1599 data and the user’s licenses obtained in the previous steps. At this point, in the case shown in Figure 5, the IEEE 1599 Server requests to the IEEE 1599 Repository Server the subparts needed to complete a suitable IEEE 1599 representation. Then, the IEEE 1599 Server builds a new IEEE 1599 document containing both synchronization and the license information governing the access to the content parts the user is authorized to access. Depending on the considered case, this new document is built in one of two ways:

by removing from a complete IEEE 1599 document the subparts related to the data that the user can’t access, or

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Although this scenario might seem complex, the heterogeneity of terms and conditions is fully supported by the REL environments cited earlier. Now, let’s focus on users, who obviously represent key actors within the framework. Access to digital contents can significantly differ depending a user’s interests and aims. For example, an occasional listener could be pushed toward an integrated fruition of music contents merely by enjoying a low-quality and time-limited version of nonfree digital objects, such as compressed JPEG files with an invasive graphical watermarking that contain only the initial pages or 30 seconds of music in compressed MP3 format for each available audio track. In such a case, the service is free and the economic gain is zero, but the idea is to solicit the interest of potential future paying users. Another fruition model might be a selective one, in which particular users are interested in full access to specific digital objects, such as a particular score edition or a relevant historical recording. In this case, synchronization among contents could be not strictly required: an orchestra conductor doesn’t need automatic score following; similarly, an expert of music handwriting isn’t interested in the corresponding audio tracks. Finally, a professional user— such as a music student or a media producer— might want full, high-quality access to all contents. Even access to digital contents over time and space should be user-tailored. An institution, such as a conservatory or a university library, could sign an n-years subscription for all of its staff members and students to obtain a no-limit local access to all digital contents. Young people could have a discount to get play/display/synchronize rights over a subset of digital contents, while music students could be encouraged to buy pieces concerning the repertoire they play through ad-hoc cost policies. All such cases can be managed through a suitable use of the framework we have described in this work.

Content Repository Server

IEEE 1599 Repository Server

license evaluation IEEE 1599 structure request IEEE 1599 structure

IEEE 1599 structure evaluation content request wrt IEEE 1599 structure and license info content

IEEE 1599 subpart request IEEE 1599 subpart

new IEEE 1599 with license info

encrypted package encrypted package

Figure 5. An alternative IEEE 1599 client-server architecture. The bold text represents differences between this architecture and that presented in Figure 4.

related to multimedia as well, such as multiplelanguages movie subtitling and song lyrics synchronization. MM

April–June 2016

EEE 1599 has been an international standard since 2008, yet this article marks the ﬁrst discussion of the DRM issue. Currently, none of the available Web-based frameworks supports license information with this granularity, thus the client-server interactions we discuss here are novel in several ways. In our opinion, synchronization rights will become increasingly important in other ﬁelds

License Server

IEEE 1599 Server

Client

References 1. M.A. Super, Why You Might Steal from Jay-Z: An Examination of File-Sharing Using Techniques of Neutralization Theory, ProQuest, 2008. 2. M. Fagin, F. Pasquale, and K. Weatherall, “Beyond Napster: Using Antitrust Law to Advance and

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Enhance Online Music Distribution,” Boston Univ. J. Science and Technology Law, vol. 8, no. 2, 2002, pp. 451–573. € 3. W.J. Cardi, “Uber-middleman: Reshaping the Broken Landscape of Music Copyright,” Iowa Law Review, vol. 92, 2006, pp. 835–890. 4. P. Roland, “The Music Encoding Initiative (MEI),” Proc. 1st Int’l Conf. Musical Applications Using XML, 2002, pp. 55–59. 5. R. Kainhofer, “An Extensive MusicXML 2.0 Test

18. S. Baldan et al., “Music-Related Media Contents Synchronized over the Web: The IEEE 1599 Initiative,” Proc. 1st Web Audio Conference (WAC), 2015; http://wac.ircam.fr/pdf/demo/wac15 submission _____________________________ 2.pdf. ____ Adriano Barate` is a researcher for the IEEE Standards Association PAR1599 project, focusing on structural aspects of the standard. His research interests include music description through XML, music petri

Suite,” Proc. 7th Int’l Symp. Computer Music Modelling and Retrieval (CMMR), 2010; http://reinhold.

nets, and music programming. Barate` received a PhD in computer science from the University of Milan.

kainhofer.com/Papers/Kainho-

Contact him at barate@di.unimi.it. ___________

fer MusicXML Testsuite CMMR2010.pdf. ________________________ 6. D.L. Baggi and G.M. Haus, “IEEE 1599: Music Encoding and Interaction,” Computer, vol. 42, no. 3, 2009, pp. 84–87. 7. D.L. Baggi and G.M. Haus, Music Navigation with Symbols and Layers: Toward Content Browsing with IEEE 1599 XML Encoding, John Wiley & Sons, 2013. 8. S. Guth, “Rights Expression Languages,” Digital Rights Management, LNCS 2770, Springer Verlag, 2003, pp. 101–112. 9. E. Rodriguez, S. Llorente, and J. Delgado, “Use of Rights Expression Languages for Protecting Multimedia Information,” Proc. 4th Int’l Conf. Web Delivering of Music, 2004, pp. 70–77. 10. H. Abelson et al., “ccREL: The Creative Commons Rights Expression Language,” tech. rep., 2008, https://wiki.creativecommons.org/images/d/d6/ Ccrel-1.0.pdf. ________ 11. R. Iannella, “Open Digital Rights Management,” World Wide Web Consortium (W3C) DRM Workshop, 2001; www.w3.org/2000/12/drm-ws/pp/iprsys________________________ tems-iannella.html. ___________

Goffredo Haus is a professor of computer science at the University of Milan, where he is director of the Computer Science Department and founded the Musical Informatics bachelor’s and master degrees. His research interests include multimedia and human–computer interaction for music and cultural heritage. He was Official Reporter of the IEEE 1599 standard and chair of the IEEE Technical Committee on Computer Generated Music. Haus has a master’s degree in physics from the University of Milan. Contact him at ___________ haus@di.unimi.it. Luca A. Ludovico is an assistant professor in the Computer Science Department at the University of Milan. His research interests include formalization and encoding of symbolic music, multimedia, and cultural heritage. Ludovico has a PhD in computer engineering from Politecnico di Milano. He was a member of the IEEE Technical Committee on Computer Generated Music and is currently part of the W3C Community on Music Notation. Contact him at ludovico@di.unimi.it. _____________

12. W. Buhse and J. van der Meer, “The Open Mobile Alliance Digital Rights Management,” Signal Processing Magazine, vol. 24, no. 1, 2007, pp. 140–143. 13. X. Wang et al., “XrML—eXtensible Rights Markup Language,” Proc. 2002 ACM Workshop on XML Security, 2002, pp. 71–79. 14. I.S. Burnett et al., The MPEG-21 Book, Wiley Online Library, 2006. 15. X. Wang et al., “The MPEG-21 Rights Expression Language and Rights Data Dictionary,” IEEE Trans.

Paolo Perlasca is an assistant professor in the Computer Science Department at the University of Milan. His research interests include data security, access control models and security policies, and intellectual property management and protection, with a focus on digital rights management and license specification and management. Perlasca has a PhD in computer science from the University of Milan. Contact him at perlasca@di.unimi.it. _____________

Multimedia, vol. 7, no. 3, 2005, pp. 408–417. 16. R. Garcia et al., “Formalising ODRL Semantics

IEEE MultiMedia

Using Web Ontologies,” Proc. 2nd Int’l ODRL Workshop, 2005, pp. 1–10. 17. P. Bellini et al., “AXMEDIS Tool Core for MPEG-21 Authoring/Playing,” Proc. 1st Int’l Conf. Automated Production of Cross Media Content for Multi-Channel Distribution (AXMEDIS2005), 2005; doi:10.1109/ AXMEDIS.2005.10.

_______________ ________

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Scientific Conferences

Susanne Boll University of Oldenburg, Germany

The BAMMF Series in Silicon Valley

ilicon Valley is home to many of the world’s largest technology corporations, as well as thousands of small startups. Despite the development of other high-tech economic centers throughout the US and around the world, Silicon Valley continues to be a leading hub for high-tech innovation and development, in part because most of its companies and universities are within 20 miles of each other. Given the high concentration of multimedia researchers in Silicon Valley, and the high demand for information exchange, I was able to work with a team of researchers from various companies and organizations to start the Bay Area Multimedia Forum (BAMMF) series back in November 2013.

Uniting the Community Fifteen years ago, I came to Silicon Valley to join the Fuji-Xerox Palo Alto Laboratory (FXPAL), located in the Stanford Research Park. Before starting BAMMF, I enjoyed focusing on my own research in the lab. Then, in September 2013, FXPAL’s chief technology ofﬁcer, Lynn Wilcox, told me to focus less on my own research and start spending more time helping other researchers in FXPAL and the community. Given the many research directions of those at FXPAL and, more generally, in the multimedia community, I had a hard time knowing where to start. Around the same time, Henry Tang of Apple, Tong Zhang of Intel, Shanchan Wu and Jian Fan of HP, and Bee Liew of FXPAL asked me if I wanted to organize a conference. I thought it was a good idea, because it would encourage researchers in the community to share their talent and expertise—and trust me, there are many multimedia researchers in the San Francisco Bay Area. Yet at the time, many of these researchers had joined startups or product teams at large companies, so they didn’t have time to travel to academic conferences. However, they wanted to meet with peers to exchange ideas, and they wanted to attend fre-

1070-986X/16/$33.00/ c 2016 IEEE

quent but short local meetings for technology updates. In addition, university professors wanted a forum for exposing their ideas to industrial researchers. They also wanted to learn about real problems in industry that needed solving to guide their future research. To ﬁt these professors’ tight schedules, again, short but frequent meetings were preferable. I started thinking about how I missed the old PARC Forum (Palo Alto Research Center) series, and how I didn’t like traveling to a different country to hear local researchers’ talks. So, I worked with Tang, Zhang, Wu, Fan, and Liew to develop a forum series. We discussed the idea with the ACM Special Interest Group on Multimedia (SIGMM), IEEE Technical Committee on Multimedia Computing (TCMC), IEEE Technical Committee on Semantic Computing (TCSEM), and IEEE International Conference on Multimedia and Expo (ICME) steering committee. Leaders of these societies liked the idea and wanted to support the series. Encouraged by the academic and industrial support, Tang, Wu, Fan, and I worked with Tong Zhang of Intel and Bee Liew of FXPAL to start the BAMMF series in collaboration with ACM SIGMM, IEEE TCMC, IEEE TCSEM, and FXPAL.

Qiong Liu FX Palo Alto Laboratory

BAMMF 2013 For the ﬁrst BAMMF, we borrowed the FXPAL conference room and invited four speakers: Bernd Girod of Stanford University, Junfeng He of Facebook, Haohong Wang of TCL Research America, and Scott Carter of FXPAL. Tang, Zhang, Wu, and Fan helped me with food preparation, signs, table arrangements, and more. Liew created the website and taught us many social network tricks for publicizing the event, and my colleague Tony Dunnigan designed a BAMMF logo for us. The weekend before this ﬁrst BAMMF, my colleagues John Boreczky and John Doherty worked long hours to prepare the FXPAL conference

Published by the IEEE Computer Society

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

Scientific Conferences

Figure 1. Professor Bernd Girod from Stanford University, giving a talk during the first Bay Area Multimedia Forum Series (BAMMF) in 2013.

Figure 2. Yangqing Jia from Google giving a talk on the Caffe deep learning framework during the 5th BAMMF.

Figure 3. Dinner with some speakers after the 7th BAMMF. From left to right: Yan-Ying Chen, Bo Begole, Zhengyou Zhang, Susie Wee, Mia, John Apostolopoulos, Qiong Liu (myself), and Oliver Brdiczka.

room. Doherty also volunteered to record the video series for the BAMMF series. When we tried to figure out gifts for those first BAMMF keynote speakers, FXPAL chairman Larry Rowe donated two bottles of Greyscale wine from his winery. With help from colleagues and peers in the community, the first BAMMF started with nearly 60 registered attendees from many different companies and universities (see Figure 1).

offered us its auditorium, free of charge. The second and third BAMMF events took place at the PARC George E. Pake Auditorium. To attract more high-quality speakers, attendees, and supporting funds, we started the San Francisco Bay Area SIGMM Chapter, which I currently chair. From November 2013 to March 2016, we hosted nine BAMMF events, with 45 speakers and over 1,300 registered attendees. All 45 talks have been recorded (by Doherty and staff at the Stanford Center for Professional Development) and posted on the BAMMF website. Talks in our BAMMF events have covered a wide range of topics, from augmented reality to user experience analytics; from social media to Google StreetView; from immersive conferencing to IoT; and from data mining to deep learning (see Figure 2), the Deep Speech system, and intelligent agents. To facilitate discussions among speakers, we organize dinners after the BAMMF events. Because we have a rough theme for each event, and speakers for each event work on related research topics, many speakers enjoy the afterevent dinner (see Figure 3). Also, to make our events more interesting, we started periodically changing the BAMMF venue. In addition to the FXPAL conference room and PARC auditorium, we’ve used the Stanford Gates Building, HP Executive Brieﬁng Center, Prysm Theater, and Huawei Silicon Valley Auditorium. Furthermore, to ensure the content isn’t limited by the forum organizers, we invited six guest organizers to help us invite speakers, prepare questions and panels, and so on.

ith great connections to nearby companies, universities, and ACM/IEEE societies, the BAMMF community plans to host the 25th ACM Multimedia conference in Silicon Valley (see www.acmmm.org/2017). For this ________________ quarter-century celebration, we invited the ﬁrst ACM Multimedia General Chair J.J. GarciaLuna-Aceves from PARC to be our honorary chair. This upcoming international conference will be an even larger platform to connect researchers in academia and industry. If people in the multimedia community would like to give talks at our forums or help us organize more events, please contact us by visiting www.bammf.org. MM

The Series Grows Starting from the second BAMMF, the number of registrations grew beyond our conference room capacity. So, we contacted PARC, which

Qiong Liu is a Principal Scientist at FX (Fuji-Xerox) Palo Alto Laboratory. Contact him at liu@fxpal.com. _________

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

__________________________________

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®

___________________ _________________________

M q M q

M q

MqM q THE WORLD’S NEWSSTAND®