CUbRIK Pipelines for Relevance Feedback by CUbRIK Project

R1 PIPELINES FOR RELEVANCE FEEDBACK Human-enhanced time-aware multimedia search

CUbRIK Project IST-287704 Deliverable D7.1 WP7

Deliverable Version 1.0 - February 2013 Document.ref.:cubrik.D71.UNITN.WP7.V1.0

Programme Name: ...................... IST Project Number: ........................... 287704 Project Title:.................................. CUBRIK Partners:........................................ Coordinator: ENG (IT) Contractors: UNITN, TUD, QMUL, LUH, POLMI, CERTH, NXT, MICT, ATN, FRH, INN, HOM, CVCE, EIPCM Document Number: ..................... cubrik.D71.UNITN.WP7.V1.0 Work-Package: ............................. WP7 Deliverable Type: ........................ Document Contractual Date of Delivery: ..... 28 February 2013 Actual Date of Delivery: .............. 28 February 2013 Title of Document: ....................... R1 Pipelines for Relevance Feedback Author(s): ..................................... Anastasios Drosou (CERTH/ITI), Ilias Kalamaras (CERTH/ITI), Dimitrios Tzovaras (CERTH/ITI), Otto Chrons (MICT), Markus Brenner (QMUL), Martha Larson (TUD), Babak Loni (TUD), Mark Melenhorst (TUD), Raynor Vliegendhart (TUD), Uladzimir Kharkevich (UNITN), Maria Menendez (UNITN), Anca-Livia Radu (UNITN) Approval of this report ............... Executive Board Summary of this report: .............. Description of different pipelines and components for feedback acquisition and processing History: .......................................... Keyword List: ............................... Pipeline, relevance feedback, implicit feedback, explicit feedback, userâ&#x20AC;&#x2122;s feedback, crowdsourcing, H-Demo, V-App, accessibility. Availability .................................... This report is public

This work is licensed under a Creative Commons Attribution-NonCommercialShareAlike 3.0 Unported License. This work is partially funded by the EU under grant IST-FP7-287704

CUbRIK R1 Pipelines for Relevance Feedback

D7.1 Version 1.0

Disclaimer This document contains confidential information in the form of the CUbRIK project findings, work and products and its use is strictly regulated by the CUbRIK Consortium Agreement and by Contract no. FP7- ICT-287704. Neither the CUbRIK Consortium nor any of its officers, employees or agents shall be responsible or liable in negligence or otherwise howsoever in respect of any inaccuracy or omission herein. The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7-ICT-2011-7) under grant agreement n째 287704. The contents of this document are the sole responsibility of the CUbRIK consortium and can in no way be taken to reflect the views of the European Union.

CUbRIK R1 Pipelines for Relevance Feedback

D7.1 Version 1.0

Table of Contents EXECUTIVE SUMMARY

INTRODUCTION

MEDIA ENTITY ANNOTATION H-DEMO

2.1 MEDIA HARVESTING PIPELINE DESCRIPTION 2.1.1 Context 2.1.2 Dataset 2.1.3 Use case: Media harvesting for entities 2.2 CROWDSOURCING TASK IMPLEMENTATION 2.2.1 Crowdsourcing task publication 2.2.2 Crowdsourcing results gathering 2.3 COMPUTER-HUMAN HYBRID APPROACH EXPERIMENTS 2.3.1 Pilot study 2.3.2 Main study 3.

CROSS-DOMAIN RECOGNITION IN CONSUMER PHOTO COLLECTIONS H-DEMO11 3.1 RELEVANCE FEEDBACK H-DEMO DESCRIPTION 3.1.1 Context 3.1.2 Dataset 3.1.3 Computation techniques 3.2 RELEVANCE FEEDBACK H-DEMO IMPLEMENTATION

LIKELINES H-DEMO 4.1 RELEVANCE FEEDBACK PIPELINE DESCRIPTION 4.1.1 Context 4.1.2 Dataset 4.1.3 Implicit playback behaviour 4.2 RELEVANCE FEEDBACK PIPELINE IMPLEMENTATION

ACCESSIBILITY AWARE RELEVANCE FEEDBACK H-DEMO 5.1 ACCESSIBILITY AWARE RELEVANCE FEEDBACK PIPELINE DESCRIPTION 5.1.1 Context 5.1.2 Dataset 5.1.3 Accessibility related features 5.1.4 Accessibility relevance feedback module Overview 5.1.5 Existing approaches for data visualization 5.1.6 Pipeline for relevance feedback 5.2 ACCESSIBILITY AWARE RELEVANCE FEEDBACK PIPELINE IMPLEMENTATION 5.2.1 Profile Registration 5.2.2 Visual Feature Extraction 5.2.3 Accessibility rank estimation based on user profile 5.2.4 Relevance feedback 5.2.5 Profile update - fine-tuning 5.3 ACCESSIBILITY AWARE RELEVANCE FEEDBACK USER INTERFACE PIPELINE

4 4 5 5 6 7 7 8 8 9

SME INNOVATION V-APP. FASHION TREND 6.1 FASHION-FOCUSED DATASET AND GROUND TRUTH COLLECTION 6.1.1 Context 6.1.2 Dataset 6.1.3 Ground truth generation

11 11 12 13 15 17 17 17 17 17 18 20 20 20 20 21 23 24 24 26 27 28 28 30 31 32 33 33 33 33 33

CONCLUSION

REFERENCES

CUbRIK R1 Pipelines for Relevance Feedback

D7.1 Version 1.0

APPENDIX 1

10.

APPENDIX 2

Figures Figure 1. Overview of the Media Entity Annotation H-Demo .................................................... 4 Figure 2. Sequence diagram for the Harvesting Media pipeline ............................................... 6 Figure 3. SMILA pipeline for automatic components in the Media Harvesting pipeline............ 6 Figure 4. Example of Initial set of retrieved images ................................................................ 10 Figure 5. Representative and diverse pictures retrieved by media analysis........................... 10 Figure 6. Representative and diverse pictures refined by the crowd ...................................... 10 Figure 7: A web-based interface enables users to retrieve and browse results ..................... 11 Figure 8: Overview of the people recognition framework........................................................ 12 Figure 9: Exemplary photos of the Gallagher Dataset ............................................................ 13 Figure 10: Traditional nearest-neighbour matching compared to our graph-based approach 14 Figure 11: Workflow on how relevance feedback is incorporated........................................... 15 Figure 12: Sequence diagram showing relevance feedback actions...................................... 15 Figure 13: Overview of the LikeLines H-Demo ....................................................................... 17 Figure 14: Interaction between the LikeLines player and server components........................ 19 Figure 15: Possible vision based impairments........................................................................ 22 Figure 16: Block diagram of the pipelines/pipelets used/developed within T7.4 .................... 25 Figure 17: The ImpairmentProfile attribute in the User Taxonomy Model .............................. 27 Figure 18: The AccessibilityAnnotation sub-class in the Content Description Model. ............ 28 Figure 19: Representation of the retrieved results and the user profile.................................. 29 Figure 20: Evaluation of accessibility ranking scores for the results....................................... 30 Figure 21: Representation of the user selection in the impairment space.............................. 31 Figure 22: Representation of the user disability profile update in the impairment space ....... 32 Figure 23: Set of effective solutions to the multimodal placement problem............................ 32 Figure 24: Estimator of the contrast quality factor of an image............................................... 37 Figure 25: Histogram of monochromatic image ...................................................................... 38 Figure 26: Cumulative histogram of a monochromatic image................................................. 38 Figure 27: Cumulative histogram of a monochromatic image................................................. 38

CUbRIK R1 Pipelines for Relevance Feedback

D7.1 Version 1.0

Executive Summary This deliverable contains a detailed description of different pipelines and components for feedback acquisition and processing included in CUbRIK. In particular, most of the pipelines describe usersâ&#x20AC;&#x2122; feedback acquisition and processing in the context of different H-Demos, as described in the deliverable D2.2. Additionally, the deliverable also includes the results of the ground truth generation using crowdsourcing for one of the V-Apps datasets.

CUbRIK R1 Pipelines for Relevance Feedback

Page 1

D7.1 Version 1.0

Introduction

CUbRIK aims to improve precision and relevance of multimedia search by combining computer-based processing power with human and social intelligence. The development of pipelines for feedback acquisition and processing enable this approach by creating components that collect feedback from people and connecting them with computer-based components. Furthermore, the pipelines for feedback acquisition and processing allow investigating the implications of introducing people’s feedback into the search process. In order to explore diverse approaches for feedback acquisition, this document does not provide a single general-purpose reusable pipeline; instead it contains several pipelines to be reused in different CUbRIK’s contexts. User’s feedback used in the pipelines presented in this document can be classified into the following categories, depending on what users’ feedback is used for: •

Relevance feedback for creating additional training data. Results are used for retrofitting the search algorithm and thus improving future operations, the algorithm does not learn from past results (e.g., Cross-domain recognition in consumer photo collection H-Demo)

Crowd feedback for enhancing the results, which is not used for retrofitting the search algorithm. In this case, feedback from the crowd is used for progressively enriching or refining the harvested media or metadata and therefore improving media finding for the end user. (e.g., Media Entity Annotation H-Demo) However, introducing people’s feedback in the search process does not ensure complete accuracy of results. People can do mistakes or might try to fool the system in order to achieve a higher goal, as in the case of paid crowdsourcing [2][3]. Furthermore, crowdsourcing tasks might request feedback on subjective variables, susceptible of individual variations depending on people’s individual, cultural, or social characteristics. These facts motivate the need of investigating crowdsourcing related issues. The following sections describe four H-Demos, aimed at experimenting advanced capabilities in the context of feedback acquisition and processing; and implementations/user studies, aimed at ultimately improving the quality of the V-Apps. The following sections are organized and summarized as follows: •

•

Section 2 describes the work on the Media-Entity annotation H-Demo. This H-Demo is aimed at harvesting representative images of named entities. This section describes the design of a hybrid computer-human approach in the media-harvesting pipeline and elaborates on the results of several user studies using this approach. The related user studies provide information on underexplored issues of crowdsourcing, such as reliability of crowd-sourced results. In general, given a set of named entities referring to any context (e.g., History of Europe), this H-Demo enables the enhancement of textual metadata with semantically related images. Section 3 relates to the Cross-domain recognition in photo consumer collection HDemo, which is aimed at identifying people in a photo collection. This H-Demo differs from traditional people recognition approaches because it integrates contextual information such as time, location, and social semantics. The graph-based recognition component developed within this H-Demo can be specially interesting in contexts where physical co-occurrence can be used as source of information (e.g., History of Europe App) Section 4 presents Likelines H-Demo, which aims at identifying the most interesting/relevant fragments in a video and visualizing them for further consumption. This is an innovative concept that could be used in contexts which benefit from spontaneous and implicit crowd-based feedback, as in the case of trend analysis of people’s preferences in the SME Fashion App. Section 5 contains the description of the Accessibility aware relevance feedback HDemo, which aims at enhancing the accessibility to media content. The work

CUbRIK R1 Pipelines for Relevance Feedback

Page 2

D7.1 Version 1.0

â&#x20AC;˘

reported in this section elaborates on how to estimate the level of accessibility and usability for a specific group of users (i.e. confidence factor) and accordingly adjust the displayed media content. Universal accessibility becomes especially relevant in applications that target a large and unknown user population, as most crowdsourcing tasks and the SME Fashion App. Section 6 presents the collection of a Fashion-focused Creative Commons Dataset and the annotation of the collected media using feedback from the crowd. This dataset has two purposes: First, the data set served as a concrete means to investigate the kinds of feedback that can be collected from the worker population of a commercial crowdsourcing platform. Second, the data set served as an initial Creative Commons licensed data set useful for testing components potentially relevant for the SME Fashion App. The dataset creation was only required in the context of the SME fashion App as the History of Europe V-App uses an existing specialized dataset. Finally, Section 7 provides the conclusion and future work in the context of V-Apps.

CUbRIK R1 Pipelines for Relevance Feedback

Page 3

D7.1 Version 1.0

Media Entity Annotation H-Demo

2.1

Media harvesting pipeline description

2.1.1

Context

CUbRIK’s entity repository contains real world named entities and facts about these entities. The Media harvesting pipeline presented in this deliverable is part of the Media Entity Annotation H-Demo and is used to harvest representative images for named entities stored in the repository. In this pipeline, images are automatically harvested using social media search engines and refined using a computer-human hybrid approach. Social search engines are a great media source, however the set of harvested images can be inaccurate or ambiguous. For example, querying “Le due torri” (The two Towers) in a search engine retrieves pictures of the Italian monument “Le due Torri” in Bologna and images of “Le due torri” movie (i.e. “The Lord of the Rings: The two towers”). In order to discard inaccurate images, the pipeline presented in this deliverable includes a computer-human hybrid approach for media refinement. This computer-human hybrid approach consists of an automatic media analysis task, which retrieves a set of potentially representative and diverse pictures for each monument [1]; and a crowd task, which refines the set of automatically retrieved images. After the automatic media analysis and crowd task, entities and related pictures are automatically imported into CUbRIK’s entity repository. This deliverable exemplifies a case study of the Media harvesting pipeline as part of the Entity Media Annotation H-Demo (Figure 1). This case study uses a dataset of Italian Monuments created within CUbRIK for testing purposes (dataset described in Dataset Section). However, the design of this pipeline can be applied to any other CUbRIK’s contexts that need entities to be harvesting of related media. For example, it can be used in the History of Europe (HoE) V-App. The HoE V-App uses content of an existing specialized digital library that focuses on the historical development of Europe (for further details on the HoE V-App refer to D2.2). The Entity Media Annotation H-Demo can enable the population of the Entity repository with entities related to people, locations, and events appearing in the HoE digital library and, in particular, the media harvesting pipeline can contribute to the enrichment/refinement of the related media.

Figure 1. Overview of the Media Entity Annotation H-Demo

CUbRIK R1 Pipelines for Relevance Feedback

Page 4

D7.1 Version 1.0

2.1.2

Dataset

In order to investigate the feasibility and related issues for developing the pipeline, we defined a case study on Italian monuments. The “Italian Monuments” dataset consists of various metadata for 105 Italian monuments. The metadata for each of the 105 monuments contain the following information: •

The name of the monument, in English and Italian.

•

The type of the monument (e.g. "church" or "palace").

•

The location of the monument, as text (e.g. "Bologna") and as geo-coordinates.

• A link to the Wikipedia page of the monument. As part of the execution process of the Media harvesting pipeline, pictures were gathered using popular image search engines (i.e. Picasa, Flickr and Panoramio). For crawling, the monument’s name and geo-coordinates were used. In total, a set of images containing more than 25,000 pictures related to 105 Italian monuments was collected. For each image, the following information is available: •

The name of the image file.

•

The URL of the image.

•

Text comments about the image.

•

Text tags describing the image.

•

The person who uploaded the image.

• The time the image was captured. Media harvesting was done as part of WP4. Further details on crawling components can be found in Section 2.3 of the deliverable D4.1.

2.1.3

Use case: Media harvesting for entities

The sequences of tasks in the pipeline are: •

The administrator provides a set of entity names and a template for the crowdsourcing task is created

•

The list of entities with associated metadata is extracted from entity repository

•

Media content and metadata are automatically harvested from popular social media sharing web sites, e.g. Panoramio, Picasa, and Flickr

•

For every entity, a small number (e.g., 10) of representative images that depict diversely aspects of the entities are automatically selected from the harvested images. The set of pictures is ranked by estimated level of representativeness

•

A crowdsourcing task is created and instantiated using entity metadata and related pictures. The goal of the crowdsourcing task is to refine the related pictures

•

Crowdsourcing results are retrieved and aggregated. Aggregation is done, e.g., using majority voting

•

The resulting pictures with metadata are imported in the entity repository

CUbRIK R1 Pipelines for Relevance Feedback

Page 5

D7.1 Version 1.0

Figure 2. Sequence diagram for the Harvesting Media pipeline

The automatic harvesting from popular social media sharing web sites, automatic selection of representative and diverse pictures, and extraction and import in the entity repository have been developed within WP4. WP7 deals with peopleâ&#x20AC;&#x2122;s feedback for refining the automatic selection of pictures.

2.2

Crowdsourcing task implementation

The first version of the Media harvesting H-Demo was delivered as part of D4.1. In D4.1, a set of SMILA pipelets and a SMILA pipeline were used for triggering the media harvesting in order to populate the entity repository with multimedia content and update the stored entities with new content.

Figure 3. SMILA pipeline for automatic components in the Media Harvesting pipeline

CUbRIK R1 Pipelines for Relevance Feedback

Page 6

D7.1 Version 1.0

As part of D7.1, a hybrid computer-human step has been introduced between the media crawler and the entity update. This computer-human hybrid step consists of an automatic media analysis task, which retrieves a set of potentially representative and diverse pictures for each monument [1] and has been developed as part of WP4; and a crowdsourcing task, which refines the set of automatically retrieved images using people’s feedback. The crowdsourcing task was implemented using Crowdflower1. Crowdflower is a platform where requesters can create and publish jobs to more than 30 crowdsourcing channels using a public API. The crowdsourcing jobs are developed using Crowdflower Markup Language (CML), which allows using code for creating special contingencies (e.g., HTML, CSS, Javascript). Crowdsourcing jobs are instantiated using a CSV file. Instances of jobs are called units. The API also allows setting publishing variables such as location from where the job can be accessed, number of units per page, number of judgments per unit, and price per task. Results are stored in a CSV file which contains, in addition to answers to the proposed question, information on worker’s location, worker’s ID, crowdsourcing channel, and task’s acceptance and submission time. Low quality answers are an important issue in crowdsourcing. A substantial part of the existing research on crowdsourcing deals with the development of strategies for quality control and identification of dysfunctional workers [2][3]. Crowdflower recommends requesters creating gold units for quality control. Gold units are unambiguous questions for which requesters provide an answer. They can be created and edited using the public API. If workers fail at correctly answering at least 4 gold units or their accuracy is lower than 70%, their results will be flagged as untrusted and not included in the final set of results. If too many workers fail at answering the proposed gold units, Crowdflower automatically pause the job and send a notification. The implementation code for the Media Harvesting pipeline can be found at https://89.97.237.243:443/svn/CUBRIK/Demos/MediaEntityAnnotation_UNITN_CERTH_LUH

2.2.1

Crowdsourcing task publication

The steps followed for the creation and publication of the crowdsourcing task are as follows: •

A set of entity metadata and related pictures are provided

•

Metadata and links to related pictures are stored in a data file

•

The data file is uploaded to Crowdflower. If gold units are used in the crowdsourcing task, they are included in the data file

•

The template for the crowdsourcing task provided at the beginning of the pipeline is automatically instantiated using the data file

•

The publishing variables are defined and the job is published. Variables indicate job settings such as number of judgments per unit, number of units per page, and payment per unit.

2.2.2

Crowdsourcing results gathering

The steps followed for gathering the results of the crowdsourcing task are as follows: •

Crowdflower notifies that the job has been finished

•

Crowdsourcing results are downloaded from Crowdflower’s server

•

Judgments are analysed and aggregated. Judgments’ aggregation selects a unique judgment for the final set of results. Judgments can be aggregated using different mechanisms, such as majority voting

•

Final results are stored in a data file containing metadata and links to selected pictures.

1 crowdflower.com CUbRIK R1 Pipelines for Relevance Feedback

Page 7

D7.1 Version 1.0

2.3

Computer-human hybrid approach experiments

A crowd-sourcing study was performed in order to investigate the feasibility of the Media harvesting pipeline and, in particular, of the hybrid approach for media refinement. First, an exploratory pilot study investigated issues related to task design and kind of pictures retrieved by the media analysis component. The pilot study was performed using a small crowd consisting of a group of researchers working at a local university. The results of the pilot study informed the design of a main crowdsourcing study which involved Crowdflower workers. The main study explored the technical feasibility of the task using Crowdflower and investigated the reliability of crowdsourcing workers.

2.3.1

Pilot study

Method The study required people to assess the level of representativeness and diversity of a set of images related to Italian Monuments. The study consisted of two tasks. The first task required people to annotate pictures as representative or non-representative of a given monument. People were asked to annotate as representative picture which “show, partially or entirely, the outside part of the original picture of the monument. Pictures containing people are accepted if the outside of monument, or part of the monument, is clearly depicted”. For reference, an example image of the monument was provided. The example image was selected by the authors of the study and represented a prototypical image of the monument. The second task required people to group pictures previously annotated as representative in clusters of similar images. Similar pictures belonging to a same cluster were described as “depicting de monument from a similar perspective and sharing the same light condition (e.g., day or night light)”. All participants were requested to annotate 701 pictures related to 105 Italian Monuments for representativeness. Annotations from all participants were aggregated using majority voting. For each monument, researchers annotated the selected set of representative pictures for diversity. Reliability of results was calculated using Kappa statistics. Kappa statistics measure the level of agreement among annotators discarding agreement given by chance. Values of Kappa vary from 1 to -1, where values from 0 to 1 indicate agreement below chance, Kappa values equal to 0 indicate equal to chance, and values from 0 to -1 indicate agreement worse than chance [4]. For this study we used a Kappa scale previously used in other crowdsourcing studies [5], where agreement among annotators can be slight (0.01–0.20), fair (0.21– 0.40), moderate (0.41–0.60), substantial (0.61–0.80), and almost perfect (0.81– 0.99). Participants The task was distributed among researchers of a local university in Italy. All researchers were familiar with image annotation and were living in Italy for at least one year. Participants received a coupon as reward. Results In total, 21 researchers (15 male, mean age = 31.6 years) participated in the study. They provided 14,721 annotations for representativeness (701 pictures and 21 annotators for picture). Results indicated that 48.14% of the potentially representative images retrieved by media analysis component were annotated as representative. The percentage of pictures annotated as representative was lower than the percentage of representative pictures indicated by the ground truth [1]. Results of the pilot study indicate 88.53% completeness. Reliability analysis was performed using Kappa statistics. Kappa achieved a value of .44 when comparing all annotations for representativeness in the entire set of images. Low reliability values might indicate an issue in task design (e.g., the task was not clear to participants, images were not correctly displayed), or subjectiveness in the measured variable. In order to obtain further insight, the task was redesign and two additional variables CUbRIK R1 Pipelines for Relevance Feedback

Page 8

D7.1 Version 1.0

were introduced for annotation. Task redesign is described in the next section. Further description of the pilot study and detailed results can be found in [1].

2.3.2

Main study

In this study, the set of pictures to be refined by the crowd was different as in the pilot study. The different set of pictures was due to two modifications done in the media analysis component. Firstly, the number of returned images by the media step was increased. The aim was to allow the human part deciding on a higher number of possible candidates. Secondly, the underlying visual content of the images was described using different types of image features. The set of images used in the pilot study were retrieved using color and feature descriptors. In the main study, the set of images were retrieved classic color moments descriptor [1]. Task design Considering the results of the pilot study, the representativeness and diversity tasks were redesigned. Questions were reformulated and several examples were provided. Furthermore, the monument’s Wikipedia entry was embedded into the representativeness task for reference, instead of a single original as in the pilot study. Additionally, two new variables were introduced for annotation to investigate the effect of task subjectiveness. Thus, the main study collected information on representativeness and on two new variables: relevance and scenario. The variable relevance indicated whether the picture contains, partially or entirely, a given Italian monument. Scenario indicated whether the monument was located in an indoor or outdoor scenario. The difference between relevance and representativeness lays on the fact that “being related” is a more objective concept which indicates the presence or absence of the monument, or part of it; while “being representative” is a more subjective concept that might depend on variables such as visual context, personal perception, and previous experience. Figure x exemplifies this difference. The example picture on the left contains the Duomo in Milano. The target picture on the right contains the Duomo on Milano, but it might not be considered representative of it. The study was divided in two different crowdsourcing tasks. Task 1 addressed the representativeness, relevance and scenario of the pictures provided by the media analysis step. From this task, a subset of representative pictures were extracted and used to test their diversity in Task 2. Task 1 assessed individual pictures and Task 2 assessed group of pictures per monument. Results In total, 228 contributors judged 5,377 units in the Task 1. Averaged results indicate that 81% of the images depict an outdoor scenario, 63% are relevant and 57% are representative of a given monument. As the definition of representativeness used in the media analysis implies that a picture is representative if it depicts an outdoor scenario, distributions were also calculated considering just images annotated as outdoor. In this case, results show that the percentage of relevant images is 66% and the percentage of representative pictures is 60%. Reliability analysis was calculated using Kappa statistics. Results indicate Kappa achieves a value of 0.7 for the variable scenario, 0.44 for relevance, 0.32 for representativeness. In total, 62 contributors participated in 262 units in the Task 2. Results indicate that: 48% of the monuments contain all representative and diverse images; some 73% contain at least 75% representative and diverse images, and 90% contain at least 50% representative and diverse images. Media analysis’ results refinement Comparing the set of images refined by the crowd with the images automatically selected by the machine media analysis indicates an improvement in the quality of the results. Figure 4, Figure 5 and Figure 6 exemplify the refinement process from the original set of retrieved images to set of representative and diverse images selected by the crowd. In general, results suggest that crowd-sourcing tends to increase the diversity among the already representative pictures, while for the less relevant results corresponding diversity already exists among CUbRIK R1 Pipelines for Relevance Feedback

Page 9

D7.1 Version 1.0

images and the crowd-sourcing analysis tends to increase imagesâ&#x20AC;&#x2122; relevance. Further description of the main study and detailed results can be found in [6].

Figure 4. Example of Initial set of retrieved images

Figure 5. Representative and diverse pictures retrieved by media analysis

Figure 6. Representative and diverse pictures refined by the crowd

The Media harvesting pipeline addresses the problem of enforcing representativeness and diversity for entity harvesting by introducing the human in the loop. The media analysis selects representative images that depict diversely aspects of the entities. However, results indicate that media analysis can only reach up to 60% precision. Thus, a final refinement step is included where the crowd improves the representativeness and diversity of the set of pictures. Although crowdsourcing results provide promising results, several issues need further investigation. For example, existing studies do not address the external validity of crowdsourcing results and reliability of different kinds of annotators. Also, current aggregation methods in crowdsourcing tend to assume that the most populate answer is correct. In this pipeline we would like to explore methods that consider task and userâ&#x20AC;&#x2122;s profile in the aggregation of results. Future work on the implementation of the pipeline will also focus on the automation and orchestration of the human and computer-based tasks.

CUbRIK R1 Pipelines for Relevance Feedback

Page 10

D7.1 Version 1.0

3. Cross-domain recognition in consumer photo collections H-Demo 3.1

Relevance feedback H-Demo description

This H-Demo investigates the development of framework to recognize people in photo collections and incorporates relevance feedback and contextual information aiming at improving recognition performance. Our focus is on recognizing people primarily based on their faces. We devise and research a framework that detects faces and then discriminates these faces. Instead of a traditional classifier-based approach, we choose a graphical model together with a distance-based face description method. We use the graphical model primarily to more flexibly incorporate context like time, demographics and social semantics. Most of the processing is offline. In other words, the retrieval only implies looking up results that are computed beforehand. A reason for this is that the recognition within a single photo considers - and the same time depends on - information from other photos. Another reason is to incorporate relevance feedback that might not be gained immediately.

3.1.1

Context

The first step when using the people recognition framework is to import photos into the frameworkâ&#x20AC;&#x2122;s local repository. Then, each imported photo is processed: faces are detected; features are extracted, and the intermediate output stored for future re-use. Thereafter, any detected faces are identified by discriminating them against other faces in the local repository, and the final recognition results are stored. We present the results in an interactive website (illustrated in Figure 7) where users can browse through galleries of photos, people, and their associated faces. For example, by clicking on a person, all (face) appearances of that person within the local repository (the imported photos) are retrieved by the people recognition and displayed. Moreover, users can also retrieve by name if people are associated with names (either explicitly by users or implicitly by importing some additional photos that are already annotated, e.g. an archive).

Figure 7: A web-based interface enables users to retrieve and browse results

In order to enable relevance feedback, the interface allows users to provide explicit feedback on irrelevant results. Users can click on a cross icon placed next to each face to remove the image from the set of related results. A server backend registers irrelevant results selected by users and forwards the information to the recognition framework. The recognition framework can use this information to improve the current results and then re-present them to the user. Alternatively the framework can store the relevance feedback information to improve future recognition tasks. CUbRIK R1 Pipelines for Relevance Feedback

Page 11

D7.1 Version 1.0

The following diagram provides an overview of the overall framework. Photo Import

People Recognition

Users playing a game

Implicit Feedback (WP3)

Detection

Relevance Feedback

Presentation

Extraction Recognition

Database

Figure 8: Overview of the people recognition framework

Lastly, we intend to later incorporate social semantics as additional contextual information. For example, we envision computing a social co-occurrence metric based on gained recognition results. On one side, such social information could help improve recognition accuracy by feeding-back and incorporating the gained social information in the recognition process itself. On the other side, we could provide and present such social information directly to external components or the user; for example, to tell or illustrate which identified person is depicted alongside certain other persons over a certain timespan. As such, the latter use-case relates to the History of Europe V-Demo.

3.1.2

Dataset

The framework is evaluated and demonstrated using datasets that resemble consumer photo collections. Such photo collections mainly depict people and are usually rich in contextual information. The proposed recognition framework does however not rely on normalized face shots. Thus, photos taken in uncontrolled environments are acceptable. Generally, the recognition framework is best suited for photo collections depicting only few individual people, as it is the case with family photo collections. Photos depicting people that frequently appear together are of special interest to research social semantics. For evaluation, the dataset should also contain a ground truth for each depicted person, or face, in every photo. In particular, the ground truth should provide a class label and annotations relating to face markings (e.g., eye-centers). To date, we are using the Gallagher Collection Person Dataset that is publically available at chenlab.ece.cornell.edu/people/Andy/GallagherDataset.html. It contains 589 family photos, all of which are shot in an uncontrolled environment with a typical consumer camera. Many of the photos show the main subjects, a couple and their children, in a broad variety of settings and scenes both indoors and outdoors (see the following examples). The dataset depicts 32 different individual people over a period of roughly two years. In total, there are 931 face appearances.

CUbRIK R1 Pipelines for Relevance Feedback

Page 12

D7.1 Version 1.0

Figure 9: Exemplary photos of the Gallagher Dataset

3.1.3

Computation techniques

The technical contribution of the people recognition framework is the detection of people (i.e. faces), the extraction and pre-processing of features, and the identification of people in photo collections integrating additional contextual cues and relevance feedback. We focus predominately on faces as a mean of detecting and recognizing people. Other contextual cues like time or social semantics are incorporated to further increase recognition performance. A key-characteristic of the researched graph-based recognition framework (that we detail next) compared to traditional classifier-based approaches is the support of constraints; in particular, the exclusivity or uniqueness constraint wherein multiple faces in a photo cannot relate to the same individual person. Such a constraint is especially useful for photos that depict groups of people, and can notably increase recognition accuracy. Other social semantics that are conceivable are, for example, the co-occurrence of people. Face Detection and Basic Recognition We decided to utilize the seminal work of Viola and Jones included in the OpenCV package to detect faces. Their detection framework builds upon Haar-like features and an AdaBoostlike learning technique. The face recognition technique we introduce next provides some leeway for minor misalignment. Thus, the only normalization we perform is scaling the patches identified as faces to a common size and converting them to a gray-scale representation. Compared to holistic face recognition approaches that typically require training, we turn to a feature-based method using histograms of Local Binary Patterns. The feature-based method allows us to directly compute face descriptors, and subsequently, compare these with each other based on a distance measure (e.g. utilizing Chi-square statistics). In order to actually recognize faces, the most straightforward approach is then nearest-neighbor matching against a set of known face descriptors. Graph-based Recognition In order to further improve recognition of people (e.g., by incorporating constraints), a graphical model is used Such models are factored representations of probability distributions where nodes represent random variables and edges probabilistic relationships. In particular, we choose a pairwise Conditional (Markov) Random Field (CRF). In our proposed approach, people's appearances (e.g., as represented by their faces) correspond to the nodes in such an undirected graphical model. We set up one single graph with nodes relating to a testing and a training set, signified by samples Tr and Te in Figure 10, and where we condition on the observed training samples with known class labels. The states of the nodes reflect the people's identities. We use the graph's unary node potentials to express how likely people's appearances belong to particular individuals (in the training set). We base the unary term on a face similarity function conveying the distances among faces. In particular, we encode each state using a nearest-neighbor approach among corresponding training samples. As there is one unary potential for each node, there is one pairwise potential for each edge CUbRIK R1 Pipelines for Relevance Feedback

Page 13

D7.1 Version 1.0

(connecting two nodes) in a pairwise CRF. They allow us to model the combinations of the states two connected nodes can take, and thus, to encourage a spatial smoothness among neighboring nodes in terms of their states. They also allow us to enforce an exclusivity or uniqueness constraint that no individual person can appear more than once in any given photo. The larger the spatial smoothness, the more we encourage neighboring nodes to take on the same state, therefore leading to fewer but larger clusters. Apart from damping noise, we also use this effect to model the observation that photo collections are often organized into smaller groups of consecutive photos (events), which are usually limited to only a few individuals. Note that for the smoothness and constraint to be effective, we need to establish edges reflecting direct dependencies among nodes -- in the latter case, simply among all nodes that share the same photos. In general, however, we only connect nodes representing the most similar appearances of people with each other for our aim of a sparse but effective graph representation. In other words, we connect each node with its closest matches among the combined testing and training sets. Ultimately, we wish to infer the states -- as in a discrete model -- of the random variables. An approach is to compute the individual marginal distributions, and then the largest marginal values would reflect the world states that are individually most probable. However, it is also possible to find the maximum a posteriori (MAP) solution of such a model. To deal with eventual cycles (loops) in our graph, we employ Loopy Belief Propagation as our method for inference. Our aim for the proposed graph-based approach is to consider all people's appearances (along with any other contextual cues) within an entire dataset simultaneously. As a result, we perform recognition over the entire dataset (and thus inference over the entire graph) and not for a singular appearance of a person. When adding new photos to a dataset, an incremental (perhaps approximated) solution might be possible, but for simplicity and lack of space we restrict ourselves to always performing inference over the entire graph. Note, however, that we can efficiently store and re-use all information (e.g. any extracted features) except the graph-structure.

Figure 10: Traditional nearest-neighbour matching compared to our graph-based approach

Incorporating Relevance Feedback As described in the previous section, each graph node (face appearance) is associated with a unary term that translates face similarities to states (labels/individuals). In order to incorporate relevance feedback (workflow in Figure 11), for instance when a user clicks a cross-icon to indicate that a particular face presented to him is incorrectly recognized as another, wrong person, we override the unary term corresponding to the face to which the user provides feedback. In particular, we set the nodeâ&#x20AC;&#x2122;s likelihood as being the other, wrong person to zero. Likewise, we can also zero-out all likelihoods but one if a user wishes to validate a recognition result. Since each node is interconnected with other nodes, the feedback information will propagate and influence the overall recognition result. Note that we store all feedback information so that we can re-apply it on subsequent recognition tasks. Thus, we expect the overall recognition performance to increase with every further feedback that users provide over time.

CUbRIK R1 Pipelines for Relevance Feedback

Page 14

D7.1 Version 1.0

3.2

Relevance feedback H-Demo implementation

The people recognition framework is consists of the following modules (Figure 11): •

Detector Module Detects specific parts like faces within a given photo

•

Feature Extractor Module Extracts features from detected parts

•

Recognition Module Performs people recognition offline and stores results

•

Feedback Module Leverages explicit feedback to further improve recognition performance

•

Presentation Module Looks up pre-computed results and displays them

Figure 11: Workflow on how relevance feedback is incorporated

The Feedback module is not a standalone module. Its functionality is exposed by the Recognition module and the Presentation module (Figure 12).

Figure 12: Sequence diagram showing relevance feedback actions

Most of our research and work to date has been on the Detector, Feature Extractor and Recognition modules (thus providing the main functionally of identifying people in photos). Currently, these three modules are being brought together and integrated into an automated horizontal demo (the web-based demo detailed in Section 3.1.1). Our goal is to eliminate any manual steps that usually exist in prototyped research work and experiments. The focus of current and future research and work is on the remaining Feedback and Presentation modules. Note that the Feedback module depends on the Presentation module. We provide a REST-based API to expose functionality of our people recognition framework. CUbRIK R1 Pipelines for Relevance Feedback

Page 15

D7.1 Version 1.0

The graphical user interface of our H-Demo is web-based and also communicates with the recognition framework via this REST-API. In other words, the API separates our business logic (the recognition framework) into a backend (running as a service) and a frontend (the web-based graphical user interface). Cross-domain recognition in consumer photo collection belongs to R2 (M18), as specified in the release plan described in the Deliverable 9.2 (referred as People Identification). The implementation code for the Cross-domain recognition in consumer photo collection H-Demo can be found at https://89.97.237.243:443/svn/CUBRIK/Demos/PeopleRecognition_QMUL.

CUbRIK R1 Pipelines for Relevance Feedback

Page 16

D7.1 Version 1.0

Likelines H-Demo

4.1

Relevance feedback pipeline description

4.1.1

Context

User interactions with multimedia items in a collection can be used to improve their representation within the information retrieval system’s index. The LikeLines H-Demo collects users’ interactions for identifying the most interesting/relevant fragments in a video. The collected user interaction can be implicit (e.g., play, pause, rewind) and explicit (i.e., explicitly liking particular time points in the video).

Figure 13: Overview of the LikeLines H-Demo

Likelines can be used in contexts which benefit from spontaneous and implicit crowd-based feedback, as in the case of trend analysis of people’s preferences in the SME Fashion V-App. For example, user's playback behaviour can signal which fragments of a fashion-related video contain the most interesting or appealing clothing items. LikeLines allows extraction of individual keyframes containing these clothing items, which can be passed to the SME Fashion App for further processing.

4.1.2

Dataset

The Likelines components in this pipeline require a Timecode-aware video dataset. This dataset contains user-contributed deep-links mined from user comments. To power this inference, large quantities of user interactions are needed in order to understand how each type of contribution needs to be interpreted. Therefore, LikeLines is made to target a large collection of existing videos on the Web and to be incorporated in any Web site. By having an open component, it can be deployed beyond conventional lab settings and have a greater reach.

4.1.3

Implicit playback behaviour

In order to perform multimedia retrieval at the fragment level, it is necessary to know what the interesting parts of a multimedia item are. The LikeLines component focuses on finding the interesting bits in a video by capturing the user’s playback behaviour. It is based on the assumption that a user only wants to watch the parts of a video that are interesting, possibly multiple times, and skips those parts that are uninteresting. By recording which parts are watched and which parts are skipped by each user through playback events (such as “play”, “pause”, and “seek”), LikeLines infers what is considered by viewers to be interesting and visualizes this as a heat map (“like line”) below the video. This heat map can then be used by users to jump directly to a point in the video. The core of the H-Demo is the LikeLines player and server components that serve the purpose of capturing the user interactions.

CUbRIK R1 Pipelines for Relevance Feedback

Page 17

D7.1 Version 1.0

4.2

Relevance feedback pipeline implementation

This section originally appeared partly in: R. Vliegendhart, M. Larson, and A. Hanjalic. LikeLines: Collecting timecode-level feedback for web videos through user interactions. In Proceedings of the 20th ACM international conference on Multimedia. ACM, 2012. The LikeLines system consists of two main components: a Web video player component that resides in a browser on the user’s system and a server component (see Figure 13). The LikeLines multimedia player component was designed as part of WP3 - T3.2 Implicit UserDerived Information. In WP7, the components are developed and integrated into a pipeline. In Likelines, the user only directly interacts with the player component. This component is implemented in JavaScript and uses HTML5 or Flash for video playback. User interactions such as playing and pausing the video are captured by the LikeLines player and are sent to the server component. The server component is responsible for storing and aggregating all these user interactions. The player component communicates with the server using the HTTP protocol and can make the following requests listed below. The sequence of interactions is depicted in Figure 14. a) Create a new interaction session for a video; b) Add new interactions to an existing session; and c) Aggregate content analysis and all sessions for a particular video to compute a heat map. The server’s reply messages to these requests are encoded in the JSON or JSONP format. The heat map is computed by representing an interaction session for an n seconds video as n bins. Each bin is initially set to 0 and each interaction can contribute, possibly negatively, to a bin’s value. Content analysis of a video is modelled as an interaction session as well. The heat map is then obtained by aggregating all sessions and mapping each bin’s accumulated value to a colour.

CUbRIK R1 Pipelines for Relevance Feedback

Page 18

D7.1 Version 1.0

likelines.js

LikeLines server

/createSession?videoId=<URI>&ts=<timestamp>

/aggregate?videoId=<URI>

interaction session token

aggregation of sessions and MCA

event loop /sendInteractions?token=<token>&interactions=<interactions>

Figure 14: Interaction between the LikeLines player and server components

Likelines belongs to R2 (M18), as specified in the release plan described in the Deliverable 9.2. The implementation code for Likelines H-Demo can be found at https://89.97.237.243:443/svn/CUBRIK/Demos/LikeLines_TUD.

CUbRIK R1 Pipelines for Relevance Feedback

Page 19

D7.1 Version 1.0

Accessibility aware relevance feedback H-Demo

5.1

Accessibility aware relevance feedback pipeline description

5.1.1

Context

Disabled people represent a significant group in the population of the European Union (EU). The estimated number of people with disabilities accounts for approximately 40 Million persons (lowest estimate) which stands for the nearly 11% of the population of the EU. Furthermore, these numbers raise when also considering people with cognitive difficulties and people in the so-called hinterland between fully able bodied and the classically termed disabled. Thus, designing and evaluating web pages for the disabled is becoming an increasingly important topic for a variety of reasons. In this respect, T7.4 will deliver the definition, design and creation of an accessibility-aware relevance feedback module. The impaired people will provide their feedback related to the accessibility of the retrieved results. The feedback module will fine tune the results and promote the relative results with higher accessibility score and vice-versa so as to increase the perceived usefulness and usability of the complete system by disabled people. The final outcome of the current task will be the Accessibility aware Relevance feedback HDemo that aims to enhance the process of multimedia content harvesting in such a way that it will be later able to provide an accessibility related indicator (i.e. confidence factor), regarding the level of accessibility and usability it offers to specific groups of users. Initially, the “Accessibility aware Relevance feedback” demo will be exhibited as a standalone H-Demo, which could handle various types of multimedia data containing web pages, while later on will be incorporated to the “Fashion Trends” V-App. The selection of this V-app has been based mainly on usability criteria. Since the Fashion Trend V-app will be designed so as to address both SME and end-users needs, the incorporation of accessibility related factors will broaden the target group of the future CUBRIK users. In particular, since fashion trends do not only rely on the shape of the clothes, but also in the colours used, the quality of an image (e.g. bit-depth, contrast, etc.), the proportion of colours and tons, the geometry of the illustrated objects, etc. play a significant role in the anticipation of it by impaired people (i.e. end users). Similarly, issues may appear also in other multimedia objects such as videos (i.e. fashion shows), where specific properties of the video may either harm or be undetectable by specific impairments. This way, the current pipeline aims at providing added value at the delivery of query results in terms of the personal impairment related profile of each registered user. Moreover, in the framework of the current task, specific technological novelties will be proposed in terms of relevance feedback techniques or profile update. Finally, improvements in the definition, extraction and analysis of specific accessibility related features from multimedia data will be achieved, while their values will be linked to actual degrees of certain impairments.

5.1.2

Dataset

The Fashion dataset, has been initially utilized for the development and testing of the accessibility-aware relevance feedback module. For this pipeline, the set of fashion related images collected from Flickr website were used, account a total of 330622 images. Each image is accompanied by the information and metadata listed below: •

The title of the image.

•

The URL of the image in Flickr.

•

URLs for small, medium, original and large versions of the image.

•

Text comments about the image, along with their author and the comment date/time.

•

Text tags describing the image, along with their author.

CUbRIK R1 Pipelines for Relevance Feedback

Page 20

D7.1 Version 1.0

•

The geo-coordinates of the image (latitude, longitude and their accuracy).

•

The person who uploaded the image.

•

The date/time the image was taken.

•

The date/time the image was uploaded.

•

The name and the type of the context where the image belongs (e.g. a collection).

• Licence information about the image. While some required metadata include: •

Alt attributes

•

Header tags

• Absence of “hard-coded” text size. The images of the dataset and their associated metadata (including the metadata of the corresponding fashion item) are used to create individual SMILA records, which can be used by the pipelines of accessibility-aware relevance feedback. The records are first indexed, so that accessibility scores can be assigned to them, and then the search and relevance feedback pipelines can be performed, which utilize these accessibility scores.

5.1.3

Accessibility related features

CUbRIK R1 Pipelines for Relevance Feedback

Protanopia

Original Image

Apart from many individual studies that have been conducted and carried out over time [15][16][17], the most significant accessibility related impairments concerning web-pages, internet and can be extended on multimedia data are included and described in Guidelines of Web Content Accessibility (WCAG). The first web accessibility guideline was compiled by Gregg Vanderheiden and released in January 1995 just after the 1994 WWW II in Chicago (where Tim Berners-Lee first mentioned disability access in a keynote speech after seeing a pre-conference workshop on accessibility led by Mike Paciello). Over 38 different Web Access guidelines followed from various authors and organizations over the next few years all being a part of a series of Web accessibility guidelines published by the W3C's Web Accessibility Initiative. In particular, they consist of a set of guidelines for making content accessible, primarily for disabled users, but also for all user agents, including highly limited devices, such as mobile phones. The current version, 2.0, is also an ISO standard, ISO/IEC 40500:2012 In this context, they include vision, hearing, motor, cognitive, etc. impairment restrictions and guidelines how to overcome them. Although the task 7.4 will deal with and will address the majority of the aforementioned impairements, being still in the 5th of its development, it will initially focus only on vision based disabilities. Following this, the most known disabilities concerning the eyesight and the visual perception of a coloured image are summarized in the Figure 15.

Page 21

D7.1 Version 1.0

Deuteranomaly

Tritanopia

Deuteranopia

Achromatopsia

Protanomaly Tritanomaly Blue Cone Monochromacy

Figure 15: Possible vision based impairments

Moreover, a list of further features that will be extracted and combined so as to extract quality factors for the accessibility of multimodal objects are summarized in the following: â&#x20AC;˘

Accessibility related descriptors/features concerning the visual objects in a web-page (i.e. images and video streaming media): o the contrast of the image o the colour histogram, o the colour layout, o areas with high luminance values, o the image resolution o shape descriptors of the image o its texture and o frequency related features for video transitions so as to address for vision related disabilities (e.g. colour-blindness, etc.).

â&#x20AC;˘

Accessibility related descriptors/features concerning the audio objects in a web-page (i.e. sounds and audio streaming media): o Frequency spectrum, o phase related features, o percentage of high-, low- frequencies, o duration,

CUbRIK R1 Pipelines for Relevance Feedback

Page 22

D7.1 Version 1.0

o o o o o o

mono/stereo/surround/etc., bit rate, loudness per frequency band (dB), DC values, (P)SNR, etc.

•

Accessibility related features concerning the text (objects) in a web-page: o Font size, o Font colour, o Font contrast with respect to the background, o Text alignment, o Indentation/Spacing, o etc.

•

Accessibility related features concerning the metadata (for image/sound/text) in a webpage: o Any content in audio/visual format should also be available as a text transcript for hearing impaired users o Existence of images’ “alt” attribute. o Existence of proper header tags (i.e. h1, h2, h3, etc.) that make site navigation easier for users using assistive technologies (e.g. screen readers). o Preservation of consistency in layout, colour, and terminology for reducing the cognitive load placed on users. o Absence of “hard-coded” text size so as to defeat the use standard browser controls.

5.1.4

Accessibility relevance feedback module Overview

Hereafter follows a short description of the main building blocks that will implement the pipeline suggested in task 7.4. Table 1 shows the tasks that have been noted essential (up to now) to be implemented within the framework of T7.4 and their degree of implementation within the first 5 months of the task. No.

SubTask Name

Described Java

Smila

Designed Java

Vision related accessi bility algorithms

✓

Sound related accessibility algorithms

✓

Dynamic Vision related accessi bility algorithms (i.e. from video)

Mapping of feature values to actual impairment degrees

Profile registration

✓

Relevance feedback algorithm

✓

Retrival of entities/objects from dataset

✓

Smila

✓

Implemented Java

Delivered

Smila

Java

Smila

✓

✓ ✓

✓

Table 1: Progress of subtasks implementation from M13 to M17 CUbRIK R1 Pipelines for Relevance Feedback

Page 23

D7.1 Version 1.0

Initially, the generation of user-specific profiles from information derived during the user’s registration phase will be implemented while updates of these profiles will follow-up according to the behaviour/selections of each user and the implicit relevance feedback. Then, during the indexing of the data, data evaluation in terms of accessibility and usability of the multimedia content of Web pages (i.e. images, sound, text, header tags and metadata), will take place via the appropriate data analysis module, so as to provide accessibility relevance factors by executing extended multimedia analysis. The final accessibility related evaluation of each web page will be user-specific, by adjusting to the user’s profile and thus, it will address the needs and the disabilities of the profile of the logged-in user (i.e. specific disabilities will be assigned to certain attributes of the multimedia objects of the web page). In order to adapt the accessibility distance measure to match the users expectations (disabilities), the users will be given the opportunity to provide relevance feedback information regarding the accessibility level of the web page, via a novel concept of interaction on the query results. The relevance feedback they provide will refer either to the whole web page itself or to specific multimedia objects contained in the web page. The relevance feedback referring to single multimedia objects contained in the web page will be processed autonomously and will partially (e.g., via weighting factors) contribute to the final accessibility confidence factor of the full web page. Finally, crowdsourcing techniques are planed (although not included in the current deliverable) to be utilized for the evaluation of these attributes of multimedia data or multimedia metadata, for which no automatic evaluation can be performed – or is unacceptably expensive in terms of processing resources (e.g., useful navigation options, meaningful link-text, meaningful tagging of tables, etc.).

5.1.5

Existing approaches for data visualization

Existing approaches addressing the problem of multimodal placement and other related problems, such as multimodal dimensionality reduction, are usually called multi-view learning methods, where the input objects are considered to be represented by multiple views and the goal is to discover patterns in them by utilizing all the available views. Multi-view learning methods, similarly to related single-view learning methods, are usually formulated as an optimization problem, where the final solution (placement, clustering etc.) is the one for which an appropriate objective function is optimized. Utilization of information from the multiple views during the optimization process is generally performed in one of the following ways: •

By defining an objective function which incorporates information from all views and using this function during the optimization (such as finding a multimodal space using correlations between views, as in [8] or using Multiple Kernel Learning, as in [9].

By using different objective functions (one for each view of the data) at each step of the optimization loop (such as using co-training, as in [10]). In both these paradigms, information from the multiple modalities, or views, is merged in the final solution. A drawback of this merging is that some information that existed in the initial separate modalities is inevitably lost. This information could be useful to the human viewer. •

5.1.6

Pipeline for relevance feedback

The general workflows for the accessibility-aware relevance feedback module are as follows: Initially, the data of the database (the Italian Monuments database), are indexed. During indexing, accessibility scores for the various supported impairments are assigned to each record, according to accessibility-relevant visual and other multimodal attributes of the image contained in each record. After indexing has finished, users can search in the database. For each user, the system maintains an, initially rough, impairment profile, which contains information about the disabilities of the user. After the submission of a query by a user, a set of relevant results is returned, ranked by their relevance to the query. These results are then re-ranked, by taking CUbRIK R1 Pipelines for Relevance Feedback

Page 24

D7.1 Version 1.0

into account the user impairment profile as well. The re-ranked results are then visualized in a way that is best for the specific user profile. After viewing the results, the user is able, through the application’s user interface, to provide feedback about the accessibility of the presented results, e.g. by selecting the most accessible results. This feedback is then used by the system in order to fine-tune the user profile, and thus to present the results in a way that better fits to the user disabilities. From the above description, the following four pipelines are distinguished: • Profile registration pipeline • Indexing pipeline • Search pipeline • Accessibility-aware relevance feedback pipeline The four pipelines and their corresponding interconnections are illustrated in Figure 16 and briefly described, along with their sub-components (i.e. pipelets), in the following subsections.

Figure 16: Block diagram of the pipelines/pipelets used/developed within T7.4

Profile registration pipeline This pipeline is responsible for the registration of the user impairment profile. The user is able to provide information about his/her disabilities, by entering his/her amount of impairment for each of the supported disabilities, in a rough fuzzy scale. This information is stored to his/her profile, in order to be subsequently used and finetuned by the system. The user impairment profile registration pipeline contains the following pipelets: •

•

Profile registration: This is the pipelet responsible for registering user information other than accessibility-related ones. It is considered as already implemented by other modules. Accessibility related profile registration: This pipelet extends the general profile registration pipelet, by adding the disability profile information of the user.

Indexing pipeline Before any search can be performed in any search engine, an indexing step needs to be carried out, in order to construct an index of the database records, so that the subsequent searches are fast and efficient. Hereby, in order for accessibility-related information for the database records to be available, an extra step is added in the indexing pipeline. Before the records are indexed, visual and other multimodal features are extracted from them, which are relevant to the various supported disabilities. These features are stored in the index, in order to be available during the accessibility-aware relevance feedback. The indexing pipeline

CUbRIK R1 Pipelines for Relevance Feedback

Page 25

D7.1 Version 1.0

consists of the following pipelets: •

•

Multimodal feature extraction: This pipelet is responsible for taking a database record as its input and extracting accessibility-related multimodal (e.g. visual, auditory) features from the multimedia contained within it. These features are added to the record as new metadata. Indexing: This pipelet is responsible for pushing the input records to the indexing service and is already implemented in SMILA.

Search pipeline The search pipeline is responsible for receiving an input query by the user and searching in the index for the most relevant results. The results are returned to the user, ordered (ranked) according to their relevance to the query. For simple text queries, the search pipeline is already implemented in SMILA; however, for content-based or multimodal queries, it needs to be modified. In any case, the internals of the search pipeline are irrelevant to the accessibility-aware relevance feedback module, since relevance feedback occurs after the results have been retrieved. Accessibility-aware relevance feedback pipeline This is the central pipeline of the accessibility-aware relevance feedback module. It uses the results which are retrieved from the search pipeline, as well as the user profile created by the profile registration pipeline, in order to estimate accessibility scores for the results, which are specific to the user. These scores are used in order to visualize the results in a way that best fits the user profile (e.g. showing the most accessible results first). After the results are visualized, the user is able to provide feedback about the accessibility of the results. Feedback can be provided either directly, by letting the user select among a number of different suggested visualizations, or indirectly, by letting the user select his/her preferred results and assuming that these are the most accessible ones. This feedback is then used in order to update the registered user profile. The user profile that is initially registered by the user is rather rough, since the user is usually not able to insert exact values about the amount of disability he/she has in each of the supported impairments. Thus, through the feedback-based update of the user profile, the latter is fine-tuned to best represent the actual user disabilities. The updated user profile is then used to recalculate the accessibility scores of the results, and present a better visualization to the user. The accessibility-aware relevance feedback pipeline consists of the following pipelets: •

•

Accessibility score estimation: This pipelet takes the results retrieved by the search pipeline and the user profile as its input and calculates accessibility scores for the results. These accessibility scores are based both on the relevance scores of the results, as returned by the search engine, and on the multimodal features of them, which were extracted during indexing. Results visualization: This pipelet uses the accessibility scores of the results in order to present them to the user in a way that best fits his/her impairments, or let the user select among different visualizations. User relevance feedback: This pipelet is responsible for receiving the user feedback, either directly (by letting the user select among a number of suggested visualizations) or indirectly (by letting the user select the most accessible results). Profile updating: This pipelet takes the user feedback as its input and updates the user impairment profile, so that it best represents the actual user disabilities.

5.2 Accessibility aware relevance feedback pipeline implementation For implementing the functionality indicated in Task 7.4, the development and the combination of several distinct pipelines and pipelets as illustrated in Figure 16, is required. To this extent, the reader can find a thorough description of the building blocks presented in Figure 16, in the following paragraphs. CUbRIK R1 Pipelines for Relevance Feedback

Page 26

D7.1 Version 1.0

5.2.1

Profile Registration

In order to accomplish accessibility-aware analysis and the corresponding relevance feedback, the system needs to be aware about the disabilities of each user (i.e. the user’s disabilities have to be registered to the system). For this purpose, an impairment profile is maintained for each user. This profile contains information about the amount of the user's disability in each of the supported impairments. Formally, if M is the number of supported disabilities, the impairment profile can be considered as a vector

u = (u1 , u 2 ,..., u M ), ui ∈ [0,1] The elements of vector u take values in the range [0,1] and represent the degree of disability in each impairment. A value of 1 means that the user has the respective disability in the maximum possible degree, while a value of 0 means that the user does not have this disability at all. The results of retrieval should be rearranged such that the user is primarily presented with results that are easy to view/listen/etc., according to his/her own disability profile. For instance, a severely colour-blind person (i.e. having a value close to 1 for colour-blindness) should be presented first with results that are easy for him/her to see (e.g. images that do not contain red and green). Every CUbRIK end user is associated with an impairment profile, which is used and updated as the user interacts with the system. Referring to the User Taxonomy Model of deliverable D2.1 (section 1.4: “User and Social Models”), the user profile is added as an attribute of a CUbRIKEndUser, as depicted in Figure 17. Impairment is an object representing the amount of impairment of a specific user for a specific impairment type. The various impairment types, such as myopia or colour-blindness, are represented by ImpairmentType objects, which contain information about a specific impairment type.

Figure 17: The ImpairmentProfile attribute in the User Taxonomy Model

The user submits his/her disability profile before any retrieval is performed, so that the system can deliver personalized analysis and result representation (i.e. result re-ranked order). The profile submission is handled by the Subscription Manager component of the Content and User Acquisition Tier of the CUbRIK architecture (section 3.1.1 of deliverable D9.8). As mentioned in the description of the Subscription Manager, most of the user profile data are stored in existing single sign-on accounts of the users, while CUbRIK stores only the minimal amount of data to reference this information. Since an impairment profile is not present in such external user accounts, the minimal amount of data stored by CUbRIK should be extended to accommodate the impairment profile. Nevertheless, since the submission of the profile may not be too accurate (e.g. due to the fact that the amount of disability is entered with a fuzzy scale), relevance feedback is CUbRIK R1 Pipelines for Relevance Feedback

Page 27

D7.1 Version 1.0

incorporated, in order to fine-tune the user profile to fit the user's disabilities exactly and adjust the result presentation.

5.2.2

Visual Feature Extraction

The first kind of information needed by an accessibility-aware relevance feedback system is the list containing degree of the impairments of users, covered by the user impairment profile described in the previous subsection. However, The aim is to estimate the appropriateness of each result for users having a specific impairment. The suitability of a specific result (e.g., an image) for people having the various kinds of supported impairments can be formulated by an accessibility vector

a = ( a1 , a2 ,..., aM ), ai ∈ [0,1] Each of the elements of vector a represents how suitable the result is for people having the respective disability. Each element takes values in the [0,1] range, with 1 meaning that the result is appropriate for people with the respective disability (i.e. it can be viewed/heard/etc. easily) and 0 meaning that the result is totally inappropriate. The vector of the accessibility scores for a specific content object is stored as a special kind of annotation for it. This is illustrated in Figure 18, which is part of the Content Description Model, described in deliverable D2.1. An AccessibilityAnnotation class is added as a subclass of Annotation, holding the accessibility scores for the various impairments. Each object of class Accessibility contains an impairment type, of class ImpairmentType, and the amount of suitability of the content object for this type of impairment.

Figure 18: The AccessibilityAnnotation sub-class in the Content Description Model.

In order to calculate the accessibility vector for each result, certain visual/auditory/etc. characteristics need to be extracted from it (i.e. each result is pointing at either a multimedia entity or a set of such entities). In this first version of the current deliverable, only visual characteristics are considered. The high level visual characteristics are described in Appendix 1.

5.2.3

Accessibility rank estimation based on user profile

When a user submits a query to the search engine, a set of results is returned, which need to be properly presented to him/her, taking his/her impairments into account. In other words, results which are initially ranked according to their resemblance to the query, need to be reranked in order to be properly presented to the specific user. Let O be the set of N top retrieved results. Each result oi ∈ O can be considered as a tuple

oi = {si , ai }, si ∈ [0,1], ai ∈ [0,1]M

CUbRIK R1 Pipelines for Relevance Feedback

Page 28

D7.1 Version 1.0

where si is the normalized ranking score of the result, as returned by the search engine (i.e.

si is 1 if the result is identical to the query, while it is 0 if the result and the query are completely dissimilar), and ai is the accessibility vector of result oi , as calculated by the feature extraction methods of the previous subsection. In order to re-rank the results, so as impairments are taken into account, each result

needs to be associated with a ranking score ri ∈ [0,1] , which incorporates both the retrieval ranking of the results and accessibility information. For the calculation of the ri values, the following procedure is followed: The accessibility vectors ai of the results, as well as the user impairment profile u , are considered as points in a M -dimensional space. In the following this space will be mentioned as the "impairment space". The axes of this space represent the various types of the supported impairments and are normalized in the [0,1] range. Such a space is depicted in Figure 19 where, without loss of generality and for illustration purposes, only two types of impairments ( d1 and d 2 ) are considered. In this figure, the results are depicted as white circles, while the user profile as a black square. Results that are close in this space are similar as far as their accessibility characteristics are considered.

Figure 19: Representation of the retrieved results and the user profile

A first step in calculating the final ranking values ri is to rank the results by considering just their accessibility features, without using the retrieval rankings. Let bi ∈ [0,1], i ∈ 1... N be these accessibility ranking scores. The bi values are calculated as follows: A line is considered starting at the center of the axes of the impairment space and passing through the user profile point. The direction of this line represents the proportions at which the user has each impairment. If ,e.g., this line is close to the axis of impairment j , then the user has impairment j in a significantly larger degree than the other disabilities. The direction of the line thus determines the accessibility ranking scores of the results. Each point in the impairment space representing a result is projected on this line, as illustrated in Figure 20. The distance of the projected points from the center of the axes is their accessibility ranking score:

bi = k u

ai ⋅ u , ku ∈ R |u |

where ku is a normalization constant, used to normalize the projected point distance in the range [0,1] . ku is different for different line directions (i.e. for different u vectors). The larger

bi is, then the more appropriate result i is for the user.

CUbRIK R1 Pipelines for Relevance Feedback

Page 29

D7.1 Version 1.0

By projecting the results on a line whose direction is determined by the position of the user profile in the impairment space, the results are ordered so that the most appropriate for the user are first. If the direction of the line is altered, i.e. the user has different proportions for the supported disabilities, then the projected points will have a different ordering along the line.

Figure 20: Evaluation of accessibility ranking scores for the results.

So far, only the accessibility information for the results and the user has been utilized for the computation of the ranking scores. However, the presented results have different similarities to the initial query submitted by the user, which need to be considered as well. For the incorporation of the initial retrieval ranking scores, si , the following reasoning is adopted: If a user does not have the supported disabilities at a large degree, i.e. if the user profile point lies close to the center of the axes of the impairment space, then more emphasis should be given to the initial retrieval rankings, rather than the accessibility rankings of the results. On the other hand, if the use has one or more disabilities at a large degree, i.e. the user profile point is far from the center of the axes, then more emphasis should be given to the accessibility rankins of the results, rather than the retrieval rankings. The final rankings of the results, taking into account both the retrieval and the accessibility ranking scores, are thus calculated as follows:

ri = (1 â&#x2C6;&#x2019; w) si + wbi , w â&#x2C6;&#x2C6; [ 0,1] i.e. as a weighted sum of the retrieval rankings, si , and the accessibility rankings, bi . The weight w determines the trade-off between the two and is calculated as

w = cu | u |, cu â&#x2C6;&#x2C6; R The weight w is essentially the distance of the user profile point u from the center of the axes, normalized so as to lie in the [0,1] range via the normalization constant cu . The normalization constant is different for different directions of vector u . Overall, the position of the user profile in the impairment space determines the final ranking of the results. The direction of the user profile vector determines the accessibility ranking of the results, while its magnitude determines the trade-off between the accessibility ranking and the retrieval ranking. After the ranking scores ri are calculated, the results can be presented user in a way that best fits his/her profile. This presentation could be a ranked list of results, ordered by the ri values, or some other visualization of the results.

5.2.4

Relevance feedback

While the elements of the user profile vector are submitted indirectly by the user, they may not represent exactly his/her disability degrees for the various types of impairments. The user may not know the exact values for his/her disabilities, as expected by the system. For this reason, as mentioned in Section 5.2.1, the user profile values are submitted in a fuzzy scale. The purpose of such a fuzzy submission is that the system has a rough picture of the user CUbRIK R1 Pipelines for Relevance Feedback

Page 30

D7.1 Version 1.0

impairments, as a starting point for the initial ranking and presentation of the results. In order to fine-tune the user profile and present the results in a way that best fits the actual impairments of the user, accessibility-aware relevance feedback is employed. The feedback scheme used hereby is similar to existing relevance feedback schemes, use in standard search engines and is as follows: The user is presented with a set of results, ranked according to the procedure described in the previous subsection. However, due to the already mentioned inaccuracies, this result presentation may not be totally appropriate for the user (e.g. the user may not be able to view/hear/etc. some of the first results). Following the standard course of actions in any search session, the user then selects some results that are easily seen/heard/etc., in order to get more information about them (e.g. the user selects the second result in order to view the respective image in full size, or view the respective website). Let F be the set of L results that the user selects. This selection is an indirect form of feedback about which results are easy for him/her to see/hear/etc. Hence an assumption is made that the results that the user selects are those which should have the highest ranking values and be presented first. Thus the user profile needs to be moved towards the selected results. The procedure for the profile update is described in the next subsection.

5.2.5

Profile update - fine-tuning

With the relevance feedback scheme described in the previous subsection, the user selects some of the presented results, which are those that should have the highest rankings. Figure 21 depicts an example of such a selection, in the impairment space. The selected results are denoted as bold circles. In this example, it can be seen that the line passing through the user profile is away from the area of the selected results, i.e. the user profile needs adjustment. The user profile is then updated by moving the profile point towards the selected result points (Figure 22). Such a movement changes the direction of the projection line, towards the d1 disability so that it passes through the area of the selected points, thus taking different proportions for the different types of impairments and best fit the selection. Formally, the user profile is moved to the center of the result points:

1 L â&#x2C6;&#x2018; ai L i =1

Figure 21: Representation of the user selection in the impairment space

CUbRIK R1 Pipelines for Relevance Feedback

Page 31

D7.1 Version 1.0

Figure 22: Representation of the user disability profile update in the impairment space

5.3 Accessibility aware relevance feedback User Interface pipeline The method for the implementation of the accessibility aware relevance feedback User Interface (UI) in task 7.4 will be based on a novel data visualization approach. Given a set of any multimodal objects, the goal is to present them on the screen in an intuitive way, by making use of all modalities. Initially, the focus is laid on no other visual characteristics (e.g. colour, size, etc.) than the proper placement of the objects, based on the fact that differences in position are generally more accurately perceived than differences in colour or size [7]. Herein, the problem of multimodal placement is addressed as a multi-objective optimization problem, resulting in a set of effective solutions - instead of a single one â&#x20AC;&#x201C; from which the user can efficiently select the most applicable one (a detailed description of the approach proposed within the current section can be found in Appendix 2.) In particular, viewing multimodal placement from this different perspective allows broader possibilities. In Figure 23 (a), the gray shaded area denotes all possible solutions (i.e. the combination of rankings of the query results) while the black line stands for the Paretto front which includes all the best possible ranking alternatives addressing the impairments of the user. The black dot denotes the solution that optimally matches the accessibility profile of the current user. Figure 23 (b) illustrates the case in which the user picks a different solution than the proposed one (i.e. relevance feedback), his/her profile is updated accordingly as described in Section 5.2.3.

(a)

(b)

Figure 23: Set of effective solutions to the multimodal placement problem

Other novelties may also be introduced in the determination of a proper graph aesthetic measure and in incorporating cross-modal correlation information in the steps of the genetic optimization algorithm, so that the optimization is faster and more accurate. The implementation code for the Media Harvesting pipeline can be found at https://89.97.237.243:443/svn/CUBRIK/Demos/AccessbilityAwareRelevanceFeedback_CER TH

CUbRIK R1 Pipelines for Relevance Feedback

Page 32

D7.1 Version 1.0

SME Innovation V-App. Fashion trend

6.1

Fashion-focused dataset and ground truth collection

6.1.1

Context

The Fashion-focused Creative Commons Dataset was created in the context of the SME Innovation V-App. This data set has two purposes: First, the data set served as a concrete means to investigate the kinds of feedback that can be collected from the worker population of a commercial crowdsourcing platform. Commercial crowdsourcing platforms make it possible to access a large number of users representing a diverse population. Feedback collected from this population can be important for understanding user perceptions of similarity between the clothing presented in two images or their perceptions of which clothing items belong to which style. Second, the data set served as an initial Creative Commons licensed data set useful for testing components potentially relevant for the SME Fashion App. For example, it is possible to use this data set to test technologies that are capable of detecting the presence of a human in an image. Such technologies are important building blocks contributing to "fashion specific" technologies, for example, a classifier that is able to detect whether an image is relevant to fashion and should be processed further in a fashionspecific pipeline. The Fashion-focused dataset was collected using social search engine (i.e. Flickr) and contains sets of pictures potentially related to fashion. However, as in the case of the Media harvesting pipeline, user-generated content might be ambiguous or inaccurate. As part of the creation of a fashion-focused dataset and related ground truth, a crowdsourcing task was developed to refine and enrich the initial metadata.

6.1.2

Dataset

The fashion-focused dataset contains a mix of general images as well as a large component of images that are focused on fashion (i.e., relevant to particular clothing items or fashion accessories). The dataset contains 4810 images and related metadata collected as part of CUbRIK. Furthermore, the dataset is accompanied by a ground truth on image’s tags. The ground truth was created by two groups of contributors (i.e. crowdsourcing workers and trusted contributors). This dataset is collected from Wikipedia and tried to cover “specialty” fashion as well as general fashion categories. Using Wikipedia’s index for fashion topics, 470 topics were selected (only topics which are related to categories containing the text “fashion” or “cloth”) and for each topic a query is issued to Flickr. After filtering and selecting the categories which has more than 10 Creative Commons (CC) images, 4810 images are collected. In addition to the actual images, several metadata such as title, description, comments and other available context and social features are collected.

6.1.3

Ground truth generation

The fashion-focused ground truth was generated collecting feedback from people in two annotation tasks. The first annotation task collected feedback on pictures’ relevance to fashion (i.e., verifying an image is fashion-related), the second annotation task collected feedback on specialty item confirmation (i.e., confirming an image contains the fashionrelated tag given by Flickr’s users). Two groups of annotators participated in the ground truth generation. The first group consisted of people known by the researchers (i.e. trusted annotators), the second group were crowdsourcing workers in Amazon Mechanical Turk (AMT). The two sets of annotations represent a contribution not only because of the additional level of ground truth (i.e., ground truth’s ground truth) but also because they provide data for investigating crowdsourcing results in comparison to trusted annotations.

CUbRIK R1 Pipelines for Relevance Feedback

Page 33

D7.1 Version 1.0

Task Design The 4810 images were distributed among 1239 Human Intelligent Task (HIT) in AMT. Most of the HITs contained four images belonging to one of the selected fashion categories, a few contained 1, 2, or 3 images. Each HIT was assigned to three different people in each group of annotators. In total, 14,430 annotations were collected for each annotation task and group of annotator. For clarification purposes, each HIT contained a short description of the addressed fashion category together with a sample image. Results In order to investigate the level of agreement among different kind of annotators, reliability analysis was performed using Fleiss’ Kappa statistics [4]. This variant of the Kappa statistics used in the user studies in the Media Harvesting pipeline allow annotations from multiple raters. Results revealed higher agreement among trusted annotators than among AMT workers. For the annotation task on image’s relevance to fashion, the agreement among AMT annotators achieved a Kappa value of .56, while the Kappa value among trusted annotators was .67. Similarly, in the second annotation task AMT workers achieved a Kappa value of .58, while trusted annotators achieved a value of .64. This dataset has been published and hosted by the website of the MMSys conference2 on March 2013 and would be publicly available from then. The dataset contains actual images as well several csv files including all metadata information and annotation data. For further information on the dataset, collection methodology, ground truth management, or envisioned applications refer to [18]. This dataset proves useful in different application such as image classification, segmentation, pose detection as well as number of fashion related scenarios. Three specific use scenarios of this dataset have been envisioned as result of the collaboration among WP2, WP7, and WP10 in CUbRIK. These scenarios are: “search similar images” which allow users to find similar fashion images to their own image; “play fun games” which elicit users interest on different fashion items; and “what do I wear today” in which users upload their picture and get feedback from community about their clothes and fashion related accessories.

2 www.mmsys.org CUbRIK R1 Pipelines for Relevance Feedback

Page 34

D7.1 Version 1.0

Conclusion

This document describes the work related to gathering and processing users’ feedback in the context of H-Demos and V-Apps. This is an accompanying document of the actual implementations, which can be found in CUbRIK’s svn at the following addresses: Media Entity Annotation H-Demo: https://89.97.237.243:443/svn/CUBRIK/Demos/MediaEntityAnnotation_UNITN_CER TH_LUH • Cross-domain recognition in consumer photo collection H-Demo: https://89.97.237.243:443/svn/CUBRIK/Demos/PeopleRecognition_QMUL • LikeLines: https://89.97.237.243:443/svn/CUBRIK/Demos/LikeLines_TUD • Accessibility aware relevance feedback H-Demo: https://89.97.237.243:443/svn/CUBRIK/Demos/AccessibilityAwareRelevanceFeedba ck_CERTH In the next period, relevant H-Demos and implementations/user studies are to be integrated into the envisioned VApps (i.e. History of Europe and SME Fashion Trend). Furthermore, relevance feedback studies for the improvement of V-Apps will be run. For example, we are working in the development of a crowdsourcing task which helps the face recognition component used by the HoE V-App. This crowdsourcing task will request people to indicate the minimum size of bounding boxes around faces. The face recognition component will use this information for tuning the component’s input parameters and thus potentially decreasing the number of false positive face recognitions. In general, crowdsourcing is a promising approach for obtaining relevance feedback from large populations. However, the state of the art in crowdsourcing dismisses the importance of related issues such as reliability of results, cost, and user’s profile on crowdsourcing performance. Future work on relevance feedback tasks will use the collected data to investigate these open issues and thus contribute to the overall quality of the released applications. •

CUbRIK R1 Pipelines for Relevance Feedback

Page 35

D7.1 Version 1.0

References [1] A.L. Radu, J. Stottinger, B. Ionescu, M. Menendez and F. Giunchiglia,  Representativeness and Diversity in Photos via Crowd-Sourced Media Analysis, Int. Workshop on Adaptive Multimedia Retrieval, Copenhagen, Denmark, 2012. [2] C. Eickhoff and C. Vries, "Increasing cheat robustness of crowdsourcing tasks," Information Retrieval Journal, 2012. [3] Gabriella Kazai, Jaap Kamps, and Natasa Milic-Frayling. 2011. Worker types and personality traits in crowdsourcing relevance labels. In Proceedings of the 20th ACM international conference on Information and knowledge management (CIKM '11), Bettina Berendt, Arjen de Vries, Wenfei Fan, Craig Macdonald, Iadh Ounis, and Ian Ruthven (Eds.). ACM, New York, NY, USA, 1941-1944 [4] J. Randolph, "Free-marginal multirater kappa: An alternative to Fleiss' fixed-marginal multirater kappa," in Joensuu University Learning and Instruction Symposium, 2005. [5] Kazai, G. Jaap, K. Milic-Frayling N. (2012) An analysis of human factors and label accuracy in crowdsourcing relevance judgments. In Information Retrieval Journal. Springer Netherlands. [6] A.L. Radu, J. Stottinger, B. Ionescu, M. Menendez and F. Giunchiglia, Enforcing quality in photo retrieval via Hybrid Media-Crowd Analysis. Submitted [7] J. Mackinlay. Automating the design of graphical presentations of relational information. ACM Transactions on Graphics (TOG), vol.55, no.2, pp.110-141, 1986. [8] H. Zhang and J. Weng. Measuring multi-modality similarities via subspace learning for crossmedia retrieval. In Advances in Multimedia Information Processing - PCM 2006, vol. 4261 of Lecture Notes in Computer Science, pp. 979-988. Springer Berlin / Heidelberg, 2006. [9] Y.Y. Lin, T.L. Liu, and C.S. Fuh. Multiple kernel learning for dimensionality reduction. Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 33, no. 6, pp. 11471160, 2011. [10] K. Nigam and R. Ghani. Analyzing the e_ectiveness and applicability of co-training. In Proceedings of the ninth international conference on Information and knowledge management, CIKM '00, pp. 86-93. ACM, 2000. [11] M. Ehrgott. Multicriteria optimization, vol.2, Springer Berlin, 2005. [12] C.A.C. Coello, G.B. Lamont, and D.A. Van Veldhuizen. Evolutionary algorithms for solving multi-objective problems, volume 5. Springer, 2007. [13] H.C. Purchase. Metrics for graph drawing aesthetics. Journal of Visual Languages & Computing, vol. 13, no. 5, pp.501-516, 2002. [14] E.M.Fine and G.S.Rubin, "Effects of Cataract and Scotoma on Visual Acuity: A Simulation Study", in Optometry and Vision Science, vol.76, no.7, pp.468-473, 1999. [15] J.R. Laverytf, J.M. Gibson, D.E. Shaw and A.R. Rosenthal, "Vision and visual acuity in an elderly population", in Ophthal. Physiol. Opt., vol. 8, 1988. [16] H. Hirvela, P. Koskela and L. Laatikainen, "Visual acuity and contrast sensitivity in the elderly", in Acta Oothalmologia Scandinavica, 1999 [17] T.M.J. Fruchterman and E.M. Reingold, “Graph drawing by force-directed placement,” Software: Practice and experience, vol. 21, no. 11, pp. 1129–1164, 1991. [18] B. Loni, M. Menendez, M. Georgescu, L. Galli, C. Massari, M. Melenhorst, M. Larson, I. Altingovde, R. Vliegendhart, D. Martinenghi, “Fashion-focused Creative Commons Social dataset” in Proceedings of the 4th Annual ACM SIGMM Conference on Multimedia Systems, MMSys 2013 Dataset track, Oslo, Norway, February 27-March 1, 2013.

CUbRIK R1 Pipelines for Relevance Feedback

Page 36

D7.1 Version 1.0

Appendix 1

Percentage of Red/Green/Blue in the image The percentage of specific the three main colour dimensions of the RGB colourspace is a significant indicator for the accessibility of an image for persons with colourblindness related disabilities. This way, the aforementioned percentages can be estimated as shown below

R *100 R+G+ B

G *100 R+G + B

B *100 R+G+ B

where R, G, B are the cumulative summation of the corresponding dimensions of each pixel I(x,y) of the image I:

R = â&#x2C6;&#x2018; I R ( x, y ) N

G = â&#x2C6;&#x2018; I G ( x, y )

B = â&#x2C6;&#x2018; I B ( x, y )

Initially, only the case of users suffering from protanopia (disability to see the R colour) will be studied. Thus, an image is defined as inaccessible from 0-100% based on the percentage of the Red colour amount in it. Contrast ratio In visual perception of the real world, contrast is defined as the difference in the color and brightness of the object and other objects within the same field of view, Contrast is defined as the difference in luminance and/or colour that makes an object in an image distinguishable. Because the human visual system is more sensitive to contrast than absolute luminance, we can perceive the world similarly regardless of the huge changes in illumination over the day or from place to place. The maximum contrast of an image is the contrast ratio or dynamic range. Alternatively, contrast can also be defined as the difference between the colour or shading of the printed material on a document and the background on which it is printed, for example in optical character recognition. In order to estimate the contrast ratio and to assign an accessibility factor to the corresponding image, the algorithm described in Figure 24 has been implemented.

Figure 24: Estimator of the contrast quality factor of an image

Initially the histogram and subsequently the cumulative histogram of the image are calculated as shown in Figure 25 and Figure 26.

CUbRIK R1 Pipelines for Relevance Feedback

Page 37

D7.1 Version 1.0

Figure 25: Histogram of monochromatic image

Figure 26: Cumulative histogram of a monochromatic image

Then, using the least mean squares method, the c0 and c1 coefficients in the following equation are estimated

f LLS = c0 + c1 * I ( x, y ) where f LLS is the frequency the Linear Least Squares, I ( x, y ) is the grey level intensity of the pixel ( x, y ) if the monochromatic image I . Once the Root Mean Square (RMS) value is computed according to the following formula,

qrms =

1 N

â&#x2C6;&#x2018; (hist[i] â&#x2C6;&#x2019; f

LLS

[i ])

it is fed to a Fuzzy Inference System, which combines all relevant to the same impairment inputs in a common quality factor regarding the accessibility level of the image.

Figure 27: Cumulative histogram of a monochromatic image

Relative Luminance A core feature for the estimation of contrast ratio is also the relative luminance. In this respect, Relative luminance follows the photometric definition of luminance, is defined within the Web Content Accessibility Guidelines (WCAG) and is a significant accessibility related feature for regarding images as web content. Similarly to the photometric definition, it is related to the luminous flux density in a particular direction, The use of relative values is useful in systems where absolute reproduction is impractical. For example, in prepress for print media, the absolute luminance of light reflecting off the print depends on the illumination and therefore absolute reproduction cannot

CUbRIK R1 Pipelines for Relevance Feedback

Page 38

D7.1 Version 1.0

be assured. In particular, the relative brightness of any point in the image is initially normalized in the RGB colour space to 0 for darkest black and 1 for lightest white, as shown below, respectively.

sR =

G R B ; sB = ; sG = 255 255 255

A further normalization that takes place is the following

sG sR   , sG <= 0.03928 , sR <= 0.03928   12.92 12.92   g= r=  sG + 0.055  sR + 0.055 otherwise otherwise ( ( ) 2.4 , ) 2.4 ,  1.055  1.055 sB  , sB <= 0.03928  12.92  b=  sB + 0.055 otherwise ( ) 2.4 ,  1.055 Forming this way, the relative luminance colour space.

CUbRIK R1 Pipelines for Relevance Feedback

Page 39

D7.1 Version 1.0

10. Appendix 2 Formally, the problem of multimodal placement can be stated thus: A set O of multimodal objects is considered:

O = {o1 , o2 , …, oN } where N > 0 is the number of multimodal objects. Each multimodal object oi consists of exactly M media items, one from each of M total modalities. A modality is a type of object representation, such as an image, a sound, a video etc. A media item of modality m, m = 1… M , e.g. an image, is represented by a feature vector of dimensionality specific to modality m. It should be noted at this point, that although the proposed method is generic and can be applied to any modality, herein the modalities will refer to uncorrelated accessibility quality factors (i.e. visual related accessibility factor 1, visual related accessibility factor 2, hearing related accessibility factor 1, … cognitive related accessibility factor 1), without loss of generality, in order to trigger the users to provide direct feedback regarding their impairment preferences. Thus, a multimodal object oi consists of M feature vectors, representing the media items of the various modalities. Placing the multimodal objects on the screen means defining a mapping

p : O → R2 which associates each oi with a point in the 2D space. p defines a placement of the multimodal objects on the screen and is considered to be taken from the space P of all possible mappings (placements):

p∈P The problem is to find a mapping3 p , so that the structure of the dataset (the similarities and dissimilarities among the objects) is easily and instantly apparent to the human viewer. The rationale behind this proposal is the following: Let a unimodal case be considered, where the objects oi , i = 1… N , consist of just one modality m1 . As far as this modality is concerned, the objects of the dataset have certain similarities and dissimilarities among them. The desired mapping p of the objects would be one in which these similarities and dissimilarities are apparent. Let

J m1 : P → [0,1] be an objective function, evaluating the capability of a mapping p ∈ P for revealing the relationships among the data, as far as only modality m1 is concerned. If J m ( p ) = 1 , it means 1

that p is appropriate for revealing the data structure (i.e. it is a very good placement), while if J m ( p ) = 0 , p is not appropriate. 1

In such a setting, the goal would be to find the optimal mapping that maximizes J m ( p ) : 1

popt , m1 = arg max p∈P J m1 ( p ) If now a different modality m2 is considered, instead of m1 , then the similarities and

3In the following, the terms mapping and placement will be used interchangeably. CUbRIK R1 Pipelines for Relevance Feedback

Page 40

D7.1 Version 1.0

dissimilarities among the objects are generally different than with m1 . Thus a different placement is generally optimal in this case. This placement is the one that maximizes a different objective function,

J m 2 : P → [0,1] which is specific for modality m2 This objective function quantifies how appropriate p is for revealing the relationships among the objects of the dataset, as far as only m2 is concerned. For example, if m1 is the image modality and m2 is the sound modality, then, as far as the image modality is concerned, a proper placement p could be one where objects with similar colour in their images are close to each other. If, on the other hand, the sound modality is considered, a proper placement p could be one where objects with similar spectrum in their sounds are close to each other. These two placements are generally different, as similarity with respect to colour does not necessarily mean similarity with respect to sound spectrum. In a multimodal setting, both modalities need to be taken into account. The goal here is to find a placement which simultaneously maximizes both objective functions:

p opt , multimodal = arg max p∈P ( J m1 ( p ), J m 2 ( p )), where

max p∈P ( J m1 ( p ), J m 2 ( p ))

means simultaneously maximizing both

J m1 ( p ) and

J m2 ( p ) . If such a placement could be found, then data relationships as far as both modalities are concerned would be simultaneously apparent and all information from both modalities would be presented to the viewer. However, it is usually not possible to find such an ideal placement, because, as mentioned above, the two objective functions are maximized at different placements. Such problems of optimizing multiple conflicting objective functions are handled by the field of multi-objective optimization [11][12]. Multi-objective optimization methods, as mentioned below, result in a set of solutions, instead of a single one. These solutions are the most efficient among all possible solutions, but are mutually incomparable. They represent different trade-offs among the various objective functions. If such a set of placements is calculated, these placements are the most efficient ones that can be objectively calculated. It is then up to the users to select the ones that are best for them, according to their preferences. Viewing the problem of multimodal visualization from a multi-objective perspective seems to be reasonable. An advantage that this perspective has over existing approaches is that, by providing a set of the objectively most efficient solutions, no information regarding the relationships among the objects of the dataset is lost, until the user is introduced for the final decision. Hereby follow some preliminary notations and formulations from the field of multi-objective optimization. The notation is kept similar to the notation for multimodal placement, for better correspondence. Multi-objective optimization deals with simultaneously maximizing4 a vector of M objective functions J ( p ) = ( J1 ( p ), J 2 ( p ), … , J M ( p )) , over a variable p , which takes values from a set P . The problem is formulated as max p∈P J ( p ) -

Generally there is no single solution to this problem, as the various objective functions are generally maximized for different values of p ,. There are generally three ways to handle this conflict: Combining the multiple objectives into one objective function, according to some

4 Without loss of generality, the optimization problem is supposed to be a maximization problem. CUbRIK R1 Pipelines for Relevance Feedback

Page 41

D7.1 Version 1.0

known user preferences, and maximizing the combined objective with a standard single-objective optimization method. Performing a standard single-objective optimization method, using, at each iteration step, a different objective function, taken successively from the set of objectives. Calculating all objectively efficient solutions and letting the users decide on the best one, according to their preferences.

The first two approaches result in a single solution. They correspond to the Multiple Kernel Learning and Co-training paradigms of multimodal learning, respectively. In this sense, multiobjective optimization can be seen as a general framework including existing multimodal learning approaches. Providing the user with many possible efficient solutions, instead of one, eliminates information loss and thus has an advantage over existing methods. Thus the focus hereby is given to the third of the above approaches. In order to calculate the most efficient solutions, there is a need to compare between different objective function vectors J ( p) , resulting from different values of the variables p . Such comparisons are performed using the notion of Pareto dominance. An objective vector J ( p1 ) = ( J1 ( p1 ), J 2 ( p1 ), … , J M ( p1 )) , resulting from variable p1 , is said to dominate another objective vector J ( p2 ) = ( J1 ( p2 ), J 2 ( p2 ), … , J M ( p2 )) , resulting from variable p2 , if

J m ( p1 ) ≥ J m ( p2 ), ∀m ∈ {1, … , M }, ∧ ∃k ∈ {1, … , M }: J k ( p1 ) > J k ( p2 ) Pareto dominance is denoted as J ( p1 ) f J ( p 2 ) ( J ( p1 ) dominates J ( p2 ) . Similarly, a variable p1 is said to dominate another variable p2 ( p1 f p2 ), if J ( p1 ) f J ( p 2 ) . If p1 f p2 , then p1 is objectively a better solution than p2 , since at least one of the objective function values for p1 is larger than the respective value for p2 , without any other objective function of p1 having a smaller value than the respective of p2 . If two objective function vectors mutually do not dominate each other, they are said to be incomparable, since there can be no objective judgment as to which is better than the other. The goal of multi-objective optimization is to find the set of all solutions which dominate all other solutions, but are mutually incomparable. This set is called the Pareto Set, and the corresponding values of the objective function vectors are called the Pareto Front. Thus, the solution to the multi-objective optimization problem, is the Pareto Set. The Pareto Set may be infinite, as there may be infinite non-dominated solutions, so the goal is usually to approximate the Pareto Set with a finite number of discrete solutions, which are representative of the whole solution set. Genetic algorithms are usually used for approximating the Pareto Set. Genetic algorithms are convenient for multi-objective optimization as they maintain a population of solutions at each iteration, instead of a single solution, as other optimization methods do. The general outline of their usage is presented below. The problem of multimodal placement can be expressed in terms of multi-objective optimization, if a mapping p is considered as the optimization variable. Then proper objective functions J m , m = 1… M need to be defined, one for each modality, evaluating the various mappings, so that the optimal mapping to be found is the one which simultaneously maximizes all objective functions. In the following section, such objective functions are defined with the use of graphs. In this section, an example of using the framework of multi-objective optimization to approach a multimodal placement problem is presented. Multimodal placement is hereby accomplished with the use of graphs.

CUbRIK R1 Pipelines for Relevance Feedback

Page 42

D7.1 Version 1.0

Let the set O of multimodal objects be again considered:

O = {o1 , o2 , … , oN }. Each object oi ∈ O consists of M multimedia items, one from each modality:

oi = {oi ,1 , oi ,2 , … , oi , M }, i = 1… N . Each multimedia item o i , m is considered to be a feature vector of dimensionality specific to modality m:

oi , m ∈ R km , i = 1… N , m = 1… M , where km is the dimensionality of the feature vectors of modality m. These feature vectors are numerical representations of specific characteristics (ones that are relevant to the application at hand) of the actual multimedia. A dissimilarity (or distance) function Dm is defined between two multimedia items of modality m:

Dm : R km × R km → R≥ 0 , k m ∈ N > 0 , m = 1… M , where R≥0 is the set of non-negative real numbers and N >0 the set of positive natural numbers. This function is a measure of distance between two feature vectors of modality m (dimensionality km ). For convenience, a distance function d m between two multimodal objects, with respect to modality m, is also defined as

d m : O × O → R≥ 0 d m ( oi , o2 ) ≡ Dm ( o1, m , o2, m ), m = 1… M . This distance function is the distance between two whole multimodal objects, if just their mmodality components are considered. It reflects the similarities and dissimilarities among the objects of the dataset, according to the various modalities. For instance, if the image modality is considered, the objects are compared according to the image modality and the distances among them are generally different than if the sound modality is instead considered. Considering modality mand the respective distance measure d m , a neighbourhood graph

G m ( O, E m ) can be constructed for the set of multimodal objects. G m has the set of multimodal objects O as its vertices and E m ⊆ O × O is the set of the edges between vertices. An edge exists between two vertices, if the distance between the respective objects is less than a threshold:

Em = {(oi , o j ), oi , o j ∈ O, i, j = 1… N , i ≠ j | d m (oi , o j ) < Tm } , Tm ∈ R≥0 , m = 1… M , where Tm is a distance threshold, specific for each modality m. Since there are M modalities, M neighbourhood graphs are constructed, one for each modality. These graphs show the neighbourhood relationships among the objects, according to the various modalities. All M graphs have the same set of vertices (the multimodal objects), while the set of edges is different for each modality. It is the set of edges which actually contains the neighbourhood information, so in the following, the Em sets will be considered, instead of the whole graph G m The sets of edges, Em , can be used to define objective functions J m evaluating the appropriateness of a specific object placement p ∈ P for revealing the relationships among the data: CUbRIK R1 Pipelines for Relevance Feedback

Page 43

D7.1 Version 1.0

J m : P → [0,1], m = 1… M . In order to evaluate a placement p , the following procedure is followed: The multimodal objects are placed on the 2D plane, according to placement p .Then each modality is considered separately. For modality m, the respective set of edges Em is also drawn on the 2D plane, as lines connecting the respective objects. The final drawing is a drawing of the neighbourhood graph for modality m, where its vertices are placed according to placement p . The appropriateness of p is evaluated in terms of how aesthetically pleasing the final drawing is, i.e. how easily the graph's structure can be perceived. Measures of graph aesthetics have already been defined and used in the literature [13], e.g. ones counting the number of edge crossing in the graph. The objective functions for the multi-objective optimization problem can be defined as such aesthetic measures. For each modality ma different objective function J m is defined, according to the various sets of edges Em . In the following subsections, two graph aesthetic measures, taken from [13], are examined as examples of defining the objective functions J m . After the multiple objective functions are defined, they can be used in the multi-objective optimization framework. The optimization procedure can then be preformed, using (e.g. genetic algorithms). Objective function based on edge crossings A commonly used graph aesthetic evaluation measure is the number of edge crossings. A graph is considered to be aesthetically pleasing, if it contains as less edges crossings as possible. Some preliminary definitions for edge crossing follow and then the definition of the final objective functions are given. A line segment determined by the points a and b a , b ∈ R 2 , crosses another line segment determined by the points c and d , c , d ∈ R 2 , if

det ([b − a c − a ])·det ([b − a d − a ]) < 0

and

det ([ d − c a − c ])·det ([ d − c b − c ]) < 0, where the points have been considered as 2D column vectors and det( A) denotes the determinant of matrix A . An auxiliary function f can be defined, determining whether two line segments cross each other or not: f :(R

2 4

)

1, if det ([b − a c − a ])·det ([b − a d − a ]) < 0  → {0,1} f ( a, b, c, d ) =  det ([ d − c a − c ])·det ([ d − c b − c ]) < 0  0, otherwise

and a , b, c , d ∈ R 2

In other words, f ( a, b, c, d ) = 1 , if the line segment determined by points a and b crosses the line segment determined by points c and d , and f ( a, b, c, d ) = 0 , if the line segments do not cross each other. Given the set of multimodal objects, O , the set of graph edges Em and a mapping p ∈ P , the set of edge crossings, C m , p ⊆ E m × E m , is defined as

{

}

Cm, p = ( (oi , o j ), (ok , ol ) ) , (oi , o j ) ∈ Em , (ok , ol ) ∈ Em | f ( p (oi ), p (o j ), p (ok ), p (ol ) ) = 1 , m = 1… M , p ∈ P The number of edge crossings for modality m and placement p , cm, p is the cardinality of the CUbRIK R1 Pipelines for Relevance Feedback

Page 44

D7.1 Version 1.0

set Cm , p

cm , p =| C m , p |, m = 1… M , p ∈ P . In order for the final objective function to take values in the [0,1] interval, the number of crosses is divided by the maximum number of crossings for Em , cm,max . The maximum number of crosses is [13]:

cm , max =

| Em | (| Em | −1) 1 N − ∑ degree(o j )( degree(o j ) − 1), o j ∈ O. 2 2 j =1

Finally, the objective function for modality mand placement p is defined as

J m ( p) = 1 −

cm, p cm,max

where the normalized number of crossings is being subtracted from 1, in order for J m ( p) to take the value 1 when there are no crosses (i.e. a ``good'' placement) and 0 when there are maximum crossings (i.e. a “bad” placement). Objective function based on potential As an alternative objective function, a potential-based one can also be used. In [17] it is stated that an aesthetically pleasing graph drawing is produced by considering the graph's vertices as repelling charges and the edges as attractive springs attached to pairs of vertices. Starting at a random initial placement of the vertices, this dynamical system is let to run until convergence. The final result is an aesthetically pleasing and easily perceivable drawing of the graph, where vertices connected with lighter edges (smaller distances) are drawn close to each other. Based on this method, an objective function is defined for the drawing of a unimodal graph, G m (O, E m ) , as the potential of the mechanical system of its vertices: N

J m ( p) = ∑ i =1

where

q2 + ∑ ∑ k || pi − p j || 2 j =1, j ≠ i || p i − p j || i , j ,( oi ,o j )∈Em N

is the electric charge of the vertices,

is the spring constant and

|| pi − p j || denotes the Euclidean distance between points i and

in placement p .

The first term corresponds to the repelling electric forces, summing the magnitudes of the forces acting on each vertex, over all vertices. The magnitude of the electric forces is calculated according to Coulomb's law. Similarly, the second term corresponds to the attractive spring forces, according to Hooke's law. Minimizing this potential leads to a lowenergy, and thus aesthetically pleasing, placement. The same objective is also applied to the Minimum Spanning Tree of graph Gm , which contains a subset of the graph’s edges, while maintaining the graph’s structure, and thus, produces a cleaner visualization.

CUbRIK R1 Pipelines for Relevance Feedback

Page 45

D7.1 Version 1.0