ARCHITECTURE DESIGN Human-enhanced time-aware multimedia search
CUbRIK Project IST-287704 Deliverable D9.8 WP9
Deliverable Version 1.0 – 30 September 2012 Document. ref.: cubrik.D98.ATN.WP9.V1.0
Programme Name: ...................... IST Project Number: ........................... 287704 Project Title:.................................. CUbRIK Partners:........................................ Coordinator: ENG (IT) Contractors: UNITN, TUD, QMUL, LUH, POLMI, CERTH, NXT, MICT, ATN, FRH, INNEN, HOM, CVCE, EIPCM Document Number: ..................... CUbRIK.D98.ATN.WP9.V1.0 Work-Package: ............................. WP2 Deliverable Type: ........................ Document Contractual Date of Delivery: ..... 30 September 2012 Actual Date of Delivery: .............. 30 September 2012 Title of Document: ....................... Architecture Design Author(s): ..................................... Piero Fraternali (POLMI) ....................................................... Alessandro Bozzon (POLMI) ....................................................... Bjรถrn Decker (ATN) ....................................................... Mathias Otto (ATN) ....................................................... Vincenzo Croce (ENG) ....................................................... Lorenzo Eccher (ENG) Approval of this report ............... Summary of this report: .............. History: .......................................... Keyword List: ............................... Availability .................................... This report is public
This work is licensed under a Creative Commons Attribution-NonCommercialShareAlike 3.0 Unported License. This work is partially funded by the EU under grant IST-FP7-287704
CUbRIK Architecture Design
D9.8 Version 1.0
Disclaimer This document contains confidential information in the form of the CUbRIK project findings, work and products and its use is strictly regulated by the CUbRIK Consortium Agreement and by Contract no. FP7- ICT-287704. Neither the CUbRIK Consortium nor any of its officers, employees or agents shall be responsible or liable in negligence or otherwise howsoever in respect of any inaccuracy or omission herein. The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7-ICT-2011-7) under grant agreement n째 287704. The contents of this document are the sole responsibility of the CUbRIK consortium and can in no way be taken to reflect the views of the European Union.
CUbRIK Architecture Design
D9.8 Version 1.0
Table of Contents EXECUTIVE SUMMARY
1
1.
2
INTRODUCTION 1.1 ARCHITECTURE DESIGN GOAL 1.2 HIGH LEVEL VIEW OF THE CUBRIK ARCHITECTURE 1.3 CUBRIK PIPELINES 1.3.1 CUbRIK Pipeline Synchronization 1.3.2 CUbRIK Pipeline Data Exchange and Storage 1.3.3 Data persistence
2.
SMILA ARCHITECTURE AND CORE COMPONENTS 2.1 2.2 2.3 2.4 2.5
3.
MAPPING BETWEEN THE CUBRIK ARCHITECTURE AND SMILA FOUNDATIONAL CONCEPTS OF SMILA ASYNCHRONOUS WORKFLOWS COMMON BEHAVIOUR OF JOBMANAGER DEFINITION APIS CONFIGURING SMILA SYNCHRONOUS WORKFLOWS: BPEL DESIGNER DESIGN OF THE CUBRIK ARCHITECTURE
3.1 CONTENT AND USER ACQUISITION TIER 3.1.1 The Subscription Manager 3.1.2 The Upload and Crawling Managers 3.1.3 The Content and Metadata Acquisition Manager 3.1.4 The Copyright Awareness Manager 3.2 CONTENT PROCESSING TIER 3.2.1 The Conflict Manager 3.2.2 The Performer Manager 3.2.3 The Conflict Resolution Applications 3.2.4 The Relevance Feedback Manager/Pipelines 3.3 QUERY AND SEARCH TIER 3.3.1 Query Applications 3.3.2 The Query Broker 3.4 STORES AND INDEXES 3.4.1 IAS as default storage 4.
DESIGN OF CUBRIK COMPONENTS 4.1 COMPONENTS INTEGRATED IN SMILA 4.1.1 Upload Manager 4.1.2 Crawling Manager 4.1.3 Content and metadata acquisition manager 4.1.4 Copyright Awareness manager: 4.1.5 Data stores 4.1.6 CUbRIK Id Generator 4.1.7 Content Processing Manager 4.1.8 Relevance Feedback Pipelines and Manager 4.1.9 Domain / Demonstrator Specific Components and Workflows 4.2 STAND-ALONE COMPONENTS 4.2.1 Performer Manager 4.2.2 Conflict Manager
REFERENCES
CUbRIK Architecture Design
2 4 5 6 7 7 8 8 9 11 12 12 14 15 16 17 18 18 19 20 20 21 21 21 21 22 23 23 24 24 24 24 24 25 25 27 27 27 27 28 28 29 35
D9.8 Version 1.0
Tables and Figures Table 1: IAS as default data storage....................................................................................... 23 Table 2: Mapping of stores to the corresponding Data Models of D2.1.................................. 26 Table 3: Mapping of stores to data types ................................................................................ 26 Figure 1: Essential aspects of the CUbRIK architecture (from Project DoW)........................... 4 Figure 2: CUbRIK Pipeline Structure ........................................................................................ 5 Figure 3: Example of CUbRIK Pipeline graphic representation ................................................ 6 Figure 4: Overview of the SMILA architecture and its mapping to the CUbRIK tiers................ 9 Figure 5: Screenshot of BPEL Designer ................................................................................. 13 Figure 6: Detailed CUbRIK Architecture with integrated query and search tier...................... 14 Figure 7: CUbRIK PAAS and SAAS back-ends ...................................................................... 15 Figure 8: Different CUbRIK Users........................................................................................... 16 Figure 9: Process of authentication via OAuth (exemplary).................................................... 17 Figure 10: State diagram of a task .......................................................................................... 19 Figure 11 High-level view of the Performer Manager Architecture ........................................ 29 Figure 12: High Level Architecture of the Conflict Manager ................................................... 30 Figure 13 UML Class diagram of the Conflict Manager Java API........................................... 31 Figure 14 – Sequence Diagram for Task Creation ................................................................. 33 Figure 15 –Conflict Resolution Manager and an external module Interaction ........................ 34 Figure 16 Interaction between the Conflict Resolution Manager and an external module ..... 34
CUbRIK Architecture Design
D9.8 Version 1.0
Executive Summary Deliverable D9.8 provides a comprehensive view of the CUbRIK architecture. The design of the CUbRIK architecture is an example of differential design. The architecture is based on SMILA1 as the underlying framework for supporting workflow definition and execution. The goal is to exploit SMILA capabilities to enable easy integration of data source connectors, search engines, sophisticated analysis methods and other components by gaining scalability and reliability out-of-the-box. Therefore, the design of the CUbRIK architecture has proceeded by: Identifying the technical requirements of human computation enhanced multimedia processing in CUbRIK. Analyzing the capacities of SMILA related to CUbRIK objectives. Identifying gaps between SMILA and CUbRIK. Designing the architectural extensions needed to bridge the identified gaps. Designing the new components that implement advanced CUbRIK functionality on top of the basic functions of SMILA. Integrating those components in the SMILA framework. Chapter 1 provides an introduction to CUbRIK concepts (like Pipelines) and the CUbRIK architecture. Chapter 2 describes the SMILA infrastructure and its role with respect to the CUbRIK Platform. Moreover, this chapter deeply describes the underlying mechanisms exploited. Chapter 3 goes through the overall CUbRIK Architecture, tier by tier, describing the main components and related functionalities. Further, Chapter 4 details the implementation of several logical components both integrated in SMILA and stand alone components.
1 http://www.eclipse.org/smila/
CUbRIK Architecture Design
Page 1
D9.8 Version 1.0
1.
Introduction
In this Section, prior to delving into technical details, we recap the essential objectives of CUbRIK and the high level view of the platform architecture. The main goal of the Project is stated in the DoW as follows: The main goal of CUbRIK is to create an open platform for multimedia search. This goal will be achieved by either implementing own components or by leveraging different 3rd-party components. Multimedia search engines today are “black-box” systems. This closed architecture makes it difficult for technology providers, application integrators, and end-users to experiment with novel approaches for multimedia content and query processing, because there is no place where one can deploy content, components, and processes, integrate them with complementary technologies, and assess the results in a real and scalable environment. The key technical principle of CUbRIK is to create a “white-box” version of a multimedia content & query processing system, by unbundling its functionality into a set of search processing Pipelines, i.e., orchestrations of open source and third-party components instantiating current algorithms for multimedia content analysis, query processing, and relevance feedback evaluation. Examples will be Pipelines for extracting metadata from media collections using the software mix that best fits application requirements, for processing multimodal queries, and for analyzing user’s feedback in novel ways. CUbRIK aims at constructing an open platform for multimedia search practitioners, researchers and end-users, where different classes of contributors can meet and advance the state-of-the-art by joining forces. Important scientific contributions will be the systematic integration of human and social computation in the design and execution of Pipelines, and the enrichment of multimedia content and query processing with temporal and spatial entities. On the business side, CUbRIK will endorse an ecosystem where a multitude of actors will concur to implement real application scenarios that validate the platform features in real world conditions and for vertical search domains. The CUbRIK community will bring together technology developers, software integrators, social network and crowdsourcing providers, content owners and SMEs, to promote the open search paradigm for the creation of search solutions tailored to user needs in vertical domains. To achieve this vision, a sound technical foundation is needed. In this deliverable we present the architecture on how to achieve this foundation. This architectural description will guide the implementation of the services provided by the CUbRIK platform.
1.1
Architecture Design Goal
The goal of the CUbRIK architecture design is to provide a light-weight platform for executing Pipelines; each Pipeline is an arbitrary mix of human and machine tasks. Pipelines use heterogeneous components, mixing open-source and proprietary software components and services realized by different organizations. The architectural concept of CUbRIK differs from the development of a monolithic do-it-all architecture (e.g., multimedia databases), in that: •
There are no technological assumptions on the nature of the multimedia data processing components that participate to a content or query processing Pipeline;
•
There are no assumptions on the structure of the Pipelines that realize a piece of CUbRIK functionality. Each Pipeline is characterized by its specific workflow;
•
The integration mechanism of Pipelines and components is a conceptual model of data (specified in D2.1), which expresses the Object Model of CUbRIK data. Each component is responsible of adapting itself to the CUbRIK data model, via
CUbRIK Architecture Design
Page 2
D9.8 Version 1.0
appropriate input/output adaptors; •
In principle, there is no assumption also on the data storage technology and on the location of data. Components could use different data storage and distribution technologies and policies;
If, for example, for performance reasons, multiple components or Pipelines need to share ad hoc data structures and formats (e.g., data caches), they remain responsible for inter-component and inter-Pipeline communication and data exchange protocols local to such a component / Pipeline pool; when they communicate with other generic components and Pipelines (e.g., for accessing input data collections, for communicating conflicts to a conflict resolution module, for interacting with general-purpose crowdsourcing platforms), they must adhere to the CUbRIK data model for data representation and on SOAP and REST for multimedia Web service interactions, possibly enhanced with MTOM2 and SOAP Messages with Attachments3. The design of the CUbRIK architecture is an example of differential design. The architecture is based on SMILA4 as the underlying framework for supporting workflow definition and execution. The goal was to exploit SMILA capabilities to enable easy integration of data source connectors, search engines, sophisticated analysis methods and other components by gaining scalability and reliability out-of-the-box. Therefore, the design of the CUbRIK architecture has proceeded by: •
•
Identifying the technical requirements of human computation enhanced multimedia processing in CUbRIK;
•
Analyzing the capacities of SMILA related to CUbRIK objectives
•
Identifying gaps between SMILA and CUbRIK.
•
Designing the architectural extensions needed to bridge the identified gaps.
•
Designing the new components that implement advanced CUbRIK functionality on top of the basic functions of SMILA.
•
Integrating those components in the SMILA framework.
2 World Wide Web Consortium (W3C). SOAP Message Transmission Optimization Mechanism http://www.w3.org/TR/soap12-mtom/ 3 World Wide Web Consortium (W3C). SOAP Messages with Attachments. http://www.w3.org/TR/SOAP-attachment 4 http://www.eclipse.org/smila/
CUbRIK Architecture Design
Page 3
D9.8 Version 1.0
1.2
High level view of the CUbRIK architecture
The original concept of the CUbRIK architecture is specified in the Project DoW, as shown in Figure 1.
Figure 1: Essential aspects of the CUbRIK architecture (from Project DoW) This schema highlights that the CUbRIK architecture relies on a framework for executing processes (aka Pipelines), consisting of collections of tasks to be executed in a distributed fashion. Each Pipeline is described by a workflow of tasks, allocated to executors. Task executors can be software components (e.g., data analysis algorithms, metadata indexing tools, search engines of different nature, result presentation modules, etc.). Tasks can also be allocated to individual human users (e.g., via a gaming interfaces) or to an entire community (e.g., by a crowdsourcing component). Different Pipelines are defined for the different processes of a multimedia search application: content analysis and metadata extraction, query processing, and relevance feedback processing. Pipeline descriptions are stored in a process repository, automatic part of the Pipelines is encoded using BPEL standard workflow language or as SMILA Pipeline, with data exchanges across services supported by a suitable data model in order to cope with the data intensive nature of multimedia content processing and search. The CUbRIK architecture is divided into four main layers or tiers: 1. External Interfaces 2. Processes 3. Components and executors 4. Platform services. Layers group the framework artefacts, essentially composed by Interfaces, CUbRIK Apps, Pipelines –including automatic and human tasks-, components –of different kind- and core services. In reference to the SMILA exploitation for the underlying framework supporting workflow definition and execution, it is essentially corresponding to the Platform Services layer. Starting from some existent services –like execution engine, task management, persistency & cache support- SMILA is extended in the course of the project with the goal of CUbRIK Architecture Design
Page 4
D9.8 Version 1.0
having a full functional coverage of CUbRIK Platform Services layer.
1.3
CUbRIK Pipelines
Unix-like computer operating systems implements a simple Pipeline mechanism for interlocking programs executions; specifically, software Pipeline is composed by a set of processes chained by their standard streams, so that the output of each process feeds directly as input to the next one in the chain. Another example of software pipe is the one implemented by Yahoo! Pipe5; these Pipes implement a composition tool to aggregate, manipulate, and mashup content from around the web in order to allow the User the selfcreation of tailored and personalized Yahoo! homepage. Conceptually speaking, CUbRIK Pipeline reflects a similar arrangement; Pipelines are a mixing of automatic operations –JOBs- and human activities that are chained in a sequence. This concept of Pipeline reflects the CUbRIK approach to have the Humans in the loop of generic Search processes. As mentioned, in CUbRIK architecture, SMILA6 was adopted as underlying framework for supporting workflow definition and execution; Therefore, each CUbRIK JOB is implemented, in practice, as a SMILA Workflow. The latter is a system for processing synchronous requests (e.g. search requests) by orchestrating easy-to-implement components (Worker or Pipelets) in workflows defined in BPEL (Pipeline).
Figure 2: CUbRIK Pipeline Structure In this way, CUbRIK Pipeline results in a powerful mechanism to aggregate: •
Job - automatic workflow o SMILA Workflow
Human activities o Crowd-enabled CUbRIK Applications o Game With A Purpose (GWAP) CUbRIK Application. More in detail, CUbRIK JOB is defined as a SMILA workflows that is further constitute by Actions aggregation. More, depending from the complexity and specific characteristics of activity to be performed, the Action can be formalize as a Worker - Single processing component in an asynchronous workflow – a Pipelet – a reusable component in a BPEL workflow used to process data contained in records – or a Pipeline - synchronous BPEL process (or workflow) that orchestrates Pipelets and other BPEL services (e.g. web services). The characteristic of an asynchronous workflow is that it is started and its execution is controlled by a ProcessManager. The results are not given to a special recipient, but are stored, usually in a successive way. •
5 http://pipes.yahoo.com/pipes/ 6 http://www.eclipse.org/smila/
CUbRIK Architecture Design
Page 5
D9.8 Version 1.0
Figure 2 depicts the overall CUbRIK Pipeline structure, emphasizing human activities and Automatic activities executed over SMILA. Main memory data sharing in Pipelines, Jobs and Workers is quite simple: the SMILA framework takes care. For the persistence of data, a dedicated persistent data storage system is required. An approach that is standard in SMILA is to make use of a store for content assets (‘Blob store’) and one store for metadata (‘Record store’), as shown in Section 4.1.5 Data stores .
1.3.1
CUbRIK Pipeline Synchronization
One of the most challenging issues in designing the CUbRIK platform and applications is coping with the different timing of automatic and human work, which demands for flexible synchronization and data exchange among tasks realized in different ways and executed by different actors. Having a closer look to the synchronization aspect, the CUbRIK Pipeline is conceived as a macro-process including micro-processes of diverse nature articulated in different combinations. These micro-processes, represented by the two typologies above listed, are arranged according to specific synchronization; synchronous automatic processes –JOBsare combined with Crowd-enabled and GWAP Application, performed asynchronously. Figure 3 below depicts an example of CUbRIK Pipeline:
Figure 3: Example of CUbRIK Pipeline graphic representation •
JOB (Automatic Workflow) represents a workflow fragment that is executed by process execution engine; JOBs embed and exploit the processing components for the activities carried on by the machine.
•
Crowd-enabled Applications are the first kind of application implementing the mechanism of Human in the loop. These Applications are targeted to enabling individual and social participation to search processes; the actual Crowd mechanism leverages on the CrowdSearcher Framework [3] implementing distributed work
CUbRIK Architecture Design
Page 6
D9.8 Version 1.0
solutions for multimedia search. The Framework is in charge of design, execution and verification of tasks by a crowd of performers; In particular it manages core aspect including but not limited to Human task design, People to task matching, Task assignment, Task execution, Executor evaluation and Output aggregation. •
1.3.2
GWAP Applications are the other kind of application implementing the human in the loop mechanism. As for Crowd-enabled Application, even for GWAP a specific GWAP Framework is designed. By leveraging on playing games, the Gaming Framework, actually outsources certain processes steps to humans, in an entertaining way. Typical steps include labelling images to improve web searching, transcription of ancient text and other activity requiring common sense or human experience. Basic mechanism is the training of the system to solve problems mainly related to media understanding and contents interpretation. The Gaming Framework is part of the work in WP3 (Task 3.1 Games with a purpose for multimedia search). Deliverable D3.1 due at M15 (First GWAP and implicit user information techniques) deals explicitly with the issues of GWAP design and implementation; so we defer the description of the Gaming Framework to that specific deliverable.
CUbRIK Pipeline Data Exchange and Storage
A CUbRIK Pipeline is composed of different JOBs and Human Activities that are orchestrated all together on respect of their synchronous or asynchronous nature. CUbRIK infrastructure synchronizes the flow of control. By doing that CUbRIK Platform accomplishes the complementary function of data managing. In fact, availability of data exchange among JOBs and Human Activities is essential for thorough Pipeline execution. The Pipelines mechanism implements the concept of data sharing among operations and human activities belonging CUbRIK Search processes. Beside this, Pipeline Data constitutes the basis of the triggering mechanism to “put in synch” JOBs and Human Activities. In fact, referring back to the analogy with Unix-like Pipeline, it can be thought that one program runs after another; each program is triggered by underlying infrastructure when the previous program is concluded and its output is made available. The output itself is feed as input of the next one. CUbRIK synchronization process, in a similar way, makes available the data produced by each JOB or Human activity to the next JOB or Human activity. The data are made persistent in agnostic way.
1.3.3
Data persistence
As an application may consist of different communicating and possibly asynchronous Pipelines, shared data need to be stored and accessed. In general there are the following options and restrictions: •
All Pipelines that run in the SMILA environment have access to built-in blackboard where data can be exchanged.
•
For all other Pipelines there has to be another way of data exchange. Giving data as an argument is not considered as the amount of data is too large. Therefore, there has to be a storage unit which allows access to data.
•
This persistence unit has to be accessible from inside the CUbRIK framework. A JSON/REST API would fit.
CUbRIK Architecture Design
Page 7
D9.8 Version 1.0
2.
SMILA architecture and core components
This section provides an introduction into the fundamental concepts behind SMILA, which have been examined in depth and contrasted with the functional and non-functional requirements of CUbRIK, in order to understand the mapping and usage of the SMILA BPEL engine, inter- and intra-Pipeline communication and data storage characteristics that could be put to work to support the execution of human-enhanced multimedia processes. The Section describes how the failsafe content-processing is done using asynchronous workflows or synchronous Pipelines. To facilitate understanding of the relations between SMILA and CUbRIK, the section also provides a mapping between the SMILA architecture and the CUbRIK Architecture described in this document. By following this mapping, the following advantages are realized within CUbRIK: Through the use of SMILA across all tiers of the CUbRIK architecture, there are less technical boundaries and interfaces to consider. This reduces the learning effort needed to start implementing for the CUbRIK platform. The processing of content and queries can rely upon the scalability and fault tolerance mechanism implemented within SMILA. In particular, CUbRIK can scale across the tiers during performance peaks. For example, when a new datasource is crawled, more performance can be given to the crawling process, while the processing of the crawled data can be postponed to time with low access. Component once integrated within SMILA can be reused throughout CUbRIK. For example, an entity recognizer could be used in the content acquisition workflows as well as in the queries. This reduces development time and supports a consistent system behavior. SMILA supports to use the same components in a throughput optimized or in low latency scenario, i.e., synchronous workflows or synchronous Pipelines.
2.1
Mapping between the CUbRIK Architecture and SMILA
Figure 4 presents an overview of the SMILA architecture (as of version 1.1). The different color highlights in this picture provide a general allocation of the parts of SMILA to the tiers of the CUbRIK architecture. The orange highlight is the Content & User Acquisition tier. SMILA provides basic connectors and crawlers for standard data sources (web, file, database). Furthermore, it provides a push API where external components like Content Upload Apps can upload data. The uploaded data can then be processed further, i.e. to extract metadata or to consider copyright information. Furthermore the correct workflows for handling the content is selected (e.g. whether the content is stored within the platform for later external access or whether it is cached for internal processing purposes only. The blue highlight is the content acquisition tier where the content is analyzed in more detail. Why the initial metadata-extraction during data import can be considered as “cheap� concerning processing time, this processing will use more sophisticated and thus performance intensive analysis methods. The green highlight represents the query and search tier. For each type of query, a synchronous Pipeline is implemented to allow a low latency access to the searches content. Furthermore, by using Pipelines, multimodal queries that are using the search engines sequentially (e.g. for entity lookup) or in parallel can be implemented. The Pipeline mechanism allows a sufficient flexibility to implement new requirements.
CUbRIK Architecture Design
Page 8
D9.8 Version 1.0
Figure 4: Overview of the SMILA architecture and its mapping to the CUbRIK tiers
2.2
Foundational Concepts of SMILA
SMILA is a framework for creating scalable server-side systems that process large amounts of unstructured data in order to build information processing applications such as searches, linguistic analysis and information mining. SMILA's goal is to enable organisations to integrate data source connectors, search engines, sophisticated analysis methods and other components by gaining scalability and reliability out-of-the-box. Compared to similar frameworks – with UIMA as a well-known example – SMILA was developed right from the start with focus on scalability and fault tolerance as well as on supporting the whole data processing and retrieval lifecycle. Furthermore, the rigorous checks concerning intellectual property issues performed by the Eclipse Foundation provides an ideal basis for incorporating SMILA in commercial products. Firstly, a general overview of SMILA and the underlying concepts is given. Secondly, how data processing workflows are defined and handled in SMILA is described. The third section presents an example of how SMILA is applied. The summary and outlook section closes this contribution. As depicted before, SMILA provides these main parts:
●
● ● ●
JobManager: a system for asynchronous, scalable processing of data using configurable workflows. The system is able to distribute reliably the tasks to be done on big clusters of hosts. The workflows orchestrate easy-to-implement workers that can be used to integrate application-specific processing logic. Crawlers: concepts and basic implementations for scalable components that extract data from data sources. Pipelines: a system for processing synchronous requests (e.g. search requests) by orchestrating easy-to-implement components (Pipelets) in workflows defined in BPEL. Storage: concepts for integrating "Big Data" storage for efficient persistence of the processed data.
CUbRIK Architecture Design
Page 9
D9.8 Version 1.0
Eventually, all SMILA functionality is accessible for external clients via an HTTP ReST API using JSON as the exchange data format. As an Eclipse system, SMILA is built to adhere to OSGi standards and makes heavy use of the OSGi service component model. A SMILA system consists of two distinguished parts: Firstly, data has to be imported into the system and be processed to build a search index or extract an ontology or whatever should be learned from the data. â—? Secondly, the learned information is used to answer retrieval requests from users or other external systems, for example search or ontology exploration requests. In the first process, a data source is usually crawled or an external client pushes the data from the source into the SMILA system using the HTTP ReST API. Often the data consists of a large number of documents (e.g. a file system, website, or content management system). To be processed, each document is represented in SMILA by a record describing the metadata of the document (name, size, access rights, authors, keywords, etc.) and the original content of the document itself. To process large amounts of data, SMILA must be able to distribute the work to be done on multiple SMILA nodes (computers). Therefore the bulkbuilder separates the incoming data into bulks of records of a configurable size and writes them to an ObjectStore. For each of these bulks, the JobManager creates tasks for workers to process such as bulk and produce other bulks with the result. When such a worker is available it asks the TaskManager for tasks to do, does the work and finally notifies the TaskManager about the result. Workflows define which workers should process a bulk in what sequence. Whenever a worker finishes a task for a bulk successfully, the JobManager can create follow-up tasks based on such a workflow definition. In case a worker fails its task (because the process or machine crashes or because of network problems) the JobManager can decide to re-try the task later and ensure that the data is processed regardless of whether there are error conditions. The processing of the complete data set using such a workflow is called a job run and the monitoring of the current state of such a job run is possible via the HTTP ReST API.
â—?
JobManager and TaskManager use Apache Zookeeper7 in order to coordinate the state of a job run and the to-do and in-progress tasks over multiple computer nodes. As a result, the job process is distributed and parallelised. To make implementing workers easier, the SMILA JobManager system contains the WorkerManager, which enables the developer to concentrate on the actual worker functionality without having to worry about getting the TaskManager and ObjectStore interaction right. To extract large amounts of data from the data source, the asynchronous job framework can also be used to implement highly scalable crawlers. Crawling can be divided into several steps: Getting names of elements from the data source. Checking if the element has changed since a previous crawl run (delta check). Getting the content of changed or new elements. Pushing the element to a processing job. These steps can be implemented as separate workers too so that the crawl work can be parallelised and distributed quite easily. By using the JobManager to control the crawling, SMILA gains the same reliability and scalability for the crawling as for the processing. Implementing new crawlers is just as easy as implementing new workers. Eventually, the final step of such asynchronous processing of a workflow will write the processed data to a target system, for example a search engine, ontology storage or a database where it can be used to process retrieval requests that are being handled by the second part of the system. Such requests come from an external client application via the HTTP ReST API. They are usually of a synchronous nature, which means that a client sends a request and waits for the result. These results are then presented to the end user and 7 zookeeper.apache.org/
CUbRIK Architecture Design
Page 10
D9.8 Version 1.0
therefore should be produced quickly. In addition, SMILA provides a similar flexibility to configure the processing of such synchronous requests similar to the asynchronous job processing. Therefore, SMILA has a different workflow processor which is based on a BPEL engine. The BPEL workflows (called Pipelines) in this processor orchestrate so-called Pipelets in order to perform the different steps needed to enrich and refine the original requests and to produce the result. Finally, it is also possible to combine both workflow variants because there is a PipelineProcessing worker in the asynchronous system which performs a task by executing a synchronous Pipeline. It is therefore possible to implement only a Pipeline and make the functionality available in both kinds of workflows. Additionally, a PipeletProcessing worker is available, which executes just a single Pipelet. This worker reduces overhead for the synchronous workflow processor in case one Pipelet is sufficient to process the tasks.
2.3
Asynchronous workflows
One of the core requirements of CUbRIK is the ability of intermixing synchronous and asynchronous tasks in the same application. Therefore, special emphasis has been placed on the analysis of the way in which SMILA asynchronous workflows can support CUbRIK specifications, through the proper configuration of the TaskManager and of the JobManager. The JobManager controls the processing logic of asynchronous workflows in SMILA by regulating the TaskManager, which in turn generates tasks and decides which task should be processed by which worker and when. Asynchronous workflow consists of a set of actions. Each action connects the input and output slots of workers to appropriate buckets. A bucket is a virtual container of data objects of the same type. The most common data object type in SMILA is the record bulk, which is just a concatenated sequence of records (including attachments) stored in the ObjectStore service. When a new data object arrives in a bucket connected to the input slot of a worker (usually created by a worker that has the bucket connected to its output slot), a task is created for the worker to process this object and to produce data objects with the results in the buckets connected to the output slots. Thus the workflow (consisting of actions reading from and writing to buckets) describes a data flow of the data objects through the workers. The workflow usually starts with a worker that creates data objects from data sent to a SMILA API (e.g. the Bulkbuilder creates bulks of records sent by external or internal clients), or from the data which have been extracted from an external data source (e.g. a crawler worker). The workflow ends either when workers do not have output buckets, or the output buckets are not connected to input slots of other workers. Then, all temporary data objects created during the workflow are deleted and only the data objects in buckets marked as persistent will remain. A workflow definition is usually still generic because it does not define all the parameters needed by workers (e.g. the name of an index to build) and buckets (e.g. the name of the store for temporary data objects) used in the actions. To execute a workflow, a job must be defined that sets all these parameters to appropriate values. Then the job can be started which initiates a job run. As long as the job run is active, new data can be submitted to it and the JobManager will ensure that it is processed by the workflow. Finally, after receiving the finish command, the job run will not accept any new data, the job will finish and process the already submitted data (workflow runs). Then the job can be started again and be repeated several times. It is possible to monitor the job run all the time and see the amount of data being processed by a worker during a given time period, how many errors have occurred and how much work is still to be done. After the job run has finally finished the monitoring data is persisted for later analysis. Two main components are responsible for making this work: the JobManager knows the workflow and job definitions, it controls the creation of initial and follow-up tasks and accumulates the monitoring data measured with each task being finished. The TaskManager knows which tasks have to be done by which worker and which tasks are currently in progress, it also delivers tasks to workers that are currently available and ensures that a task will be repeated if a worker has died while working on it. All this works in a cluster of SMILA nodes as well, so the work can be distributed easily and reliably and parallelised across all CUbRIK Architecture Design
Page 11
D9.8 Version 1.0
nodes.
2.4
Common behaviour of JobManager definition APIs
CUbRIK demands for a high level of configurability of the process engine, to respond to the different functional specifications of CuBRIK application that involve human intervention of different nature and at different stages of the multimedia processing. SMILA provides APIs to read and write JobManager configuration elements. (Currently one can configure buckets, workflows and job definitions.) The bullets listed below describe the specific APIs to do this. However, they have some common properties: Elements can be defined either in the system configuration, or by using the APIs. System-defined elements cannot be changed by API calls. Therefore, when reading such system-defined elements using the API, they will contain a "readOnly" flag set to "true". Requests to update these elements will result in an error. You cannot set this flag when you create elements to protect them from being overwritten. The API will remove it. User-defined elements, on the other hand, will contain a timestamp attribute holding the information about when an element was last changed. This can be used by modelling tools to ensure that they do not overwrite changes made by other users. You cannot set this timestamp yourself in an update request; it will be overwritten by the API. Additionally, when an update request for an element is performed successfully, the response object will also contain the timestamp attribute generated for this update action. Apart from the required and optional structure and content of the JobManager, elements can contain additional information as needed by the user. This makes it possible to add comments, descriptions, author information etc. The read APIs show this additional information in the result objects only if invoked with a "...?returnDetails=true". Otherwise the response only contains the basic information.
2.5
Configuring SMILA synchronous workflows: BPEL Designer
The CUbRIK DOW also requires the construction of appropriate tools for the design and configuration of CUbRIK Pipelines. Therefore another focus of the differential analysis of CUbRIK and SMILA architectures has been placed on the understanding of available BPEL editing and configuration tools. This analysis has led at the choice of the BPEL Designer8 as a starting point of the future work in WP8 for the construction of CUbRIK Pipeline configuration tools. BPEL Designer is an Eclipse project that offers support for editing WSBPEL 2.0 processes. Like other Eclipse projects, it may be extended by using plug-ins. The SMILA project offers such plug-ins for editing SMILA-specific activities (to invoke SMILA Pipelets from a Pipeline). BPEL Designer incorporates SMILA's components/Pipelets as BPEL's extension activities and thus forms SMILA's synchronous workflows. The same plug-in approach will be followed for inserting into the BPEL Designer further components needed for visually editing and customizing CUbRIK human-enhanced multimedia Pipelines.
8 http://www.eclipse.org/bpel/
CUbRIK Architecture Design
Page 12
D9.8 Version 1.0
Figure 5: Screenshot of BPEL Designer
CUbRIK Architecture Design
Page 13
D9.8 Version 1.0
3.
Design of the CUbRIK Architecture
After the introduction of the CUbRIK Architecture and of the main concepts of SMILA as a support to implement CUbRIK Pipelines, this Section gives an overview of the modules of the proposed CUbRIK architecture in more detail. Figure 6 presents a refinement of the highlevel CUbRIK architecture presented in the DoW document, also illustrated in Figure 1. In Figure 6 the following meanings are associated with the symbols: •
White rectangles: logical functional components
•
Gray large rectangles: tiers
•
Arrows: flow of data
•
Hexagons: external applications
• Can-symbols: stores and indexes These elements of the CUbRIK architecture are described in the following text. Compared to earlier version – e.g., as presented in D9.1 Integration Guidelines – the lower two tiers were merged. A justification of this is in the corresponding section of the text. Content upload app (s)
Upload manager Subscription manager
Crawiling app(s)
Crawling manager
CONTENT & USER ACQUISITION TIER
Content & metadata acquisition manager
Copyright-awareness manager
Performers store (D)
Content & artifacts store (D) Annotation store (D)
Collection mgmt app
Human annotation app(s)
Content processing manager Conflict manager SMILA Pipeline engine Segmentation pipelines Feature extraction pipelines Text normalization pipelines Automatic annotation pipelines Entity extraction pipelines Conflict detection pipelines Feedback pipelines
Conflict resolution app(s)
CONTENT PROCESSING TIER
Performer manager
Relevance feedback manager
Relevance Feedback store (D) Annotation store (D)
Entities store (D)
Performers store (D)
Conflicts store (D) Query app(s)
Query Pipeline QueryBroker Broker Query
Content & artifacts stores (D) Text search engine
Content-based search engine(s)
Time search engine
Text search index (D) Feature & fingerprint store
Space search engine
Entity search engine
QUERY & SEARCH TIER
Entities store (D)
Figure 6: Detailed CUbRIK Architecture with integrated query and search tier
CUbRIK Architecture Design
Page 14
D9.8 Version 1.0
An alternative view that focuses on the deployment can be provided to describe the CUbRIK platform as a two-server architecture model. A server tier provides support to the process control and the data storage, respectively built upon the functionality offered by SMILA (process engine) and Pipeline containers which bas upon SMILA, i.e. Empolis Information Access System (IAS) (a SMILA-compliant persistent storage service wrapping the Mongo DB9 open source record store) [3]. This is the so-called PAAS backend (Platform as a service). The application specific parts of the backend run on another server, called the SAAS backend (Software as a service). This server also provides the HTTP-based API access to applications. This deployment configuration is depicted in Figure 7.
Figure 7: CUbRIK PAAS and SAAS back-ends
3.1
Content and User Acquisition Tier
The content and user acquisition tier has a twofold purpose: o Supporting the provision of content into the CUbRIK platform, as required by one or more Pipelines. Two major content provision modes are supported: •
By copy: content is physically ingested and stored inside an instance of the platform. This mode is viable when content access rights permit to upload or make a copy inside a CUbRIK instance;
By reference: content is stored externally and accessed by CUbRIK Pipelines when necessary. This is the case, for example, of external semantic repositories (e.g., Entitypedia), which cannot obviously be stored internally, but need to be accessed via data transfer APIs from within a CUbRIK Pipeline.10 Supporting the subscription of users to a CUbRIK platform instance, for using CUbRIK to launch multimedia search applications or to join CUbRIK as human performers of multimedia processing tasks. The acquisition of users to a CUbRIK instance can occur in two ways: •
o
•
By registration: users are offered an explicit application to register themselves to a CUbRIK instance and provide their profile data;
•
By import: users can be detected and imported from external systems,
9 www.mongodb.org 10 In this case there hast o be an internal store to handle the references (URIs).
CUbRIK Architecture Design
Page 15
D9.8 Version 1.0
including crowdsourcing platforms and social networks. This modality is respectful of the terms of usage of the user’s community platform and may simply provide a reference to the user’s identity (e.g., a public profile) with no associated profile data. Users’ identities, profile data for registered users and simple references for users of other platforms, are normalized according to the CUbRIK data model and made available to the other subsystems and modules via suitable user data access Web services. The content and user acquisition tier consists of the subsystems is described in the following.
3.1.1
The Subscription Manager
The Subscription Manager handles the explicit registration of users to the platforms. There are two broad classes of users: searchers and performers. o Searchers use CUbRIK apps for searching and interacting with information; they may be exploited to get feedback on query result quality (e.g., via click stream analysis); o Performers execute tasks on CUbRIK applications (for example, via an external crowdsourcing platform, a gaming or Query&Answer application) to provide contribution, inspections or conflict resolution. Figure 8 provides a UML diagram for the two user types. The registration process is lightweight and leverages existing single sign-on platforms (e.g., login with Facebook account via OAuth, openID, etc.). In this case, profile data of the user who registers himself to the CUbRIK platform are kept in the system where they belong, and CUbRIK stores only the minimal amount of data for referencing the external user.
Figure 8: Different CUbRIK Users The registration process associates a unique CUbRIK ID to the existing account of the user. The CUbRIK Subscription Manager should be designed to be “privacy-friendly”, thus using (and retaining) only the users information required for internal purposes (e.g., performance monitoring of CUbRIK task execution by performers, as needed to implement capacity-based task-to-user allocation policies), reusing their native profiles and personal data on social networks as much as possible and in compliance with the data access terms of each social network platform. In Figure 9 the process of using an external authentication service by CUbRIK is depicted.
CUbRIK Architecture Design
Page 16
D9.8 Version 1.0
Figure 9: Process of authentication via OAuth (exemplary) As a consequence of this design principle, CUbRIK Pipelines should not rely on personal data retention features other than the work statistics of users who subscribed to perform specific CUbRIK tasks in human-enhanced CUbRIK Pipelines. The information stored inside the Subscription Manager is then used in the following ways: •
The Performer Manager uses the information to manage the distribution of tasks to performers and to collect feedback from task performance. For the performers registered at the CUbRIK platform, the Performer Manager uses the CUbRIK ID of the user (CUbRIK.D21.POLMI.WP2.V1.0.doc, Section 1.4). The focus of the Performer manager is on the collection and aggregation of the information about the worker’s performance history, whereas the subscription manager acts simply as a control gateway to access CUbRIK applications.
•
If requested, the account information is used to crawl datasources available to of the subscriber.
•
It is used to provide access to the services of the CUbRIK Platform, in particular for the annotation services as well as for the search Pipelines where needed.
3.1.2
The Upload and Crawling Managers
The Upload Manager and the Crawling Manager are the subsystem responsible of provisioning a CUbRIK instance with raw content (images, video, audio, text). Content elements can be added to a CUbRIK platform via upload or by scheduled crawling. After acquisition, they are subjected to normalization by means of one or more content processing Pipelines, so to be made ready for search Pipelines. These Managers are based on the import functionality provided by SMILA. Crawling imports into CUbRIK external metadata in popular formats (e.g., Dublin Core, or a useful subset of MPEG-7). External metadata are reformatted according to the CUbRIK data model by a suitable stage of the content processing metadata. Content registration gives content element a CUbRIK ID, and then stores: The content element in raw format; The associated crawled or manual metadata (if any); The associated content rights metadata (if any); Content acquisition is independent of and asynchronous w.r.t. content processing; content processing Pipelines can be activated immediately after the acquisition terminates, or at a later stage. The Upload and Crawling Manager maintains log information on the content that been acquired and is pending for some content processing Pipeline to be applied to it.
CUbRIK Architecture Design
Page 17
D9.8 Version 1.0
Content acquisition includes extraction of metadata, including rights and license information. The Upload and Crawling Managers are built upon the existing import features provided by SMILA. The Upload Manager uses the Push API of SMILA, while the Crawling Managers are using the ETL (Extract, Transform, Load) functions of SMILA. For each datasource to crawl (web, file, database) a strategy on how to crawl the datasource is provided. These strategies can be defined via several configurations in the SMILA workers. The initial processing of metadata is then performed by asynchronous workflows in SMILA, which build the Content and Metadata Acquisition Manager.
3.1.3
The Content and Metadata Acquisition Manager
The Content and Metadata Acquisition Manager is an internal logical sub-system that offers services to the Upload Manager and to the Crawling Manager. It factors out the logic for normalizing the representation of content and of the associated metadata to the internal standards of CUbRIK. Metadata are aligned to the CUbRIK data model. Content is preserved in its native format and stored either by copy or by reference, depending on the copyright permissions and on the requirements of the content processing and query Pipelines. The Content and Metadata acquisition manager are implemented in SMILA as a set of asynchronous workflows that provide an initial processing of the crawled content. This workflow-based approach implements the required plugin structure, mentioned in D 9.1– e.g. the acquisition of a new content type can be implemented by simply adding new content normalization workers to an existing workflow, or by creating a new workflow. As an example of the abovementioned approach, refer to the logo detection h-demo, described in D5.1 R1 Pipelines for multimodal content analysis & enrichment).
3.1.4
The Copyright Awareness Manager
The goal of this component is to address copyright aspects (e.g., regarding right of reproduction, right of communication to the public) for content approval, storage, annotation, transformation, presentation and distribution. The idea is to partially automate, both using automatic annotation and rules as well as user input, the determination of whether and how content is processed and used in the platform. The aim is to maximize availability of content for users, while ensuring respect for copyright holders at the same time, especially respect for the rights of users that participate in crowdsourcing and content production processes in CUbRIK. The Copyright Awareness Manager is responsible for: a.) Determining the content status / content approval for the system. This is done by using relevant information (contextual and otherwise) to determine content provenance / authenticity and trust into source / content provider; b.) Using and interpreting relevant metadata (including CC content licenses, ACAP and other relevant information) and contextual information aggregated and harmonized by the Upload and Crawling Managers and Content and Metadata Acquisition Manager, derivation of permissions of how content (and derived information / metadata) should be handled in the system, and communication of permissions to the relevant system domains (storage, annotation, transformation, presentation, delivery), where it will be interpreted; c.) Communication with rights holders and crowds to track, complement and modify rights, license and provenance information and resolve possible conflicts. Rights holders are registered as CUbRik users, so that the Copyright Awareness Manager will dialogue with the Subscription Manager to implement this communication. Concerning a) and b), the Copyright Awareness Manager is implemented by a) Workers that extract the needed copyright information from the datasource or file and b) workflows that handle the data according to the copyright information. For example, data with copyrights might be cached inside the CUbRIK platform for processing. That means that it cannot be accessed only from the original source rather than from the CUbRIK platform. CUbRIK Architecture Design
Page 18
D9.8 Version 1.0
3.2
Content Processing Tier
The CUbRIK Content Processing Tier is not a monolithic subsystem, but rather a set of Pipelines (implemented using the SMILA framework) that are devoted to processing the content acquired the modules of the Content Acquisition Tier, in order to make it amenable for query processing. The content processing Pipelines are independent from the content acquisition jobs. Multiple Pipelines can be applied to the same content collection and conversely, a same content collection can be subjected to more than one processing Pipelines, e.g., to support different query scenarios. The link between acquisition and processing is loose and based on a registration and notification mechanism. A Pipeline registers for the availability of given content and is notified when new content elements become available. The Content Processing Manager builds upon the functionality of the native SMILA workflows and Pipelines. It listens to a queue of pending content processing tasks and is responsible of starting, suspending, resuming, terminating, rescheduling a content processing task. The content processing tier processes contents at different levels of granularity: Collection; Sub-collection; Content element; Derivative content element; A Content Processing Task has the following external states (or according due to implementation refinements: PREPARING: started but not running yet RUNNING: running FINISHING: finished but not all tasks processed yet COMPLETING: finished, all tasks processed but job run not completed (e.g. not persisted) yet SUCCEEDED: successfully completed FAILED: failed CANCELING: canceled, but clean-up is not yet completed. CANCELED: canceling done. Figure 10 shows the state transitions of a task.
Figure 10: State diagram of a task The Pipeline engine orchestrates the actual execution of Pipelines in response to requests for content processing. It is aware of the internal status of the content processing task (which Pipelines have been executed and which ones are still to be executed). It deals with exceptions, Pipeline monitoring and restart. Content processing must be incremental: the same content can be subject to several processing Pipelines, where each Pipeline adds a new annotation. The goal is to enable realCUbRIK Architecture Design
Page 19
D9.8 Version 1.0
time search, while giving to the infrastructure the time to perform full-annotation. Control flow among Pipelines is expressed declaratively (e.g., as a macro-Pipeline formed of sub-Pipelines, with runtime conditions governing the flow of control). The typical output of a content processing Pipeline consists of: Derivative content (e.g., key frames, thumbnails, audio summaries); Low-level features; Facts, i.e., annotations (a.k.a. high-level annotations) + confidence values; Entities; Conflicts (low confidence facts, contradictory facts). However, depending on the application a content processing Pipeline may output only a subset of the abovementioned elements. The Content Processing Manager uses the existing stores – which are structured according to the data model – to store the results of a processing workflow. Afterwards, other processing workflows can use these results for further processing steps.
3.2.1
The Conflict Manager
The Conflict Manager handles the set of conflicts and the assignment of conflicts to applications and performers. Conflicts can be assigned: To an application: in this case the application manages the allocation of conflicts to be solved to performers. This is the typical case for GWAP apps; To an application-performer pair: in this case, the application routes the conflict to the selected performer. This is the typical case of Q&A apps. The Conflict Manager is responsible of closing a conflict and storing the produced facts in the conflict store and, possibly, in the native knowledge repository (e.g., the EntityPedia semantic store). The Conflict Manager can implement a policy of escalation (e.g., re-routing a hard conflict to a more skilled performer, or to the CUbRIK admin). It also accesses the performer store in order to implement assignment policies that take into account the difficulty level of a task and the available profile data of a user (e.g., the record of task resolutions). The Conflict Manager detects the conflicts in the following ways: •
During a content processing job, a metadata entry is derived with low certainty or without success. Then the workflow writes the conflict according to the conflict data model into the conflict store. Thresholds will be implemented in the Pipelet/worker.
•
During a content processing job, two or more contradicting metadata entries are derived. Then a Pipelet/worker writes the metadata into the conflict store.
Periodically, the content stores are analyzed for contradicting metadata that were created be different workflows, thus detecting conflicts. The Conflict Manager receives the result of the conflict resolution from the conflict resolution apps. If the conflict was solved, the Conflict Manager writes the result back to the corresponding store. To have the Conflict Manager handle the results of conflict resolution reduces the exposure of APIs to Conflict Resolution Applications, since they do not need to know how the results are written back to the stores. •
3.2.2
The Performer Manager
The Performer Manager is responsible of keeping statistics on performers (profile, history of solved conflicts, throughput, quality of decision, etc.). These data can be used for several purposes: The construction of leader boards for gaming applications; The design of “task to performer“ allocation policies based on task difficulty levels and performer’s skill level;
CUbRIK Architecture Design
Page 20
D9.8 Version 1.0
3.2.3
The design of “task to performer“ allocation policies based on demographic data (e.g., location aware assignment policies, where the cultural background of the user may influence his or her adequacy as a task executor).
The Conflict Resolution Applications
The CUbRIK platform does not perform conflict resolution as part of its core Pipelines. Rather it offers APIs for exporting conflicts and transforming them into tasks that can be performed by humans, by means of conflict resolutions applications. A conflict resolution application can be: An application built on top of an existing crowdsourcing platform (e.g., on top of Microtask); A gaming application; A query and answer application. The Conflict Resolution Applications give report to the Conflict Manager reports back to the conflict resolution application.
3.2.4
The Relevance Feedback Manager/Pipelines
Some Pipelines are designed to receive feedback from the user on the results of a query. This feedback is routed to a Feedback Manager module that updates the level of trust of performers (human and automatic) in the Performer Store.
3.3
Query and Search Tier
In the Query and Search Tier provides the functions to search for multimedia within the CUbRIK platform. A user or an application states a request to a Pipeline in Pipeline engine. This Pipeline implements the use case specific actions to answer the request. Consider the following use case (derived from the European history scenario presented in CUbRIK D2.2): The snapshot matcher Pipeline is given a picture by the user. Content Search Engine(s) retrieve several pictures. Documents that reference these pictures are retrieved. Entities in these documents are recognized by a request to the entity search engine. The entities recognized are searched in the time and space search engines. This recognized metadata is used to search for similar pictures and text (e.g. from the same period in time or location). This example shows that the query is enriched by an interaction between different data sources. The presumably best way to handle this is to provide a specific workflow with reusable components, e.g., to access the different search engines. These components could then be combined to the specific workflow needed. Due to the high interaction between query and search, these functions were located in the same tier – rather in separate ones as in previous architectural pictures. The underlying technical framework supports the reuse of entire Pipelines, or even of pieces thereof (called jobs). The Search and Query Tier provides different Pipelines for different query purposes, for example: Keyword; Visual similarity (Still image and video), also called visual content based queries; Aural similarity, also called aural content based queries; Multimodal (keyword + one similarity criterion). Considering entities as well as spatial and temporal relationships is dependent of the functionalities of the underlying search engines. The called Pipeline might process the request to change it to the format needed by the search engine.
3.3.1
Query Applications
A Query App contains the front-end for issuing queries and viewing results; it normally
CUbRIK Architecture Design
Page 21
D9.8 Version 1.0
triggers a SMILA query Pipeline, passes to it the user’s input, receives the results of the query Pipeline and formats them according to the interface specifications of the query application. Queries are serialized and submitted to a CUbRIK Pipeline (through Web services API) according to a given query protocol; the simplest case is that of mono-modal queries expressed as bag of keywords. More complex queries are mono-modal content similarity queries and multi-modal content similarity queries. The former require the submission of one content sample. The latter may require more than one content sample or a composite content sample (e.g., a video fragment). Results are organized according to a given result schema, serialized, and returned as responses from CUbRIK to the app. Results in a result list are sorted and chunked. Results typically consist of: References to matched content element; Match details for each content element; Annotation values for the collection of results (facets). CUbRIK applications can publish as a response to a query such information as : only content elements, content elements + associated annotations, content elements + associated annotations + annotation values for the result list collection. Part of the functionalities of this tier can be implemented using a (synchronous) query Pipeline based on SMILA. This Pipeline might reuse components used for content processing or data acquisition (e.g., the metadata of an uploaded picture or a set of low level descriptors derived from the uploaded picture).
3.3.2
The Query Broker
A Query Broker is a Pipeline that issues a query to the search engines that are in charge of processing it. If needed, a query broker transforms the request – or parts thereof – in the format that is needed by the search engine and writes the result to the request. An example of the Pipelines for query processing is described in D6.1 R1 Pipelines for query processing). The association of a query type to a Pipeline can be: Static: the query application is registered into the CUbRIK platform as the submitter of a specific class of query that is associated to a specific query Pipeline; Dynamic: The query Pipeline that is associated to the query has the responsibility of translating the query in the format expected by the search engine(s) and of sending the query or sub-queries to the search engine(s); unlike the static case, where the routing to the search engine is fixed, with the dynamic association the front-end Pipeline can embody application-specific business logic that uses metadata supplied with the query in order to dynamically decide the routing of the query to further Pipelines that interact with content-based or text-based search engines. The Response Builder normalizes and fuses the responses from the search engines(s) and creates a single result list, to be returned to the query app; On the lower level of the query and search tier are the search engines that provide lookup functionality for the CUbRIK platform. There are two principal ways to use a search engine from a CUbRIK Pipeline: Black box: the search engine is external to the CUbRIK platform and is used to fetch content useful in the query Pipeline (e.g., calling Google Images to retrieve logo images corresponding to a brand name input by the user). In this case, the search engine is used without accessing to the index; White box: the search engine is internal to the CUbRIK platform and is used to index and query content procured by the CUbRIK content acquisition tier. Each search engine used in white box mode can access the content and annotation store(s) to build/rebuild its indexes. Indexing is independent and asynchronous w.r.t. content
CUbRIK Architecture Design
Page 22
D9.8 Version 1.0
processing and acquisition. Each white box search engine should listen to the content processing manager events, in order to understand when to build, re-build, extend, and update its indexes.
3.4
Stores and Indexes
All CUbRIK data elements that need to be referred or generated/communicated externally to the platform must have a unique ID (e.g.: contentID, factID, conflicted, userID) as described in the data model. Wherever possible, standard identification mechanisms will be used. Storage (in the general case) should be distributed. This implies that IDs should be assigned according to a distributed object identification mechanism. If needed, Content will be encoded (or re-encoded) so to be available for rendering and usage on several publication channels(e.g. iPhone/iPad). Since SMILA provides a coherent set of JSON/REST Interfaces, these interfaces are used as a basis. Within CUbRIK, an index is a special kind of storage. Compared to a store, an index contains additional functions to search and retrieve content in an efficient way. For example, the full text search index allows to search for documents containing a certain set of words. Another differentiation criterion between store and index is that an index is used dominantly for searching. For example, the conflicts store might have a search function that retrieves the most relevant conflicts in a certain context. However, the conflicts store is not intended to be used predominantly for searching.
3.4.1
IAS as default storage
In general the IAS [4] is the standard content storage. It provides access to all SMILA components as it is nested in SMILA. For other components a JSON/REST API can be implemented. The storage supports almost every format (as documents, blobs, Java objects etc.). There is provided a functionality to support fault tolerance and performance as the access is quick from inside the IAS. Moreover, there has to be provided a possibility to execute queries (which resemble SQL queries). An open item is the storage of big binaries and video streams. In Table 1 there are listed several data types with an evaluation if the IAS storages fits for them. Data type
Remark
Documents
--
XML
--
Audio
streaming supported
Video
streaming supported Table 1: IAS as default data storage
CUbRIK Architecture Design
Page 23
D9.8 Version 1.0
4.
Design of CUbRIK Components
This section details the overview of the CUbRIK architecture presented in the previous section. It exemplifies the implementation for several logical components presented before and – where relevant – an outline of the API of a component. This section will further explain whether certain components will run inside SMILA – the underlying implementation platform of CUbRIK – or whether they are self-operating components within the CUbRIK platform. Concerning this aspect, there is a simple technical distinction. Components running in SMILA are automatically started when SMILA is started. Stand-Alone Components (external services) call a SMILA workflow or Pipeline or are requested by a SMILA Pipelet / worker. They are not automatically started when SMILA is started.
4.1
Components integrated in SMILA
This section describes components that are running in the context of SMILA, i.e., they are started when the SMILA Framework is started.
4.1.1
Upload Manager
Upload Manager is implemented by using the Push API of SMILA. After each upload, a commit is sent to trigger the direct processing of the content. Using this Push-API, the content object and its metadata is uploaded to the CUbRIK platform and then processed by the content and metadata acquisition manager.
4.1.2
Crawling Manager
The Crawling Manager provides an externally invokeable interface to register and crawl datasources. The crawler implementation provided by SMILA 1.1, the following data sources can be crawled: web, rss-feeds, filesystem, and database (http://wiki.eclipse.org/SMILA/Manual#Importing). In SMILA 1.1, these crawl jobs have to be explicitly started via the JSON/REST API. To facilitate the usage of the crawling manager, the following extensions are needed: A mechanism to regularly trigger an update from a data source. A Graphical User Interface to configure crawling jobs
4.1.3
Content and metadata acquisition manager
The content and metadata acquisition manager is the initial processing step for new content in the CUbRIK platform. This process will be available as an asynchronous workflow. This component uses the metadata provided by the upload manager or the crawling manager and will add further metadata based on an analysis of the content object provided. The workflow is as follows: The initial step is splitting combined objects (e.g. ZIP Files) to single content objects. For each of these content objects, a MIME type detection is performed to determine the further processing of the content object. For selected binary objects like Microsoft word documents, a text conversion is performed using the Aperture. For web-pages, the boilerpipe extractor is used to extract the relevant sections. By integrating the Delta Checker into the content and metadata manager, already indexed content is identified and not forwarded to the – performance intensive content processing Pipelines. This is in particular relevant for large multimedia files. CUbRIK Architecture Design
Page 24
D9.8 Version 1.0
With reference to the logo detection h-demo, the Content and Metadata Acquisition Manager is designed as a SMILA job (crawlFilesystem), which performs a workflow (fileCrawling) and orchestrates sequentially four workers: fileCrawler (checks what files are already in the configured target folder), deltaChecker (checks what are the differences from the previous acquired content elements), fileFetcher (prepares records with "new" files), updatePusher (prepares the bundle of files to move to the next job in the Pipeline).
4.1.4
Copyright Awareness manager:
The copyright awareness manager is a logical component which takes care that copyright information – like the license of a content object – is regarded. The main part of the copyright awareness manager is a module that manages a set of Pipelets that extract the license information from the submitted data object. The license can be stated directly as a metadata during upload or from the content object itself (e.g., in case of MPEG) files. For extensibility, the addition of a new license model simply requires the deployment and registration of another Pipelet that manages the extraction of the new license scheme. The Copyright Awareness Manager sets flags in the application content processing workflow that guides the processing and will later be stored in the content store to guide the access. Currently, the copyright awareness manager sets a flag that determines whether the content will be cached or stored, i.e., whether it will be available via the CUbRIK platform. A possible extension to the Copyright Awareness Manager is to analyze the source web page of an external content object to find information about its license. For example, a podcast file might not contain any copyright information itself, since it is stated on the web page where the podcasts can be downloaded. Until this component is implemented, external content objects should only be cached and not be provided via the CUbRIK platform to minimize the risk of legal issues. A potential further extension to the copyright manager is to infer the resulting license of a combined content object - or to determine whether licenses are incompatible. For example, one object has a commercial license that forbids distribution, while another object is licensed by a Creative Commons Share Alike license, which demands free distribution under the same license. These two objects cannot be combined without violating one of the licenses.
4.1.5
Data stores
The stores within the CUbRIK architecture provide a persistence layer to store data persistently. In particular, the Pipelines on the different tiers are using it to store data that should be available across Pipeline runs. Examples of this data are user annotations or the results of a crawling job. Data stores can run either inside SMILA as native SIMLA content bundles, or outside the SMILA platform; in both cases, stores need to be accessed from within SMILA Pipelines / workflows. Therefore, a connector (Pipelet or worker) is needed to connect a store to a workflow/Pipeline. A store connector must support: Writing the a content object or its metadata to a store Reading metadata of a content object. For indexes: search for entries On a technical level, the stores provide a JSON/REST interface at least outside the CUbRIK platform. This supports interoperability and the usage via mobile devices. Furthermore, by “speaking” JSON/REST interfaces, the usage of the CUbRIK platform by developers is supported, which is important for crowdsourcing scenarios. The stores in CUbRIK that provide additional search functions are called indexes. Indexes are optimized for retrieval performance. They main purpose of an index is to find one or more objects. The main purpose of stores is to provide access to stored data. The actual distinction
CUbRIK Architecture Design
Page 25
D9.8 Version 1.0
is ambiguous, since stores might also offer functions to find objects (e.g. a user search in the performers store). The schema of a CUbRIK store is defined by the data models specified in D2.1 (Metadata models). The table below presents a mapping between the store and the data model. Store
Data Model
Performer Store
User Model
Content & Artifacts Store
Content Model Content Description model Provenance and Rights Model
Text Index
Specialisation of Content and Artifacts Store, thus using the same data model
Annotation Store
Content Annotation part of Content Description model.
Conflicts Store
Conflict Resolution Model
Entities Store / Index
Entity Model Description as part of Content description model.
Features Store
Content Annotations, Description model
&
Fingerprint
Relevance Feedback Store
low
level
features
in
Content
Content Description Model
Table 2: Mapping of stores to the corresponding Data Models of D2.1 The table below shows how the different data in CUbRIK is stored. Data Type
Stores
Text
Text Search Index Record store & index
Video
Raw files: File System Metadata / Annotations: Record store & index Segmentation data / Segmentation relationships: Record store & index Keyframes: See pictures Data type Transcripts: Record store & index
Audio
Raw files: File System Metadata / Annotations: Record store & index Segmentation data: File System & Record store
Pictures
Raw files: File System Metadata / Annotations: Record store & index Segmentation data: File System & Record store Low Level features: File System Transcripts: raw data and index
Entities (Locations / Space)
External Web Service
Relevance Feedback Store
CrowdSearch Store, Record store & index
Table 3: Mapping of stores to data types
CUbRIK Architecture Design
Page 26
D9.8 Version 1.0
4.1.6
CUbRIK Id Generator
Each content object within CUbRIK has a persistent unique ID. To create such an ID, a component is needed that is used throughout the whole platform. To provide a failsafe behavior, this component does not rely on a central instance that provides IDs. As Id model it was adopted the universally unique identifier (UUID); UUID is an identifier standardized by the Open Software Foundation (OSF) as part of the Distributed Computing Environment (DCE). It s intended as distributed systems uniquely identifier without significant central coordination. Actually, teoretically speacking, it is not guaranteed that UUID are unique, they should be formally described as "practically unique" rather than "guaranteed unique". In the context of CUbRIk project the UUID model was implemented as globally unique identifier (GUI), a 128-bits value code. This number is so large that the probability of the same number being generated randomly twice is negligible.
4.1.7
Content Processing Manager
The functionality of the Content Processing Manager is natively supported by the SMILA process engine, which supports the orchestration of workflows in asynchronous and synchronous way and the communication of data within the activities of the workflow and from/to workflows and external applications. An example of the content processing Pipelines is illustrated in D5.1.
4.1.8
Relevance Feedback Pipelines and Manager
The Feedback Relevance Manager is a set of workflows, Pipelines and Pipelets that are used to submit relevance feedback to the CUbRIK platform, which stores them in the Relevance Feedback store. This uses synchronous Pipelines, since these information objects are rather small and feedback should be submitted with low latency.
4.1.9
Domain / Demonstrator Specific Components and Workflows
This Section discusses the design choices for the domain or demonstrator specific components and workflows. Domain or application specific component and workflows comprise the query Pipelines in the search and query tier, which are described in D6.1 (Query processing Pipelines). The demonstrators or application vary according to the actual data needed and have only overlaps in functionality in the task level – for example to access the content stores. Therefore, as shown in D6.1 (Query processing Pipelines), it is more efficient to combine the needed tasks specifically in Pipelines, since the Pipelines concept – together with editing tools like the BPEL editor – allows an easy combination of the needed components. Using a generic and universal approach to the declarative description of query Pipelines and their subsequent ad hoc interpretation would lead to unnecessary overhead. The Pipelines should implement a single user interaction rather than combine different user interactions in one Pipeline. The Pipelines and components in the other tiers are used for different media types or annotations and are therefore domain independent. In general, the Pipelines in the query and search tier should use synchronous Pipelines in SMILA to allow a low latency queries, i.e., that a user gets an answer quickly. Rich multimodal queries could be constructed by creating specific Pipelines. Furthermore, federated queries can implement application specific combination strategies of the results. Both scenarios are outlined in an example in the following: An example for rich multimodal queries is an extension to the current logo detection demonstrator: a user provides a brand name and optionally several logo images. The name of the brand is searched in the entity store, which retrieves further (already confirmed) brand logos. All logos are then used for logo detection in the keyframes of the videos, using the content-based search engine with the Feature and fingerprint store. For accessing these stores, the store connectors mentioned before are used. CUbRIK Architecture Design
Page 27
D9.8 Version 1.0
An example for federated search is described in D6.1: a distributed search in picture services like Brand of the World11 or Google images and the internal picture store. The search takes the keywords to search. For each source, a Pipelet (Query Broker) is implemented, that translates the query string in the format of the source, issues the query and receives the result. Then a Pipelet combines all of the results. Further Pipelets then process this combined result further. Duplicates could be detected by comparing the URL or the content itself. This result could then be used for the ranking and combination of the results. For example, the rank could be composed by the descending number of duplicates, while entries with the same number of duplicated could be ranked according to their source.
4.2
Stand-alone components
This sections present the parts of the CUbRIK architecture which should not be tightly integrated in SMILA but instead run as a separate application.
4.2.1
Performer Manager
The Performer Manager is the component of the CUbRIK platform devoted to the management of information related to users (CUbRIK users, CUbRIK end-users, etc.), their social and performers profiles, as defined in the CUbRIK Data Model [5]. The performer manager is responsible for providing the following functionalities: •
Registration of users including generating and providing unique user identifiers
•
Creation, verification, deletion, editing and storage of information related to the endusers, like access credentials (e.g. user name and password), user profile information (e.g. name, surname, gender).
•
Management of information related to “social profiles” and related to subscriptions performed by users to “Social Groups”.
Management of the statistics related to the performance of users as conflict resolution performers. The Performer Manager module is tightly related to Subscription manager, it represents the components specific for the CUbRIK Performer. It tracks and stores user (behavior) data that is used by other modules, and it offers services for the registration and authentication of users, as well as for the management of their profile information. In details, the Performer Manager will administer: •
•
Information explicitly provided by the performer, such as basic data (gender, age, language) as well as some general interests and skills;
•
Social relationships and activities, according to the terms and conditions of the connected social network platforms;
•
Relationships with user generated metadata and annotations (tags, comments or bookmarks) made within the CUbRIK platform as part of GWAPs or conflict resolutions tasks;
•
Usage history: Issued queries and click through data from feedbacks routed through the Feedback Manager will be used to infer the performer preferences and to analyze the performer behavior;
Figure 11 depicts and high-level view of the Performer Manager architecture.
11 http://www.brandsoftheworld.com/
CUbRIK Architecture Design
Page 28
D9.8 Version 1.0
Figure 11 High-level view of the Performer Manager Architecture The Performer Manager offers a single runtime module, which is in charge of interacting with the storage system to save and retrieve information about performers, annotations, and relevance feedbacks. The Performer Manager is designed in a plug-in fashion, and it allows the injection of: •
Analysis methods for performer profiles and behavior evaluation (Performer Analysis API);
•
Performer assignment logics (Performer Assignment API);
•
Expert finding algorithms for performer assignment (Expert Finding API);
Performer Rating systems to organize performers in leaderboards (Performer Rating API); The Performer Manager APIs are currently under development, and they will be described in the D9.7 CUbRIK public API and SDK deliverable. •
4.2.2
Conflict Manager
The Conflict Manager is the component of the CUbRIK architecture devoted do the management of conflicts and conflict assignments to applications and performers. A conflict is a situation during the analysis of a given ContentObject where absence or contradictions of facts recorded in Annotations may arise [5]. Conflicts are detected by content annotation Pipelines: for example, in the logo detection h-demo a conflict is detected when the matching between a logo image and a video keyframe computed by the SIFT automatic component falls below a configurable threshold. A detected conflict is notified to the Conflict Manager, which, in turn, engages the crowd in the conflict resolution activities. As the crowd resolves conflicts, the Conflict Manager notifies the invoking Pipelines about the updated annotation. In CUbRIK, humans are involved in the multimedia content processing and querying processes to bring the quality of content analysis and querying to a next level, typically by overcoming the limitations of state-of-the-art content analysis components. In this section we describe the main architectural aspects of the conflict manager, its programming interfaces, and example of interaction from and to other CUbRIK components. Architecture The Conflict Manager component is based on CrowdSearcher [1][2][3], a software module with that embodies a general-purpose query and execution paradigm that bridges conventional information analysis and retrieval to crowdsourcing and social network exploitation. The Conflict Manager is a CUbRIk module that works by: •
Taking in input a set of conflicts, formatting according to the CUbRik data model
•
Feeding those conflicts in the form of human tasks to workers
•
Collecting the output of human performers and providing human feedback to the
CUbRIK Architecture Design
Page 29
D9.8 Version 1.0
CUbRIK platform, formatted according to the CUbRIK data model Workers can the activated in multiple ways: •
On a paid crowdsouring market, like Microtask
•
In a social network, like FaceBook
•
In a microblogging application, like Twitter
•
In an ad hoc manner, e.g., via email or a dedicated web application.
Performers Store
Conflicts Store
Task Creator
Conflict Manager
Configuration UI
Monitoring UI
Conflict Manager Configurator Client API
Task Execution Framework Browser API
Conflict Performer Client <<Browser>>
Cubrik Component <<Application>>
Performer
CUbRIK Pipeline
Figure 12: High Level Architecture of the Conflict Manager The Conflict Manager is composed by two main elements: the Configurator and the Task Execution Framework. The Configurator allows the creation of human tasks, either manually, with a configuration User Interface, or programmatically using as set of client API. Using this component the Task Creator (e.g. a CUbRIK administrator or a CUbRIK Pipeline) can define: • • • •
The set of content objects or annotations part of the Conflict to resolve, and that are going to compose the conflict resolution Task; The crowd engines (e.g. a social network, or a crowdsourcing platform) to use for the conflict resolution; A Task Planning strategy, which defines how to transform the conflict resolution task in micro-tasks, and how to assign them to performers A Task Emission strategy, which defines when external CUbRIK applications/Pipelines/components should be notified about changers in the
CUbRIK Architecture Design
Page 30
D9.8 Version 1.0
execution status of a given Task; for instance, an external module could be notified about the answers provided by each performer, immediately after the execution of the assigned micro-task. Alternatively, the external module will be noitifed when all the planned micro-task will be executed. The Configurator publishes conflict resolution Tasks in three different ways: • Stand-alone: the task is executed on an ad-hoc UI (that can be either the default execution UI, the task execution framework or the interface of a custom CUbRIK App) and it use the social network only as a repository of social data • Embedded, by using the social/crowd platform for embedding a social network application or game with the Conflict Resolution Manager. • Native: the system behaves as an external application that is directly using the native features of the social platform for creating queries and collecting results (e.g. using the Like widget of Facebook to express preferences). The Task Execution Framework (TEF) is a utility system that ease the creation of crowdenabled applications using the Conflict Manager. The TEF uses the Configurator APIs to obtain the task description in order to dynamically create a user interface for executing the conflict resolution task. After the task is executed the TEF sends back to the system the responses. Finally the Task Creator can monitor the status of his task using both the dashboard offered by the system or the API. The Conflict Manager also interacts with the Performer Store to retrieve the information about the set of performers that could contribute in the conflict resolution activities, and uploads the performance statistics resulting from the task executions. Client Libraries The Conflict Manager is shipped with a set of JAVA APIs for eased integration with Pipelines code. Figure 13 provides the UML class diagram for the Conflict Manager JAVA API, where the abovementioned concepts and data structures are represented in an object-oriented fashion. A complete description of the Conflict Manager JAVA APIs is available in [3].
TaskImplementation name:ImplementationType configuration:mplementationConfiguration
Script id:int name: String description:String
InvitationStrategy name:InvitationType configuration:InvitationConfiguration
ConflictManager addTask() getTask() setStrategies() publishTask() getAnswer() postAnswer() getPerformers() getScripts() and 19 more
EmissionStrategy name:EmissionName configuration:EmissionConfiguration
Task id:int question: String status:Status type: TaskType
Schema name:String fields: List<Fields>
CrowdObject id:int fileds:List<FieldInstances>
Performer id:int name: String
PerformerAssignmentStrategy name:AssignmentType configuration:AssignmentConfiguration
CustomPlanningStrategy script:String
EquiSplitPlanningStrategy objectNumer:int
PlanningStrategy name:PlanningName ControlRule event:Event action:String config:ControlRuleConfig
Answer id:int microTaskId: int objectAnswer:List<ObjectAnswer>
RedundantPlanningStrategy objectNumer:int redundancyFactor:int
ObjectAnswer id:int annotation: List<Annotation>
Configuration CommentAnnotation comment:String
ModifyConfig editableFields: List<String>
ClassifyConfig maxNum:int categories: List<String>
Annotation id:int timeStamp:time userId: string
LikeAnnotation
TagAnnotation tags:List<String>
AddAnnotation object:CrowdObject
ModifyAnnotation fields:List<FieldInstance> ClassifyAnnotation categories: List<String>
Figure 13 UML Class diagram of the Conflict Manager Java API
CUbRIK Architecture Design
Page 31
D9.8 Version 1.0
A typical example of usage of the JAVA APIs in the context of the Logo Detection application consists of a new conflict resolution task creation; conflicts definition as objects having the schema <brandName, recordid, imageUrl>, where brandName is the name of the brand for the logo to be found in the picture, and imageURL is the url of the picture to be shown to the performer. The conflict resolution task is then set-up, configured, and then opened for execution. Interactions with CUbRIK Components The following sections provide exemplary interaction patterns that take place between the Conflict Resolution Manager and the CUbRIK components. In more details, we show how a CUbRIK component can create a new conflict resolution task, configuring it and feeding it with new conflicts. Them we show how a Conflict Performer Client can interact with the Conflict Resolution Manager to allow a performer to execute a conflict task. Finally, we show how the Control Resolution Manager notifies third party application (e.g., a CUbRIK Pipeline) about the conclusion of a conflict resolution task. Examples are contextualized to the Logo Detection application. Task Creation In the Logo Detection application, a new Conflict Resolution Task in created when the request for the detection of new brands in the video collection occurs. When the user of the Logo Detection application specifies a new brand, a set of representative brand logos is retrieved by querying the brand name in Google Image. Unfortunately, some of the retrieved images might be irrelevant, or wrong; therefore, a new Conflict Resolution Task is created. The sequence diagram of Figure 14 â&#x20AC;&#x201C; Sequence Diagram for Task Creation shows how the Logo Retrieval job of the Logo Detection application interacts with the Conflict Resolution manager when a new brand is provided. The Logo Retrieval job is invoked once for each bucket of images (e.g. 5) retrieved from Google Image; therefore, the job creates a new Conflict Resolution Task only when the first bucket arrives, by interacting with the Conflict Resolution manager to define the planning (object splitting and micro task assignment) end emission strategies to enact for the task execution. Then, as new images arrive in new buckets, they are added to the list of objects to be evaluated by the crowd.
CUbRIK Architecture Design
Page 32
D9.8 Version 1.0
Figure 14 â&#x20AC;&#x201C; Sequence Diagram for Task Creation Task Execution and Notification The sequence diagram of Figure 15 depicts an example of interaction between the Conflict Resolution Manager and a CUbRIK Pipeline upon the execution of a micro-task by a performer. When the performer accesses the Conflict Performer Client to execute a microtask, the description of the corresponding Task is retrieved, together with the list of objects to evaluate. Then, upon the selection of the correct images by the performer, the Conflict Resolution Manager stores the answer in the Conflict Store and updates its internal information about the evaluated objects and about the performer; then, it notifies the Logo Validation job of the Logo Detection application about the provided answers which, in turns, updates the statistics about the evaluated images.
CUbRIK Architecture Design
Page 33
D9.8 Version 1.0
Figure 15 â&#x20AC;&#x201C;Conflict Resolution Manager and an external module Interaction
Task End Notification The sequence diagram of Figure 16 shows another example of notification performed by the Conflict Resolution manager: upon the end of the planned conflict resolution task, the CRM notifies the Logo Validation Job about the end of task, so to allow the job to free resources (e.g. cache) occupied for the processing of the brand logos.
Figure 16 Interaction between the Conflict Resolution Manager and an external module
CUbRIK Architecture Design
Page 34
D9.8 Version 1.0
References 1. Alessandro Bozzon, Marco Brambilla, and Stefano Ceri. 2012. Answering search queries with CrowdSearcher. In Proceedings of the 21st international conference on World Wide Web (WWW '12). ACM, New York, NY, USA, 1009-1018. DOI=10.1145/2187836.2187971 http://doi.acm.org/10.1145/2187836.2187971 2. Alessandro Bozzon, Marco Brambilla, Andrea Mauri. 2012. A Model-Driven Approach for Crowdsourcing Search. In Proceedings of the 1st International workshop on Crowdsourcing Web Search 3. The CrowdSearcher framework, http://crowdsearcher.search-computing.org 4. Empolis GmbH: Technisches White Paper e:Information Access Suite. Technical http://www.empolis.de/executiveRport, empolis GmbH, forum/Downloads/m_Technisches-WPIAS V1.1.1_050906.pdf 5. CUbRIK Metadata Models â&#x20AC;&#x201C; Deliverable D2.1, V1.0 â&#x20AC;&#x201C; 31 March 2012
CUbRIK Architecture Design
Page 35
D9.8 Version 1.0