CUbRIK Space and TIme Entity Repository by CUbRIK Project

SPACE AND TIME ENTITY REPOSITORY Human-enhanced time-aware multimedia search

CUbRIK Project FP7-ICT-287704 Deliverable D4.1 WP4

Deliverable Version 1.0 - 30 September 2012 Document ref.: CUbRIK.D41.UNITN.WP4.V1.0

Programme Name: ...................... ICT Project Number: ........................... 287704 Project Title:.................................. CUbRIK Partners:........................................ Coordinator: ENG (IT) Contractors: UNITN, TUD, QMUL, LUH, POLMI, CERTH, NXT, MICT, ATN, FRH, INNEN, HOM, CVCE, EIPCM Document Number: ..................... CUbRIK.D41.UNITN.CERTH.LUH.WP4.V1.0 Work-Package: ............................. WP4 Deliverable Type: ......................... Document Contractual Date of Delivery:...... 30 September 2012 Actual Date of Delivery: .............. 30 September 2012 Title of Document: ....................... Space and Time Entity Repository Author(s): ..................................... Vincenzo Maltese (UNITN), ....................................................... Uladzimir Kharkevich (UNITN), ....................................................... Anca-Livia Radu (UNITN), ....................................................... Theodoros Semertzidis (CERTH), ....................................................... Michalis Lazaridis (CERTH), ....................................................... Dimitris Rafailidis (CERTH), ....................................................... Anastasios Drosou (CERTH), ....................................................... Mihai Georgescu (LUH) Approval of this report ............... Summary of this report: .............. Description of a space and time entity repository which is used in CUbRIK project; populating the entity repository with entities from GeoNames and YAGO; enriching the entity repository with media entities from Flickr, Twitter, Picasa and Panoramio. History: .......................................... Keyword List: ............................... entity repository, space, time, media entity annotation. Availability .................................... This report is public

This work is licensed under a Creative Commons Attribution-NonCommercialShareAlike 3.0 Unported License. This work is partially funded by EU under the grant of IST-FP7-287704

CUbRIK Space and Time Entity Repository

D4.1 Version 1.0

Disclaimer This document contains confidential information in the form of the CUbRIK project findings, work and products and its use is strictly regulated by the CUbRIK Consortium Agreement and by Contract no. FP7- ICT-287704. Neither the CUbRIK Consortium nor any of its officers, employees or agents shall be responsible or liable in negligence or otherwise howsoever in respect of any inaccuracy or omission herein. The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7-ICT-2011-7) under grant agreement n째 287704. The contents of this document are the sole responsibility of the CUbRIK consortium and can in no way be taken to reflect the views of the European Union.

CUbRIK Space and Time Entity Repository

D4.1 Version 1.0

Table of Contents EXECUTIVE SUMMARY

ENTITY REPOSITORY 1.1 THE DATA MODEL 1.1.1 A Domain-Centric Data Model 1.1.2 Space Representation 1.1.3 Time Representation 1.2 POPULATING THE ENTITY REPOSITORY 1.3 IMPORTING BASIC TERMINOLOGY 1.4 THE SPACE DOMAIN 1.5 THE TIME DOMAIN 1.6 IMPORTING ENTITIES FROM GEONAMES 1.7 IMPORTING ENTITIES FROM YAGO 1.7.1 Discovery 1.7.2 Evaluation 1.7.3 Modularization 1.7.4 Translation 1.7.5 Refinement 1.7.6 Entity Matching and Integration 1.7.7 Evaluation of the Selected Knowledge 1.8 ENTITY REPOSITORY STATISTICS

2 2 6 6 7 8 9 9 10 11 11 12 14 15 16 20 20 22

MEDIA ENTITY ANNOTATION H-DEMO

2.1 SYSTEM ARCHITECTURE 2.2 ENTITY REPOSITORY API 2.3 CRAWLING COMPONENTS 2.3.1 COFetch tool 2.3.2 PHP COFetch for Flickr Images 2.3.3 Java crawler for Flickr Images 2.3.4 Java crawler for Panoramio Images 2.3.5 Java crawler for Picasa Images 2.3.6 YouTube Crawler for Video Metadata 2.3.7 Twitter multimedia crawlers 2.4 MEDIA CLEANING BASED ON CONTENT SIMILARITY 2.5 SOCIAL RE-RANKING OF FLICKR PHOTOS 2.6 SMILA PIPELINE 2.6.1 Entity Search and Entity Update Pipelets 2.6.2 Crawling pipelets: Flickr, Panoramio, Picasa 2.6.3 Twitter Pipelets

23 25 27 27 28 31 32 33 33 33 36 41 42 42 43 43

CONCLUSIONS AND NEXT STEPS

REFERENCES

CUbRIK Space and Time Entity Repository

D4.1 Version 1.0

Abbreviations table ABox

Assertion Axioms (Box)

API

Application Programming Interface

CLI

Command Line Interface

COFetch

Content Object Fetch tool

CRUD

Create, Read, Update and Delete

DAG

Directed Acyclic Graph

Description Logic

Geometric Blur Algorithm

GPE

Geo-Political Entity

GPS

Global Positioning System

GWAP

Game With a Purpose

JSON

JavaScript Object Notation

LCS

Least Common Subsumer

NLP

Natural Language Processing

PAT

Autonomous Province of Trento

POS

Part Of Speech

RDF

Resource Description Framework

RDFS

Resource Description Framework Schema

REST

REpresentational State Transfer

SIFT

Scale Invariant Feature Transform

SSIM

Self Similarity algorithm

TBox

Terminological Axioms (Box)

TGN

Thesaurus of Geographical Names

CUbRIK Space and Time Entity Repository

D4.1 Version 1.0

Executive Summary Deliverable D4.1 â&#x20AC;&#x153;Space and Time Entity Repositoryâ&#x20AC;? discusses the processes and tools which are used for creation of the CUbRIK Entity Repository and also for updating it with multimedia content through the Media Entity Annotation H-Demo. The first part of the deliverable discusses the creation of a highly accurate knowledge base of real world entities and facts about these entities, including knowledge in time and space domains, but also in other semantic fields (e.g., people, institutions). The knowledge base consists of classes (e.g. lake, city, monument, etc.), entities (e.g., Lago di Molveno, Trento, Buonconsiglio Castle), their metadata (e.g. latitude and longitude coordinates, colour, size, chronological or other time related information) and relations between them (e.g., part-of, instance-of, near). The knowledge base is bootstrapped by importing data from high quality resources such as YAGO [8] and GeoWordNet [9]. The quality of imported data is evaluated and incompleteness and inconsistency problems are detected. The problems are addressed by applying higher-level type checkers controlling the coherence of the knowledge base at the level of entity, as an aggregation of facts, as well as at the level of single facts. The second part of the deliverable describes how the knowledge base of entities can be enriched with images of named entities and also with other entities, relations, and attributes related to the images. For instance, people entities related to the images and data socially collected by these people (e.g. tags and comments they have created) are analyzed and imported into the entity repository. The media entity annotation H-Demo is used for this purpose. The primary goal of the media entity annotation H-Demo is to analyse and utilise popular multimedia social networks (Flickr, Twitter, Picasa and Panoramio) in order to harvest representative images for named entities stored in the entity repository. Harvesting of representative images is performed both automatically, e.g. using focused crawlers, and by using human computation techniques, e.g. CoFetch tool. Social data attached to the photos is also used to re-rank photos based on the social features in order to improve the process of selecting representative images for named entities. Representative images collected by the H-Demo, can be used in order to improve both entitycentric and media-centric searches. For instance, the representative images can be used to visualize the named entities, e.g. in entity search results. Content based multimedia search can be performed on harvested images in order to find entities depicted on a photo. The entity repository can be used to provide additional information associated to these entities. The SMILA pipelets and pipelines that implement the H-Demo are also presented.

CUbRIK Space and Time Entity Repository

Page 1

D4.1 Version 1.0

Entity Repository

In [1] we emphasized how - despite the progress made in the last forty years - one of the main barriers towards the use of semantics in practical applications is the lack of background knowledge, i.e. knowledge of specific domains such as geography (see for instance the work described in [31]), music and sport. Our proposed solution to this problem consists in the development of what we called a diversity-aware knowledge base, i.e. a very large and virtually unbound knowledge base which is able to capture the diversity of the world, especially in language and knowledge. The first instantiation of this idea is what we called Entitypedia1. Entitypedia is an entity based repository initially developed within the Living Knowledge EU project2. An entity repository developed in CUbRIK is an extension of Entitypedia entity repository. The CUbRIK entity repository introduces representation of Time, refine the entity importing methodology and instantiate it for the importing of entities from YAGO. Novel modularization techniques are development. These techniques allow for selecting the entities of specific types (locations, organizations and persons are currently supported) and satisfying required quality from the source ontology being imported (e.g. YAGO). In this section, the CUbRIK entity repository and its data model along with its improvements over the original version of Entitypedia entity repository are shortly presented. The next sections are organized as follows: Section 1.1 describes the data model with a special emphasis to Space and Time representation. Section 1.2 explains the steps and the methodologies that we have been following to progressively populate the entity repository. Sections 1.3 to 1.7 describe in detail these steps, with particular emphasis to the importing of entities from YAGO. Finally, Section 1.8 provides some statistics about the current size of the entity repository.

1.1

The Data Model

1.1.1

A Domain-Centric Data Model

For the definition of the data model we followed and adapted the faceted approach [2]. Created by the Indian librarian and mathematician Ranganathan, it is used for the organization of the knowledge in libraries and it is centred on the notions of domain and facet. As described in [10], a domain can be defined as any area of knowledge or field of study that we are interested in or that we are communicating about. Domains provide a birdâ&#x20AC;&#x2122;s eye view of the whole field of knowledge. Domains may include any conventional field of study (e.g., library science, mathematics, physics), applications of pure disciplines (e.g., engineering, agriculture), any aggregate of such fields (e.g., physical sciences, social sciences), and they may also capture knowledge about our everyday lives (e.g., music, movie, sport, recipes, tourism). A domain can be broken down into a number of facets each of them describing a specific aspect of the domain. For instance, in the Medicine domain we can distinguish among the body parts, the diseases that can affect them and the different treatments that can be taken to overcome or prevent them. Each of these aspects provides more detailed knowledge. A facet can be defined as a hierarchy of homogeneous terms describing an aspect of the domain, where each term in the hierarchy denotes an atomic concept [4]. In our adaptation of the faceted approach, an atomic concept is a class, a relation name, an attribute name or value in the domain. Entities (individuals of the classes) are at the leaves of the facets and populate them. In the original approach, since the purpose is to classify bibliographic material, facets are

1http://entitypedia.org/ 2http://livingknowledge-project.eu/

CUbRIK Space and Time Entity Repository

Page 2

D4.1 Version 1.0

classification ontologies, i.e. each concept in the ontologies denotes a set of documents while links between concepts denote subset relations [4][5]. As we emphasize in [1], the major drawback of the original approach stands in the fact that it fails in making explicit the way the meaning (semantics) of subjects (what the books are about) is built starting from the semantics of their constituents. In fact, they only consider the syntactic form by which subjects are described in natural language (syntax). Consequently, they do not allow for a direct translation of their elements - terms and arcs in the facets - into a formal language, e.g. in form of Description Logic (DL) [3] axioms. They do not explicitly specify the taxonomical isa and instance-of (genus/species) and mereological part-of (whole/part) relations between the classes, thus limiting their applicability. In particular, making them explicit is a fundamental step towards automation and interoperability. To overcome these limitations, in our approach we define facets as descriptive ontologies, i.e. ontologies where concepts denote real world objects and their properties. Specifically, we define a domain as a 4-tuple <C, E, R, A>, where: •

C is a set of classes

•

E is a set of entities

•

R is a set of binary relations

• A is a set of attributes These sets correspond to what in the faceted approach are called fundamental categories. More in detail: •

C: Elements in C denote classes of real world objects

•

E: Elements in E represent the instances of the classes in C

•

R: The set R provides structure to the domain by relating entities and classes. It includes the canonical is-a (between classes in C), instance-of (associating instances in E to classes in C) and part-of (between classes in C or between entities in E) relations and is extended with additional relations according to the purpose, scope and subject of the ontology. We assume is-a and part-of to be transitive and asymmetric and therefore we refer to them as hierarchical. We call associative those relations not having these properties.

A: Elements in A denote qualitative (when the value is expressed through a quality), quantitative (when the value is expressed through a quantity) and descriptive attributes of the entities. We further differentiate between attribute names and attribute values. Each attribute name in A denotes a relation associating each entity to corresponding attribute values. With this purpose, we also define a value-of relation that associates each attribute name to the corresponding set of possible values (the range of the relation). Within each fundamental category, we organize each domain in three levels: •

•

Formal language level: it provides the terms used to denote the elements in C/E/R/A. We call them formal terms to indicate the fact that they are language independent and that they have a precise meaning and role in (logical) semantics. Each term in C denotes a class (e.g. lake, river and city). Each term in E denotes an entity (e.g. Garda lake). Each term in R represents the name of a relation (e.g. direction). Each term in A denotes either an attribute name (e.g. depth) or an attribute value (e.g. deep). Elements in C, R and A are arranged into facets using is-a, part-of and value-of relations.

•

Knowledge level: it codifies what is known about the entities in E in terms of attributes (e.g. Garda lake is deep), the relations between them (e.g. Tiber is part of Rome) and with corresponding classes (e.g. Tiber is an instance of river). Terms in E are at the leaves of the facets and populate them. The knowledge level is codified using the formal language described in the item above and is, therefore, also language independent;

•

Natural language level: we define a natural language as a set of words (i.e. strings),

CUbRIK Space and Time Entity Repository

Page 3

D4.1 Version 1.0

that we also call natural language terms, such that words with same meaning within each natural language are grouped together and mapped to the same formal term. This level can be instantiated to multiple languages (e.g. English and Italian); Similarly to WordNet [13] and following same terminology, words are disambiguated by providing their meaning, also called sense. The meaning of each word can be partially described by associating it a natural language description. For instance, stream can be defined as “a natural body of running water flowing on or under the earth”. Within a language, words with same meaning (synonymy) are grouped into a synset. For instance, since stream and watercourse have the same meaning in English, they are part of the same synset. Given that a word can have multiple meanings (homonymy), the same word can correspond to different senses and therefore belong to different synsets. For instance, the word bank may mean “sloping land (especially the slope beside a body of water)”, “a building in which the business of banking transacted” or “a financial institution that accepts deposits and channels the money into lending activities”. In our data model, within a language each synset is associated a set of words (the synonyms), a natural language description, a part of speech (noun, adjective, verb or adverb) and a corresponding formal term. In each domain we clearly separate the elements of C/R/A that provide the basic terminology, from those in E that provide the instantiation of the domain. The data model we propose has a direct formalization in DL. In fact, classes correspond to concepts, entities to instances, relations and attributes to roles. The formal language level provides the TBox, while the knowledge level provides the ABox for the domain. They correspond to what people call the background knowledge [6], i.e. the a-priori knowledge which must exist to make semantics effective. Each facet corresponds to what in logics is called logical theory [7] and to what in computer science is called ontology, or more precisely lightweight ontology [4], and plays a fundamental role in task automation (formal reasoning). The natural language level provides instead an interface to humans and can be exploited for instance in Natural Language Processing (NLP).

CUbRIK Space and Time Entity Repository

Page 4

D4.1 Version 1.0

Figure 1: A small fragment of the Space domain following the proposed data model In Figure 1, we provide an example of the instantiation of the model for a small fragment of the Space domain where classes are represented with circles, entities with squares, relation names with hexagons, attribute names with trapezoids and attribute values with stars. Letters inside the nodes (capital letters for entities and small letters for classes, relations and attributes) denote formal terms, while corresponding natural language terms are provided as labels of the nodes. For sake of simplicity, synonyms are not given. Arrows denote relations between the elements in C/E/R/A; solid arrows represent those relations constituting the facets (is-a, part-of and value-of relations) and which are part of the formal language level; dashed arrows represent instance-of, part-of and the other relations or attributes (depth in this case) which are part of the knowledge level. Here the hierarchies rooted in body of water, populated place and landmass are facets of entity classes and are subdivisions of location, the one rooted in direction is a facet of relations and the one rooted in depth is a facet of attributes. With the above presented model, real word entities in E are described as a set {a} of attributes/relations each of them being a pair a = <AN, {AV}> where AN is the attribute/relation name and {AV} is the set of its values consistent with what is defined in A (definition of the attributes) and R (definition of the relations). Each entity is then associated to one or more classes in C through instance-of relation. The picture below shows an example of entity following the data model.

CUbRIK Space and Time Entity Repository

Page 5

D4.1 Version 1.0

Figure 2: An example of entity following the data model Here Garda lake is described as being an individual of the class lake, with six additional attributes/relations: depth is a qualitative attribute; origin is a descriptive attribute; latitude and longitude and quantitative; inflow and part-of are spatial relations. In fact, Sarca is a river flowing into Garda which is a lake in Trento.

1.1.2

Space Representation

As shown in Figure 1, we represent Space as a domain along the three levels of our data model. At formal language level, formal terms denoting the classes, the relations, the attributes and corresponding entities are arranged into facets. At knowledge level, as shown in Figure 2, entities are described in terms of attributes and relations. In particular, depth, latitude and longitude are examples of spatial attributes while inflow and part-of are examples of spatial relations. At natural language level, here the terminology of the domain is given in English only. The steps that we have followed to construct the Space domain are briefly described in Section 1.4.

1.1.3

Time Representation

We represent Time either as attribute or as a meta-attribute according to whether the information pertains to an entity as a whole or to one of its attributes. Time as attribute We assume all entities to have an associated lifespan that corresponds to three attributes Start, End and Duration: â&#x20AC;˘

Start specifies the moment in time an entity started to exist

â&#x20AC;˘

End specifies the moment in time an entity ceased to exist

â&#x20AC;˘ Duration specifies the duration of existence of an entity We represent the first two as standard dates, and the last as the distance between the two dates. They can be specialized into different attribute names for more specific entities. For instance, for persons they might be called date of birth, date of death and lifetime respectively; for organizations they might be called date of establishment, date of disestablishment and lifetime respectively. Time as meta-attribute As meta-attribute, it specifies the time validity of a relation or an attribute of a certain entity. Similarly to the attribute case, we represent it as a validity triple <Start, End, Duration> attached to the corresponding relation or attribute value. For instance, we can specify that the validity of the pair <president, USA> (where the first is the relation name and the second is the relation value) of the person Ronald Regan is <198101-20, 1989-01-20, 8 years>.

CUbRIK Space and Time Entity Repository

Page 6

D4.1 Version 1.0

1.2

Populating the Entity Repository

Following the data model presented in the previous section, the general strategy that we have been pursuing to incrementally populate the entity repository is as follows: â&#x20AC;˘

Importing basic terminology. We initially populated the entity repository with general terminology imported from WordNet 2.1 and the Italian section of MultiWordNet. This essentially provided what is needed to bootstrap the natural language level, in English and Italian, respectively3.

â&#x20AC;˘

Populate the knowledge base with concepts by developing domains. Following the principles at the basis of the faceted approach, specific domains - such as Space, food, sports, tourism, music and movie - are developed by importing concepts from high quality data sources. We started with the Space domain. This basically corresponds to the development of the TBox.

Populate the knowledge base with entities. The knowledge level is populated with entities and their properties. This basically corresponds to the development of the ABox. So far, around 7 million locations have been automatically imported from GeoNames4 [9], around 20,000 locations of the Trentino region in Italy have been imported from a local dataset of the Autonomous Province of Trento (PAT) [11] and around 500,000 additional locations, 700,000 persons and 150,000 organizations have been imported from YAGO [8]. The methodology that we follow to develop domains is an adaptation of the faceted approach. The main steps, which are manually applied, are [10]: 1. Identification of the domain terminology. Sources of domain specific terms are identified and evaluated. Best candidates are selected. Natural language terms are extracted from them and disambiguated into formal terms. 2. Analysis. Following the principles at the basis of the faceted approach, terms collected during the previous phase are analyzed in terms of their distinguishing characteristics. 3. Synthesis. By using the characteristics identified with the previous step, the terms are used as building blocks for the construction of the facets of the domain. 4. Standardization and ordering. For each different notion, a standard term is chosen among the synonyms. Terms within each synset and within each facet are therefore ordered. The methodology that we follow to populate the entity repository with entities consists of the following macro-steps: 1. Discovery: candidate data sources are identified. 2. Evaluation: data sources are evaluated against quantitative (number and kind of entities) and qualitative (accuracy of the entity metadata) criteria at the purpose of selecting the best candidate. 3. Modularization: relevant parts of the data source are selected, for instance by isolating only the entities of a certain kind (e.g. locations and organizations) or those satisfying specific quality criteria. 4. Translation: the selected data source is pre-processed to comply with an internal data storage format. 5. Refinement: the selected entities are modified and extended at the purpose of refining their classes, relations and attributes, of correcting errors detected with the â&#x20AC;˘

3These two languages were selected because of the importance that the English and Italian languages have respectively in the context of the Living Knowledge (http://livingknowledge-project.eu) and the Live Memories (http://www.livememories.org) projects we had been involved in. 4 http://www.geonames.org/

CUbRIK Space and Time Entity Repository

Page 7

D4.1 Version 1.0

evaluation and to enrich them semantically. The latter consists in disambiguating the meaning of classes, relations and attributes and making their semantics compliant with the entity repository model and content. 6. Entity Matching and integration: refined entities are matched against the entity repository to detect possible duplicates with entities already present in it. With the integration entities are merged with those in the entity repository. In particular, new entities are created while matched entities are merged with existing ones. The steps above are consistent with state of the art solutions in ontology reuse such as those described by Simperl in [14]. However, while these solutions focus on the importing of ontologies, we import entities from any structured data source. The importing of the basic terminology is described in Section 1.3; the construction of the Space domain is described in Section 1.4 (the development of additional domains is left as future work); entity importing is described in Section 1.6 for locations imported from GeoNames and the dataset of the PAT and in Section 1.7 for those entities imported from YAGO.

1.3

Importing Basic Terminology

Words, synsets and lexical relations between them were imported from WordNet and MultiWordNet to the natural language part of the entity repository, instantiated both for English and Italian language. WordNet instances/entities were not imported for two main reasons. Firstly, they are not a significant number and no attributes are provided for them and secondly, we plan to import huge quantities of entities and corresponding metadata from other resources. Note that the official number of entities in WordNet is 7,671 [12], while we found out that 683 of them are common nouns instead. The wrong ones were identified by manually verifying those with no uppercased lemma. These were converted into noun synsets, while the other 6,988 were considered still entities. Figures are provided in Table 1. Excluding the 6988 entities and corresponding relations, WordNet was completely imported. MultiWordNet, mainly due to the heuristics used to reconstruct the mapping with WordNet 2.1, was only partially imported. In particular, 92.47% of the words, 94.28% of the senses and 94.30% of the synsets were imported. Moreover, the provided 318 (Italian) lexical and semantic relations were not imported. MultiWordNet

WordNet 2.1 Object

Instances

Object

Instances

Synset

110,609

Synset

36,448

Relation

204,481

Relation

Word

147,252

Word

41,705

Sense

192,620

Sense

63,595

Word exceptional form

4,728

Word exceptional form

Table 1: Data imported from WordNet 2.1 and MultiWordNet For each synset in the two languages, a language-independent concept was created at formal language level. If the same notion can be expressed in the two languages then corresponding synsets are linked to the same concept. Since MultiWordNet is aligned with the older WordNet 1.6 version, the mapping between the two languages was reconstructed by combining the existing mapping5 between WordNet 1.6 and 2.0 with another one we created expressly between WordNet 2.0 and 2.1 using some heuristics. Notice that for

5http://www.cse.unt.edu/~rada/downloads.html#wordnet

CUbRIK Space and Time Entity Repository

Page 8

D4.1 Version 1.0

adjectives and adverbs we had to directly compute the mapping between WordNet 1.6 and 2.1 since not available elsewhere. Notice also that due to the partial coverage of the language in MultiWordNet and the well-known problem of gaps in languages (i.e. given a lexical unit in a language, it is not always possible to identify an equivalent lexical unit in another language) not all concepts have a corresponding synset in Italian. Hypernym (is-a) and transitive part meronym (part-of) relations were elected as semantic hierarchical relations. All the other relations were defined as associative relations. We plan to significantly reorganize them in the future.

1.4

The Space Domain

This section provides a brief description of the instantiation of the steps that led to the construction of the formal and natural language levels of the Space domain (the TBox), while the description of how we constructed the knowledge level (the ABox) is provided in sections 1.6-1.7. Following the faceted approach this has been done in four subsequent steps [10]: 1. Identification of the domain terminology. We selected the Thesaurus of Geographical Names (TGN)6 and GeoNames as main sources. The terms extracted from them were disambiguated and mapped to the terms already present in the entity repository (those coming from WordNet and MultiWordNet). With the integration, missing terms were added to the entity repository. 2. Analysis. The terms collected and disambiguated during the previous phase were analysed in terms of their topological, geometric or geographical characteristics. For instance a mountain can be described as (a) a well-defined elevated land, (b) formed by the geological formation and (c) with altitude in general >500m; while a hill can be described as (a) a well-defined elevated land, (b) formed by the geological formation and (c) with altitude in general <500m. 3. Synthesis. By using the characteristics identified with the previous step, the terms were used as building blocks for the construction of the facets of the Space domain. For instance, given that mountain and hill have characteristics in common they can be put in the same facet, i.e. landform. 4. Standardization and ordering. For each different notion, based on the study of some relevant scientific publications or standard vocabularies, a standard term is chosen among the synonyms. Terms within the facets are therefore ordered. For instance, we may decide to order landforms in decreasing order of altitude. The steps above are manual. Overall, the Space domain we developed currently consists of around 1000 classes, 70 different relations and 31 attributes. For instance, facets of entity classes include administrative division, populated place, facility, landform and body of water; facets of relations include direction (east, west, south, north etc.), relative level (above, below, etc.), external spatial (alongside, adjacent, near, etc.) and sideways spatial (right, left, etc.) relations; attributes include latitude, longitude, population, inflow, depth and time zone.

1.5

The Time Domain

Time domain is taken as the terminology - at formal and natural language levels - which is necessary to conceive and communicate time in its more general sense. In fact, as time entities span in different domains - i.e. they include for instance historical events (e.g. wars, revolutions), social events (e.g. parties, ceremonies), academic events (e.g. conferences, workshops), sportive events (e.g. contests, races, soccer matches) and so on - addressing them would require more specific domains in which such terminology specializes. This specialization is application dependant and therefore will be addressed as future work. Similarly to Space, we followed the faceted approach. Taken as generic, we mainly used 6http://www.getty.edu/research/conducting_research/vocabularies/tgn

CUbRIK Space and Time Entity Repository

Page 9

D4.1 Version 1.0

WordNet and Wikipedia as sources of terminology. The extracted terms were manually analyzed, disambiguated and arranged into facet hierarchies. The level of maturity of Time domain can be considered much lower than that of Space. However, an initial number of facets of entity classes, relations, attributes and units of time were created. Overall the time domain currently consists of around 200 concepts that have been lexicalized in English and Italian. Some examples are provided in the following table: ENTITY CLASSES

RELATIONS

UNITS OF TIME

Weekday

Time relation

Daytime unit

•

Monday

•

Before

•

Second

•

Tuesday

•

After

•

Minute

•

Wednesday

•

Concurrent

•

Hour

•

…

•

During

•

…

•

…

Month •

January

•

February

•

March

•

…

ATTRIBUTES Start End

Holiday •

•

Christian Holiday o Advent o Christmas o Easter o … Jewish Holiday o o o o

•

Rosh Hashanah Yom Kippur Succoth …

Duration Frequency •

Daily

•

Weekly

•

Monthly

…

… Table 2: Examples of Time domain concepts

1.6

Importing Entities from GeoNames

As described in the previous section, TGN and GeoNames were selected as main sources of terminology to construct the Space domain following the faceted approach. The constructed domain was mapped and integrated with the entity repository. Given the limited coverage of TGN, geographical entities (locations) and their attributes were automatically imported only from GeoNames[9]. This was also motivated by the fact that TGN has proprietary licence. More in detail, this has been done as follows:

CUbRIK Space and Time Entity Repository

Page 10

D4.1 Version 1.0

1. Discovery: we selected an initial set of candidates including Wikipedia7, DBPedia8, GEMET9, the ADL gazetteer10, GeoNames and TGN. 2. Evaluation: we selected GeoNames as best candidate as it contains 7 million locations with high quality metadata integrated from several geographical resources. 3. Modularization: this step does not apply in this case as GeoNames was fully used. 4. Translation: GeoNames was converted into an intermediate storage format. 5. Refinement: each entity from GeoNames is associated a class that was previously mapped with the entity repository by means of an instance-of relation. We also created part-of relations between such entities, according to the information provided in GeoNames. For instance, we codify the information that Florence is an instance of city and is part of the Tuscany region in Italy. Some minor modifications were applied concerning the UTF8 encoding of the text. 6. Entity Matching and integration: as the entity repository was initially empty, there was no need to look for duplicates. Using the mapping of the GeoNames classes with the entity repository, entities from GeoNames were imported into the entity repository at the knowledge level. The attributes associated to each entity in GeoNames were imported as attributes and corresponding values (focusing on English and Italian names). This process generated around 7 million locations with 70 million attributes and corresponding values. We released a significant portion of this dataset as open source under the name GeoWordNet11. Similarly, we imported around 20,000 locations of the Trentino region in Italy from a local dataset of the Autonomous Province of Trento [11]. However, in this case we also applied some heuristics to avoid duplicates.

1.7

Importing Entities from YAGO

The importing of entities from GeoNames and the PAT dataset, briefly described in the previous section, did not present strong difficulties. In fact, entities were homogeneous (only locations) and there was a one-to-one mapping of the entities with a limited number of classes that were preliminary refined and mapped to the entity repository. As we wanted to enrich the entity repository with entities of different kind, including also organizations and persons, we looked at additional sources of information. The following sections describe the steps taken in each of the importing phases:

1.7.1

Discovery

The initial candidates for entities sources included Wikipedia, DBPedia and YAGO (2009 version). However, the final preference was YAGO due to the following reasons: (a) similarly to DBPedia, YAGO provides entities of different kind and corresponding classes, attributes and relations automatically extracted from Wikipedia and in a structured format; overall it contains around 2.5 million entities and 29 million facts (classes, relations and attributes) about them (b) w.r.t. Wikipedia and DBPedia, the claimed accuracy of YAGO is higher, given the quality checks performed during the extraction process; (c) entity classes in YAGO are mapped to WordNet thus making the mapping with the backbone structure of the entity repository easier. As described by Suchanek in [15], the YAGO model is compliant with RDFS in which entities are described in terms of facts of the kind <source, relation, target>. Overall, 95 kinds of 7http://www.wikipedia.org/ 8http://dbpedia.org/About 9http://www.eionet.europa.eu/gemet/about 10http://www.alexandria.ucsb.edu/gazetteer/ 11http://geowordnet.semanticmatching.org/

CUbRIK Space and Time Entity Repository

Page 11

D4.1 Version 1.0

relations are instantiated in YAGO. For instance: Elvis_Presley Elvis_Presley Elvis_Presley Elvis_Presley

isMarriedTo bornOnDate type type

Priscilla_Presley 1935-01-08 wordnet_musician_110340312 wikicategory_Musicians_from_Tennessee

where in our terms isMarriedTo is a relation, bornOnDate is an attribute and type corresponds to instance-of thus connecting an entity to a class. In YAGO, classes are of three different kinds: •

WordNet classes (denoted with prefix “wordnet_”) correspond to nouns imported from WordNet 3.0

•

Wikipedia classes (denoted with prefix “wikicategory_”) are those extracted from Wikipedia categories

YAGO classes (such as “YagoLiteral” or “YAGOActor”) are those introduced by YAGO for instance to enforce type checking on the domain and range of facts The mapping with WordNet is computed by extracting and disambiguating the head of the sentence from Wikipedia categories. Such mapping is maintained through the subClassOf relation. For instance, the head of the Wikipedia category Musicians from Tennessee is musician that is disambiguated as wordnet_musician_110340312. Therefore, the following fact is also added to YAGO: •

wikicategory_Musicians_from_TennesseesubClassOf

wordnet_musician_110340312

In the majority of the cases, when the meaning of the identified head is ambiguous, they automatically assign the first sense in WordNet. However, there are a few cases in which they did exceptions. Unfortunately, there is no documentation available about these exceptions. Attributes and relations are extracted from Wikipedia infoboxes using a set of heuristics. Quality control is mainly guaranteed by checking that they are consistent with the domain and range defined for them.

1.7.2

Evaluation

There is in general a lack of methodologies providing a comprehensive and global approach to ontology evaluation [20]. The OntoClean methodology [18], based on philosophical principles, provides a set of meta-properties that, among other things, impose a set of constraints on the taxonomic structure of ontologies and turn out to be very useful in evaluating and improving them [19]. OntoMetric[21] provides guidelines to help choosing the most appropriate ontology for a certain task among a set of candidates. EvaLexon[22] focuses only on syntactic aspects, i.e. on measuring the appropriateness of the terms used in the ontology. According to Gomez-Perez [16][17], the goal of the evaluation process is to determine what the ontology defines correctly, does not define, or even defines incorrectly. This should be done in two steps: verification and validation. The purpose of verification is to check the syntactic correctness, i.e. that the ontology comply with the syntax of the representation language used. The purpose of validation is mainly to check its consistency, completeness and conciseness. An ontology is consistent when contradictory conclusions cannot be obtained, for instance because of cycles or partition errors, i.e. when by mistake we have an entity that is in the extension of two disjoint classes. An ontology is complete if it fully captures what it is supposed to represent of the real world. For instance, it may be judged as incomplete if it lacks of important classes or properties; in presence of partition omissions, i.e. omission of explicit definition of disjoint classes; lack or partial definition of domain and range of the properties; definitions and documentation of terms used. An ontology is concise if it

CUbRIK Space and Time Entity Repository

Page 12

D4.1 Version 1.0

does not contain redundancies, such as duplicated definitions or axioms, which are not functional to the usefulness of the ontology. Specifically for RDF, ODEval tool [23] allows checking for circularity and redundancy problems. Tao et al. [24] present the TW OIE tool which in not only able to check for standard syntax errors and logical inconsistencies, but also for type mismatch in property ranges, logical redundancy and violation of number restrictions. In comparing it with other state of the art tools they underline how the latter functionalities are not typically offered by them. Structural problems Despite the claimed quality, we found out that YAGO has a series of structural defects: •

The provided transitive closure is faulty as it lacks of a significant number of the subClassOf relations. We found that the skipped relations have target or source in thing, entity, causal agent, matter, psychological feature, attribute, relation, communication, measure, ability, cognition, creativity, imagination, whole, organism, orientation, attitude, social group, abstraction, unit, content, quality. This obviously creates problems when traversing the YAGO hierarchy and reasoning with it.

•

YAGO does not include all noun synsets of WordNet. In fact, while WordNet 3.0 contains 82.115 noun synsets, YAGO only contains 66,117 WordNet classes. No rationale is provided in the documentation, so it is not clear whether this has been done on purpose, for instance because no instances for them have been found in Wikipedia, or it is by mistake.

•

There are 113,228 classes with prefix “wikicategory_wordnet”. This is not in line with the way in which classes in YAGO are defined (see previous section). As they are, they are neither Wikipedia nor WordNet classes. Therefore, we refer to them as malformed.

•

19,191 Wikipedia classes (15,480 if we exclude those with prefix “wikicategory_wordnet”) are directly subclasses of entity. From the manual analysis of a randomly selected part of them (see [32]) we found out that (a) 18% of them are actually proper nouns thus they correspond to individuals rather than classes and as such they should not be linked to WordNet classes; (b) 56% of them could have been mapped to a more specific concept in WordNet; (c) 26% of them would require the creation of a new concept as WordNet lacks of a suitable concept for it.

•

108,113 entities do not have Wikipedia classes; as Wikipedia classes are the means by which entities are imported in YAGO from Wikipedia through the mapping to WordNet classes, it is not clear how these entities were actually mapped to WordNet classes.

Incompleteness and inconsistency Gomez-Perez [16] specifies that an ontology is incomplete when it lacks of information concerning the world is supposed to describe - for instance when important classes are missing or when the domain and range of the relations is not precisely delimited - and it is inconsistent in presence of cycles or partition errors, i.e. when the disjointness of classes is not properly defined. We consider YAGO as incomplete meanly because: •

Locations lack of latitude and longitude coordinates which are very important for their characterization.

•

No semantics of the relation names is given thus they turn out to be ambiguous. For instance, the YAGO relation hasHeight can be interpreted as height (in the sense of stature) in case of persons, altitude in case of locations and tallness for buildings. They correspond to three different senses in WordNet and the entity repository.

•

In some cases the domain and range of the defined relations look too broad. For instance, the relation isAffiliatedTo is defined between any two generic entities, while we believe the domain should include only persons and organizations while the range should only include organizations.

CUbRIK Space and Time Entity Repository

Page 13

D4.1 Version 1.0

Wikipedia classes lack of explicit definition and therefore turn out to be ambiguous. Consider for instance the class Cemeteries in Canada. Is Canada the location? Which Canada? There are at least 70 locations of different type (e.g. countries, villages, hills, lakes and so on) with this name in the world12. We consider YAGO as inconsistent meanly because: •

•

As no explicit disjointness of the classes is provided, YAGO ends up containing persons which are for instance also fishes, dogs or hills, organizations which are also events and so on. For instance, the city Rome (the capital of Italy) is also defined as person. By adding such constraints, YAGO becomes clearly inconsistent.

Accuracy of the linking of Wikipedia classes to WordNet classes As classes are the main means by which we categorize entities, we pay particular attention to their disambiguation as it is performed by YAGO. This corresponds to the way in which the linking between Wikipedia categories and WordNet nouns has been performed. The claimed accuracy in this case is 95%. We took 500 random Wikipedia categories taken from the whole YAGO. 66 of them had a totally wrong disambiguation. For instance, the class arena extracted from Indoor arenas in Lithuania is wrongly mapped to the first WordNet sense: • sphere, domain, area, orbit, field, arena (a particular environment or walk of life) while we believe that the correct one is the third sense: stadium, bowl, arena, sports stadium (a large structure for open-air sports or entertainments) So, according to our evaluation the accuracy of the classes is rather 86.8%. However, there are some cases in which, despite the proximity of the right sense, a more general or a more specific sense would be more appropriate. For instance, YAGO selects the word station from Coal-fired power stations in Uzbekistan thus linking it to: •

station (a facility equipped with special equipment and personnel for a particular purpose) while we believe that the correct class should be power station corresponding to the sense: •

• power station, power plant, powerhouse (an electrical generating station) If we account also for these cases the accuracy drops to 82%. However, this problem is more a weakness of the extraction phase rather than of the disambiguation. We also found 4 mistakes due to lack of senses in WordNet. For instance, Eredivisie derbies was mapped to the only sense of WordNet available for derby: bowler hat, bowler, derby hat, derby, plug hat (a felt hat that is round and hard with a narrow brim) while we believe that it refers more to the notion of football derby (i.e. a particularly important football match). They were not counted as mistakes in the evaluation above. •

1.7.3

Modularization

We understand modularization as the process of identifying self-contained portions of a dataset, that can be an ontology, a database or any other structured data source. D’Aquin et al. [28] define ontology modularization as the task of partitioning a large ontology into smaller parts each of them covering a particular sub-vocabulary (typically in terms of classes, properties or individuals). They also describe and evaluate, in terms of quality of the result (size, redundancy, connectedness and distance between axioms and terms) and of the tool (assumptions, level of interaction and run-time performance), some prominent tools developed for this task. Doran et al. [30], define an ontology module as a self-contained

12See for instance: http://www.geonames.org/search.html?q=canada&startRow=0

CUbRIK Space and Time Entity Repository

Page 14

D4.1 Version 1.0

subset of the parent ontology where all concepts in the module are defined in terms of other concepts in the module, and do not refer to any concept outside the module. They reduce the problem of module extraction to the traversal of a graph given a starting vertex that ensures that the module is transitively closed with respect to the traversed relations. CuercaGrau et al. [29] specify that the partitioning should preserve the semantics of the terms used, i.e. the inferences that can be made with the terms within the partition must be the same as if the whole ontology had been used. In our work, locations, organizations and persons were selected from YAGO by taking all those entities whose class (identified through the type relation) is equivalent or more specific (identified through the subClassOf relation) than one of those in Table 3. In particular, we consider facilities, buildings, bodies of water, geological formations and dry lands as more specific kinds of locations. The table also shows the amount of entities and Wikipedia classes found in each sub-tree. Class

Entities

Wikipedia classes

wordnet_location_100027167

412,839

16,968

wordnet_person_100007846

771,852

67,419

wordnet_organization_108008335

213,952

19,851

wordnet_facility_103315023

83,184

8,790

wordnet_building_102913152

49,409

6,892

wordnet_body_of_water_109225146

36,347

1,820

wordnet_geological_formation_109287968

19,650

1,978

wordnet_land_109334396

9,621

805

Table 3: Number of entities and Wikipedia classes for the classes under consideration Overall, we identified 1,568,080 entities. As entities might belong to more than one class (both Wikipedia and WordNet classes) this number does not correspond to the sum of the entities given in Table 3.

1.7.4

Translation

Data was translated into an internal database format on top of which other phases are performed. We call this database the intermediate schema. YAGO entities, their classes, relations and attributes were imported into the intermediate schema. Alternative names in English and Italian were also imported. The table below provides the statistics about the amount of objects created. Kind of object

Amount

Classes

3,966

Entities

1,568,081

instance-of relations

3,453,952

Attributes/Relations

3,229,320

Alternative English names

3,609,373

Alternative Italian names

220,151

Table 4: Kind and amount of objects in the intermediate schema CUbRIK Space and Time Entity Repository

Page 15

D4.1 Version 1.0

The classes above do not include Wikipedia classes. As we wanted to map entities to the backbone structure of the entity repository which is aligned with WordNet (though in our case is version 2.1) we only imported WordNet classes. However, as we were not happy enough of the accuracy of the mapping between Wikipedia and WordNet classes as computed by YAGO, we recomputed them starting from Wikipedia classes. This was done through the use of a POS tagger (developed and trained with the work presented in [25]) and a grammar generated to work on annotated Wikipedia classes. The grammar turns out to be able to roughly process from 96.1 to 98.7% of the Wikipedia classes according to the different sub-tree in which they are rooted, where the roots are the WordNet classes listed in Table 3. For the uncovered cases, we reused YAGO extraction. The final grammar we generated, able to recognize class names and entity names from Wikipedia classes, is as follows: Wikipedia-Class ::= classes IN [DT] [pre-ctx] entity {post-cxt}* | classes classes ::= class [, class] [CC class] class ::= {modifier}* classname classname ::= {NNS}+ | NNS IN {JJ}* NN [^NNP] modifier ::= JJ | NN | NNP | CD | VBN entity ::= {NNP}+ | CD {NNP}* pre-ctx ::= ctxclass IN post-ctx ::= VBN IN {CD | DT | JJ | NNS | NN | NNP}* | CD | , entity | ctxclass | (ctxclass) | (entity [ctxclass]) ctxclass ::= {NN}+ Note that [^NNP] means â&#x20AC;&#x153;not followed by NNPâ&#x20AC;?. In case multiple classes can be extracted from a Wikipedia class, the modifiers of the first class are considered to apply to all the classes. For instance, Ski areas and resorts in Kyrgyzstan means Ski areas and ski resorts in Kyrgyzstan. Some modifiers can explicitly - with NNP - or implicitly - with JJ - denote a named entity. Examples for the first kind are Hawaii countries and New Mexico countries, while an example of the second kind is Russian subdivisions. Nevertheless, entities as modifiers may denote any kind of entity and not only locations, e.g. Star Trek locations. As a final note, the less frequent POS tags (e.g. NNPS, VBG, POS, TO) were not included in the grammar. For example, from the Wikipedia class City, towns and villages in Ca Mau Province we extract three classes: city, town and village (while YAGO extracts city only); from Low-power FM radio stations we extract radio station (while YAGO extracts station only). The disambiguation of these classes is done as part of the refinement and it is therefore described in the next section.

1.7.5

Refinement

The purpose of the refinement is to fix the problems described in Section 1.7.2, with particular emphasis to the incompleteness and inconsistency problems. Most of them have been handled with a few of them left as future work. Fixing the transitive closure To overcome the faulty transitive closure a dedicated piece of software was developed to detect the missing axioms (by comparing the YAGO hierarchy w.r.t. WordNet). Other structural problems were ignored by skipping malformed classes from the importing. Enriching entities with latitude and longitude coordinates As YAGO lacks of latitude and longitude coordinates, which we believe are very important for locations, they were extracted directly from Wikipedia. The closest to the one used by YAGO 2009 Wikipedia version was selected. The heuristics we used allowed us extracting 440,687 latitude/longitude pairs which is comparable with what is currently extracted by DBPedia on a much more recent Wikipedia version. The extracted latitude-longitude coordinates were CUbRIK Space and Time Entity Repository

Page 16

D4.1 Version 1.0

stored in the intermediate schema, by converting them when necessary, as latitude and longitude attributes in the WGS84 decimal format (the same used by GeoNames). Assigning a type to each entity As discussed in the evaluation section, as no explicit disjointness of the classes is provided, YAGO ends up containing trivial implicit inconsistencies such as the city Rome which is also a person. To detect these problems the following constraints were defined: •

All entities can have the following attributes: {hasWebsite, hasMotto, hasWonPrice, hasPredecessor, hasSuccessor}

•

Persons can have the following attributes: {hasHeight, hasWeight, bornOnDate, diedOnDate, bornIn, diedIn, originatesFrom, livesIn, isCitizenOf, graduatedFrom, hasChild, isMarriedTo, interestedIn, worksAt, affiliatedTo, influences, isNumber, actedIn, produced, created, musical_role, directed, discovered, wrote, hasAcademicAdvisor, madeCoverOf, politicianOf, participatedIn}

•

Organizations can have the following attributes: {hasNumberOfPeople, isAffiliatedTo, hasBudget, hasRevenue, hasProduct, establishedOnDate, createdOnDate, isLeaderOf, influences, dealsWith, participatedIn, produced, created, musicalRole, isOfGenre}

•

Locations (including facilities, buildings, bodies of water, geological formations and dry lands)can have the following attributes: {latitude, longitude, hasHeight, hasUTCOffset, establishedOnDate, hasArea, inTimeZone}

•

Geo-political entities (GPE), e.g. countries and cities, are more specific locations that can have the following additional attributes: {hasPopulation, hasPopulationDensity, hasCurrency, hasOfficialLanguage, hasCallingCode, hasWaterPart, hasInflation, hasEconomicGrowth, hasGini, hasPoverty, hasCapital, hasGDPPPP, hasTLD, hasHDI, hasNominalGDP, hasUnemployment, isLeaderOf, has_labourdealsWith, imports, exports, has_imports, has_exports, has_expenses, participatedIn}

•

Facilities and buildings are more specific locations that can have the following additional attributes: {createdOnDate, hasNumberOfPeople}

• Locations, persons and organizations are disjoint. For each entity, by taking into account the associated attributes and all the possible senses of the associated classes, type X is assigned by checking that: 1. ALL the classes associated to the entity have a sense which is more specific or more general than X 2. ALL the attributes of the entity are characteristic of the type X 3. The type X is the only one exhibiting properties 1 and 2 As reported in the table below, in this way we managed to assign unambiguously a type to 1,389,505 entities corresponding to around 89% of the entities in the intermediate schema. Type

Amount

Person

719,551

Organization

154,153

Location

284,267

Geological formation

14,426

Body of water

34,958

Geo-political entity

100,910

Building and facilities

81,240 Total:

1,389,505

Table 5: Type assignment to entities CUbRIK Space and Time Entity Repository

Page 17

D4.1 Version 1.0

20,135 entities were categorized as ambiguous, i.e. more than one type is consistent with the classes and the attributes of the entity. 158,441 entities were not categorized because of lack of or conflicting information. As shown in Section 1.7.7, these entities turn out to be very noisy. Refining attribute names The attributes imported from YAGO were renamed according to the type assigned with the previous step to each single entity. This has been done in order to map them to the entity repository schema. In general, each YAGO attribute/relation name corresponds to exactly one attribute/relation name in the entity repository, with some exceptions. For instance, while the YAGO attribute hasWebsite is only mapped to the entity repository attribute webpage, the YAGO attribute hasHeight is mapped to height (in the sense of stature) in case of persons, altitude in case of generic locations and tallness for buildings. In the entity repository, we further characterize each attribute by providing a corresponding definition: •

webpage: a document connected to the World Wide Web and viewable by anyone connected to the internet who has a web browser

•

height: (of a standing person) the distance from head to foot

•

altitude: elevation especially above sea level or above the earth's surface

tallness: the vertical dimension of extension; distance from the base of something to the top Particular attention has been paid to the attributes/relations codifying Space and Time information. Concerning Space, the YAGO relation locatedIn has been mapped to the entity repository part-of relation. Examples of Time attributes include bornOnDate (date of birth in the entity repository, that specializes lifespan.start) and diedOnDate (date of death in the entity repository, that specializes lifespan.end) for persons, establishedOnDate (date of establishment in the entity repository, that specializes lifespan.start) for organizations and locations, createdOnDate (date of creation in the entity repository, that specializes lifespan.start) for buildings and facilities. We did not import any meta-attribute for Time from YAGO. This is actually one of the recent enhancements done with YAGO2 [33]. •

Refining attribute values The YAGO relations were kept only in the case both the source and the target entities were assigned a type and they satisfied the domain and range constraints listed in the table below. YAGO Relation

Source entity type(s)

Target entity type(s)

isLeaderOf

Person, Organization, GPE

Organization, GPE

influences

Person, Organization

Person

isAffiliatedTo

Person, Organization

Organization

locatedIn

Location

Location, GPE

hasCapital

Location

GPE

bornIn

Person

Location, GPE

originatesFrom

Person

Location, GPE

livesIn

Person

Location, GPE

diedIn

Person

Location, GPE

isCitizenOf

Person

GPE

graduatedFrom

Person

Organization

CUbRIK Space and Time Entity Repository

Page 18

D4.1 Version 1.0

YAGO Relation

Source entity type(s)

Target entity type(s)

hasChild

Person

isMarriedTo

Person

worksAt

Person

Organization

created

Person, Organization

Facility, Organization

Table 6: Domain and range constraints on relations These constraints pertain to relations only and where applied to only those relations whose domain or range is a type included in our analysis (location, person, organization). For instance, we did not enforce any constraint on the relation directed as the domain should include for instance the type movie. We could have defined additional constraints for the attributes also. We left them as future work. Class disambiguation We disambiguated the classes of the entities according to the type X associated to them. In particular, we always assigned to them the WordNet sense more specific than the type X. In case more than one with this property was available, we assigned the sense with highest rank among them. For instance, the class bank is automatically disambiguated in one the following three different ways according to the type X: a) If X is building or facility: bank, bank building (a building in which the business of banking transacted) b) If X is organization: depository financial institution, bank, banking concern, banking company (a financial institution that accepts deposits and channels the money into lending activities) c) If X is location or geological formation: bank (sloping land (especially the slope beside a body of water)) Using the classes we also computed the least common subsumer (LCS) [26], representing the concept that describes the largest set of commonalities between the classes. As in the entity repository we order classes by preference, we put the LCS as preferred class. We do it by (a) first discarding the classes with senses that are not more specific than X and (b) by selecting the LCS of the senses of the remaining classes. Note that in case of acyclic terminologies the LCS can be efficiently computed and corresponds to the top concept in the worst case [27]. However, the LCS does not necessarily exist. In some cases there is no common ancestor at all (think to two isolated concepts or two concepts in two completely separated terminologies). In some other cases, there might be many common ancestors (typical of DAGs) but none of them with the property of being the most specific. In the former case we assign as a class the concept of the type X itself. In the latter case we can recursively compute the LCS on these common ancestors till either we find their LCS or we fall in the latter case. However, condition (a) excludes the former case and ensures that, after a finite number of iterations, the LCS will be the concept X itself in the worst case. However, for performances reasons, we do not recursively compute the LCS but we directly assign the concept X in case the LCS does not exist. Note that the LCS may not correspond to any of the classes assigned to the entity. In this case a new class corresponding to the LCS was associated to the entity. For instance: LCS (river, body of water) = body of water LCS (river, lake) = body of water LCS (foothill, acquifer, diapir) = geological formation Notice that, as Wikipedia classes are ambiguous, we preferred to do not import them in the entity repository. As described in the next section, we only use WordNet classes as the main means for the integration of YAGO entities into the entity repository.

CUbRIK Space and Time Entity Repository

Page 19

D4.1 Version 1.0

1.7.6

Entity Matching and Integration

The final step consisted in importing the successfully typed entities and corresponding refined classes, relations and attributes into the entity repository. This has been done by connecting each entity to the corresponding class using the WordNet classes that we recomputed as described in the previous section on the basis of the identified entity type. Classes are associated to entities using the entity repository attribute instance-of. The entity matching to detect possible duplicates with entities already imported from GeoNames and the PAT dataset is left for future work.

1.7.7

Evaluation of the Selected Knowledge

With the modularization 1,568,080 entities (around 56% of YAGO) candidates to be a location, person or organization were extracted from YAGO and imported into our intermediate schema. With the refinement: •

• •

CASE I: 1,389,505 entities (around 89% of the imported entities) were assigned a type X in the set {person, organization, location, geological formation, body of water, geo-political entity, building, facilities}. CASE II: 20,135 entities were assigned as ambiguous, i.e. more than one type is consistent with the classes and the attributes of the entity CASE III: 158,441 entities were not categorized because of lack of or conflicting information

In the following, for each of the cases above the quality of our class disambiguation was evaluated w.r.t. the disambiguation of the classes done by YAGO. Notice that as the entity classes were recomputed by extracting them from the Wikipedia classes, they might differ from each other. Moreover, these were disambiguated only in case I. Finally, the accuracy of the type assignment for the first case (case I) was also evaluated. Evaluation of the entities of CASE I Using over 100 randomly selected entities, we found out that our assignment of the type was 100% correct, while our disambiguation of the classes was 98% correct (5 mistakes over 250 classes). By checking the Wikipedia classes of these entities in YAGO, we found out that their disambiguation is 97.2% correct (6 mistakes over 216 Wikipedia classes). The kinds of mistakes tend to be the same, for instance: • • •

both engines (Entitypedia and Wikipedia) map crater to volcanic crater instead of impact crater both engines map manager to manager as director instead of manager as coach both engines map captain to captain as military officer instead of captain as group leader

Evaluation of the entities of CASE II 50 randomly selected entities among those that we categorized as ambiguous were used to found out that the disambiguation of the Wikipedia classes in YAGO is 72.3% correct (18 mistakes over 65 Wikipedia classes). Mistakes include for instance: • • • •

Bank as river slope instead of organization Ward as person instead of administrative division Carrier as person instead of warship Division as military division instead of geographical administrative division

However, 10% of these entities (5 over 50) are neither locations, nor organizations, nor persons. They are in fact events about the participation of some country to some competition while they are treated as countries (e.g. Norway in the Eurovision Song Contest 2005). Moreover, 4% of them (2 over 50) are not even entities. One is actually a list of categories

CUbRIK Space and Time Entity Repository

Page 20

D4.1 Version 1.0

and the other a list of countries. Overall 14% of them are therefore wrong. In addition, among the 72.3% of the cases considered as correct there are actually 18 ambiguous cases where: • • •

It is not clear if the class is really meant as geographical or political division (e.g. subdivision) It is not clear if the class is meant as geographical or political entity (e.g. country) It is not clear if the class is meant as organization or building (e.g. hospital)

As last curiosity, notice that around 42% of the correctly disambiguated classes (28 over 65) correspond to exactly the same case, i.e. range as mountain range. Therefore if we count unique classes, the percentage of correct disambiguation decreases significantly. Evaluation of the entities of CASE III For this experiment we selected 50 random entities among those that we preferred to do not assign any type for lack of information (e.g. one class and zero attributes) or presence of conflicting information (e.g. a class denoting an event and one denoting a country, or conflicting attributes characteristic to different types). The outcome was that the disambiguation of the Wikipedia classes in YAGO is 86.14% correct (14 mistakes over 101 Wikipedia classes). Mistakes include for instance: • •

Unit as unit of measurement instead of military unit Model as fashion model instead of mathematical model

However, 72% (!) of the candidates (36 over 50) present some form of wrong information or they are not even entities. They include for instance: • • •

Entities which are partly events and half organizations Entities which are partly animals and half persons Entities which are partly organizations and half persons

They are mostly due to the presence of conflicting classes. For instance, we have political parties that are categorized also as politics; persons that are categorized also as fishes (e.g. we found out that in YAGO there are 137 persons that are categorized as argentine fish instead of Argentinian and 4,216 entities categorized as dogs while most of them are doges, i.e. chief magistrate in the republics of Venice and Genoa; the latter might be due to a mistake in lemmatization); persons that are also categorized as families; persons that are categorized as biography.

CUbRIK Space and Time Entity Repository

Page 21

D4.1 Version 1.0

1.8

Entity Repository Statistics

Overall, the entity repository (i.e., Entitypedia and its CUbRIK extension) currently contains around 8.5 million entities and more than 80 million axioms (classes, attributes and relations about them). The table below shows some statistics. Object

Quantity

Natural language level English synsets

110,531

English words

134,581

Italian synsets

35,305

Italian words

41,683

Formal language level Classes, relations and attributes

112,934

Entities

~8.5 millions

Knowledge level is-a and part-of relations

96,454

instance-of relations

~8.5 millions

Table 7: Statistics about the current size of the knowledge base

CUbRIK Space and Time Entity Repository

Page 22

D4.1 Version 1.0

Media Entity Annotation H-Demo

The media entity annotation horizontal demo (H-Demo) is used to harvest representative images for named entities stored in the entity repository. The goal of the H-Demo is a twofold: a) to enhance the entity repository to include also multimedia based content which can be used to visualize the named entities, e.g. in entity search results and b) to perform content based multimedia search in order to find entities depicted on a photo where the entity repository can be used to provide additional information associated to these entities. As an initial input data-set for this H-Demo we used a set of famous Italian monuments with expert-generated metadata (Tables 8-10). In total, experts collected a set of ~100 monuments located in different Italian cities such as Rome and Florence. Entities related to the cities or monuments were also collected. The goal of the components and pipelets that were developed is to crawl online multimedia social networks in order to fetch multimedia content (in particular images) related to monuments and to update the records of the entity repository with the freshly retrieved multimedia content and its metadata. The related entities, relations, and attributes are also imported. The next sections are organized as follows. Section 2.1 describes the system architecture for the media-entity annotation H-Demo and the use case for harvesting representative images of named entities. Section 2.2 describes the java client which is used by the H-Demo in order to access the entity repository. Section 2.3 provides the details on the implementation of crawling components which are used in the H-Demo. Section 2.4 describes how content similarity can be used for media cleaning. Section 2.5 discusses the use of social features for re-ranking of related photos. Section 2.6 presents details of SMILA pipeline implementation.

2.1

System Architecture

The lifecycle of entities in a semantic multimedia entity repository includes the following 4 main steps: (i) harvesting, (ii) cleaning, (iii) searching, and (iv) visualizing entities. These steps are shown in Figure 3.

Figure 3: Media-Entity Annotation H-Demo

CUbRIK Space and Time Entity Repository

Page 23

D4.1 Version 1.0

•

Harvest. Images and other entities are imported into the system from existing data sets or by using focused web crawling. Images are further ranked and filtered using social features to obtain relevant and socially attractive images. Crowdsourcing techniques can be used to implement higher quality crawling.

•

Clean. The quality of imported images, entities and their metadata can further be improved by using human computation (e.g., GWAP) techniques.

•

Search. Entity search is used to provide access to entities and harvested images.

Show. Entity visualization is used as a module during all the steps in lifecycle of entities in entity repository. In this deliverable, we concentrate on the problem of automatic harvesting of representative images for entities. Given that monuments entities had been selected as the data-set, images and entities were automatically related by using (GPS) locations and entity names. As it was shown in [41], names and locations are the most useful attributes for finding representative images for this kind of entities. Note that other attributes can be useful in case of different entities, e.g. organizations or people. The media harvesting use case is shown in Figure 4. •

Figure 4: Use case diagram for Media Harvesting for Entities

CUbRIK Space and Time Entity Repository

Page 24

D4.1 Version 1.0

Basic course of actions: •

The administrator provides a set of entity ids or a search query.

•

The list of entities with associated metadata is extracted from entity repository.

•

Media content and metadata is automatically harvested from the popular social media sharing web sites, e.g. Twitter, Flickr, Picasa and Panoramio. First 100 images are downloaded for every entity from every resource. Crowd information is captured (author, user response, etc.) to allow research in image quality and user feedback. Images are further ranked and/or filtered with respect to their relevance to the initial query (i.e., name of the monument) taking into account the community appeal (i.e., social features).

•

The resulting media with metadata are stored in the entity repository.

2.2

Entity Repository API

Remote programmatic access to the entity repository is implemented by using JSON Entitypedia API13. The documentation for the API can be found on Entitypedia development wiki page14. In order to support the H-Demo, an instance of entity repository was set up15. High level java client16 was built on top of the JSON API in order to facilitate the integration of the API with SMILA and other java programs. The client currently supports the basic CRUD (create-read-update-delete) operations on entities and their attributes. Searching entities by using entity names is also supported by the java client. The data model for entities as it was used for H-Demo is presented in Figure 5. Note that the data model of the entity repository is an open schema model, i.e., in addition to the required attributes, any number of new attributes can be added to describe the entity. Moreover, it can be extended with new entity types. For instance, in Figure 5, we show how a new type Monument (any structure erected to commemorate persons or events) was defined by extending type Location. The entity data model uses Boolean, Integer, Long, Float, String, Date, and URL traditional data types for qualifying entity properties. The value of each entity attribute can also be associated with a provenance and confidence information. In order to support the provenance information, standard data types were extended. For instance, StringProvenance class extends String class.

13Entitypedia API: http://api.entitypedia.org/webapi/ping 14 Entitypedia API Documentation: https://dev.entitypedia.org/trac/entitypedia/wiki/WebAPI 15http://api.entitypedia.org:8082/webapi/ping 16https://dev.entitypedia.org/trac/entitypedia/attachment/wiki/CUbRIKAPI/cubrick-client-1.0-SNAPSHOT-javadoc.jar

CUbRIK Space and Time Entity Repository

Page 25

D4.1 Version 1.0

Figure 5: Entity Repository Data Model

CUbRIK Space and Time Entity Repository

Page 26

D4.1 Version 1.0

2.3

Crawling components

In the process of populating the entity repository with content, this part of the work aims to provide images relevant to the Named Entities. For this task a set of tools were developed or reused. The COFetch tool is an existing tool that was developed in the context of I-Search EU project17, while the rest of the tools were developed within CUbRIK project. The concept was to first use existing tools or implement offline tools to be used to bootstrap the entity repository population and initial experimentation and then the implementation of SMILA pilelets that will be integrated in a CUbRIK release in order to be available for different applications. For the bootstrapping and testing processes the CoFetch and php-Cofetch tools were used. In parallel to the entity repository population and the algorithms evaluation, Twitter, Flickr, Picasa and Panoramio SMILA pipelets were also developed. In the following subsections the available crawlers are presented, while the SMILA specific implementation details are presented in Section 2.6.

2.3.1

COFetch tool

COFetch is a “Content Object” [38] creation tool. With COFetch the user poses text queries and the tool searches in various online multimedia sharing websites to find similar multimedia content of various modalities. The tool searches inside Wikipedia (text), Flickr (images), Youtube (video), freesound (audio), google warehouse (3D models), etc. and fetches the relevant content from each modality. Then the user is able to select one or more multimedia objects from different modalities and construct a “Content Object”. As a result of the process, the UI of COFetch tool provides a link to download the RUCOD [38] file, which contains all the metadata of the constructed “Content Object” either in XML or JSON format.

(b)

(a)

17http://www.isearch-project.eu/isearch/

CUbRIK Space and Time Entity Repository

Page 27

D4.1 Version 1.0

(c)

(d)

Figure 6: COFetch tool: a) search in the online web services b) relevant 3D models from Google warehouse c) images found in Flickr d) the extracted RUCOD files COFetch is an easy to use tool, only if the user needs to add some specific multimedia content (not only images) to a small set of Named Entities, since a user intervention is needed to select the appropriate multimedia content to add in the RUCOD file. This is extremely useful for a clean dataset with high confidence entries. However, in our case we also needed large volumes of data which were extremely difficult, if not impossible, to do manually. For this reason an automated tool was implemented to fetch only images and construct RUCOD files with all the available metadata along with the target content (see subsection 2.3.2)

2.3.2

PHP COFetch for Flickr Images

PHP-Cofetch aims to automate the process of populating the entity repository with images and their metadata for a selected set of Named Entities. PHP-Cofetch was developed using PHP language and the phpFlickr library18 to access Flickr web services19. This implementation is extremely lightweight and can be used through the PHP CLI (command line interface) as any other scripting language without the need of an http server. One major advantage of this approach is the ability to scale the application by just instantiating multiple command lines (consoles) to one or more PCs. Since Flickr web services use unique API keys to authenticate each client, for each one of the instances we registered a new authentication key. The tool gets a set of text queries in a text file and optionally geo-coordinates and iterates through them to fetch relevant images and metadata from Flickr web services. For each retrieved image, a new ROCUD file is created to store all the available metadata such as: time of creation, time of upload, user tags if available, geolocation, and other sensor information if available (temperature, altitude etc.). In the queries txt file, the queries are written one row for each Named Entity. For each row in the queries file we had a corresponding row in the geo txt file. The geo txt file contains the latitude and longitude of the Named Entity in order to be more accurate while querying the online services. This approach saved as from disambiguations or noisy user tags that would result to low quality results. By using the latitude and longitude of each Italian Monument that was queried, the tool constructed a bounding box where the accepted results should fall in. Flickr web API enables us to perform search queries with the use of geo bounding boxes to refine the results and thus the process was straight forward and noise free as far as the location in concerned. However, many of the retrieved results were actually irrelevant since 18http://phpflickr.com/ 19http://www.flickr.com/services/api/

CUbRIK Space and Time Entity Repository

Page 28

D4.1 Version 1.0

e.g. the location was correct but the orientation was wrong (i.e. not looking at the monument but have it on the back of the camera). For these kinds of noisy data we had to use different multimedia based cleaning techniques that we will analyse in Section 2.4. In the process of bootstrapping the content population of the entity repository, a set of ~100 queries of Italian Monuments along with their geographical coordinates (in Longitude Latitude) were extracted from the entity repository. This set of queries was used as input to php-COFetch tool to fetch images and metadata from Flickr. The initial goal was to retrieve images from Flickr using these text queries and their geolocation. The given information for every query was, as shown in Table 8, the location, the class, the name of the monument in English, the name of the monument in Italian, the latitude, the longitude and the corresponding Wikipedia web page. Table 9 shows the number of query monuments per location while Table 10 the monuments per class. Location

Class

English name

Italian name

Latitude

Longitude

Web page

Bologna

Church

San Petronio Basilica

Basilica di San Petronio

44.4927 78

11.343611

http://en.wikip edia.org/wiki/S an_Petronio_ Basilica

Table 8: Example of information given for every query.

Location

Number of monuments

Bologna

Caserta

Ferrara

Florence

Genoa

Lecce

Messina

Milan

Naples

Pesaro and Urbino

Pisa

Rome

Siena

Torino

Trento

Venice

Verona

Table 9: Number of monuments for every location.

CUbRIK Space and Time Entity Repository

Page 29

D4.1 Version 1.0

amphitheater ancient site

arcade arch bridge campanile castle cemetery church column fountain lighthouse monument museum obelisk palace square temple tower villa

2 3 1 5 3 1 10 1 26 2 4 1 6 2 1 20 11 1 4 1

Table 10: Number of monuments per class After experiments, in order to retrieve more results, the query for every monument was formed like “Italian name Location” along with the latitude and longitude information, e.g. “Basilica di San Petronio Bologna”. However, in some cases this formulation did not retrieve enough results, so either the location was omitted, or the geo-location information needed corrections or the Italian name had to be changed. For example “Castello di Otranto Lecce” was changed to “Castello di Otranto”, “Chiesa di San Giovanni Battista Lecce” was searched without geo-tagging, “Santa Maria del Fiore/Duomo Florence” was changed to “Basilica di Santa Maria del Fiore Florence”. Initially, 14628 photos were retrieved from Flickr. From those, only 6430 were kept to a new database, after a first pre-processing stage. During this stage only the images that clearly depict the corresponding monument were selected and then 3 of the 105 queries that did not produce any results or produced only one result, were excluded from the database. These were “Colonna votiva di San Pietro Lecce”, “Santuario di Maria Ausiliatrice Torino” and “Fontana del Gigante Naples”. In some cases, there were photos that were retrieved using different queries. This was occurred because one monument may be part of another one or next to another one. For example, in Figure 7, image (a) was retrieved using as query the “Piazza del Campo Siena” and the “Palazzo Pubblico Siena”, because the second one is part of the first one which is a square. Similarly the photo (b) of Figure 7 was retrieved using the queries “Arco di Settimio Severo Rome” and “Foro Romano Rome”. Moreover there are 3 queries that correspond to the same monument; these are “Le Due Torri Bologna”, “Torre Asinelli Bologna” and “Torre Garisenda Bologna”, which means that many of the photos they retrieve are the same. So, at a second pre-processing step these images were excluded from the database resulting to a new one of 6342 images.

CUbRIK Space and Time Entity Repository

Page 30

D4.1 Version 1.0

(a)

(b)

Figure 7: Flickr results that were retrieved using different queries and thus contain possibly more than one monuments.

2.3.3

Java crawler for Flickr Images

A Flickr crawler tool was also implemented in Java, focusing on the integration with SMILA and the media harvesting H-Demo. The tool retrieves Flickr photos related to the named entities, using Flickrâ&#x20AC;&#x2122;s API20, and the Java library flickrj21. Similarly to the PHP Flickr tool described in the previous section, the Java Flickr crawler tool also aims to populate the entity repository with images, metadata and related entities for the selected set of named entities. Flickr stores images in different sizes, and provides URLs where these can be accessed. For each photo, the URLs of the small, medium, original and large photo were retrieved. Aside from the actual physical photo, each Flickr photo has a set of associated metadata. The basic metadata information for a photo consists of: Flickr internal id, title, URL, owner, description, date taken, date added, date uploaded, date posted and license. Additional metadata includes information about tags, favourites, comments, URLs, contexts and notes. Tags are like a keyword or category label. Tags help Flickr users to find photos which have something in common. A user can assign up to 75 tags to each photo. The tag information retrieved for a photo consists of a list containing tag(id, raw and processed) and the user(id, name) that produced it. A photo can be declared by one or more users as a favourite. For each photo, when available, the list of users(id, name), i.e. users that declared it as favourite, was retrieved together with the information about the time when this action occurred. Users can also comment on photos. For the photos that elicited such comments, the list of users(id, name), the creation date, a link to the comment, and the text of the comment itself were retrieved.

20http://www.flickr.com/services/api/ 21http://flickrj.sourceforge.net/

CUbRIK Space and Time Entity Repository

Page 31

D4.1 Version 1.0

A photo can appear in one or more contexts. A context is either a group photo pool or a photoset. Users can organize themselves in groups and share photos in a common group pool, for example the group “CeBit 2012” would have a group pool where the members share photos taken at the respective event. Users can organize their own photos in photosets, containing photos that share the common characteristics, for example the “Vacation 2012 South Africa” photo set. For the photos that appear in a certain context, the type of context and its id and title were retrieved. Users are offered the possibility to create a bounding box on a photo and attach a note to it, enabling them to label or comment only regions of a photo, for example tag persons in a group photo. For the photos that have notes attached, the id, author(id, name), the coordinates of the bounding box and the text were retrieved. The crawler can work as an independent tool. In this case, the tool gets a set of queries from a text file and iterates through them to fetch relevant images and metadata from Flickr web services. Each row in the text file corresponds to a named entity. Two such files were used, one containing the named entities in English and one in Italian. For each entity, Flickr was queried for the relevant images, which were retrieved along with all the associated metadata and ranked according to Flickr’s relevance. The gathered information is stored in an Oracle database. A few statistics on the crawled data are presented in the next Table. Number of entity queries

169

Number of retrieved photos

208,193

Photos with Geo Information

71,126

Number of photo, tag pairs

2,188,406

Number of photo, note pairs

12,506

Number of photo, comment pairs

673,436

Number of photo, context pairs

740,666

Number of favorite assignments

194,727

Size of all retrieved images

64 GB

Table 11: Data crawled from Flickr for the Italian monuments dataset The tool can also directly upload the retrieved information to the entity repository, or be part of a SMILA pipeline that would populate the entity repository with named entity related images and associated metadata. Furthermore, a tool was developed, that uploads the information, already stored in the database, to the entity repository.

2.3.4

Java crawler for Panoramio Images

A Panoramio crawler was implemented in Java using Panoramio API22. GPS coordinates of the monument were used as a query. Images within a distance smaller than a predefined range were crawled by using the API. Images and their owners together with associated metadata are represented as entities in the entity repository and linked back to the monument entity.

22http://www.panoramio.com/api/data/api.html

CUbRIK Space and Time Entity Repository

Page 32

D4.1 Version 1.0

2.3.5

Java crawler for Picasa Images

A Picasa crawler was implemented in Java using Picasa Web Albums Data API23. In the case of Picasa, images were returned according to the relevance (similarity) of the surrounding text and input query. Images together with their creators were represented as the entities and interlinked with corresponding monument entity in the entity repository.

2.3.6

YouTube Crawler for Video Metadata

In order to be able to populate the entity repository also with videos and their metadata for the selected set of Named Entities as well as any query submitted to the SMILA pipelet, a YouTube crawler was also implemented. The YouTube crawler was developed in Java, making use of the Google Data API24 for YouTube and the provided Java client library. The tool works similarly to the Flickr crawler and has the same functionalities, reading a set of text queries from a text file and iterating through them to fetch relevant data from the Google Data services and store it in a database or in the entity repository, standalone or as a part of a SMILA pipeline. For each entity or search query, the API was called and the first 300 relevant videos were retrieved. For each video the following associated metadata was retrieved: id, title, duration, uploader, link date, tags, favourite count, view count, number of likes, number of dislikes, and number of comments. In addition to this basic metadata, the data on the first 1000 comments associated to the video were retrieved: the id, the rank, the text, author, total rating, and the posting date. The video itself was not retrieved; it is available through a link to the YouTube website. Note that the limitations on the number of videos and comments are imposed by the Google Data API. Some statistics about the data gathered for the Italian monuments dataset are presented in the following Table. Number of distinct entity queries

166

Number of videos

31,067

Number of video, comment pairs

194,308

Table 12: Data crawled from YouTube for the Italian monuments dataset

2.3.7

Twitter multimedia crawlers

The entity repository can be used in various applications through the CUbRIK platform and thus the enhancement with new content and the type of this content may vary. Since users of the social web share extremely large volumes of multimedia through different sharing websites such as Flickr, photobucket, instagram, Facebook to name a few, we could use these data streams to update/enhance the multimedia contents of the entity repository. Twitter is one of the most active sharing/distribution channels where the users share not only their status but also images stored in different online storage services. Twitter provides a set of APIs to access their databases in different ways: Twitter Search API The Twitter Search API can be used in case of searching for past tweets. It is based on REST protocol and it can return tweets that were posted from the time of the request and up to 6-9 days before. Each request will return up to 100 tweets per page and a maximum of 15 pages, giving a theoretical maximum of 1500 tweets for a single query. This API has a Rate Limit, according

23https://developers.google.com/picasa-web/docs/2.0/developers_guide_java 24https://developers.google.com/gdata/

CUbRIK Space and Time Entity Repository

Page 33

D4.1 Version 1.0

to the complexity and frequency of requests. Unfortunately, Twitter does not make publicly available their rate limits and thresholds in order to avoid misuse. The query may by a plain text query such as â&#x20AC;&#x153;Mykonos 2012â&#x20AC;? or a more complex one. The complex search queries have several operators to define geo-location, distance, number of results per page, as well as type of results, such as popular, real time or recent25. Twitter Streaming API The streaming API26, on the other hand, can be used in case the user wants to retrieve realtime data starting from the time the server connection is established. It uses a continuous connection to the Twitter servers, which is maintained for as long as the client needs to collect tweets, possibly for days or even months. This approach enables multiple queries to be set and the service will return whenever one of them occurred. The implementation is an event based loop which is activated every time a new event i.e. a result matching one of the queries is triggered. Based on these two APIs, two types of crawlers for retrieving data from Twitter have been implemented. The first one uses the Twitter Search API and the second one uses the Twitter Streaming API. Both of them have been developed in Java, using Twiter4j27 library to access the twitter API web services. They take as input a text query and retrieve tweets that contain it. The retrieved tweets are processed to find if they contain any link and then check if this link leads to an image. Both crawlers are connected to a MySQL database, where they store all these data. The schema of the MySQL database is depicted in Figure 8, while the overall process for both crawlers is presented in Figure 9. Please note that this MySQL database is independent of the entity repository and is used as a media storage database. The crawlers are also responsible to update the entities in the entity repository as it is presented in the following SMILA sections.

Figure 8: Schema of the twitterCrawler MySQL database

25 https://dev.twitter.com/docs/api/1/get/search 26 https://dev.twitter.com/docs/streaming-apis 27 http://twitter4j.org/en/index.html

CUbRIK Space and Time Entity Repository

Page 34

D4.1 Version 1.0

Figure 9: Search the Twitter using a query in order to retrieve the images posted through URLs in the relevant tweets. When a relevant tweet is retrieved, it is stored to the MySQL database accompanied by its ID, the date that was posted, the ID of the user who posted the tweet, a flag that shows whether it contains a link or not and the query which refer to. Twitter has a limit of 140 characters for each tweet (post). Due to this constraint twitter clients use shortening services that produce a shorted URL in order to save the characters for writing more text. This trick enters the need for each client that needs to use a URL to expand it and find the original one. In the case that a tweet contains a line this is processed to find the expanded URL. Since the shortening URL services are so many, and not all of them provide an API to reverse the shortening process, the URL expander algorithm does exactly what a browser does. It follows each URL and by reading the HTTP headers finds the redirects of the shortened URLs to the final destination URL. The final URL is checked among a list of given photo sharing / storage sites (Flickr, photobucket, instagram, etc) to find whether it leads to an image or not. In the meantime, the initial URL is stored to the DB along with the corresponding expanded one, the tweet ID and a flag that shows whether it leads to a photo-site or not. In case of a photo-site, the URL of the uploaded image is taken through an HTML parser and its content is downloaded with a unique file name to a specified folder. This file name is also stored to the DB with the corresponding tweet ID and the image URL. The downloading of the original image is optional since we have the actual URL of the image. This is only done in order to use the image and extract its multimedia content descriptors for the multimedia indexing search and retrieval process. Then we may delete the image file from the local file system.

CUbRIK Space and Time Entity Repository

Page 35

D4.1 Version 1.0

2.4

Media Cleaning based on Content Similarity

As it was stated above, after the harvesting of the Images from various sources, a cleaning process should take place. This cleaning process aims to filter out images with low quality that do not depict the actual monument, or contains a very small part of the monument and thus it is not very informative. Moreover, there are some images that contain more than one of the Italian Monuments that were selected. The last problem is one of the most difficult to solve using only the multimedia descriptors. However, the focus of the entity repository is to provide valid information to the end user and one of the criteria to this is to provide multimedia content that depicts clearly the requested Entity. With this approach in mind a fast but accurate approach in clearing the content is to discard the images that were retrieved from more than one query. The rationale behind this decision is based on the assumption that if one image is being retrieved from more than one query then the image contains more than one Monument or there is a disambiguation or noisy tags or wrong geolocation information in the query. This exact issue is presented in Figure 10 where Basilica di San Petronio, Palazzo dei Notai and Palazzo dâ&#x20AC;&#x2122;Accursio all appear in the same image.

Figure 10: Multiple monuments in the same image A second step in refining the retrieved results is the typical content based similarity approach. The study of the appropriate multimedia descriptors and multimedia indexing approach are tasks that will be studied in Task 4.2 starting M13 of the project. However, a first evaluation of state-of-the art algorithms was examined in this period in order to push the results in the entity repository. With the following work in Task 4.2 the multimedia descriptors for each image will be updated. The evaluation of the descriptors [34][35][36][37] and selection of the appropriate multimedia descriptor is part of task 4.2 of the work package, however a initial set of multimedia descriptors algorithms was already examined. The following algorithms were used to extract local features and then codebooks of different sizes: â&#x20AC;˘

C-SIFT with 512 and 1024 dimensions

â&#x20AC;˘

RGB-SIFT with 512 and 1024 dimensions

CUbRIK Space and Time Entity Repository

Page 36

D4.1 Version 1.0

•

Opponent SIFT with 512 and 1024 dimensions

•

HSV SIFT with 512 and 1024 dimensions

•

Geometric Blur with 500 and 1000 dimensions

•

Self Similarity with 500 and 1000 dimensions

•

CEDD (global descriptor with 144 dimensions)

Moreover, we have evaluated different distance metrics for these descriptor vectors such as: •

Bray Curtis

•

Cos

•

Spearman

•

Figure 11: Extraction of multimedia descriptors using different algorithms and distance metrics The monuments depicted in the harvested images have strong spatial information in their structures e.g. the windows and doors of a specific monument are in the same place with respect to the whole structure. However, the simple bag-of-words approach, that most of the multimedia similarity methodologies use, loses this kind of information. A solution for keeping such spatial information in the extracted descriptors vectors is spatial decomposition techniques. For this purpose we implemented pyramidal decompositions in configurations of 3, 4 and 16 parts and generated descriptors of up to size of ~10000 dimensions. The following image depicts typical pyramidal decompositions of level 0, 1 and 2 and their respective histograms.

CUbRIK Space and Time Entity Repository

Page 37

D4.1 Version 1.0

Figure 12: Pyramidal decompositions of level 0, 1 and 2 [39] For each image a set of content based multimedia descriptors were extracted. Then by using a human selected “good query” (i.e. that contains the whole monument and in good quality) the most similar images to this were selected. Since for the “Italian Monuments” application we needed a set of 5 good quality images for each monument, the top-5 images were selected. The selection of the image query was a manual process that will be replaced by an automated process after the analysis of the social metadata of the crawled images. A crowdsourcing methodology will be used in the future, to select the appropriate image query to be used in order to fetch similar images for the retrieved set. This approach is actually a query expansion methodology where the initial text query is expanded not only with additional textual information existing in the Entity repository but also with multimedia content. The approach enters the “human in the loop” and solves critical, “long tail” problems that appear for rare / ambiguous / locally eminent named Entities. Yago K2 project [41] works also in the same direction in populating their repository with multimedia content. However, their approach aims in a fully automated process that needs training sets of already annotated multimedia in order to perform. In our future work, in Task 4.2 (starting M13), a combination of the two approaches will be considered. Since no annotated dataset exists yet, the evaluation of descriptor algorithms using typical evaluation metrics such as mAP or ROC area are not possible yet. In the following experiment from a set of 30 relevant images from Anfitheatro Romano in Lecce a set of top-5 images was selected as the most similar using each time the same query image but different descriptors algorithms. The selected algorithm for constructing the top-5 sets for the ~100 Italian monuments queries is Opponent SIFT both for the performance (evaluated by a small set of manually annotated samples) and the speed of descriptor extraction which is critical in CUbRIK Space and Time Entity Repository

Page 38

D4.1 Version 1.0

an online process.

Query: Results with Opponent SIFT (1024 dim)

Results with Geometric Blur:

CUbRIK Space and Time Entity Repository

Page 39

D4.1 Version 1.0

Results with GB level 1 pyramidal decomposition:

CUbRIK Space and Time Entity Repository

Page 40

D4.1 Version 1.0

Results with Self Similarity (1000 dimensions BoW):

2.5

Social re-Ranking of Flickr Photos

For a given monument entity, the photos retrieved from Flickr are ordered by relevance based on Flickr’s algorithms. We propose to re-rank them by using measures derived from the social interaction between users and the photos. As such, each photo has a certain number of tags or comments and it was designated as favourite by a certain number of users. These criteria may be exploited for re-ranking the photos accordingly. To go a step further we want to aggregate as many social dimensions as possible, and not be limited to just one. In order to do this we use Borda ranking. Borda’s voting method [40] works as follows: given N candidates and multiple voters, points of N-1, N-2, … 0 are assigned to the first-ranked, second ranked, … and last ranked in each voter’s preference order. The points for each candidate are summed across all voters and the winning candidate is the one with the greatest total number of points. Instead of voters we use multiple criteria. If each criterion is thought of as being a voter and if rik is the rank of result I under criterion k, the Borda count for result i is bi =

∑ (N − r

) , and the results are ordered according to

these counts. We propose to use a slight variation of this ranking method in the following direction. For

CUbRIK Space and Time Entity Repository

Page 41

D4.1 Version 1.0

each criterion k, each result I that is placed on the rank counts aggregating all the criteria are given by b (i ) =

rik gets a score of 1

â&#x2C6;&#x2018;r k

1 . The final rik

, and the results are reordered

according to these counts. We consider the top-N results for each criterion. Some results might appear in the top-N ranked list for some criteria, but not for others. For this we can use a placeholder, e.g. may assume that photo is ranked at rank N+1 (optimistically) or at an average rank given the number of results returned in the union set of all photos for this query from all criteria.

2.6

SMILA Pipeline

For the first version of the Media Entity Annotation H-Demo (media harvesting H-Demo) we prepared a set of SMILA pipelets and a SMILA pipeline for triggering the media harvesting in order to populate the entity repository with multimedia content and update the stored entities with new content. The SMILA pipeline is shown in Figure 13.

Figure 13: SMILA Pipeline for Media Harvesting H-Demo This first version of the Media Harvesting H-Demo consists of a set of pipelets for media crawling and pipelets for the entity repository querying and updating. Entity Search pipelet retrieves monument entities from the entity repository. The crawling pipelets search for similar content from Twitter, Flickr, Panoramio, and Picasa web services by using entity name and/or GPS coordinates. The updating pipelet is used to store the retrieved results from the media crawlers as entities in the entity repository.

2.6.1

Entity Search and Entity Update Pipelets

Entity Search and Entity Update pipelets are implemented on top of the java client for the entity repository API (Section 2.2). Entity Search pipelet takes a monumentâ&#x20AC;&#x2122;s name as an input and submits it as a query to the entity repository. The output of the pipelet is a monument entity retrieved from the entity repository. Entity Update pipelet takes a set of (related) entities as input and create (or update already existing) entities (e.g., people, photos, tags) in the entity repository. Created entities are also linked to the related monument entity. In order to allow a tight integration of pipelets which are built on top of the entity repository API java client into SMILA pipelines, namely to allow transferring entities between different pipelets, conversion of entities to and from SMILA records was implemented. JSON representation of entities and additional information about entity relationship and provenance was stored as string values in the SMILA record. CUbRIK Space and Time Entity Repository

Page 42

D4.1 Version 1.0

2.6.2

Crawling pipelets: Flickr, Panoramio, Picasa

In order to simplify the integration of various crawling pipelets into the media harvesting pipeline, abstract MediaCrawlerPipelet class was implemented. MediaCrawlerPipelet performs routine work with entities, e.g. conversion of entities from/to JSON strings and storing/reading JSON representations of entities to/from SMILA records. Concrete implementations of MediaCrawlerPipelet class need to implement a single method crawlMedia. The method takes as input the Monument entity (received from the EntitySearchPipelet) that it has to crawl and it returns a list of crawled entities related to the given monument. This is to be further processed by the EntityUpdatePipelet. MediaCrawlerPipeletFlickr, MediaCrawlerPipeletPanoramio, MediaCrawlerPipeletPicasa are the examples of MediaCrawlerPipelet abstract class implementations. The functionalities of the pipelets are described in Sections 2.3.3 - 2.3.5.

2.6.3

Twitter Pipelets

Two services have been implemented for crawling in Twitter and downloading//storing images relevant to specific keywords. The first service has been implemented as a SMILA pipelet, namely TwitterCrawlerPipelet. The TwitterCrawlerPipelet pipelet can be called through SMILA's search client or any RESTFul web service client, with a keyword as input. The pipelet crawls through the relevant messages of the last 9 days, saves their content (text and images) and returns. This pipelet is implemented as a wrapper to the first twitter crawler presented in section 2.3.7. The second twitter crawler has been implemented as an OSGi service (org.eclipse.smila.services.twitterCrawlerService), which is started along with the compiled SMILA framework bundle. The service implements the second crawler as it is presented in section 2.3.7 which uses the twitter streaming API. The user can communicate with the service through a set of pipelets: a) the TwitterCrawlerStartPipelet, which accepts as input a set of keywords, starts a permanent search over the keywords and returns a (String) receipt to the caller, b) the TwitterCrawlerStatusPipelet, which accepts a receipt and returns the status of the initiated search (e.g. Running) and c) the TwitterCrawlerStopPipelet, which accepts a receipt and stops the search. This approach was selected in order to overcome the pipelet timeout issue of SMILA. SMILA is designed in such a way that pipelet should not overcome a certain time duration. Over that time duration the pipelet is timed out and the process is stopped. This approach is very useful when used for quering an indexing structure or performing a real time filter on some data. However, it is not useful for event based processes that need to stay available for the lifespan of the rest of the services (e.g. for days or even months).

CUbRIK Space and Time Entity Repository

Page 43

D4.1 Version 1.0

Conclusions and next steps

Deliverable D4.1 â&#x20AC;&#x153;Space and Time entity repositoryâ&#x20AC;? presented the work done in Task 4.1 to adapt Entitypedia as CUbRIK entity repository and to evolve it to an advanced multipurpose entity repository with advanced notions of space and time. Moreover the instantiation of the service and the laborious and demanding task of populating the entity repository with large amounts of valuable content were also presented. In detail the report presented the entity repository, its data model and the methodology that have been followed to progressively populate it by reusing high quality data sources. In particular, the report described in detail the work done with YAGO in which a novel modularization technique has been experimented which allows us selecting entities of specific types and, as shown by the evaluation, with a high degree of accuracy. Moreover, a refinement has been employed to fix most of the structural inconsistency and incompleteness problems which we identified in YAGO. In the second part of the deliverable, the media harvesting H-Demo and a detailed discussion on the components and pipelets that were developed were presented. The aim of the HDemo is to stress the ability to enhance the entity repository with multimedia content and its metadata to enable new usage scenarios in various applications. As future work we plan to further evolve our methodology and develop an importing framework supporting it. In fact, the software components that we developed to support the different phases are mostly customized for GeoNames and YAGO. This has to be generalized to support the importing of generic data sources. In particular, we plan to develop an entity matching tool to ensure that no duplicates are introduced with the importing. We will keep evolving the entity repository by importing entities of different type from suitable data sources. Moreover, the functionality and pipelines developed in the context of the media harvesting H-Demo will be incorporated to larger demo applications inside CUbRIK to enable complete multimedia search pipelines.

CUbRIK Space and Time Entity Repository

Page 44

D4.1 Version 1.0

4. [1]

[2] [3]

[4]

[5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20]

[21] [22]

References Giunchiglia, F., Maltese, V., Dutta, B. (2012). Domains and context: first steps towards managing diversity in knowledge. Journal of Web Semantics, special issue on Reasoning with Context in the Semantic Web. Ranganathan, S. R. (1967). Prolegomena to library classification. Asia Publishing House. Baader, F., Calvanese, D., McGuinness, D., Nardi, D., Patel-Schneider, P. F. (2002). The Description Logic Handbook: Theory, Implementation and Applications. Cambridge University Press. Giunchiglia, F., Dutta, B., Maltese, V. (2009). Faceted Lightweight Ontologies. In: Conceptual Modeling: Foundations and Applications, A. Borgida, V. Chaudhri, P. Giorgini, Eric Yu (Eds.) LNCS 5600 Springer. Maltese, V., Farazi, F. (2011). Towards the Integration of Knowledge Organization Systems with the Linked Data Cloud, UDC seminar. Giunchiglia, F., Shvaiko, P., Yatskevich, M. (2006). Discovering missing background knowledge in ontology matching. ECAI conference, 382–386. Giunchiglia, F., Villafiorita. A., Walsh, T. (1997). Theories of Abstraction. AI Communications 10 (3/4), 167-176. F. M. Suchanek, G. Kasneci, G. Weikum (2011). YAGO: A Large Ontology from Wikipedia and WordNet. Journal of Web Semantics. Giunchiglia, F., Maltese, V., Farazi, F., Dutta, B. (2010). GeoWordNet: a resource for geo-spatial applications. 7th Extended Semantic Web Conference (ESWC). Giunchiglia, F., Dutta, B., Maltese, V., Farazi, F. (2012). A facet-based methodology for the construction of a large-scale geospatial ontology. Journal of Data Semantics. Farazi, F., Maltese, V., Giunchiglia, F., Ivanyukovich, A. (2011). A faceted ontology for a semantic geo-catalogue. 8th Extended Semantic Web Conference (ESWC). Miller, G. A., Hristea, F. (2006). WordNet Nouns: classes and instances. Computational Linguistics, 32(1), 1 – 3. Miller, G. A. (1998). WordNet: An electronic Lexical Database. MIT Press. Simperl, E. (2009). Reusing ontologies on the semantic web: a feasibility study, Data and Knowledge Engineering 68 (10), 905–925. F. M. Suchanek, 2008. Automated Construction and Growth of a Large Ontology. PhD thesis, Saarland University, Germany. Gomez-Perez, A. (2001) "Evaluation of ontologies", International Journal of Intelligent Systems, vol. 16, no. 3, pp. 36. Gomez-Perez, A. Some ideas and examples to evaluate ontologies. Knowledge Systems Laboratory, Stanford University, 1994. Guarino, Nicola and Chris Welty. 2002. Evaluating Ontological Decisions with OntoClean. Communications of the ACM. 45(2):61-65. New York: ACM Press. Welty, C., Mahindru, R. & Chu-Carroll, J. (2004). Evaluating ontology cleaning. In D. McGuinness& G. Ferguson (eds), AAAI2004. San Jose, CA: AAAI/MIT Press. A. Gangemi, C. Catenacci, M. Ciaramita, and J. Lehmann. A theoretical framework for ontology evaluation and validation. In Semantic Web Applications and Perspectives (SWAP) – 2nd Italian Semantic Web Workshop, Trento, Italy, 2005. Lozano-Tello, A. and Gomez-Perez A., 2004: “ONTOMETRIC: A method to choose the appropriate ontology”, J. of Database Management, 15(2). P. Spyns, “EvaLexon: Assessing triples mined from texts,” Tech. Rep. STAR-2005-09, STAR Lab, 2005.

CUbRIK Space and Time Entity Repository

Page 45

D4.1 Version 1.0

[23] Corcho, O., Gomez-Perez, A., Gonzalez-Cabero, R., and Suarez-Figueroa, C. (2004). ODEVAL: A Tool for Evaluating RDF(S), DAML + OIL, and OWL Concept Taxonomies. In Proceedings of the 1st IFIP Conference on Artificial Intelligence Applications and Innovations (AIAI 2004), pp. 369–382. [24] Tao, J., Ding, L., McGuinness, D.L.: Instance data evaluation for semantic web-based knowledge management systems. In: Proceedings of the 42th Hawaii International Conference on System Sciences (HICSS-42). pp. 5-8. Big Island, Hawaii (2009) [25] A. Autayeu, F. Giunchiglia, P. Andrews (2010). Lightweight parsing of classifications into lightweight ontologies. In the proceedings of the European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2010). [26] W. Cohen, A. Borgida, and H. Hirsh. Computing least common subsumers in description logics. In the Proc. of AAAI ’92. [27] F. Baader, B. Sertkaya, and A.-Y. Turhan (2007). Computing the least common subsumer w.r.t. a background terminology. In Journal of Applied Logic 5 (3), 392-420. [28] M. d’Aquin, A. Schlicht, H. Stuckenschmidt, M. Sabou, Ontology modularization for knowledge selection: experiments and evaluations, in: Proceedings of the 18th International Conference on Database and Expert Systems Applications DEXA, 2007, pp. 874–883. [29] Cuenca Grau, B., Horrocks, I., Kazakov, Y., and Sattler, U. Modular reuse of ontologies: Theory and practice. Journal of Artificial Intelligence Research 31 (2008), 273–318. [30] P. Doran, V. Tamma, L. Iannone, Ontology module extraction for ontology reuse: an ontology engineering perspective, in: Proceedings of the 16th ACM Conference on Information and Knowledge Management CIKM, 2007, pp. 61-70. [31] Farazi, F., Maltese, V., Dutta, B., Ivanyukovich, A. (2012). A semantic geo-catalogue for a local administration. AI Review Journal, to appear. [32] Maltese, V., Hossain, B. A. (2012). SAM: A tool for the semi-automatic mapping and enrichment of ontologies. In proceedings of the OnToContent Workshop 2012. [33] J. Hoffart, F.M. Suchanek, K. Berberich, E. Lewis-Kelham, G. de Melo, G. Weikum (2011). YAGO2: exploring and querying world knowledge in time, space, context, and many languages. In Proceedings of the 20th international conference on World Wide Web. [34] van de Sande, K.E.A.; Gevers, T.; Snoek, C.G.M.; , "Evaluating Color Descriptors for Object and Scene Recognition," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol.32, no.9, pp.1582-1596, Sept. 2010doi:10.1109/TPAMI.2009.154URL:http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=& arnumber=5204091&isnumber=5530071 [35] S. Α. Chatzichristofis and Y. S. Boutalis, “CEDD: Color and Edge Directivity Descriptor A Compact Descriptor for Image Indexing and Retrieval.”, « 6th International Conference in advanced research on Computer Vision Systems ICVS 2008» Proceedings: Lecture Notes in Computer Science (LNCS) pp.312-322, May 12 to May 15, 2008, Santorini, Greece. [36] Berg, A.C.; Berg, T.L.; Malik, J.; , "Shape matching and object recognition using low distortion correspondences," Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol.1, no., pp. 26- 33 vol. 1, 2005doi: 10.1109/CVPR.2005.320URL:http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber =1467245&isnumber=31472 [37] Shechtman, E.; Irani, M."Matching Local Self-Similarities across Images and Videos," Computer Vision and Pattern Recognition, 2007. CVPR '07. IEEE Conference on , vol., no., pp.1-8, 17-22 June 2007doi: 10.1109/CVPR.2007.383198URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnu mber=4270223&isnumber=4269956 [38] P. Daras, A. Axenopoulos, V. Darlagiannis, D. Tzovaras, X. Le Bourdon, L. Joyeux, A. Verroust-Blondet, V. Croce, T. Steiner, A. Massari, A. Camurri, S. Morin, A-D. Mezaour, CUbRIK Space and Time Entity Repository

Page 46

D4.1 Version 1.0

L. Sutton, S. Spiller, “Introducing a Unified Framework for Content Object Description”, International Journal of Multimedia Intelligence and Security, Special Issue on “Challenges in Scalable Context Aware Multimedia Computing”, Volume 2, Number 3– 4/2011, DOI 10.1504/IJMIS.2011.044765, Pages: 351-375, January 2012. [39] Anna Bosch, Andrew Zisserman, and Xavier Munoz. 2007. Representing shape with a spatial pyramid kernel. In Proceedings of the 6th ACM international conference on Image and video retrieval (CIVR '07). ACM, New York, NY, USA, 401-408. DOI=10.1145/1282280.1282340 http://doi.acm.org/10.1145/1282280.1282340 [40] Borda, J-C. "Memoire sur les Elections au Scrutin," Histoire de V Academie Royale des Sciences, Paris, 1781 [41] Bilyana Taneva, Mouna Kacimi, Gerhard Weikum “Gathering and ranking photos of named entities with high precision, high recall, and diversity”, WSDM 2010: 431-440

CUbRIK Space and Time Entity Repository

Page 47

D4.1 Version 1.0