panagea

Page 1

PANGAEA - Providing access to geoscientific data using Apache Lucene Java Uwe Schindler PANGAEA / SD DataSolutions GmbH, uschindler@pangaea.de


My Background

I am committer and PMC member of Apache Lucene and Solr. My main focus is on development of Lucene Java. Implemented fast numerical search and maintaining the new attribute-based text analysis API. Studied Physics at the University of Erlangen-Nuremberg and work as consultant and software architect for PANGAEA (Publishing Network for Geoscientific & Environmental Data) in Bremen, Germany, where I implemented the portal's geospatial retrieval functions with Lucene Java. Talks about Lucene at various international conferences like ApacheCon EU/US, Lucene Eurocon, Berlin Buzzwords and various local meetups.


About PANGAEA

since 1993 Information system for earth system science data hosted by AWI & MARUM 2001 Mandate of the International Council for Science (ICSU): World Data Center for Marine Environmental Sciences (WDCMARE) 2007 Mandate of the World Meteorological Organisation (WMO): World Radiation Monitoring Center (WRMC) 2010 (certification in progress) Mandate of the World Meteorological Organisation (WMO): Data Collection and Processing Center (DCPC)


Network of World Data Centers Geophysical Year 1957 Airglow Mitaka,Japan Astronomy Beijing, China

Meteorology Asheville NC, USA Beijing, China Obninsk, Russia

Marine Geology and Geophysics Boulder CO, USA Nuclear Radiation Moscow, Russia Tokyo, Japan

Atmospheric Trace Gases Oak Ridge TN, USA

Seismology Denver CO, USA Beijing, China

Cosmic Rays Toyokawa, Japan

Soils Wageningen, The Netherlands

Earth Tides Brussels, Belgium

Solar Activity Meudon, France

Geology Beijing, China

Solar Radio Emission Nagano, Japan

Geomagnetism Copenhagen, Denmark Edinburgh, UK Kyoto, Japan Colaba, India

Solar Terrestrial Physics Boulder CO, USA Didcot Oxon, UK Moscow, Russia Haymarket, Australia

Glaciology Boulder CO, USA Cambridge, UK Lanzhou, China

Ionosphere Tokyo, Japan Marine Environmental Sciences Bremen, Germany, (2001)

Rotation of the Earth Obninsk, Russia Washington DC, USA Satellite Information Greenbelt MD, USA

Aurora Tokyo, Japan

Human Interactions in the Environment Palisades NY, USA

Rockets and Satellites Obninsk, Russia

WDC Co-ordination Offices Washington DC, USA Beijing, China Oceaography Obninsk, Russia Silver Spring MD, USA Tianjin, China

Recent Crustal Movements Ondrejov, Czech Republic

Paleoclimatology Boulder CO, USA

Renewable Resources and Environment Beijing, China

Remotely Sensed Land Data Sioux Falls SD, USA

Solid Earth Geophysics Beijing, China Boulder CO, USA Moscow, Russia Space Science Beijing, China Space Science Satellites Kanagawa, Japan Sunspot Index Brussels, Belgium


Why do we need Data Libraries? - Good scientific practice - Needed for verification of scientific work - Good availability of data for large scale and complex scientific approaches - ³'DWD UHF\FOLQJ´ LV PRUH HIIHFWLYH than reproduction


Geosciences before 1900 William Smith, 1815

Glomar challenger, 1875

Turin papyrus, ~1160 BC


Technical Improvements ENIAC, 1944

Magnetometer


Development of the global climate

Thousands of years before present

Thousands of years before present The last 1300 years


Information increase in empirical sciences ?

30

25

20

Publications

15

Data 10

5

0

1970

1980

1990

2000

2010


Archiving and publication of scientific data

Data acquisition Quality assurance Long-term availability and access


Long term archive Open access & non restricted data o Creative Commons license

Data accepted from individual scientists, institutes, and science projects Long term funding for basic operation o hardware, software, system management & organisation

Long term preservation of data o Technical: security, migration of media, o Usability: preserving the integrity & semantics of data sets


Contents


Data Types in PANGAEA PS1389-3

PS1390-3

IRD

Sand

( gr av/ 10 cm 3) 0

CaCO3

( %) 20

0

TOC

( %) 100

0

Radio

( %) 15

0

Sm ect

( %/ sand) 0. 5

0

0

PS1431-1

IRD

( %/ clay) 50

Sand

( gr av/ 10 cm 3) 100

0

CaCO3

( %) 20

0

TOC

( %) 100

0

Radio

( %) 15

0

Sm ect

( %/ sand) 0. 5

0

0

PS1640-1

IRD

( %/ clay) 50

Sand

( gr av/ 10 cm 3) 100

0

CaCO3

( %) 20

0

TOC

( %) 100

0

Radio

( %) 15

0

Sm ect

( %/ sand) 0. 5

0

0

PS1648-1

IRD

( %/ clay) 50

Sand

( gr av/ 10 cm 3) 100

0

CaCO3

( %) 20

0

TOC

( %) 100

0

Radio

( %) 15

0

Sm ect

( %/ sand) 0. 5

0

IRD

( %/ clay) 50

0

Sand

( gr av/ 10 cm 3) 100

0

CaCO3

( %) 20

0

TOC

( %) 100

0

Radio

( %) 15

0

Sm ect

( %/ sand) 0. 5

0

( %/ clay) 50

0

100

0.0

100.0

Profiles => doi:10.1594/PANGAEA.701299 Time series => doi:10.1594/PANGAEA.323487 Sea bed photos => doi:10.1594/PANGAEA.319877 Distributes samples => doi:10.1594/PANGAEA.51749 Complex data => doi:10.1594/PANGAEA.108079 Air photos => doi:10.1594/PANGAEA.323540 Audio record => doi:10.1594/PANGAEA.339110 200.0

Age (kyr) max. : 233.55 kyr

PS1389-3ff

11°

12°

13°

14°

15°

55°30'

55°30'

55° 0'

55° 0'

54°30'

54°30'

54° 0'

54° 0'

11°

12°

13°

14°

15°

Scale: 1:2695194 at Latitude 0° Source: Baltic Sea Research Institute, Warnemünde.

World vector shore line Grain size class KOLP A Grain size class KOEHN2 Grain size class KOEHN Geochemistry Grain size class KOLP B Grain size class KOLP DIN 20 m


Statistics (9/2010) unclassified Ice

Atmosphere

Sediment Corals

Water

Total number of data sets ~ 1 million Data items ~ 8 billions


Now the technical details :-)


PANGAEA Architecture

Editorial system

Sybase ASE

Harddisk + tape (silo)

Apache Lucene

RDB

Webserver

Google Maps / Earth

PANGAEA search engine

Middleware

ÂŤ


Indexing contents from relational database with dynamic updates Staffs Update Log

Projects

Data Set

Data Series

Events

XML Data Set Description (Metadata)


Indexed Information ‡ Textual metadata: citation (authors, title), abstract, measurement parameters, methods, associated projects, comments, documentation  including field info for all XML schema element types) ‡ Fulltext data set contents ‡ Geographical information: latitude/longitude/BBOX/track, dates, geological age, depth/elevation [NumericField/NumericRangeQuery]

‡ Soon: Fulltext of attached external documentation 3') 


Geo-Retrieval with Lucene


Using scored queries with KML regions as filters


Apache Lucene as fast Key-Value Store Lucene is used for almost every query on the web-client /RWµV of keyword terms indexed for quick retrieval of data sets Example: Lookup of datsets related to publications using DOI ± PANGAEA is hit by hundreds of DOI lookup queries per second from scientific publishers:


Apache Lucene as fast Key-Value Store Lucene is used for almost every query on the web-client /RWµV of keyword terms indexed for quick retrieval of data sets Example: Lookup of datsets related to publications using DOI ± PANGAEA is hit by hundreds of DOI lookup queries per second from scientific publishers:


Live

PRESENTATION


Contact Uwe Schindler

PANGAEA - Publishing Network for Geoscientific & Environmental Data MARUM, Leobener Str., 28359 Bremen, Germany uschindler@pangaea.de

SD DataSolutions GmbH W채tjenstr. 49, 28213 Bremen, Germany uschindler@sd-datasolutions.de


Thank you! Know more about Apache Lucene at www.lucidimaginatin.com


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.