PANGAEA - Providing access to geoscientific data using Apache Lucene Java Uwe Schindler PANGAEA / SD DataSolutions GmbH, uschindler@pangaea.de
My Background
I am committer and PMC member of Apache Lucene and Solr. My main focus is on development of Lucene Java. Implemented fast numerical search and maintaining the new attribute-based text analysis API. Studied Physics at the University of Erlangen-Nuremberg and work as consultant and software architect for PANGAEA (Publishing Network for Geoscientific & Environmental Data) in Bremen, Germany, where I implemented the portal's geospatial retrieval functions with Lucene Java. Talks about Lucene at various international conferences like ApacheCon EU/US, Lucene Eurocon, Berlin Buzzwords and various local meetups.
About PANGAEA
since 1993 Information system for earth system science data hosted by AWI & MARUM 2001 Mandate of the International Council for Science (ICSU): World Data Center for Marine Environmental Sciences (WDCMARE) 2007 Mandate of the World Meteorological Organisation (WMO): World Radiation Monitoring Center (WRMC) 2010 (certification in progress) Mandate of the World Meteorological Organisation (WMO): Data Collection and Processing Center (DCPC)
Network of World Data Centers Geophysical Year 1957 Airglow Mitaka,Japan Astronomy Beijing, China
Meteorology Asheville NC, USA Beijing, China Obninsk, Russia
Marine Geology and Geophysics Boulder CO, USA Nuclear Radiation Moscow, Russia Tokyo, Japan
Atmospheric Trace Gases Oak Ridge TN, USA
Seismology Denver CO, USA Beijing, China
Cosmic Rays Toyokawa, Japan
Soils Wageningen, The Netherlands
Earth Tides Brussels, Belgium
Solar Activity Meudon, France
Geology Beijing, China
Solar Radio Emission Nagano, Japan
Geomagnetism Copenhagen, Denmark Edinburgh, UK Kyoto, Japan Colaba, India
Solar Terrestrial Physics Boulder CO, USA Didcot Oxon, UK Moscow, Russia Haymarket, Australia
Glaciology Boulder CO, USA Cambridge, UK Lanzhou, China
Ionosphere Tokyo, Japan Marine Environmental Sciences Bremen, Germany, (2001)
Rotation of the Earth Obninsk, Russia Washington DC, USA Satellite Information Greenbelt MD, USA
Aurora Tokyo, Japan
Human Interactions in the Environment Palisades NY, USA
Rockets and Satellites Obninsk, Russia
WDC Co-ordination Offices Washington DC, USA Beijing, China Oceaography Obninsk, Russia Silver Spring MD, USA Tianjin, China
Recent Crustal Movements Ondrejov, Czech Republic
Paleoclimatology Boulder CO, USA
Renewable Resources and Environment Beijing, China
Remotely Sensed Land Data Sioux Falls SD, USA
Solid Earth Geophysics Beijing, China Boulder CO, USA Moscow, Russia Space Science Beijing, China Space Science Satellites Kanagawa, Japan Sunspot Index Brussels, Belgium
Why do we need Data Libraries? - Good scientific practice - Needed for verification of scientific work - Good availability of data for large scale and complex scientific approaches - ³'DWD UHF\FOLQJ´ LV PRUH HIIHFWLYH than reproduction
Geosciences before 1900 William Smith, 1815
Glomar challenger, 1875
Turin papyrus, ~1160 BC
Technical Improvements ENIAC, 1944
Magnetometer
Development of the global climate
Thousands of years before present
Thousands of years before present The last 1300 years
Information increase in empirical sciences ?
30
25
20
Publications
15
Data 10
5
0
1970
1980
1990
2000
2010
Archiving and publication of scientific data
Data acquisition Quality assurance Long-term availability and access
Long term archive Open access & non restricted data o Creative Commons license
Data accepted from individual scientists, institutes, and science projects Long term funding for basic operation o hardware, software, system management & organisation
Long term preservation of data o Technical: security, migration of media, o Usability: preserving the integrity & semantics of data sets
Contents
Data Types in PANGAEA PS1389-3
PS1390-3
IRD
Sand
( gr av/ 10 cm 3) 0
CaCO3
( %) 20
0
TOC
( %) 100
0
Radio
( %) 15
0
Sm ect
( %/ sand) 0. 5
0
0
PS1431-1
IRD
( %/ clay) 50
Sand
( gr av/ 10 cm 3) 100
0
CaCO3
( %) 20
0
TOC
( %) 100
0
Radio
( %) 15
0
Sm ect
( %/ sand) 0. 5
0
0
PS1640-1
IRD
( %/ clay) 50
Sand
( gr av/ 10 cm 3) 100
0
CaCO3
( %) 20
0
TOC
( %) 100
0
Radio
( %) 15
0
Sm ect
( %/ sand) 0. 5
0
0
PS1648-1
IRD
( %/ clay) 50
Sand
( gr av/ 10 cm 3) 100
0
CaCO3
( %) 20
0
TOC
( %) 100
0
Radio
( %) 15
0
Sm ect
( %/ sand) 0. 5
0
IRD
( %/ clay) 50
0
Sand
( gr av/ 10 cm 3) 100
0
CaCO3
( %) 20
0
TOC
( %) 100
0
Radio
( %) 15
0
Sm ect
( %/ sand) 0. 5
0
( %/ clay) 50
0
100
0.0
100.0
Profiles => doi:10.1594/PANGAEA.701299 Time series => doi:10.1594/PANGAEA.323487 Sea bed photos => doi:10.1594/PANGAEA.319877 Distributes samples => doi:10.1594/PANGAEA.51749 Complex data => doi:10.1594/PANGAEA.108079 Air photos => doi:10.1594/PANGAEA.323540 Audio record => doi:10.1594/PANGAEA.339110 200.0
Age (kyr) max. : 233.55 kyr
PS1389-3ff
11°
12°
13°
14°
15°
55°30'
55°30'
55° 0'
55° 0'
54°30'
54°30'
54° 0'
54° 0'
11°
12°
13°
14°
15°
Scale: 1:2695194 at Latitude 0° Source: Baltic Sea Research Institute, Warnemünde.
World vector shore line Grain size class KOLP A Grain size class KOEHN2 Grain size class KOEHN Geochemistry Grain size class KOLP B Grain size class KOLP DIN 20 m
Statistics (9/2010) unclassified Ice
Atmosphere
Sediment Corals
Water
Total number of data sets ~ 1 million Data items ~ 8 billions
Now the technical details :-)
PANGAEA Architecture
Editorial system
Sybase ASE
Harddisk + tape (silo)
Apache Lucene
RDB
Webserver
Google Maps / Earth
PANGAEA search engine
Middleware
ÂŤ
Indexing contents from relational database with dynamic updates Staffs Update Log
Projects
Data Set
Data Series
Events
XML Data Set Description (Metadata)
Indexed Information ‡ Textual metadata: citation (authors, title), abstract, measurement parameters, methods, associated projects, comments, documentation  including field info for all XML schema element types) ‡ Fulltext data set contents ‡ Geographical information: latitude/longitude/BBOX/track, dates, geological age, depth/elevation [NumericField/NumericRangeQuery]
‡ Soon: Fulltext of attached external documentation 3') 
Geo-Retrieval with Lucene
Using scored queries with KML regions as filters
Apache Lucene as fast Key-Value Store Lucene is used for almost every query on the web-client /RWµV of keyword terms indexed for quick retrieval of data sets Example: Lookup of datsets related to publications using DOI ± PANGAEA is hit by hundreds of DOI lookup queries per second from scientific publishers:
Apache Lucene as fast Key-Value Store Lucene is used for almost every query on the web-client /RWµV of keyword terms indexed for quick retrieval of data sets Example: Lookup of datsets related to publications using DOI ± PANGAEA is hit by hundreds of DOI lookup queries per second from scientific publishers:
Live
PRESENTATION
Contact Uwe Schindler
PANGAEA - Publishing Network for Geoscientific & Environmental Data MARUM, Leobener Str., 28359 Bremen, Germany uschindler@pangaea.de
SD DataSolutions GmbH W채tjenstr. 49, 28213 Bremen, Germany uschindler@sd-datasolutions.de
Thank you! Know more about Apache Lucene at www.lucidimaginatin.com