The Challenge of Big Data in Science (4th International LSDMA Symposium)
Thursday 1 October 2015 -
08:30
Monday 28 September 2015
Tuesday 29 September 2015
Wednesday 30 September 2015
Thursday 1 October 2015
08:30
Registration
Registration
08:30 - 09:00
Room: Foyer of Lecture Hall NTI
09:00
Welcome, Introduction
-
Achim Streit
(
KIT
)
Michael Decker
(
KIT
)
Welcome, Introduction
Achim Streit
(
KIT
)
Michael Decker
(
KIT
)
09:00 - 09:30
Room: Lecture Hall NTI
09:30
Data and software preservation for open science - connecting publications with Cyberinfrastructure
-
Jarek Nabrzyski
(
University of Notre Dame
)
Data and software preservation for open science - connecting publications with Cyberinfrastructure
Jarek Nabrzyski
(
University of Notre Dame
)
09:30 - 10:30
Room: Lecture Hall NTI
Many science domains are exploring mechanisms to preserve research data artifacts such that they can be reused in the future consistent with scientific principles of reproducibility. For the computational sciences, research artifacts include not just data, but the software that produced the data artifacts. Without access to software, it is difficult to form a proper scientific contextualization, and therefore judgement about the resultant data. One possible mechanism for preserving and sharing of data is to utilize the principles of Linked Open Data. These principles facilitate data to be discovered, shared, understood and reused for scientific research utilizing web standards for handling data and encourage publication of data under an open license. However, it has been observed that linked data without proper context just more data in a different schema. This observation is particularly true in the sciences where requirements such as provenance, quality, credit, attribution, methods are critical to the scientific process. For the Computational Sciences, context and provenance of data artifacts require connection to the software that produced those artifacts as well as connection to conceptual and mathematical models that constitute algorithms that are instantiated in software. In this talk various US-based projects addressing the issues of scientific reproducibility will be introduced. Next, a model of publishing software that would facilitate connecting data artifacts to the software algorithms that produced those artifacts utilizing a Linked Open Data Model will be presented in detail. This model extends work done by different scientific communities to share measurements in a standardized way that captures both provenance, methods, conditions, units of the measurement process and extends conceptualization to include model and algorithm that constitute a “computational measurement”.
10:30
Coffee break
Coffee break
10:30 - 11:00
Room: Foyer of Lecture Hall NTI
11:00
Computational Requirements for Climate Science
-
Peter Braesicke
(
KIT
)
Computational Requirements for Climate Science
Peter Braesicke
(
KIT
)
11:00 - 11:30
Room: Lecture Hall NTI
11:30
From seismic stations to integrated data centers and computational facilities
-
Javier Quinteros
(
GFZ Potsdam
)
From seismic stations to integrated data centers and computational facilities
Javier Quinteros
(
GFZ Potsdam
)
11:30 - 12:00
Room: Lecture Hall NTI
GEOFON is a component of the Scientific Infrastructure at the German Research Centre for Geosciences (GFZ) [1]. GEOFON operates a global real-time seismic network, a seismological archive and provides rapid global earthquake information. It is also part of the European Integrated Data Archive (EIDA) [2]. EIDA today is a distributed data archive with 10 institutions, 5000 stations and 300 TB of data that process more than 100.000 daily data requests and sends between 100-300 GB of daily data if an important earthquake took place. The EIDA system is a also a fundamental service/component of EPOS Seismology within the European Plate Observing System [3]. One of the challenges for GEOFON/EIDA is not the amount of data to store but, despite our complete decentralized approach, to make the federation look as a single data centre from the user perspective. In particular, integrating new data centres continuously. For the near future, we plan to add two new functionalities: 1) reproducibility of datasets. Namely, to be able to reconstruct the data sent at a particular moment in time. With the number of requests and (meta)data changing often during its initial phase, this is a challenge for our "Dynamic Data" developments. 2) find the best way to join big data requests and processing tools at computational facilities, where this will be processed. We are actively working in the context of the EUDAT2020 [4] project to provide these services to our community. [1] http://geofon.gfz-potsdam.de/ [2] http://www.orfeus-eu.org/eida/eida.html [3] http://www.epos-eu.org/ [4] http://eudat.eu/
12:00
Indigo DataCloud
-
Isabel Campos
(
CSIC
)
Indigo DataCloud
Isabel Campos
(
CSIC
)
12:00 - 12:30
Room: Lecture Hall NTI
The presentation will provide an overview of the goals of the project INDIGO, with an emphasis on the strategy for the improvement of usability of e-infrastructures. In this context the work of the INDIGO consortium in the first months of the project regarding research communities requirements and the strategy to support to research data will be discussed.
12:30
Lunch break
Lunch break
12:30 - 14:00
Room: Foyer of Lecture Hall NTI
14:00
Data management challenges in Astronomy and Astroparticle Physics
-
Giovanni Lamanna
(
LAPP/IN2P3
)
Data management challenges in Astronomy and Astroparticle Physics
Giovanni Lamanna
(
LAPP/IN2P3
)
14:00 - 14:30
Room: Lecture Hall NTI
Astronomy is experiencing a deluge of data with the next generation of telescopes prioritised in the European Strategy Forum on Research Infrastructures (ESFRI), and with other world-class facilities. The new ASTERICS-H2020 project brings together the concerned scientific communities in Europe to work together to find common solutions to their Big Data challenges, their interoperability, and their data access. The presentation will highlight these new challenges in Astronomy and the work being undertaken also in cooperation with federated initiatives of major computing and data centres, and e-infrastructures in Europe.
14:30
Science SQL: Advancing from Data to Service Stewardship
-
Peter Baumann
(
Jacobs University Bremen
)
Science SQL: Advancing from Data to Service Stewardship
Peter Baumann
(
Jacobs University Bremen
)
14:30 - 15:00
Room: Lecture Hall NTI
In today's science archives, data typically are managed separately from the metadata, and with different, restricted retrieval capabilities. While databases are good at metadata modelled in tables, XML hierarchies, and RDF graphs, they traditionally do not support "the data", in particular: multi-dimensional arrays. Consequently, file-based solutions let users "drown in data files" rather than presenting just a few datacubes for dissection and rejoining with other cubes. In the quest for improved service quality the new paradigm is to allow users to "ask any question, any time" thereby enabling them to "build their own product on the go". This requires a new generation of services with new quality parameters, such as flexibility, ease of access, embedding into well-known user tools, and scalability mechanisms that remain completely transparent to users. In the field of massive spatio-temporal arrays this gap is being closed by Array Databases, pioneered by the scalable rasdaman ("raster data manager") array engine. Its declarative query language, rasql, extends SQL with array operators which are optimized and parallelized on server side, including dynamic mash-up configuration. As of today, rasdaman is in operational use on hundreds of Terabytes of satellite image timeseries datacubes, with transparent query distribution across more than 1,000 nodes. Its concepts have shaped international Big Data standards in the field, including the forthcoming array extension to ISO SQL and the Open Geospatial Consortium (OGC) Big Geo Data standards, with manifold take-up by both open-source and commercial systems. We show how array queries enable flexible data access and describe the rasdaman architecture with its optimization and parallelization techniques.
15:00
Coffee break
Coffee break
15:00 - 15:30
Room: Foyer of Lecture Hall NTI
15:30
How astronomy shares and reuses big and small data
-
Francoise Genova
(
Strasbourg Astronomical Data Centre CDS
)
How astronomy shares and reuses big and small data
Francoise Genova
(
Strasbourg Astronomical Data Centre CDS
)
15:30 - 16:00
Room: Lecture Hall NTI
Astronomy is based on observations from ground- and space-based telescopes, and it has been for many years at the forefront of the world-wide sharing of scientific data, with open access (in general after a proprietary period), the definition of common formats, and the early networking on on-line services on the web, including observatory archives, added-value data bases such as the ones developed by Strasbourg astronomical data centre CDS, and academic journals. Since the turn of the century the development of the Virtual Observatory, a framework of standards and interoperable tools, allows seamless access to the wealth of available on-line resources. The VO framework is open and inclusive, and all data producers, large agencies as well as teams in research labs, can provide their data in the framework. Also, services host 'smaller', such as the CDS VizieR database for data attached to publications, and this 'long tail' of research data is made available, discoverable and usable as well as the 'Big Data' produced by the large facilities and the large sky surveys. The talk will give an overview of the history and status of data sharing in the discipline. On-line resources are used by astronomers in their daily research work, combination of data from different origins for multi-wavelength, multi-instrument studies, is at the core of the scientific process, and the change of the research paradigm towards 'Open science' is already mainly accomplished.
16:00
Data Management and Data Analysis in CLARIN-D
-
Erhard Hinrichs
(
Uni Tübingen
)
Data Management and Data Analysis in CLARIN-D
Erhard Hinrichs
(
Uni Tübingen
)
16:00 - 17:00
Room: Lecture Hall NTI
CLARIN (short for: Common Language Resource and Technology Research Infrastructure) is part of the ESFRI roadmap for research infrastructures that operate on a European scale. CLARIN's objective is to provide a research infrastructure and associated data services for researchers in the humanities and social sciences working with language related material. CLARIN operates under European law as a European Research Infrastructure Community (ERIC) with currently 15 member countries. The CLARIN research infrastructure is organized as a federation of data centers which are geographically across Europe. The presentation will highlight the challenges that such a geographically distributed research infrastructure poses for data management, data access, and data analysis.