The Challenge of Big Data in Science (4th International LSDMA Symposium)

Europe/Berlin
Lecture Hall NTI (Building 30.10, KIT (Campus South))

Lecture Hall NTI

Building 30.10, KIT (Campus South)

Description
Management of scientific Big Data poses challenges at all stages of the data life cycle – acquisition, ingest, access, replication, preservation, etc. For scientific communities the data exploration – commonly considered as the 4th pillar besides experiment, theory and simulation – is of utmost importance to gain new scientific insights and knowledge.
At this symposium, organized by the cross-program initiative “Large Scale Data Management and Analysis” (LSDMA) of the German Helmholtz Alliance, international experts address several aspects of Big Data.
It also provides a common platform for discussions and identification of new perspectives. The participation fee of 50 EUR is to be paid in cash at arrival (for KIT employees, an alternative payment method is available) and includes coffee, lunch and shuttle bus service.
For further info, please contact lsdma@scc.kit.edu
    • 08:30
      Registration Foyer of Lecture Hall NTI

      Foyer of Lecture Hall NTI

      Building 30.10, KIT (Campus South)

    • 1
      Welcome, Introduction Lecture Hall NTI

      Lecture Hall NTI

      Building 30.10, KIT (Campus South)

      Speakers: Prof. Achim Streit (KIT), Prof. Michael Decker (KIT)
      Slides
    • 2
      Data and software preservation for open science - connecting publications with Cyberinfrastructure Lecture Hall NTI

      Lecture Hall NTI

      Building 30.10, KIT (Campus South)

      Many science domains are exploring mechanisms to preserve research data artifacts such that they can be reused in the future consistent with scientific principles of reproducibility. For the computational sciences, research artifacts include not just data, but the software that produced the data artifacts. Without access to software, it is difficult to form a proper scientific contextualization, and therefore judgement about the resultant data. One possible mechanism for preserving and sharing of data is to utilize the principles of Linked Open Data. These principles facilitate data to be discovered, shared, understood and reused for scientific research utilizing web standards for handling data and encourage publication of data under an open license. However, it has been observed that linked data without proper context just more data in a different schema. This observation is particularly true in the sciences where requirements such as provenance, quality, credit, attribution, methods are critical to the scientific process. For the Computational Sciences, context and provenance of data artifacts require connection to the software that produced those artifacts as well as connection to conceptual and mathematical models that constitute algorithms that are instantiated in software. In this talk various US-based projects addressing the issues of scientific reproducibility will be introduced. Next, a model of publishing software that would facilitate connecting data artifacts to the software algorithms that produced those artifacts utilizing a Linked Open Data Model will be presented in detail. This model extends work done by different scientific communities to share measurements in a standardized way that captures both provenance, methods, conditions, units of the measurement process and extends conceptualization to include model and algorithm that constitute a “computational measurement”.
      Speaker: Jarek Nabrzyski (University of Notre Dame)
      Slides
    • 10:30
      Coffee break Foyer of Lecture Hall NTI

      Foyer of Lecture Hall NTI

      Building 30.10, KIT (Campus South)

    • 3
      Computational Requirements for Climate Science Lecture Hall NTI

      Lecture Hall NTI

      Building 30.10, KIT (Campus South)

      Speaker: Peter Braesicke (KIT)
      Slides
    • 4
      From seismic stations to integrated data centers and computational facilities Lecture Hall NTI

      Lecture Hall NTI

      Building 30.10, KIT (Campus South)

      GEOFON is a component of the Scientific Infrastructure at the German Research Centre for Geosciences (GFZ) [1]. GEOFON operates a global real-time seismic network, a seismological archive and provides rapid global earthquake information. It is also part of the European Integrated Data Archive (EIDA) [2]. EIDA today is a distributed data archive with 10 institutions, 5000 stations and 300 TB of data that process more than 100.000 daily data requests and sends between 100-300 GB of daily data if an important earthquake took place. The EIDA system is a also a fundamental service/component of EPOS Seismology within the European Plate Observing System [3]. One of the challenges for GEOFON/EIDA is not the amount of data to store but, despite our complete decentralized approach, to make the federation look as a single data centre from the user perspective. In particular, integrating new data centres continuously. For the near future, we plan to add two new functionalities: 1) reproducibility of datasets. Namely, to be able to reconstruct the data sent at a particular moment in time. With the number of requests and (meta)data changing often during its initial phase, this is a challenge for our "Dynamic Data" developments. 2) find the best way to join big data requests and processing tools at computational facilities, where this will be processed. We are actively working in the context of the EUDAT2020 [4] project to provide these services to our community. [1] http://geofon.gfz-potsdam.de/ [2] http://www.orfeus-eu.org/eida/eida.html [3] http://www.epos-eu.org/ [4] http://eudat.eu/
      Speaker: Javier Quinteros (GFZ Potsdam)
      Slides
    • 5
      Indigo DataCloud Lecture Hall NTI

      Lecture Hall NTI

      Building 30.10, KIT (Campus South)

      The presentation will provide an overview of the goals of the project INDIGO, with an emphasis on the strategy for the improvement of usability of e-infrastructures. In this context the work of the INDIGO consortium in the first months of the project regarding research communities requirements and the strategy to support to research data will be discussed.
      Speaker: Isabel Campos (CSIC)
      Slides
    • 12:30
      Lunch break Foyer of Lecture Hall NTI

      Foyer of Lecture Hall NTI

      Building 30.10, KIT (Campus South)

    • 6
      Data management challenges in Astronomy and Astroparticle Physics Lecture Hall NTI

      Lecture Hall NTI

      Building 30.10, KIT (Campus South)

      Astronomy is experiencing a deluge of data with the next generation of telescopes prioritised in the European Strategy Forum on Research Infrastructures (ESFRI), and with other world-class facilities. The new ASTERICS-H2020 project brings together the concerned scientific communities in Europe to work together to find common solutions to their Big Data challenges, their interoperability, and their data access. The presentation will highlight these new challenges in Astronomy and the work being undertaken also in cooperation with federated initiatives of major computing and data centres, and e-infrastructures in Europe.
      Speaker: Giovanni Lamanna (LAPP/IN2P3)
      Slides
    • 7
      Science SQL: Advancing from Data to Service Stewardship Lecture Hall NTI

      Lecture Hall NTI

      Building 30.10, KIT (Campus South)

      In today's science archives, data typically are managed separately from the metadata, and with different, restricted retrieval capabilities. While databases are good at metadata modelled in tables, XML hierarchies, and RDF graphs, they traditionally do not support "the data", in particular: multi-dimensional arrays. Consequently, file-based solutions let users "drown in data files" rather than presenting just a few datacubes for dissection and rejoining with other cubes. In the quest for improved service quality the new paradigm is to allow users to "ask any question, any time" thereby enabling them to "build their own product on the go". This requires a new generation of services with new quality parameters, such as flexibility, ease of access, embedding into well-known user tools, and scalability mechanisms that remain completely transparent to users. In the field of massive spatio-temporal arrays this gap is being closed by Array Databases, pioneered by the scalable rasdaman ("raster data manager") array engine. Its declarative query language, rasql, extends SQL with array operators which are optimized and parallelized on server side, including dynamic mash-up configuration. As of today, rasdaman is in operational use on hundreds of Terabytes of satellite image timeseries datacubes, with transparent query distribution across more than 1,000 nodes. Its concepts have shaped international Big Data standards in the field, including the forthcoming array extension to ISO SQL and the Open Geospatial Consortium (OGC) Big Geo Data standards, with manifold take-up by both open-source and commercial systems. We show how array queries enable flexible data access and describe the rasdaman architecture with its optimization and parallelization techniques.
      Speaker: Peter Baumann (Jacobs University Bremen)
      Slides
    • 15:00
      Coffee break Foyer of Lecture Hall NTI

      Foyer of Lecture Hall NTI

      Building 30.10, KIT (Campus South)

    • 8
      How astronomy shares and reuses big and small data Lecture Hall NTI

      Lecture Hall NTI

      Building 30.10, KIT (Campus South)

      Astronomy is based on observations from ground- and space-based telescopes, and it has been for many years at the forefront of the world-wide sharing of scientific data, with open access (in general after a proprietary period), the definition of common formats, and the early networking on on-line services on the web, including observatory archives, added-value data bases such as the ones developed by Strasbourg astronomical data centre CDS, and academic journals. Since the turn of the century the development of the Virtual Observatory, a framework of standards and interoperable tools, allows seamless access to the wealth of available on-line resources. The VO framework is open and inclusive, and all data producers, large agencies as well as teams in research labs, can provide their data in the framework. Also, services host 'smaller', such as the CDS VizieR database for data attached to publications, and this 'long tail' of research data is made available, discoverable and usable as well as the 'Big Data' produced by the large facilities and the large sky surveys. The talk will give an overview of the history and status of data sharing in the discipline. On-line resources are used by astronomers in their daily research work, combination of data from different origins for multi-wavelength, multi-instrument studies, is at the core of the scientific process, and the change of the research paradigm towards 'Open science' is already mainly accomplished.
      Speaker: Francoise Genova (Strasbourg Astronomical Data Centre CDS)
      Slides
    • 9
      Data Management and Data Analysis in CLARIN-D Lecture Hall NTI

      Lecture Hall NTI

      Building 30.10, KIT (Campus South)

      CLARIN (short for: Common Language Resource and Technology Research Infrastructure) is part of the ESFRI roadmap for research infrastructures that operate on a European scale. CLARIN's objective is to provide a research infrastructure and associated data services for researchers in the humanities and social sciences working with language related material. CLARIN operates under European law as a European Research Infrastructure Community (ERIC) with currently 15 member countries. The CLARIN research infrastructure is organized as a federation of data centers which are geographically across Europe. The presentation will highlight the challenges that such a geographically distributed research infrastructure poses for data management, data access, and data analysis.
      Speaker: Erhard Hinrichs (Uni Tübingen)