The Challenge of Big Data in Science (3rd International LSDMA Symposium)

Europe/Berlin
Lecture Hall NTI (Building 30.10, KIT Campus South)

Lecture Hall NTI

Building 30.10, KIT Campus South

Engesserstraße 5 Karlsruhe Germany
Achim Streit (KIT), Christopher Jung (KIT)
Description
Management of scientific Big Data poses challenges at all stages of the data life cycle – acquisition, ingest, access, replication, preservation, etc. For scientific communities the data exploration – commonly considered as the 4th pillar besides experiment, theory and simulation – is of utmost importance to gain new scientific insights and knowledge. At this symposium, organized by the portfolio extension “Large Scale Data Management and Analysis” (LSDMA) of the German Helmholtz Alliance, international experts address several aspects of Big Data. It also provides a common platform for discussions, identification of new perspectives and learning more about LSDMA. The participation fee of 50 EUR is to be paid in cash at arrival (for KIT employees, different payment methods are available) and includes coffee, lunch and shuttle bus service. For further info, please contact lsdma@scc.kit.edu
    • 08:30 09:00
      Registration 30m Entrance Hall NTI

      Entrance Hall NTI

      Building 30.10, KIT Campus South

    • 09:00 09:30
      Welcome, Introduction 30m Lecture Hall NTI

      Lecture Hall NTI

      Building 30.10, KIT Campus South

      Engesserstraße 5 Karlsruhe Germany
      Speakers: Prof. Achim Streit (KIT), Wilfried Juling (KIT)
    • 09:30 10:30
      An Open Source Big Data Ecosystem 1h Lecture Hall NTI

      Lecture Hall NTI

      Building 30.10, KIT Campus South

      Engesserstraße 5 Karlsruhe Germany
      Big Data Challenges in several scientific domains including Astronomy with the Square Kilometre Array (SKA), Climate Science with the Intergovernmental Panel on Climate Change (IPCC) and intelligence projects with DARPA including XDATA and Memex. These challenges are of the Volume, Velocity and Variety mix (700TB/sec) from the SKA; 100s of thousands of files and formats in IPCC model to remote sensing data comparisons; language translation, automatic file identification of 50+ thousand files in the DARPA and other contexts). To address these challenges I have proposed and published a Vision for Data Science in Nature that addresses these challenges through a combination of: (1) rapid science algorithm integration; (2) intelligent data movement; (3) automated and accurate extraction of text, metadata, and language from 1000s of file formats; and (4) the promotion of open source software and communities to push this agenda forward. NASA, DARPA, NSF, and many government agencies in the US are seeing the benefits and reaping the reward of open source software products. From the traditional consumption model, to groups learning how to produce open source, and participate in community oriented ecosystems, there is a large emerging and fast paced environment that connects government to industry to academia and to the outside world. In this talk, I will describe an Open Source Big Data ecosystem at the Apache Software Foundation and elsewhere including projects and progress towards implementing the Vision for Data Science.
      Speaker: Chris Mattmann (NASA/JPL, USC)
      Slides
    • 10:30 11:00
      Coffee break 30m Lecture Hall NTI

      Lecture Hall NTI

      Building 30.10, KIT Campus South

      Engesserstraße 5 Karlsruhe Germany
    • 11:00 11:30
      The Open Science Imperative - Opportunities, Challenges and Limits 30m Lecture Hall NTI

      Lecture Hall NTI

      Building 30.10, KIT Campus South

      Engesserstraße 5 Karlsruhe Germany
      Over the last decade, for a variety of reasons, open access to research data has been found to be a guiding principle of handling research data, if not of good scientific practise. Summoning first principles, the Royal Society observed in 2012, that “Open inquiry is at the heart of the scientific enterprise”. And there are quite “practical” reasons as well: It has been shown in 2014 that in a vast field of research up to one half of publications are not reproducible - which situation could only be remedied by providing openly all manner of supporting evidence, including primary data and software codes. On a more positive note, there is also evidence that sharing data for reuse could double the number publications based on it. But there is a number of real and perceived barriers to openness and a lack of drivers. Many of the barriers derive from tightly interwoven reasons and some may even be tracked to underdeveloped common understanding of terms and concepts or a hidden overload of meanings.
      Speaker: Hans Pfeiffenberger (AWI)
      Slides
    • 11:30 12:00
      Smart Data Innovation Lab : Turning Big Data into Smart Data 30m Lecture Hall NTI

      Lecture Hall NTI

      Building 30.10, KIT Campus South

      Engesserstraße 5 Karlsruhe Germany
      At the Smart Data Innovation Lab in Karlsruhe, Germany, various partners from industry and academia are committed to revealing the secrets locked in Big Data to tackle society’s most demanding challenges. The SDIL provides cutting-edge research and analytical capabilities for large data sets from industry and public sources, and facilitates the flow of information across conventional borders of the various economic sectors to help build competitive advantages. At the Smart Data Innovation Lab, there will be a variety of opportunities to learn from the best: be it in collaborating with leading-edge big data institutions or as a member of the four data innovation communities (Industry 4.0, Energy, Smart Cities or Personalised Medicine), where we can jointly discuss relevant smart data use cases with academic institutions and industry partners. Designed in close collaboration between industry and research, the SDIL is operating at the KIT. The scientists have access to real company data stored securely on the platform within the framework of defined projects. The analysis, specification, and structuring of specific data sets and the detection of anomalies facilitate closer collaboration with the respective business partners, in turn making knowledge and technology transfers possible faster than ever before.
      Speaker: Laure Le Bars (SAP)
      Slides
    • 12:00 12:30
      Opening Big Data; in large and small chunks 30m Lecture Hall NTI

      Lecture Hall NTI

      Building 30.10, KIT Campus South

      Engesserstraße 5 Karlsruhe Germany
      Sharing data openly involves much more than the challenges of access, universal identifiers, and so forth. Data must be prepared for reuse and preservation, and accompanied by the necessary software and documentation to interpret and access the values and structures within. I will describe how we are approaching these challenges for two contrasting big data stores, firstly a single domain store of large data sets from CERN and secondly the multi-domain store of numerous small data sets from all sciences in OpenAIRE's Zenodo.
      Speaker: Tim Smith (CERN)
      Slides
    • 12:30 14:00
      Lunch break 1h 30m Foyer of Lecture Hall NTI

      Foyer of Lecture Hall NTI

      Building 30.10, KIT Campus South

    • 14:00 14:30
      Digital Curation Centre 30m Lecture Hall NTI

      Lecture Hall NTI

      Building 30.10, KIT Campus South

      Engesserstraße 5 Karlsruhe Germany
      The Digital Curation Centre (DCC) is the UK's national centre of expertise in digital preservation and curation, specialising in the management of data created by and used in research. DCC activities and services include: providing advice around best practice for digital preservation, data management and curation; helping universities to assess their existing institutional capability levels; providing generic and tailored training and guidance for different stakeholder groups; helping to develop and implement data-related policies; and support with data management planning. This presentation will provide an overview of the DCC's work with universities and other stakeholders, highlighting relevant activities and resources developed as part of our national role and via international partnerships and liaison.
      Speaker: Donnelly Martin (DCC)
      Slides
    • 14:30 15:00
      Big Data Research at DKRZ 30m Lecture Hall NTI

      Lecture Hall NTI

      Building 30.10, KIT Campus South

      Engesserstraße 5 Karlsruhe Germany
      DKRZ is running one the largest climate data archives. In August 2014 the mass storage archive (HPSS) contains about 36 PB. This includes the long-term archive Word Data Center Climate (WDCC) with more than 4 PB climate model reference data. Emphasis at DKRZ is on production, analysis, storage, curation and dissemination of climate model data and related observations. At the current HPC system at DKRZ (HLRE-2) annual growth rates are observed of 8 PB for HPSS and 0.5 PB up to 1.5 PB for WDCC. The next generation of the HPC system (HLRE-3) implies the estimation of annual data growth rates of 75 PB for HPSS and 8 PB for WDCC. Important for HLRE-3 will a seamless end-to-end workflow which covers all steps in the data life cycle. This means that not only services around data processing and data storage have to be optimized but also parallelization of climate model code and improvement of I/O processes. The optimization of the end-to-end workflow is key to make optimal use existing HPC resources. The aim is not to produce pure numbers but to generate climate information.
      Speaker: Michael Lautenschlager
      Slides
    • 15:00 15:30
      Coffee break 30m Lecture Hall NTI

      Lecture Hall NTI

      Building 30.10, KIT Campus South

      Engesserstraße 5 Karlsruhe Germany
    • 15:30 16:00
      Big Infrastructure to support Big Data in the Life Sciences 30m Lecture Hall NTI

      Lecture Hall NTI

      Building 30.10, KIT Campus South

      Engesserstraße 5 Karlsruhe Germany
      The life sciences are generating huge volumes of data of many different types from many different sources. Exploiting this deluge of data to generate information and knowledge is a challenge facing EMBL-EBI and the life science community as a whole. EMBL-EBI provides an infrastructure to store and analyse over 46Pb of data and is a leading node in the Elixir research infrastructure that is being established in Europe. The presentation will highlight these challenges and the work being undertaken at EMBL-EBI, and in the broader Elxir collaboration.
      Speaker: Steven Newhouse (EBI)
      Slides
    • 16:00 17:00
      The Data-enabled Revolution in Science and Society: A Need for National Data Services and Policy 1h Lecture Hall NTI

      Lecture Hall NTI

      Building 30.10, KIT Campus South

      Engesserstraße 5 Karlsruhe Germany
      Modern science is undergoing a profound transformation as it aims to tackle the complex problems of the 21st Century. Across all domains, science is becoming highly collaborative, computational, and data-intensive, requiring new methods, new services, new approaches, and new policies to accelerate discovery through data sharing, publication, and repurposing. These issues are not only critical to support the increasing demand for interdisciplinary approaches to science, they go to the heart of reproducibility of science in a digital world. I will illustrate trends with examples from astrophysics, materials science, social science, and more, and describe a vision for national data services and policies that aim to address some of these needs.
      Speaker: Ed Seidel (NCSA)
      Slides