Name: The Challenge of Big Data in Science (3rd International LSDMA Symposium)
Start: 2014-10-07T08:30:00+02:00
End: 2014-10-07T17:30:00+02:00
Location: Building 30.10, KIT Campus South

The Challenge of Big Data in Science (3rd International LSDMA Symposium)

Tuesday 7 October 2014 - 08:30

Monday 6 October 2014
Tuesday 7 October 2014

08:30 Registration
Registration
08:30 - 09:00
Room: Entrance Hall NTI
09:00 Welcome, Introduction - Achim Streit (KIT) Wilfried Juling (KIT)
Welcome, Introduction
- Achim Streit (KIT)
- Wilfried Juling (KIT)
09:00 - 09:30
Room: Lecture Hall NTI
09:30 An Open Source Big Data Ecosystem - Chris Mattmann (NASA/JPL, USC)
An Open Source Big Data Ecosystem
- Chris Mattmann (NASA/JPL, USC)
09:30 - 10:30
Room: Lecture Hall NTI Big Data Challenges in several scientific domains including Astronomy with the Square Kilometre Array (SKA), Climate Science with the Intergovernmental Panel on Climate Change (IPCC) and intelligence projects with DARPA including XDATA and Memex. These challenges are of the Volume, Velocity and Variety mix (700TB/sec) from the SKA; 100s of thousands of files and formats in IPCC model to remote sensing data comparisons; language translation, automatic file identification of 50+ thousand files in the DARPA and other contexts). To address these challenges I have proposed and published a Vision for Data Science in Nature that addresses these challenges through a combination of: (1) rapid science algorithm integration; (2) intelligent data movement; (3) automated and accurate extraction of text, metadata, and language from 1000s of file formats; and (4) the promotion of open source software and communities to push this agenda forward. NASA, DARPA, NSF, and many government agencies in the US are seeing the benefits and reaping the reward of open source software products. From the traditional consumption model, to groups learning how to produce open source, and participate in community oriented ecosystems, there is a large emerging and fast paced environment that connects government to industry to academia and to the outside world. In this talk, I will describe an Open Source Big Data ecosystem at the Apache Software Foundation and elsewhere including projects and progress towards implementing the Vision for Data Science.
10:30 Coffee break
Coffee break
10:30 - 11:00
Room: Lecture Hall NTI
11:00 The Open Science Imperative - Opportunities, Challenges and Limits - Hans Pfeiffenberger (AWI)
The Open Science Imperative - Opportunities, Challenges and Limits
- Hans Pfeiffenberger (AWI)
11:00 - 11:30
Room: Lecture Hall NTI Over the last decade, for a variety of reasons, open access to research data has been found to be a guiding principle of handling research data, if not of good scientific practise. Summoning first principles, the Royal Society observed in 2012, that “Open inquiry is at the heart of the scientific enterprise”. And there are quite “practical” reasons as well: It has been shown in 2014 that in a vast field of research up to one half of publications are not reproducible - which situation could only be remedied by providing openly all manner of supporting evidence, including primary data and software codes. On a more positive note, there is also evidence that sharing data for reuse could double the number publications based on it. But there is a number of real and perceived barriers to openness and a lack of drivers. Many of the barriers derive from tightly interwoven reasons and some may even be tracked to underdeveloped common understanding of terms and concepts or a hidden overload of meanings.
11:30 Smart Data Innovation Lab : Turning Big Data into Smart Data - Laure Le Bars (SAP)
Smart Data Innovation Lab : Turning Big Data into Smart Data
- Laure Le Bars (SAP)
11:30 - 12:00
Room: Lecture Hall NTI At the Smart Data Innovation Lab in Karlsruhe, Germany, various partners from industry and academia are committed to revealing the secrets locked in Big Data to tackle society’s most demanding challenges. The SDIL provides cutting-edge research and analytical capabilities for large data sets from industry and public sources, and facilitates the flow of information across conventional borders of the various economic sectors to help build competitive advantages. At the Smart Data Innovation Lab, there will be a variety of opportunities to learn from the best: be it in collaborating with leading-edge big data institutions or as a member of the four data innovation communities (Industry 4.0, Energy, Smart Cities or Personalised Medicine), where we can jointly discuss relevant smart data use cases with academic institutions and industry partners. Designed in close collaboration between industry and research, the SDIL is operating at the KIT. The scientists have access to real company data stored securely on the platform within the framework of defined projects. The analysis, specification, and structuring of specific data sets and the detection of anomalies facilitate closer collaboration with the respective business partners, in turn making knowledge and technology transfers possible faster than ever before.
12:00 Opening Big Data; in large and small chunks - Tim Smith (CERN)
Opening Big Data; in large and small chunks
- Tim Smith (CERN)
12:00 - 12:30
Room: Lecture Hall NTI Sharing data openly involves much more than the challenges of access, universal identifiers, and so forth. Data must be prepared for reuse and preservation, and accompanied by the necessary software and documentation to interpret and access the values and structures within. I will describe how we are approaching these challenges for two contrasting big data stores, firstly a single domain store of large data sets from CERN and secondly the multi-domain store of numerous small data sets from all sciences in OpenAIRE's Zenodo.
12:30 Lunch break
Lunch break
12:30 - 14:00
Room: Foyer of Lecture Hall NTI
14:00 Digital Curation Centre - Donnelly Martin (DCC)
Digital Curation Centre
- Donnelly Martin (DCC)
14:00 - 14:30
Room: Lecture Hall NTI The Digital Curation Centre (DCC) is the UK's national centre of expertise in digital preservation and curation, specialising in the management of data created by and used in research. DCC activities and services include: providing advice around best practice for digital preservation, data management and curation; helping universities to assess their existing institutional capability levels; providing generic and tailored training and guidance for different stakeholder groups; helping to develop and implement data-related policies; and support with data management planning. This presentation will provide an overview of the DCC's work with universities and other stakeholders, highlighting relevant activities and resources developed as part of our national role and via international partnerships and liaison.
14:30 Big Data Research at DKRZ - Michael Lautenschlager
Big Data Research at DKRZ
- Michael Lautenschlager
14:30 - 15:00
Room: Lecture Hall NTI DKRZ is running one the largest climate data archives. In August 2014 the mass storage archive (HPSS) contains about 36 PB. This includes the long-term archive Word Data Center Climate (WDCC) with more than 4 PB climate model reference data. Emphasis at DKRZ is on production, analysis, storage, curation and dissemination of climate model data and related observations. At the current HPC system at DKRZ (HLRE-2) annual growth rates are observed of 8 PB for HPSS and 0.5 PB up to 1.5 PB for WDCC. The next generation of the HPC system (HLRE-3) implies the estimation of annual data growth rates of 75 PB for HPSS and 8 PB for WDCC. Important for HLRE-3 will a seamless end-to-end workflow which covers all steps in the data life cycle. This means that not only services around data processing and data storage have to be optimized but also parallelization of climate model code and improvement of I/O processes. The optimization of the end-to-end workflow is key to make optimal use existing HPC resources. The aim is not to produce pure numbers but to generate climate information.
15:00 Coffee break
Coffee break
15:00 - 15:30
Room: Lecture Hall NTI
15:30 Big Infrastructure to support Big Data in the Life Sciences - Steven Newhouse (EBI)
Big Infrastructure to support Big Data in the Life Sciences
- Steven Newhouse (EBI)
15:30 - 16:00
Room: Lecture Hall NTI The life sciences are generating huge volumes of data of many different types from many different sources. Exploiting this deluge of data to generate information and knowledge is a challenge facing EMBL-EBI and the life science community as a whole. EMBL-EBI provides an infrastructure to store and analyse over 46Pb of data and is a leading node in the Elixir research infrastructure that is being established in Europe. The presentation will highlight these challenges and the work being undertaken at EMBL-EBI, and in the broader Elxir collaboration.
16:00 The Data-enabled Revolution in Science and Society: A Need for National Data Services and Policy - Ed Seidel (NCSA)
The Data-enabled Revolution in Science and Society: A Need for National Data Services and Policy
- Ed Seidel (NCSA)
16:00 - 17:00
Room: Lecture Hall NTI Modern science is undergoing a profound transformation as it aims to tackle the complex problems of the 21st Century. Across all domains, science is becoming highly collaborative, computational, and data-intensive, requiring new methods, new services, new approaches, and new policies to accelerate discovery through data sharing, publication, and repurposing. These issues are not only critical to support the increasing demand for interdisciplinary approaches to science, they go to the heart of reproducibility of science in a digital world. I will illustrate trends with examples from astrophysics, materials science, social science, and more, and describe a vision for national data services and policies that aim to address some of these needs.