The Challenge of Big Data in Science - with a focus on Big Data Analytics (2nd International LSDMA Symposium)

Europe/Berlin
Aula FTU (KIT)

Aula FTU

KIT

Hermann-von-Helmholtz-Platz 1 76344 Eggenstein-Leopoldshafen Germany
Achim Streit (KIT), Christopher Jung (KIT)
Description
Management of scientific Big Data poses challenges at all stages of the data life cycle – acquisition, ingest, access, replication, preservation, etc. For scientific communities the data exploration – commonly considered as the 4th pillar besides experiment, theory and simulation – is of utmost importance to gain new scientific insights and knowledge. At this symposium, organized by the portfolio extension “Large Scale Data Management and Analysis” (LSDMA) of the German Helmholtz Alliance, international experts address topics on Big Data analysis and beyond. The opening keynote will be given by Beth Plale, Director of the Data To Insight Center at Indiana University; the closing keynote by Sayeed Choudhury, Director of the Digital Research and Curation Center at Johns Hopkins University. Leif Laksonen, CSC, will provide a recent update on the Research Data Alliance (RDA). The symposium also provides a common platform for discussions, identifying new perspectives and getting in contact with LSDMA.
    • 09:00 09:30
      Welcome, Introduction 30m Aula FTU

      Aula FTU

      KIT

      Hermann-von-Helmholtz-Platz 1 76344 Eggenstein-LeopoldshafenGermany
      Speakers: Prof. Achim Streit (KIT), Prof. Wilfried Juling (KIT)
      Slides
    • 09:30 10:30
      Big Data and Open Access: on track for collision of cosmic proportions? 1h Aula FTU

      Aula FTU

      KIT

      Hermann-von-Helmholtz-Platz 1 76344 Eggenstein-Leopoldshafen Germany
      Big Data is an opportunity in science, health, and social well being to bring together information in ways that creates new insight. Much of the opportunity is growth in the predictive and real time aspects of data, marked by an increase in velocity of information due to growing real time sources of information. As the value of data to social well being grows, so do concerns about ensuring that data becomes an entrenched part of our scholarly record, for today but more importantly, for future generations of citizens and scientists who do not have a voice at the table today. Recent open access directives across the world in part address this. We discuss the exciting dual forces: Big Data and Open Access, forces that without due attention to the dual sided nature of the forces will result in a collision of cosmic proportions. To give concreteness to solutions, both technological and organizational, we discuss technical developments drawn from examples of researcher in the Data To Insight Center at Indiana University and elsewhere in provenance, data preservation, and text analytics; and policy level developments, specifically the Research Data Alliance.
      Speaker: Prof. Beth Plale (Indiana University)
      Slides
    • 10:30 11:00
      Coffee break (incl. LSDMA poster session) 30m Aula FTU

      Aula FTU

      KIT

      Hermann-von-Helmholtz-Platz 1 76344 Eggenstein-Leopoldshafen Germany
    • 11:00 11:30
      Research Data Alliance as a platform for global research data interoperability 30m Aula FTU

      Aula FTU

      KIT

      Hermann-von-Helmholtz-Platz 1 76344 Eggenstein-Leopoldshafen Germany
      The Research Data Alliance (RDA) implements the technology, practice, and connections that make Data Work across barriers. The RDA aims to accelerate and facilitate research data sharing and exchange. The work of the RDA is primarily undertaken through its working groups. Community-based approaches cover worldwide activities seeking to address cross-domain challenges for the common good and find answers to major scientific and societal challenges. RDA is tackling these key challenges through a bottom-up approach with working groups focusing on short-term concrete goals. This presentation gives an update to the output from the recent 2nd Plenary in Washington DC on 16 - 18 September 2013 and complements with details from the supporting activities of the RDA Europe - The European plug-in into RDA.
      Speaker: Dr Leif Laaksonen (CSC)
      Slides
    • 11:30 12:00
      Subspace Cluster and Outlier Detection in Big Data 30m Aula FTU

      Aula FTU

      KIT

      Hermann-von-Helmholtz-Platz 1 76344 Eggenstein-Leopoldshafen Germany
      Outlier mining and clustering are important when analyzing big data. Outliers are objects that deviate from regular objects in their neighborhood by much, clusters in turn are sets of objects with very little deviation from each other. In many applications, outliers and clusters do not show up in the full space, only in subspaces. Identifying subspaces likely to contain outliers is the open research issue our research is currently addressing. In this presentation, we present three subspace-search methods we have proposed recently. We also explain their importance for various scientific domains present at KIT.
      Speaker: Prof. Klemenes Böhm (KIT)
      Slides
    • 12:00 12:30
      Big Data at Google and How to Process a Trillion Cells per Mouse Click 30m Aula FTU

      Aula FTU

      KIT

      Hermann-von-Helmholtz-Platz 1 76344 Eggenstein-Leopoldshafen Germany
      This talk gives a brief overview of 2-3 of the main tools & languages which are used at Google for Web-scale data-analysis. The main part will then focus on a column-oriented datastore developed as one of the central components of PowerDrill - an internal Google analysis project. PowerDrill achieves large speed-ups that enable a highly interactive Web UI where it is common that a single mouse click leads to processing a trillion values in the underlying dataset.
      Speaker: Dr Alex Hall (Google)
      Slides
    • 12:30 14:00
      Lunch break 1h 30m Canteen

      Canteen

      KIT

    • 14:00 14:30
      Dead or Alive. Access to Research Data in Trustworthy Digital Archives 30m Aula FTU

      Aula FTU

      KIT

      Hermann-von-Helmholtz-Platz 1 76344 Eggenstein-Leopoldshafen Germany
      The massive increase of data production in all scientific fields poses the challenge of how to keep data accessible over time. This challenge is of course not new. In fact, some disciplines have about half a century of experience in selecting, documenting, managing, and preserving research data for re-use. There appear to be both differences and similarities in the problems, solutions and priorities of keeping research data alive for the long-term across disciplines. This paper will present a few examples based on the practices of data sharing, mainly from the humanities and social sciences. The added value of data archives has been demonstrated time and again, be it that choices have to be made about what to preserve: the costs of digital preservation are to be balanced against the benefits. Data re-use leads undoubtedly to new research and publications; sharing data is also a deterrent for sloppy data practices and outright data fraud. Sharing data puts demands on the data creator, the data repository, and the data user, for which quality guarantees and guidelines have been developed.
      Speaker: Dr Peter Doorn (DANS)
      Slides
    • 14:30 15:00
      Big Data from Crowdsourcing in the Humanities: Potentials and Challenge 30m Aula FTU

      Aula FTU

      KIT

      Hermann-von-Helmholtz-Platz 1 76344 Eggenstein-Leopoldshafen Germany
      Mobile devices, enabling easy interaction and providing advanced sensors and audio and video recording for virtually everybody, have a great potential to revolutionize science – in particular, they will raise crowd-sourcing to a new level, providing “big” data streams for the humanities. App-based crowd-sourcing has a potentially transforming effect in several phases in the research process: * Data creation, elicitation and collection: Many more subjects can be easily actively involved, overcoming shortcomings such as unbalanced or lacking participation in questionnaires * Data storing, pre-processing, and management: Advanced back-end-infrastructures shall be built to receive and distribute data, pre-process it (assess, select, curate), and organize it for later archiving and analysis, dealing with data magnitudes bigger than what we have currently * Data “processing” (compile, enrichment, annotation): Overcoming the bottleneck of annotation by massive contribution from the crowd or citizen-specialists * Dissemination of results, evaluation: Data-sets can be made easily accessible, accompanying publications or as published results in their own right, with visualization app technology Several technological and societal challenges have to be addressed to make the most of the potential of the big data from crowd-sourcing: * How to engage the users: Making them aware of the app, and making them engage in using it * How to receive, manage, pre-process, and store the data sustainably: Setting up an appropriate central back-end infrastructure; adding metadata; automatically pre-processing; storing large amounts of dynamic data; distributing stimuli and resulting datasets * Provenance information and quality assessment: It is crucial to have data quality assessment and information is who was the user, when and where the data were generated, on the basis of which stimulus or in which context, etc. * How to curate the data: A strategy for data curation has to be included in the plans for an app infrastructure and workflow right from the start * Privacy, intellectual property, authorship, access restrictions: Protecting the privacy of contributors or curators and give the appropriate credits for contributions * Life-cycle: policy-based handling and de-commitment: Dealing with data sets in a systematic manner requires policy-based automated treatment possibly including for de-commitment. Generally, successful employment of crowd-sourcing and in particular apps for mobile devices depends on much more than a well-designed and programmed app. This implies longer development time and costs that have to be taken into consideration. In particular, the back-bone infrastructure needs careful planning and installation in order to be prepared to deal adequately with the incoming data streams and sets. This seems to be a task for data centres which have already some experience with the handling of complex and large sets of digital data.
      Speaker: Dr Sebastian Drude (MPI for Psycholinguistics)
      Slides
    • 15:00 15:30
      Coffee break (incl. LSDMA poster session) 30m Aula FTU

      Aula FTU

      KIT

      Hermann-von-Helmholtz-Platz 1 76344 Eggenstein-Leopoldshafen Germany
    • 15:30 16:30
      Data Curation Lessons Learned from Johns Hopkins University Data Management Services 1h Aula FTU

      Aula FTU

      KIT

      Hermann-von-Helmholtz-Platz 1 76344 Eggenstein-Leopoldshafen Germany
      The launch of Johns Hopkins University Data Management Services (JHUDMS) represented the culmination of over a decade of research and development, prototyping, needs assessment, capacity building and sustainability planning. Beginning with the archiving and preservation of the Sloan Digital Sky Survey (SDSS) and advancing through the work of the Data Conservancy, the Sheridan Libraries at Johns Hopkins University (JHU) developed substantial expertise and initial infrastructure for data management that contributed directly to the development of the JHUDMS. This presentation will describe lessons learned and implications for Big Data analytics from the experiences at JHU.
      Speaker: Prof. Sayeed Choudhury (Johns Hopkins University)
      Slides
    • 16:30 17:00
      Panel Discussions with Speakers 30m Aula FTU

      Aula FTU

      KIT

      Hermann-von-Helmholtz-Platz 1 76344 Eggenstein-Leopoldshafen Germany
      Speaker: Prof. Achim Streit (KIT)