DIAS: Data Integration, Analytics and Services in ProteomicsDB
The architectural and molecular complexity of living systems is astounding. To maintain normal life, the actions of thousands of genes, proteins, metabolites and other biomolecules need to be tightly coordinated in space and time. Now, it is well accepted that systems level measurements are necessary to understand biological phenomena comprehensively. As proteins execute and control most biological processes and are the targets for most therapeutic drugs, there is enormous interest and potential value in proteome level research. The emergence of Omics technologies is a direct consequence of the desire to study - ideally all - biomolecules in a molecular, cellular or organismal context turning modern life science research into data science.
However, this also presents substantial challenges: It may seem paradoxical, but the ease of generating large-scale data impairs the ability of understanding it, since too many potential hypotheses are generated with each new experiment. As a result, only a very small portion of data is used for analysis and eventually lost when a scientist continues his research journey. This unsatisfactory situation arises because of a shortage of computational and experimental tools that can systematically use, and re-use acquired data and translate it into meaningful information. Such tools would support scientists in making informed decisions about the most appropriate next steps in a research program to create actual knowledge and, eventually, translate it into some form of utility. Furthermore, powerful computational systems and algorithms are required not only for data collection, but also for data analytics and integration of other types of data like molecular or phenotypic data.
Therefore, there is stronger need to extract more knowledge from such high-throughput data than currently practiced and to make existing and increasing mass data more usable, insightful and re-usable for developing targeted applications with a high utility. Modern machine learning approaches, including deep learning and performant database systems offer great opportunities for meeting these needs. Most efforts in proteomics informatics have so far focused on processing the raw mass spectrometric data files into lists of protein identifications and quantitative values. Consequently, virtually no tools exist in the public domain that would go to the next step and systematically integrate and analyze a huge amount of very large experiments.
To address the mentioned challenges, ProteomicsDB was developed in a previous project. It is an in-memory database built on enterprise grade technology (SAP HANA) and hardware (IBM Power). The overarching aims of this project are to substantially extend the capabilities of ProteomicsDB, to integrate it with other important resources in the field and to develop several computational tools that benefit the scientific community and the related industry transforming it to a Proteomics-as-a-Service platform. These software tools and data services should help (non-expert) biologists and (expert) bioinformaticians alike to use, and re-use, their own and public high-throughput data. They should also help to integrate and analyze this data delivering a manageable number of sound hypothesis that can then be tested by experiments. The specific aims are:
- to provide FAIR access to research data, data analytics and data services in ProteomicsDB to the world-wide scientific community.
- to demonstrate that the integrated analysis of heterogeneous data modalities leads to new models providing insight into complex biological relationships, by implementing efficient algorithms and visualizations into ProteomicsDB, exemplified by protein expression, drug:protein interaction, protein structure and phenotypic drug sensitivity data.
- to demonstrate that molecular, drug and phenotypic data and data services in ProteomicsDB can be translated into clinical decision making, exemplified by the molecular tumor board of the Comprehensive Cancer Center Munich.
The project is split into the following seven work packages (WPs):
- WP1: Service Engineering in Life Science Research
- WP2: Data Access, Integration and Exchange
- WP3: Public Data Analytics Services in ProteomicsDB
- WP4: Data Services for Mass Spectrometry Based Proteomics
- WP5: Integration & Analytics of Drug Effect Data
- WP6: Integration & Analytics of Protein Structure Data
- WP7: Supporting Decision Making in Molecular Tumor Boards.
Each WP itself contains further tasks. In addition, the WPs have interdependencies with each other resulting in multi-disciplinary project teams consisting of different stakeholder and project partners for each WP. The chair for information systems is responsible for WP1 (Service Engineering in Life Science Research) that lays the foundation of all other WPs (WP2 - WP7). This WP is divided in four tasks. Furthermore, the chair is involved in WP2, WP3, WP4 and WP7 and actively supports and contributes to each of them. In the following, WP1 is explained in detail.
Data and tools are key to advances in life science research, particularly if one aims to study biological processes on a system level. This requires researchers to be able to find, retrieve and reuse data, tools and services. However, bringing together relevant data, usable tools and meaningful services is already a big challenge. By applying a Design Science Research approach, the following three goals should be reached as an outcome of each cycle and their phases:
- Establish guidelines on data integration strategies and service engineering in life science research, focusing on the implementation and execution of the FAIR guidelines.
- Define the requirements for implementing a “Analytics platform” and implement a “Analytics Platform”.
- Evaluate economic models for providing computational tools and infrastructure to support analytical services.
To reach the goals, the chair will work on the following tasks:
Task 1.1: Implementation of FAIR Principles in ProteomicsDB
Task 1.2: ProteomicsDB as an “Analytics Platform”
Task 1.3: Service Science and Engineering for ProteomicsDB
Task 1.4: Economic Sustainability Models for Open Source and FAIR Resources
The new Proteomics-as-a-Service platform, based on ProteomicsDB, will cater to the experimental life sciences community by merging a unique vantage point to view and merge relevant protein-centric computational approaches. The value added by extending ProteomicsDB will include:
- placing results of individual experiments into context to be able to ask much larger scientific questions than currently possible
- integrating other molecular, structural and drug phenotypic information leading to new insights
- bringing proteomic data, tools and results to (non-expert) parts of the community that usually would not have access to such information and capabilities enabling new collaborations and innovations in therapies, medication or other fields in healthcare
- bringing together hitherto often independently operating scientific communities, in this case proteomics, drug discovery and structural biology
- enabling systematic data re-use and analytics across large experiment sets opening up completely new possibilities for studying and understanding proteins in the context of biological systems
- generating information and data analysis concepts that can either be directly or indirectly be commercialized.
This platform will not only contribute to society by fostering collaboration and expediting progress in (proteomics) research, it also offers a controlled research environment, able to monitor and study long-term effects from a business information systems perspective.
The interdisciplinary DIAS cluster is funded by the German Federal Ministry of Education and Research (BMBF).