Data Science

Extracting the best of data

An estimated trillion bacterial species populate the Earth, of which a vast majority remain undiscovered, uncultivated or unexplored. They hold a large reservoir of metabolites with potentially interesting biological activities, which appeared unreachable until a few years. Recent scientific and technological advances have changed the game and are starting to unlock the best-kept secrets of biodiversity in terms of bioactive compounds. An essential part of this revolution has been the ability to handle and analyse large and complex sets of data. For instance, high-throughput sequencing, genomics or metabolomics are just some of the rapidly evolving fields that would never have been developed without the ability to generate and process these data. The data science hub has been essential in the development of DEINOVE’s R&D platform. In constant collaboration with all the technology units, the team analyzes the large amounts of data generated at every step of the process and develops specific custom-made tools to deepen this analysis while ensuring data management and traceability on the entire platform.

Core activities

The activities described in this section correspond to tools developed by the data science unit to ease data management and analysis throughout DEINOVE’s R&D platform.

LIMS: data management and traceability

DEINOVE’s R&D platform uses automation and high-throughput methods at every step of the discovery process, from biodiversity exploration to pre-industrial production of a lead compound. Tremendous amounts of data are generated at every one of these steps, which need to be precisely documented and archived for future reference. The laboratory information management system (LIMS) managed and exploited by the data science hub is the result of a continuous effort to capture and structure all the data generated, a tool under constant improvement to adapt to the evolving needs of each one of the technology units.

SLiMe: predicting the metabolites produced by a bacterial species

Dereplication aims to identify the chemical entity responsible for a given biological activity. While part of this process is achieved by the advanced analytics hub through analytical separation and detection of the metabolites present in a bacterial extract, it greatly benefits from an integrated analysis of the genomic sequence of the bacterial species in which the activity was detected. To perform such analyses, the data science hub has developed SLiMe (Species Links to Molecules), a tool that predicts the metabolites produced by a given bacterial species through combining previous knowledge on natural products and on bacterial genomes. No single data source currently presents all publicly available information on natural antimicrobial compound. The data science unit has built an in-house database by aggregating and restructuring data on bacterial ecology, taxonomy, genomics, and metabolomics, this tool will ultimately accelerate dereplication of antimicrobial agents as well as structure elucidation and medicinal chemistry efforts.

Support activities


Analysis of high-throughput sequencing data obtained from the ribosomal RNA 16S genes of the bacterial mixture present in an environmental sample. This analysis is performed in collaboration with the biodiversity farming unit and aims to identify the bacterial species present in the sample. After processing with dedicated in house tools, theses information can be used to provide guidance to biodiversity farming unit in isolating most interesting bacteria from the environment.

Under certain circumstance, the barcoding genes can be extended to other molecular markers conserved and shared across various taxonomic groups such as the rRNA 18S gene or the internal transcribed sequences (ITS) within the rRNA.


DEINOVE systematically performs whole genome sequencing of most interesting strains. Genomic sequences reveal a great deal of information about the studied bacteria. For example, the Data Science unit is annotating these genomes to map the metabolic capabilities of the strains.  These sequences can be “mined” to find genetic clusters more or less buried inside the bacterial genome. In collaboration with the synthetic biology unit which has the capability to disrupts target genes it is for example possible determine which genes or gene clusters are responsible for or are involved in the production of a certain active compound.

Data analytics

Advanced data analytics is required to cope with the large amount and throughput of data generated by the discovery platform. For example, in collaboration with the activity testing unit, the data science hub has developed in house image processing tools to process high content biological activity screens.

More generally, the data science hub provides a continuous support in biostatistics for data analysis across Deinove’s discovery platform.


In collaboration with the advanced analytics hub, the data science unit analyses data obtained from metabolomic analyses to decipher the metabolic pathways that lead to or affect the biosynthesis of a specific compound in a bacterial strain.


Zhu, J.-W., Zhang, S.-J., Wang, W.-G., & Jiang, H. (2020). Strategies for Discovering New Antibiotics from Bacteria in the Post-Genomic Era. Current Microbiology, 77(11), 3213–3223.

Foulston, L. (2019). Genome mining and prospects for antibiotic discovery. Current Opinion in Microbiology, 51, 1–8.

Baltz, R. H. (2017). Synthetic biology, genome mining, and combinatorial biosynthesis of NRPS-derived antibiotics: a perspective. Journal of Industrial Microbiology & Biotechnology, 45(7), 635–649.

Bush, A., Compson, Z. G., Monk, W. A., Porter, T. M., Steeves, R., Emilson, E., Gagne, N., Hajibabaei, M., Roy, M., & Baird, D. J. (2019). Studying Ecosystems With DNA Metabarcoding: Lessons From Biomonitoring of Aquatic Macroinvertebrates. Frontiers in Ecology and Evolution, 7.