Data Science

Extracting the best of data

An estimated trillion bacterial species populate the Earth, of which a vast majority remain undiscovered, uncultivated or unexplored. They hold a large reservoir of metabolites with potentially interesting biological activities, which appeared unreachable until a few years. Recent scientific and technological advances have changed the game and are starting to unlock the best-kept secrets of biodiversity in terms of bioactive compounds. An essential part of this revolution has been the ability to handle and analyse large and complex sets of data. For instance, high-throughput sequencing, genomics or metabolomics are just some of the rapidly evolving fields that would never have been developed without the ability to process big data. The data science hub has been essential in the development of DEINOVE’s R&D platform. In constant collaboration with all the technology units, the team analyzes the large amounts of data generated at every step of the process and develops specific custom-made tools to deepen this analysis while ensuring data management and traceability on the entire platform.

Core activities

The activities described in this section correspond to tools developed by the data science unit to ease data management and analysis throughtout DEINOVE’s R&D platform.

LIMS: data management and traceability

DEINOVE’s R&D platform uses automation and high-throughput methods at every step of the discovery process, from biodiversity exploration to pre-industrial production of a lead compound. Tremendous amounts of data are generated at every one of these steps, which need to be precisely documented and archived for future reference. The laboratory information management system (LIMS) developed by the data science hub is the result of a continuous effort to capture and structure all the data generated, a tool under constant improvement to adapt to the evolving needs of each one of the technology units.

SLiMe: predicting the metabolites produced by a bacterial species

Dereplication aims to identify the chemical entity responsible for a given biological activity. While part of this process is achieved by the advanced analytics hub through analytical separation and detection of the metabolites present in a bacterial extract, it greatly benefits from an integrated analysis of the genomic sequence of the bacterial species in which the activity was detected. To perform such analyses, the data science hub has developed SLiMe (Species Links to Molecules), a tool that predicts the metabolites produced by a given bacterial species through combining previous knowledge on natural products and on bacterial genomes.

Bankiise: a natural product knowledge management platform

To date, no single database presents all the information that is publicly available on antimicrobial natural compounds. The data science unit has undertaken the ambitious task of gathering heterogenous knowledge scattered across diverse databases and publications in a single knowledge management platform named Bankiise. By aggregating and restructuring data on bacterial ecology, taxonomy, genomics, and metabolomics, this toolwill ultimately accelerate dereplication of antimicrobial agents.

Support activities


Analysis of high-throughput sequencing data obtained from the ribosomal RNA 16S genes of the bacterial mixture present in an environmental sample. This analysis is performed in collaboration with the biodiversity farming unit and aims to identify the bacterial species present in the sample. When a previously unknown species is found, sequencing data is used for its taxonomic classification.


At least two activities require genomic analysis by the data science unit. (1) In collaboration with the synthetic biology hub, the data science hub performs genome analysis to determine which genes or gene clusters are responsible for or are involved in the production of a certain compound. (2) Bacteria that present a particular interest undergo whole genome sequencing. The data science hub analyzes the sequencing data obtained and annotates the genome to respond to specific needs (species identification, gene cluster identification).


To better characterize microbial communities in an environmental sample in terms of taxonomic affilitations, the data science team is expanding the analysis of the ribosomal RNA (rRNA) 16S gene sequences to other molecular markers conserved and shared across various taxonomic groups such as the rRNA 18S gene or the internal transcribed sequences (ITS) within the rRNA. Using this data, analysis of operational taxonomic units gives rise to more easily identifiable taxonomic groups.  

Screening data analysis

In collaboration with the activity testing unit, the data science hub analyses data from high throughput and high content biological activity screens.


In collaboration with the advanced analytics hub, the data science unit analyses data obtained from metabolomic analyses to decipher the metabolic patways that lead to or affect the biosynthesis of a specific compound in a bacterial strain.


The data science hub provides a continuous support in biostatistics for data analysis across Deinove’s discovery platform.


Zhu, J.-W., Zhang, S.-J., Wang, W.-G., & Jiang, H. (2020). Strategies for Discovering New Antibiotics from Bacteria in the Post-Genomic Era. Current Microbiology, 77(11), 3213–3223.

Foulston, L. (2019). Genome mining and prospects for antibiotic discovery. Current Opinion in Microbiology, 51, 1–8.

Baltz, R. H. (2017). Synthetic biology, genome mining, and combinatorial biosynthesis of NRPS-derived antibiotics: a perspective. Journal of Industrial Microbiology & Biotechnology, 45(7), 635–649.

Bush, A., Compson, Z. G., Monk, W. A., Porter, T. M., Steeves, R., Emilson, E., Gagne, N., Hajibabaei, M., Roy, M., & Baird, D. J. (2019). Studying Ecosystems With DNA Metabarcoding: Lessons From Biomonitoring of Aquatic Macroinvertebrates. Frontiers in Ecology and Evolution, 7.