MIP, my proposal for a high-performance analysis pipeline for whole exome sequencing

Saturday, August 23, 2014

Hi all,

It has been a while since my last post, but the reason is worth the long time absent. Since January I am co-leading the bioinformatics and I.T department of one of Genomika Diagnósticos

Genomika is one of most advanced clinical genetics laboratory in Brazil. Located in Recife, Pernambuco, in the Northeast of Brazil, it provides cutting edge molecular testing in cancer samples to better define treatment options and prognosis, making personalized cancer management a reality. It also has a vast menu of tests to evaluate inherited diseases, including cancer susceptibility syndromes and  rare disorders. Equipped with state-of-the-art next-generation sequencing instruments and a world-class team of specialists in the field of genetic testing, Genomika focus on test methods that improve patient care and have immediate impact on management.  There is available a pitch video about our lab and one of our exams (unfortunately the spoken language is portuguese).

Our video about sequencing exams spoken in portuguese

My daily work is to provide tools; infra-structure and systems to support our clients and teams in the lab. One of major teams is the molecular biology sector. It is responsible for the DNA sequencing exams, which includes targeted-panels, specific genes or exons or whole exome.  Each one of those genetic tests, before delivered to the patient and the doctor,  goes under several data pre-processing and analysis steps organised in a ordered set of sequential steps, which we call a pipeline.

There is a customised pipeline for clinical sequencing; where we bioinformaticians and specialists study the genetic basis of human phenotypes. In our lab pipeline we are interested on selecting and capturing the protein-coding portion of the genome (we call the exome).  This region responsible for only 3% of our human DNA can be used to elucidate the genetic causes of many human diseases, starting from single gene disorders and moving on more complex genetic disorders, including complex traits and cancer.

Clinical Sequencing Pipeline overview

For this task,  we use several tools that must handle with large volumes of data, specially because of the new next-generation DNA sequencing machines (yeah we have one at our lab from Illumina). Those machines are capable of producing in shorter times and lower costs  large amount of NGS data. 

Taking those challenges into account, we perform our sequencing, alignment, detection and data analysis of human samples in order to seek variants.  This study we call variant analysis. Variant analysis looks for variant information, that is, possible mutations that may be associated to genetic diseases . Let's consider as examples of mutation or variant as follows: a change of nucleotide (A for T) (single nucleotide variant or SNV) or even a small insertion or deletion (INDEL's) that can impact the functional activity of the protein.  Looking after variants and even further seek and identify those related to diseases or genetic disorders is a big challenge in terms of technology, tools and interpretation.

The reference in the genome at bottom; the variants above.
In this example there's a possible exchange of A to G (SNV) in a specific position of the genome.

In our lab we are developing a streamlined, highly automated pipeline for exome and targeted panel regions data analysis. In our pipeline we handle multiple datasets and state of the art tools that are integrated in a custom pipeline for generating, annotating and analyzing sequence variants.

We named our internal pipeline tool as MIP (Mutation Identification pipeline). Some minimal requirements we stablished for MIP in order to use it with maximum performance and productivity. 

1. It must be automatic;  with a limited team like ours (2 or 3 bioinformaticians) we need a efficient service that is capable to execute the complete analysis without typing commands at terminals calling software or converting files among several data formats.

MIP pipeline overview for clinical sequencing. All those steps requires tools and files in a specific format.
Our engine   would be capable of manage and execute all or some of those steps with
specific parameters defined by the specialist .

2. It must be user-oriented; it mean that MIP platform must provide an easy-to-use interface, that any researcher of lab could use the system and start out-of-box their sequencing analysis.  For biologists and geneticists it would allow them to focus their work on what matters: the downstream experiments.

3. Scalable-out architecture;  More and more hight throughput sequencing data is pulled out from NGS instruments, so MIP must be designed to be a building block for a scalable genomics infrastructure. It means that we must work with distributed and parallel approaches and the best practices from high-performance computing and big data to efficient take advantage of all resources available at our infra-structure while thinking on continuous optimization in order to minimize the network and shared disk I/O footprint.

My draft proposal to our exome sequencing pipeline

4. Rich-detailed reports and smart software and dataset updates;  In order to maintain our execution engine working healthy, it requires that our software stack always being updated. Since our engine is written on top of numerous open-source biological and big data packages, we need a self-contained management system that could not only check for any new versions but also with a few clicks start any update and perform a post-check for any possible corruptions at the pipeline.  In addition to the third-party genomics software used on MIP, we are also developing our tool for variant annotation. So it stands for an engine that could query and analyze several genomic dataset, generate real-time interactive reports where the researchers could filter out variants based on specific criteria and output in formats of QC reports, target and sequencing depth information, descriptions of the annotations and variants hyperlinked to public datasets in order to get further details about a variation.

Example of web interface where a researcher could select any single or combination of annotations to display. Links to the original datasources are readily available (Figure from WEP annotation system)

5. Finally, we think the most important requirement nowadays to MIP is the integration with our current LMS (Laboratory management system), in order to put the filtered variants  as input to our existing laboratory report analysis and publishing workflow. It means more productivity and automation with our existing infrastructure.

MIP could be also be acessible via RESful API, where the runs output
 would be interchanged with our external LMS solution.

As you may see, there's a huge effort on coding, design and infrastructure to meet those requirements. But we are thrilled to make this happen!  One of our current works in this project is the genv tool. Genv is what we call our genomika environment builder. The basic idea behind it is a tool written in python and fabric package, that provides instant access to biological software, programming libraries and data. The expected result  is a fully automated infrastructure that installs all software and data required to start MIP pipeline.  We are thinking of also arranging pre-built images with Docker.  Of course I will need a whole post to explain more about it!

To sum up,  I hope I could summarise one of the projects I've been working this first semester. At Genomika Diagnósticos we are facing big scientific challenges and the best part is that those tools are helping our lab to provide a next level of health information to the patients,  all from our DNA!

If you are interested on working with us, keep checking our github homepage with any open positions at our bioinformatics team.

Until next time!