WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads

Patterson, Murray; Marschall, Tobias; Pisanti, Nadia; van Iersel, Leo; Stougie, Leen; Klau, Gunnar W.; Schönhuth, Alexander

doi:10.1007/978-3-319-05269-4_19

Murray Patterson^20,21,
Tobias Marschall²⁰,
Nadia Pisanti^22,23,
Leo van Iersel²⁰,
Leen Stougie^20,24,
Gunnar W. Klau^20,24 &
…
Alexander Schönhuth²⁰

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 8394))

Included in the following conference series:

International Conference on Research in Computational Molecular Biology

3436 Accesses
3 Altmetric

Abstract

The human genome is diploid, that is each of its chromosomes comes in two copies. This requires to phase the single nucleotide polymorphisms (SNPs), that is, to assign them to the two copies, beyond just detecting them. The resulting haplotypes, lists of SNPs belonging to each copy, are crucial for downstream analyses in population genetics. Currently, statistical approaches, which avoid making use of direct read information, constitute the state-of-the-art. Haplotype assembly, which addresses phasing directly from sequencing reads, suffers from the fact that sequencing reads of the current generation are too short to serve the purposes of genome-wide phasing.

Future sequencing technologies, however, bear the promise to generate reads of lengths and error rates that allow to bridge all SNP positions in the genome at sufficient amounts of SNPs per read. Existing haplotype assembly approaches, however, profit precisely, in terms of computational complexity, from the limited length of current-generation reads, because their runtime is usually exponential in the number of SNPs per sequencing read. This implies that such approaches will not be able to exploit the benefits of long enough, future-generation reads.

Here, we suggest WhatsHap, a novel dynamic programming approach to haplotype assembly. It is the first approach that yields provably optimal solutions to the weighted minimum error correction (wMEC) problem in runtime linear in the number of SNPs per sequencing read, making it suitable for future-generation reads. WhatsHap is a fixed parameter tractable (FPT) approach with coverage as the parameter. We demonstrate that WhatsHap can handle datasets of coverage up to 20x, processing chromosomes on standard workstations in only 1-2 hours. Our simulation study shows that the quality of haplotypes assembled by WhatsHap significantly improves with increasing read length, both in terms of genome coverage as well as in terms of switch errors. The switch error rates we achieve in our simulations are superior to those obtained by state-of-the-art statistical phasers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

GenHap: a novel computational method based on genetic algorithms for haplotype assembly

Article Open access 18 April 2019

HapCHAT: adaptive haplotype assembly for efficiently leveraging high coverage in long reads

Article Open access 03 July 2018

High-Performance Haplotype Assembly

References

Aguiar, D., Istrail, S.: Hapcompass: A fast cycle basis algorithm for accurate haplotype assembly of sequence data. J. of Comp. Biol. 19(6), 577–590 (2012)
Article MathSciNet Google Scholar
Aguiar, D., Istrail, S.: Haplotype assembly in polyploid genomes and identical by descent shared tracts. Bioinformatics 360, i352–i360 (2013)
Google Scholar
Bansal, V., Bafna, V.: HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics 24(16), i153–i159 (2008)
Google Scholar
Bansal, V., et al.: An MCMC algorithm for haplotype assembly from whole-genome sequence data. Genome Research 18(8), 1336–1346 (2008)
Article MathSciNet Google Scholar
Boomsma, D.I., et al.: The Genome of the Netherlands: design, and project goals. European Journal of Human Genetics (2013), doi:10.1038/ejhg.2013.118
Google Scholar
Chen, Z.Z., Deng, F., Wang, L.: Exact algorithms for haplotype assembly from whole-genome sequence data. Bioinformatics 29(16), 1938–1945 (2013)
Article Google Scholar
Cilibrasi, R., van Iersel, L., Kelk, S., Tromp, J.: On the complexity of several haplotyping problems. In: Casadio, R., Myers, G. (eds.) WABI 2005. LNCS (LNBI), vol. 3692, pp. 128–139. Springer, Heidelberg (2005)
Chapter Google Scholar
Delaneau, O., Howie, B., Cox, A., Zagury, J., Marchini, J.: Haplotype estimation using sequencing reads. Am. J. of Human Genetics 93(4), 687–696 (2013)
Article Google Scholar
Deng, F., Cui, W., Wang, L.: A highly accurate heuristic algorithm for the haplotype assembly problem. BMC Genomics 14(suppl. 2), S2 (2013)
Google Scholar
Earl, D.A., et al.: Assemblathon 1: A competitive assessment of de novo short read assembly methods. Genome Research (2011), doi:10.1101/gr.126599.111
Google Scholar
Fouilhoux, P., Mahjoub, A.: Solving VLSI design and DNA sequencing problems using bipartization of graphs. Comp. Optim. and Appl. 51(2), 749–781 (2012)
Article MATH MathSciNet Google Scholar
Greenberg, H., Hart, W., Lancia, G.: Opportunities for combinatorial optimization in computational biology. Informs J. on Computing 16(3), 211–231 (2004)
Article MATH MathSciNet Google Scholar
Hartl, D., Clark, A.: Principles of Population Genetics. Sinauer Associates, Inc., Sunderland (2007)
Google Scholar
He, D., Eskin, E.: Hap-seqX: expedite algorithm for haplotype phasing with imputataion using sequence data. Gene. 518(1), 2–6 (2013)
Article Google Scholar
He, D., Han, B., Eskin, E.: Hap-seq: an optimal algorithm for haplotype phasing with imputation using sequencing data. J. Comp. Biol. 20(2), 80–92 (2013)
Article MathSciNet Google Scholar
He, D., et al.: Optimal algorithms for haplotype assembly from whole-genome sequence data. Bioinformatics 26(12), i183–i190 (2010)
Google Scholar
Howie, B., Donnelly, P., Marchini, J.: A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genetics 5(6), e1000529 (2009)
Google Scholar
Lancia, G., Bafna, V., Istrail, S., Lippert, R., Schwartz, R.: SNPs problems, complexity and algorithms. In: Meyer auf der Heide, F. (ed.) ESA 2001. LNCS, vol. 2161, pp. 182–193. Springer, Heidelberg (2001)
Chapter Google Scholar
Levy, S., et al.: The diploid genome sequence of an individual human. PLoS Bio. (2007), doi:10.1371/journal.pbio.0050254
Google Scholar
Li, H.: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv (1303.3997) (2013)
Google Scholar
Li, Y., et al.: MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet. Epidemiol. 34, 816–834 (2010)
Article Google Scholar
Lippert, R., et al.: Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem. Briefings in Bioinformatics 3(1), 23–31 (2002)
Article Google Scholar
Menelaou, A., Marchini, J.: Genotype calling and phasing using next-generation sequencing reads and a haplotype scaffold. Bioinformatics 29(1), 84–91 (2013)
Article Google Scholar
Mossige, S.: An algorithm for Gray codes. Computing 18, 89–92 (1977)
Article MATH MathSciNet Google Scholar
Panconesi, A., Sozio, M.: Fast hare: A fast heuristic for single individual SNP haplotype reconstruction. In: Jonassen, I., Kim, J. (eds.) WABI 2004. LNCS (LNBI), vol. 3240, pp. 266–277. Springer, Heidelberg (2004)
Chapter Google Scholar
Scheet, P., Stephens, M.: A fast and flexible statistical model for large-scale population genotype data: Applications to inferring missing genotypes and haplotypic phase. American Journal of Human Genetics 78, 629–644 (2006)
Article Google Scholar
Selvaraj, S., et al.: Whole-genome haplotype reconstruction using proximity-ligation and shotgun sequencing. Nature Biotechnology 31, 1111–1118 (2013)
Article Google Scholar
The 1000 Genomes Project Consortium: A map of human genome variation from population-scale sequencing. Nature 467(7319), 1061–1073 (2010)
Google Scholar
The International HapMap Consortium: A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851–861 (2007)
Google Scholar
The International HapMap Consortium: Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010)
Google Scholar
Wang, R.S., Wu, L.Y., Li, Z.P., Zhang, X.S.: Haplotype reconstruction from SNP fragments by minimum error correction. Bioinformatics 21(10), 2456–2462 (2005)
Article Google Scholar
Yang, W.Y., Hormozdiari, F., Wang, Z., He, D., Pasaniuc, B., Eskin, E.: Leveraging reads that span multiple single nucleotide polymorphisms for haplotype inference from sequencing data. Bioinformatics 29(18), 2245–2252 (2013)
Article MATH Google Scholar
Zhang, Y.: A dynamic Bayesian Markov model for phasing and characterizing haplotypes in next-generation sequencing. Bioinformatics 29(7), 878–885 (2013)
Article Google Scholar
Zhao, Y.T., Wu, L.Y., Zhang, J.H., Wang, R.S., Zhang, X.S.: Haplotype assembly from aligned weighted SNP fragments. Computational Biology and Chemistry 29, 281–287 (2005)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Life Sciences, CWI, Amsterdam, The Netherlands
Murray Patterson, Tobias Marschall, Leo van Iersel, Leen Stougie, Gunnar W. Klau & Alexander Schönhuth
LBBE, CNRS, Université de Lyon 1, Villeurbanne, France
Murray Patterson
Department of Computer Science, University of Pisa, Italy
Nadia Pisanti
LIACS, Leiden University, The Netherlands
Nadia Pisanti
VU University Amsterdam, The Netherlands
Leen Stougie & Gunnar W. Klau

Authors

Murray Patterson
View author publications
You can also search for this author in PubMed Google Scholar
Tobias Marschall
View author publications
You can also search for this author in PubMed Google Scholar
Nadia Pisanti
View author publications
You can also search for this author in PubMed Google Scholar
Leo van Iersel
View author publications
You can also search for this author in PubMed Google Scholar
Leen Stougie
View author publications
You can also search for this author in PubMed Google Scholar
Gunnar W. Klau
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Schönhuth
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science, Tel Aviv University, 69978, Tel Aviv, Israel
Roded Sharan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Patterson, M. et al. (2014). WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads. In: Sharan, R. (eds) Research in Computational Molecular Biology. RECOMB 2014. Lecture Notes in Computer Science(), vol 8394. Springer, Cham. https://doi.org/10.1007/978-3-319-05269-4_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-05269-4_19
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-05268-7
Online ISBN: 978-3-319-05269-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics