Papers, reviews, tutorials

2021

Genome Research
Data structures based on k-mers for querying large collections of sequencing data sets

Camille Marchet, Christina Boucher, Simon J. Puglisi, Paul Medvedev, Mikaël Salson, and Rayan Chikhi

Genome Research, Jan 2021

Abstract BIB URL PDF

High-throughput sequencing data sets are usually deposited in public repositories (e.g., the European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow one to perform online sequence searches, yet, such a feature would be highly useful to investigators. Toward this goal, in the last few years several computational approaches have been introduced to index and query large collections of data sets. Here, we propose an accessible survey of these approaches, which are generally based on representing data sets as sets of k-mers. We review their properties, introduce a classification, and present their general intuition. We summarize their performance and highlight their current strengths and limitations.
@article{marchet_data_2021, title = {Data structures based on k-mers for querying large collections of sequencing data sets}, volume = {31}, doi = {10.1101/gr.260604.119}, number = {1}, journal = {Genome Research}, author = {Marchet, Camille and Boucher, Christina and Puglisi, Simon J. and Medvedev, Paul and Salson, Mikaël and Chikhi, Rayan}, month = jan, year = {2021}, pages = {1--12}, }
ACM Comput. Surv.
Data Structures to Represent a Set of K-Long DNA Sequences

Rayan Chikhi, Jan Holub, and Paul Medvedev

ACM Comput. Surv., Mar 2021

Abstract BIB URL PDF

The analysis of biological sequencing data has been one of the biggest applications of string algorithms. The approaches used in many such applications are based on the analysis of k-mers, which are short fixed-length strings present in a dataset. While these approaches are rather diverse, storing and querying a k-mer set has emerged as a shared underlying component. A set of k-mers has unique features and applications that, over the past 10 years, have resulted in many specialized approaches for its representation. In this survey, we give a unified presentation and comparison of the data structures that have been proposed to store and query a k-mer set. We hope this survey will serve as a resource for researchers in the field as well as make the area more accessible to researchers outside the field.
@article{chikhi_data_2021, title = {Data {Structures} to {Represent} a {Set} of {K}-{Long} {DNA} {Sequences}}, volume = {54}, issn = {0360-0300}, url = {https://doi.org/10.1145/3445967}, doi = {10.1145/3445967}, number = {1}, journal = {ACM Comput. Surv.}, author = {Chikhi, Rayan and Holub, Jan and Medvedev, Paul}, month = mar, year = {2021}, }
Genome Biology
Technology dictates algorithms: recent developments in read alignment

Mohammed Alser, Jeremy Rotman, Dhrithi Deshpande, Kodi Taraszka, Huwenbo Shi, Pelin Icer Baykal, Harry Taegyun Yang, Victor Xue, Sergey Knyazev, Benjamin D. Singer, Brunilda Balliu, David Koslicki, Pavel Skums, Alex Zelikovsky, Can Alkan, Onur Mutlu, and Serghei Mangul

Genome Biology, Aug 2021

Abstract BIB URL PDF

Aligning sequencing reads onto a reference is an essential step of the majority of genomic analysis pipelines. Computational algorithms for read alignment have evolved in accordance with technological advances, leading to today’s diverse array of alignment methods. We provide a systematic survey of algorithmic foundations and methodologies across 107 alignment methods, for both short and long reads. We provide a rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read alignment. We discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology.
@article{alser_technology_2021, title = {Technology dictates algorithms: recent developments in read alignment}, volume = {22}, issn = {1474-760X}, url = {https://doi.org/10.1186/s13059-021-02443-7}, doi = {10.1186/s13059-021-02443-7}, number = {1}, journal = {Genome Biology}, author = {Alser, Mohammed and Rotman, Jeremy and Deshpande, Dhrithi and Taraszka, Kodi and Shi, Huwenbo and Baykal, Pelin Icer and Yang, Harry Taegyun and Xue, Victor and Knyazev, Sergey and Singer, Benjamin D. and Balliu, Brunilda and Koslicki, David and Skums, Pavel and Zelikovsky, Alex and Alkan, Can and Mutlu, Onur and Mangul, Serghei}, month = aug, year = {2021}, pages = {249}, }