So you want to learn bioinformatics for bacterial population genomics
by Taj Azarian, Lauren Cowley and Karel Brinda
In the ever-changing world of genomics and epidemiology, technological advancements in sequencing and analysis may often move faster than the workforce can keep up. Here we provide resources for those who want to become more familiar with sequencing, bioinformatics, and population genomic analysis in the emerging field of genomic epidemiology.
Taj and Lauren both began their careers in public health practice, Taj working for the Florida Department of Health and Lauren for Public Health England (PHE). Having successfully (they think) made the transition from shoe-leather infectious disease epidemiology to genomic epidemiology, they often receive questions from past public health colleagues interested in making the jump into the bioinformatics/Next-Generation Sequencing (NGS)/phylogenetic/population genetic world (hereon referred to simply as genomic epidemiology). Certainly, with the increasing buzz about genomic epi and the growing presence of pathogen genome sequencing in public health practice, long-time epidemiologists want to learn these new tools to stay up-to-date. At the very least, they want to understand the lexicon so that they can read genomic epi papers or interpret phylogenetic/genomic results in the course of an outbreak investigation. Those employed by public health laboratories are witnessing this transition firsthand. Pulsed-field gel electrophoresis (PFGE), for example, is slated for eventual replacement by whole-genome sequencing (WGS), and we have already seen the CDC and PHE apply WGS to surveillance, outbreak detection, and investigation of notifiable infectious diseases (e.g., Salmonella, Listeria, E. coli etc.). Epidemiologists are now being provided with phylogenies (or single-nucleotide polymorphism (SNP) distance matrices) during outbreak investigations and being expected to interpret them, which often results in them asking the question “How much relatedness (in terms of genetic distance) counts as related (i.e., epidemiologically linked)?” [Perhaps a topic for a future blog post]. In all, they and other members of CCDD have in various ways tried to help colleagues and friends make this transition and soften the overwhelming feeling when opening the terminal for the first time. So here is Taj, Lauren and Karel’s (a seasoned algorithms focused bioinformatician) attempt to consolidate that advice into one location. By all means this is not the be-all and end-all guide to genomic epi, but it is a place to start.
1. Read scientific articles and blogs
This is a must for a first step. You have to build a detailed understanding of concepts and applications before diving into analysis. This will also help you focus on areas of learning that you want to pursue. Time invested up front will pay dividends later. As one of our labmates likes to say, “An hour of reading will save you a day of hacking.” Here are some suggested articles, which we feel are a good start. There are numerous publications out there covering the same topics, but these are maybe the most representative of the field by some really sharp researchers (and some of our friends).
ARTICLES
- Grad YH, Lipsitch M. Epidemiologic data and pathogen genome sequences: a powerful synergy for public health. Genome Biol. 2014;15(11):538. doi:10.1186/s13059-014-0538-4.
- Loman N, Watson M. So you want to be a computational biologist? Nat Biotechnol. 2013;31(11):996. doi:10.1038/nbt.2740.
- Köser CU, Ellington MJ, Cartwright EJP, et al. Routine use of microbial whole genome sequencing in diagnostic and public health microbiology. Rall GF, ed. PLoS Pathog. 2012;8(8):e1002824. doi:10.1371/journal.ppat.1002824.
- Croucher NJ, Didelot X. The application of genomics to tracing bacterial pathogen transmission. Curr Opin Microbiol. 2015;23:62-67. doi:10.1016/j.mib.2014.11.004.
- Sintchenko V, Holmes EC. The role of pathogen genomics in assessing disease transmission. BMJ. 2015;350:h1314.
- Altmann A, Weber P, Bader D, Preuss M, Binder EB, Müller-Myhsok B. A beginners guide to SNP calling from high-throughput DNA-sequencing data. Hum Genet. 2012;131(10):1541-1554. doi:10.1007/s00439-012-1213-z.
- Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet. 2011;12(6):443-451. doi:10.1038/nrg2986.
- Dallman TJ, Byrne L, Ashton PM, Cowley LA, et al. Whole-Genome Sequencing for National Surveillance of Shiga Toxin–Producing Escherichia coli O157. Clin Infect Dis. 2015;61(3):305-312. doi:10.1093/cid/civ318
BLOGS
- The Holt Lab
- Bits and Bugs
- Loman Labs
- Heng Li’s blog
- Blue Collar Bioinformatics by Brad Chapman
- Living in the Ivory Basement by Titus C. Brown
- Bits of DNA by Lior Pachter
- The Genome Factory by Torsten Seemann
2. Get over your fear of the command line (i.e., Terminal)
Even though this may require some time, mastering the Unix shell is essential for a bioinformatician (and people who want to talk to bioinformaticians).
However, once you get familiar with the underlying philosophy,
it becomes surprisingly easy. Remember that there is the man
command and the --help
option which can help you whenever you feel lost.
(Anecdote from Taj) When I started making my way into bioinformatics, I began using a local instance of Galaxy, which was a great learning tool. It is essentially a drag-and-drop web-based graphical interface that allows you to create analysis pipelines (i.e., stringing analysis together so that the results of one analysis are used as the input for another analysis). I became very proficient in using Galaxy, but as the size of the datasets I was working with increased from 10s of isolates to 100s , I began to see its limitations. When I sought help to increase the throughput of my Galaxy pipelines, a friend at the high-performance computing center finally told me “Look, Galaxy is a coloring book and the command line is the blank pages of a fine novel. It’s time to make the transition.” When I finally did, I realized that I was missing the flexibility of creating custom pipelines and increasing throughput. Sure, that same flexibility (essentially a blank slate) can be overwhelming, but once you grasp the concept of what you are trying to do analytically, building your own pipeline (or using someone else’s) is not as hard as you think. But how do you make that transition, you ask… well, continue below.
3. Attend workshops or complete self-directed tutorials
Bioinformaticians are fairly good at sharing their knowledge and teaching others. If you are at an academic institution, your research computing center probably hosts several training courses. In addition, there are a range of great courses on the Internet but some of the best in our opinion are:
Taught courses
Online courses
- Algorithms for DNA Sequencing by Ben Langmead & Jacob Pritt (available also on YouTube)
- Journey to the Frontier of Computational Biology by Pavel Pevzner & Phillip E. C. Compeau et al.
Self directed tutorials:
4. There’s this thing called Google…
There is no shortage of online resources including a myriad of message boards and websites dedicated to bioinformatics, phylogenetics, and bacterial population genomics. The trick is knowing how to search for what you are looking for. Remember that reading we suggested earlier? This is where the dividends pay off. Once you have a handle of the lingo, you will be able search accordingly. Remember, a pet peeve of long-standing members of any message board community is newbies posting a question that has already been answered a number of times. Always search first! Here are a few message boards we frequent:
5. Find out what resources your institution has available and what programs they currently use for NGS data analysis
If you are reading this post, you are likely employed by an academic institution, public health entity, or private institution. Ask around or search your intranet for key words like “high-performance computing” or “research computing”. If you don’t find any hits, go to someone in your IT department and they can hopefully point you in the right direction. Become familiar with what resources are available to you, what analysis platforms, if any, your institution uses, and who the people are you need to network with to learn more about the resources or gain access. If your institution does not have any compute resources at the moment, you will have to consider third-party services (more on this in a future post). Don’t be discouraged. These days a lot of these analyses can be run on a desktop or laptop.
6. Download and install bioinformatics “tools”
In your reading assignments, you will learn about analysis such as reference-based or de novo assemblies. These terms are frequent in the methods sections of most genomic epidemiology papers. It will soon become obvious that researchers are not creating their own tools to perform these tasks but instead using published (and hopefully peer-reviewed) tools to do the job. For example, for de novo assemblies, Velvet or SPAdes are commonly used tools. So how do you download and install these tools?
Bioinformaticians often tend to (compile and) install the bioinformatics tools manually, but this can be very time-consuming and the resulting environments are difficult to replicate. A recently emerged package manager called Bioconda strongly simplifies installation of both programs and libraries. By a single command, you can set up an environment and install all the required packages. Using Bioconda will save you an incredible amount of time. In future blog posts, we will talk more about setting up and using Bioconda.
7. Learn how to write reproducible pipelines
Our resident GitHub advocate would like us to now make a public service announcement. GitHub is an online site for developing and storing code for pipelines you develop. It is a good business practice to use GitHub or similar services because it offers several benefits. Among those benefits is data security and reproducibility. There is a little bit of a learning curve for using Git, and it may seem like you are far off from ever needing such a thing. However, once you get to the point where you are creating code frequently, it is probably time to start considering Git. At this point, you should begin refining your code, thinking about how to make it more efficient and overall more intuitive. It is also a good idea to imagine yourself coming back to the code after 3 years and trying to figure out how or why you did something. As such, comment your code and try to make it self explanatory.
Remember every time that you write code, it’s an opportunity to introduce yet another bunch of bugs into your, probably already buggy, pipeline. So try to avoid having to write new code! Prefer those programs, programming languages and libraries that allow you to write minimalistic code or even avoid writing new code at all. As wise software engineers say, a line of code that doesn’t exist is a line with no bug. When you cannot avoid writing new code, keep asking yourself all the time, how a potential bug could be observed and what could you do to spot it early. Learn about defensive programming, especially about asserts and tests. Once you become familiar with the console, learn how to use GNU Make and Snakemake. They can do a great job for you.
8. Try it on your own data set
Once you have braved the previous steps (or skipped from 1 to 4 then 6), it is time to jump into working with your own data. Hopefully at this point you have a good overall idea of where to begin. Maybe you have some raw sequencing data you are trying to assemble or you have sequences you are trying to align and use to infer a phylogeny. The best thing to do is write out the steps of your analysis and how you will move from one step to another. This is the foundation of your “pipeline”. We could spend an entire post on how to do this (and probably will at some point), but just think of it as your analysis roadmap. If you hit a roadblock, see #4. When it comes to interpreting your results, keep in mind that anything that looks incredibly interesting and/or is largely unexpected is also almost certainly (and unfortunately) wrong*. Be your biggest critic. Step back and look at your results objectively… do they make sense? Are they what you expected? Once you have convinced yourself, give yourself a pat on the back and then go back and double check your code one last time to make sure there are no errors.
*This is Hanage’s 2nd Law.
9. Start back at #1
Closing remarks
Hopefully these tips will help you in the process of learning about genomic epidemiology and bioinformatics at large. In closing, we included some vocabulary words, bioinformatic tool packages, and file types you should work to become familiar with. If you have questions or comments, feel free to leave them below.
Vocabulary words:
- Core- Accessory- and Pan-genome
- Recombination
- Mobile genetic element (MGE)
- Heterozygosity
- Ortholog, Homolog, paralog
- Coverage (in terms of sequencing)
- Reference genome
- Reference based assembly
- Read depth
- De novo assembly
- Contig
- Multiple sequence alignment
- Single-nucleotide polymorphisms (SNP)
- Tree/phylogeny/dendrogram/clade
- Out-group
Bioinformatic Tool Packages commonly used/discussed:
- TextWrangler or BBEdit Not a “tool” per say, but a text editor
- GATK
- Smalt
- Bowtie/Bowtie2
- Velvet
- SPAdes
- Prokka
- Roary
- Mauve
- RAxML
- FastTree
- For a more detailed overview see this great Guideline for Bioinformatic Tools on the Bits and Bugs blog.
File formats to be familiar with:
- Fasta
- Fastq
- Gff
- Gbk (or gb)
- Embl
- Nexus (important for some phylogenetic analysis programs)
- Learn how to find and download files from ENA and NCBI