This blog post deals with the various ways of downloading large amounts of sequencing data (e.g., from NCBI’s SRA database). When I needed to bulk download short read for a recent project, it took me some time to figure out how to achieve this efficiently, and I am sharing my experience here in the hope it might be useful.
The problem: you want to download lots of sequencing data (typically in form of Illumina generated reads), e.g., to reproduce a published experiment. The amount of data makes it impossible to click+download through a browser interface. There are two potential solutions: 1) download via NCBI’s SRA toolkit, and 2) access ftp servers directly.
Two new scripts for creating simple GC-coverage plots from SPAdes assemblies and analysing PhyloBayes trace files
Before writing about science again (new post is in the making), I have uploaded two scripts that I find useful for my work:
This script is basically a very simplified version of the blobplots function from the BlobTools package. It creates GC-coverage plots directly from SPAdes assembly files, without the need for mapping the reads back to the assembly. Since I use mainly SPAdes anyway, this has been quite handy. The script will also annotate the plot when a taxonomy file is provided, which can be generated, e.g., from blast outputs. Below is an example for the plots that can be generated with the script.
I have written a small script that automates the download of fastq files from the European Nucleotide Archive (ENA). This was created because I was annoyed with the speed NCBI's sra-tools. It takes NCBI SRA accession numbers as input and downloads the fastq files directly from the ENA using wget. The download speed is thus basically only limited by your bandwidth. Any feedback is very welcome!
Find it on the resources page or on github.
This post is a ‘behind the paper’ story of our publication ‘Short reads from honey bee (Apis sp.) sequencing projects reflect microbial associate diversity’ which was just published in PeerJ. I will explain the motivation behind the study and also show some new data generated with our approach.
UPDATE (July 24, 2017): Our paper was covered in the "The Molecular Ecologist" blog: http://www.molecularecologist.com/2017/07/genomes-are-coming-sequence-libraries-from-the-honey-bee-reflect-associated-microbial-diversity/
Part one: background & motivation
I work as a postdoc in Greg Hurst’s group, who has projects on many different bacterial symbionts (https://sites.google.com/site/hurstlab/home). The project of his PhD student Georgia Drew aims to identify the potential impacts of Arsenophonus on honey bee health (https://eegid.wordpress.com/phd-students/georgia-drew/). This bacterium is an inherited symbiont of arthropods, and has been found in honey bees and other bees occasionally (Aizenberg-Gershtein et al. 2013; Gerth et al. 2015; Yañez et al. 2016; McFrederick et al. 2017). Its exact role in honey bees is unclear, and this is what Georgia studies.
I became involved when Greg suggested to extract genomic data of the honey bee associated Arsenophonus from Apis Illumina data stored public databases. In many cases, symbionts are sequenced inadvertently alongside with their hosts, and a number of symbiont genomes have been extracted from sequencing data of their hosts before (e.g., Salzberg et al. 2005; Siozios et al. 2013). There’s plenty of honey bee sequencing data around, so it was definitely worth checking if we could get an Arsenophonus genome ‘for free’, without actually sequencing it ourselves.
Our new paper "Comparative genomics provides a timeframe for Wolbachia evolution and exposes a recent biotin synthesis operon transfer" was published today!
Check out the paper here: www.nature.com/articles/nmicrobiol2016241 – please email me for a pdf of this article!
I also wrote a short "Behind the paper" blogpost for the Nature Microbiology Community website.
Read it here:
EDIT (July 12, 2017): You can access the paper for free under http://rdcu.be/t8tX
This post deals with a recent paper about Wolbachia in plant parasitic nematodes, and with Wolbachia phylogeny in general.
Almost 30 genomes of the bacterial endosymbiont Wolbachia have been sequenced so far, and this trend is likely to continue. Wolbachia are found in a large proportion of arthropods (insects, arachnids, and allies) and in filarial nematodes. Very generally speaking, Wolbachia in arthropods are opportunistic, with varied fitness effects for their hosts, and may switch hosts horizontally. In contrast, Wolbachia in filarials are highly specialized and absolutely required for their hosts (the mechanisms underlying this co-dependence are not 100% clear yet).
The differences in lifestyle of Wolbachia from arthropods and filarials is also reflected in their genomic architecture. For example, arthropod Wolbachia typically harbour many mobile genetic elements (e.g, insertion sequences, prophages & other phage-derived elements) that are almost always missing in the very streamlined and reduced filarial Wolbachia genomes.
Now, for the first time, there is genomic data from more 'exotic' Wolbachia strains: Brown et al. have sequenced the genome of Wolbachia from a plant-parasitic nematode (wPpe from Pratylenchus penetrans), and, in a recent publication (Brown et al. 2016) compare it to the rest of the genomes of Wolbachia from arthropods and filarial nematodes. They also include in their analysis a strain from the banana aphid (wPni from Pentalonia nigronervosa) and a springtail (wFol from Folsomia candida). These strains were sequenced previously (De Clerck et al. 2015 & Gerth et al. 2014, respectively), but never investigated in a comparative framework before. All three strains are genetically very divergent from typical arthropod and filarial Wolbachia, so it was really cool to see this analysis published.
Here, I want to briefly summarize the main findings of Brown et al.' s study and comment on what phylogenomic datasets and gene repertoires can tell us about evolutionary relationships within Wolbachia.
In my first blog post, I will discuss a recent paper about Wolbachia classification.
In a recent study, Wang et al. (2016) investigated Wolbachia sequences from cave spiders (Telema ssp.). They found that these belong to a genetic lineage distinct from all other described Wolbachia strains (in Wolbachia, those genetically distinct lineages are called “supergroups”). I re-analysed these data and found that in fact, Wolbachia strains from cave spiders cluster within supergroup A (Gerth 2016).
[If you are unfamiliar with Wolbachia biology, or the supergroup classification system, the excellent review by Werren et al. (2008) is a good starting point.]
After I uploaded my re-analysis to bioRxiv, Guan-Hong Wang has kindly send me the alignment files they used in their study. In this post, I want to try to use these data and illustrate why their and my analysis are discordant and also, why their conclusions are likely misled.
This is the website of Michael Gerth. I am a biologist with an interest in insects and the microbes within them. Click here to learn more.