This post is a ‘behind the paper’ story of our publication ‘Short reads from honey bee (Apis sp.) sequencing projects reflect microbial associate diversity’ which was just published in PeerJ. I will explain the motivation behind the study and also show some new data generated with our approach.
UPDATE (July 24, 2017): Our paper was covered in the "The Molecular Ecologist" blog: http://www.molecularecologist.com/2017/07/genomes-are-coming-sequence-libraries-from-the-honey-bee-reflect-associated-microbial-diversity/
Part one: background & motivation
I work as a postdoc in Greg Hurst’s group, who has projects on many different bacterial symbionts (https://sites.google.com/site/hurstlab/home). The project of his PhD student Georgia Drew aims to identify the potential impacts of Arsenophonus on honey bee health (https://eegid.wordpress.com/phd-students/georgia-drew/). This bacterium is an inherited symbiont of arthropods, and has been found in honey bees and other bees occasionally (Aizenberg-Gershtein et al. 2013; Gerth et al. 2015; Yañez et al. 2016; McFrederick et al. 2017). Its exact role in honey bees is unclear, and this is what Georgia studies.
I became involved when Greg suggested to extract genomic data of the honey bee associated Arsenophonus from Apis Illumina data stored public databases. In many cases, symbionts are sequenced inadvertently alongside with their hosts, and a number of symbiont genomes have been extracted from sequencing data of their hosts before (e.g., Salzberg et al. 2005; Siozios et al. 2013). There’s plenty of honey bee sequencing data around, so it was definitely worth checking if we could get an Arsenophonus genome ‘for free’, without actually sequencing it ourselves.
The question was how to extract symbiont genomes from short sequencing reads of the host. This is not a new problem, and there are very neat software solutions that can be used for this, e.g., Blobtools or Anvi’o. What these usually involve is a meta-assembly (assembling all reads, microbial & host), mapping of the reads to the contigs of this meta-assembly, and a taxonomic annotation of the contigs, e.g., via BLAST. Symbionts can then be identified via GC- and/or coverage distributions, or other binning methods. It was immediately clear to us that this is not the best approach for our purpose. There are hundreds of Apis sequencing libraries stored in NCBI’s sequencing read archive, and processing them all in this way would take much too long for our quick screen.
We thus decided we would not assemble the short reads, but map them against a small Arsenophonus reference sequence (16S). There a very fast mappers (we used NextGenMap), and mapping against a small sequence does not take long even for very large sequencing libraries. If for any of the libraries, this mapping would yield many reads matching to the Arsenophonus reference sequence, we would be able to go back to the library and do an assembly, mapping, etc.
We thus began to download all Apis DNA sequencing libraries from NCBI using sra-tools. This proved to be very tedious, as the connections were frequently interrupted and the conversion into gzipped fastq files took forever. We later switched to a much faster alternative: direct download from ENA via ftp (http://www.ebi.ac.uk/ena/browse/read-download#downloading_files_ftp; EDIT July 28, 2017: I have written a script that automates the process of downloading fastq files from ENA). After downloading, the mapping to Arsenophonus 16S did not take long, and the result was that there was not a single Apis sequencing library in which we could detect Arsenophonus. This was a bit disappointing and was the end of our search for Arsenophonus in honey bee sequencing data.
However, we now had all publicly available honey bee sequencing libraries on a single hard drive, and a quick method to detect microbes in those libraries. Luckily, due to lots of great work by others (summarized in Schwarz et al. 2015; Engel et al. 2016; Kwong & Moran 2016), we also pretty much knew what microbes to expect in healthy and unhealthy honey bees. We thus decided to screen for all typical honey bee symbionts, and also pathogens, just the way we did for Arsenophonus. Later on, we also downloaded libraries from honey bee RNA sequencing projects and searched for common honey bee RNA viruses in those.
It was a lot of fun diving into this huge amount of data, and we did find quite a lot of symbiont/pathogen data in the short reads. I do not want to detail our results here (please see our paper for this), but can summarize the main findings in two points:
1) There’s loads of useful genomic data on symbionts & pathogens in honey bee (and probably many other eukaryotic) sequencing projects. In most cases, this data is treated as contamination. We would argue that contamination from host-associated microbes could potentially yield useful information and should be reported in any sequencing project. For example, in our screen, we found 25 different Lactobacillus strains and reconstructed draft genomes for three bacterial honey bee symbionts. All of these were ‘hidden’ in the sequencing libraries.
2) When performing an RNA sequencing project and the sequences contain a lot of viral reads, one should be cautious in interpreting the results. Uneven viral loads across samples may lead to biased gene expression and therefore skew interpretations of differential gene expression analyses. For example, we found several honey bee RNA sequencing experiments with strongly biased viral loads between samples. We think that the presence of viruses very likely impacted gene expression profiles in these examples.
Part two: applying our approach to newly discovered viruses
In our paper, we suggested that archived short read data of honey bees can be used to learn about microbial symbionts that may yet await discovery. Some time after we finished our analyses, a number of honey bee viruses were newly described (Remnant et al. 2017). We could not include these in our search, but can now test our prediction that undescribed microbes are present in Apis short read data. I quickly performed a search for seven newly discovered viruses in honey bee RNA data from short read sequencing projects just as described in our paper.
This way, I found several isolates of two of the seven novel viruses (Apis mellifera Rhabdovirus 1 and 2). Interestingly, these two were reported to be the most widespread of the novel viruses by Remnant et al. (2017). The viral sequences of all ARV-1 and ARV-2 isolates are each highly conserved, with only few changes at the nucleotide level (see figure below). This is also in agreement with Remnant et al. (2017), who reported 98–99% nucleotide identity for the analysed sequences of these viruses.
For both ARV-1 and ARV-2, I reconstructed the phylogeny from complete and partial genomes (alignments can be found below this post). Although the sample size is quite small, there seems to be evidence for geographic clustering of the viral isolates. For example, in the ARV-1 and ARV-2 trees below, the European samples are distinct from the other samples. The same is true for the samples from Brazil, which also appear distinct. Again, this is based on a small samples size and the phylogenetic analyses cannot be considered as comprehensive – still its interesting what you can tell from 'contamination' in Apis sequencing libraries.
Maximum likelihood phylogenies for ARV-1 and ARV-2 based on whole genome alignments. Reference sequences previously sequenced by Remnant et al. (2017) are marked with an asterisk. All other strains are from honey bee RNA sequencing libraries, and the labels correspond to NCBI's SRA accession numbers. Geographic origin of all samples is color-coded (legend in top right). Numbers are support values from 1000 ultrafast bootstrap replicates. Please note that not all of the strains extracted from short read libraries are represented by a complete genome (see alignments under this post).
In summary, Apis RNA sequencing data confirm ARV-1 and ARV-2 to be widespread viruses in honey bees. The distribution range of these viruses reported by Remnant et al. (2017) can be expanded to include the USA, Germany, and Brazil. This short example demonstrates that archived short read data not only can help to quickly assess the distribution of newly discovered viruses or other microbes, but also to retrieve additional genomic data of the microbe in question. All data shown here and in our paper is from past sequencing projects and freely accessible through sequence repositories. Therefore, if one finds a ‘new’ symbiont or virus, I think it often is a sensible approach to verify its presence in host sequencing data.
Read the paper here: https://peerj.com/articles/3529/?td=bl
Link to the repository that hopefully makes our approach comprehensible and reproducible: https://github.com/gerthmicha/symbiont-sra
The screening procedure for the novel RNA viruses was performed as described in our paper. A fasta file of the reference sequences can be found below. Reads from all libraries matching any of the reference sequences were assembled with Megahit. I only found matches to ARV-1 or ARV-2. If the assembly resulted in more than one contig from ARV-1 or ARV-2, the contigs were first aligned to the reference sequenced and then merged manually into a single contig. The geographic origin of short read libraries was extracted manually from the metadata associated with the corresponding BioSample entries at NCBI. If the sample was not associated with a geographic origin, I used the geographic location of the submitter's affiliation. Trees were generated with IQ-TREE.
Aizenberg-Gershtein Y, Izhaki I, Halpern M (2013) Do honeybees shape the bacterial community composition in floral nectar? PLoS ONE 8, e67556.
Engel P, Kwong WK, McFrederick Q, Anderson KE, Barribeau SM, Chandler JA, Cornman RS, Dainat J, de Miranda JR, Doublet V, Emery O, Evans JD, Farinelli L, Flenniken ML, Granberg F, Grasis JA, Gauthier L, Hayer J, Koch H, Kocher S, Martinson VG, Moran N, Munoz-Torres M, Newton I, Paxton RJ, Powell E, Sadd BM, Schmid-Hempel P, Schmid-Hempel R, Song SJ, Schwarz RS, vanEngelsdorp D, Dainat B (2016) The bee microbiome: impact on bee health and model for evolution and ecology of host-microbe interactions. mBio 7, e02164–15.
Gerth M, Saeed A, White JA, Bleidorn C (2015) Extensive screen for bacterial endosymbionts reveals taxon-specific distribution patterns among bees (Hymenoptera, Anthophila). FEMS Microbiology Ecology 91, fiv047.
Kwong WK, Moran NA (2016) Gut microbial communities of social bees. Nature Reviews Microbiology 14, 374–384.
McFrederick QS, Thomas JM, Neff JL, Vuong HQ, Russell KA, Hale AR, Mueller UG (2017) Flowers and wild megachilid bees share microbes. Microbial Ecology 73, 188.
Remnant EJ, Shi M, Buchmann G, Blacquière T, Holmes EC, Beekman M, Ashe A (2017) A diverse range of novel RNA viruses in geographically distinct honey bee populations. Journal of Virology, doi: 10.1128/JVI.00158-17.
Salzberg SL, Hotopp J, Delcher A, Pop M, Smith D, Eisen M, Nelson W (2005) Serendipitous discovery of Wolbachia genomes in multiple Drosophila species. Genome Biology 6, R23.
Schwarz RS, Huang Q, Evans JD (2015) Hologenome theory and the honey bee pathosphere. Current Opinion in Insect Science 10, 1–7.
Siozios S, Cestaro A, Kaur R (2013) Draft genome sequence of the Wolbachia endosymbiont of Drosophila suzukii. Genome Announcements 1, e00032–13.
Yañez O, Gauthier L, Chantawannakul P, Neumann P (2016) Endosymbiotic bacteria in honey bees: Arsenophonus spp are not transmitted transovarially. FEMS Microbiology Letters 363, fnw147.
This is the website of Michael Gerth. I am a biologist with an interest in insects and the microbes within them. Click here to learn more.