Resources

Github

I keep a copy of the scripts I use on my my github page. This is also where I upload data, examples and/or explanations accompanying some of my publications.

Perl scripts

This is a collection of some of the Perl scripts that I use regularly. Although their usage should be self-explanatory, I have added short explanations and the "help" screens that appear when scripts are run without arguments.

Disclaimer: I would be happy if this is of help to anyone. However, everything comes without warranty. The code is probably not very elegant and some of the comments in it are in German. Please feel free to contact me in case of any questions.

gc_cov.pl

When working with genomic data of unculturable bacterial symbionts, one common challenge is separating bacterial from host data. GC coverage plots can be very helpful in this, especially if the bacterial community of the sample is not very complex.

This script creates very simple GC coverage plots from assembly files created with the assembler SPAdes, which can be considered the standard assembler for bacterial genomes. Conveniently, SPAdes provides kmer coverage estimates in the fasta definition lines of the assembly files. The script takes advantage of that and therefore does not require a mapping step (as opposed to the more sophisticated software BlobTools). This makes it very straightforward and fast to use, and in many cases this is all I need for a quick overview of the assembly. Taxonomy of contigs can also be displayed, but the annotation needs to be provided by the user.

    
~$ perl gc_cov.pl

This script will draw simple GC coverage plots from SPAdes assembly files. It uses the information on kmer coverage and contig lengths given in the SPAdes assembly files.
Please note that the coverage therefore does not correspond to actual nucleotide coverage, but to kmer coverage as estimated by SPAdes with the largest kmer used in assembly!
Requires Perl module 'Statistics::R' and R packages 'ggplot2' and 'viridis'.

USAGE	perl gc_cov.pl [OPTIONS] assembly.fasta
OPTIONS
  --taxonomy [FILE]	path to taxonomy file that will be used to annotate the plot (optional)
  --ntaxa [N]		the number of most frequently occuring taxa to annotate (optional, only in conjunction with --taxonomy)

Format for taxonomy file:
contigname1	taxonname
contigname2	taxonname
...

Example usage & output:

    
~$ perl gc_cov.pl --taxonomy taxonomy.txt --ntaxa 2 scaffolds.fasta

sra_download.pl

This script automates the download of fastq files from the European Nucleotide Archive. It uses regular NCBI SRA accession numbers as input. This is much faster & more convenient than using NCBI's sra-tools.

    
This script will download fastq files from the European Nucleotide Archive database.

USAGE	perl sra_dowload.pl [OPTIONS] accessions.txt
OPTIONS
  --ascp	Use ascp (Aspera secure copy) instead of wget to fetch the files
  --id		Path to Aspera private key file (defaults to ~/.aspera/connect/etc/asperaweb_id_dsa.openssh, only in conjunction with --ascp)

NOTES:	-Requires EITHER wget version 1.16 & up (https://ftp.gnu.org/gnu/wget/) OR ascp (https://downloads.asperasoft.com/en/downloads/50)
	-Input file should contain only a single SRA accession number per line and no white spaces (accession numbers are identical to NCBI's SRA accession numbers)
	-This script will access European servers, and thus is probably most efficient when used in Europe.
	-The script will sort the file with accession numbers, if an unsorted one was provided.

extract_seqs_by_id.pl

I use this script to extract any number of sequences from a fasta file.

    
~$ perl extract_seqs_by_id.pl

This script will extract those sequences from a fasta file that match (or do not match) provided IDs

USAGE  perl extract_seqs_by_id.pl [--exclude] input.fasta IDs.txt (one ID per line)
        --exclude creates fasta file containing all sequences that DO NOT match provided IDs
        (default: create fasta file containing all sequences that DO match provided IDs)

blast2fasta.pl

This is my preferred way to extract those parts of a fasta sequence that were identified with a BLAST+ search. When the match is on the minus strand, the script automatically creates the reverse-complement of the hit. Requires tabular BLAST+ output.

    
~$ perl blast2fasta.pl

This script will parse BLAST hits from a fasta file.
USAGE  perl blast2fasta.pl [BLAST output] [fasta file]

split_partitions.pl

When doing large phylogenetic analyses, partitioning the dataset is often required. After the best partitioning scheme was determined (e.g., with PartionFinder or IQTREE), I sometimes want to perform some analysis for each partition separately. This script is for the purpose of creating single partion fasta files from a supermatrix.

    
~$ perl split_partions.pl

This script will create single fasta files out of a larger alignment based on a partitioning scheme.
USAGE   perl split_partitions.pl [fasta file] [partition file]

Example of a partition file:
    
partition1=1-100 201-300
partition2=101-200
...

ncbi2fasta.pl

Using NCBI databases can be frustrating. When I need to download many individual nucleotide sequences at once, I use this script. Requires network connection.

    
~$ perl ncbi2fasta.pl

This script will fetch nucleotide sequences from NCBI GenBank based on their accession numbers.
Please provide path to file containing accession numbers (one per line or separated by commas).
Output will be directed to STDOUT!

ry_coder.pl

Recodes a nucleotide alignment: each purine (A & G) is recoded to R, and each pyrimidine (C & T) to Y. Sometimes, reducing the information content of an alignment like this can help to minimize the impact of long-branched taxa on phylogenetic inference.

    
~$ perl ry_coder.pl

This script turns a regular nucleotid alignment into an RY-coded alignment.
Please provide path to a fasta file!

filter_length.pl

A simple script that filters fasta sequences by length.

    
~$ perl filter_length.pl

This script will filter fasta files by length. 
USAGE   perl filter_length.pl [minlength] [maxlength] input.fasta

seqlength.pl

Display sequence lengths for each sequence in fasta file.

    
~$ perl seqlength.pl

This script shows the length of each sequence of a given fasta file.
USAGE perl seqlength.pl input.fasta

gc.pl

Calculate GC-content for each sequence in a fasta file.

    
~$ perl gc.pl

This script calculates GC values from each sequence of a given fasta file.
USAGE perl gc.pl input.fasta

assembly_stats.pl

Often I want to quickly look at assembly statistics, without having to use a more sophisticated tool such as QUAST. This script calculates N75, N50, N25 values and outputs the number of contigs, length of the longest contig, total assembly size, and GC content.

Two assemblies can be compared with outputs being directly shown on the screen. Also, minimal contig lengths can be specified, as this may differ between assemblers. Finally, a number of graphs helping to interpret assembly statistics can be generated (using R).

It should be noted that I use this script mainly for bacterial assemblies or sometimes insect assemblies, and that it may not be efficient for very large assembly files.

    
~$ perl assembly_stats.pl

This script will calculate and display basic assembly statistics.

USAGE    perl assembly_stats.pl [OPTIONS] assembly.fasta [assembly2.fasta]
OPTIONS
   --min_contig_length [N]      minimal size of contigs to be included in statistics (optional)
   --pdf                        generate a PDF file containing three descriptive plots (requires R installation in system path)
   --compare                    calculate assembly statistics for two assemblies in parallel

Example usage & output:

    
~$ perl assembly_stats.pl --compare --pdf filter15.fas filter25.fas

                                    filter15                filter25
N75                                     2015                     753
N50                                     3857                    1278
N25                                     6559                    2054
Number of contigs                     91,187                 175,719
Longest contig                        31,565                  24,230
Total assembly size              196,508,114             170,814,936
Average GC-content                    40.64%                  41.66%

R scripts

I use mainly Bash, Perl, and R for my work, but I have only recently started to realize that executable R scripts can be very handy, especially for plotting. Find below my first attempts.

pbplot.R

This script plots the parameters sampled during a PhyloBayes run 'on the go'. Although this is not a very sophisticated way of assessing convergence between chains, I find this helpful to get an idea if the chains are on the same track, and also to assess appropriate number of burnin generations. As such, it is very similar to some of the plotting functions in Tracer, only that running the script should be quicker and more convenient in many cases.

Must be run in a directory containing the trace files generated during at least one, and up to four PhyloBayes runs. Plots can be displayed on screen (default) or saved as pdf.

    
~$ Rscript pbplot.R

Usage: /usr/local/bin/pbplot [options]


Options:
	-b NUMBER, --burnin=NUMBER
		number of burnin iterations to discard [required]

	-f, --file
		Shall plot be saved to pdf file instead of disp? [default: FALSE]

	-h, --help
		Show this help message and exit


Error: 
NOTE:	This script plots the parameters estimated during a phylobayes run.
	It should be run in a directory containing 1–4 trace files created by phylobayes.
	Requires R packages ggplot2, optparse, and gridExtra.

Example usage & output:

    
~$ Rscript pbplot.R --file --burnin=0
Found 2 trace files!
Writing to 'plots.pdf'