README

Here are some instructions on how to set up BLAST on your local machine (as opposed to run it at NCBI on the web). It is useful if you have lots of sequences to BLAST or if you want to manipulate its output for easier processing downstream.

I included a short presentation on BLAST and the syntax of its run, as well as an example query and a database (see details below) so that you could try it yourself. What I do not cover in these instructions is how to prepare a custom BLAST database from your sequences (if you can’t rely on ready-made databases at NCBI), but it is covered in section 7 of the description of BLAST databases from NCBI (included here in full, below).

Important update

Update to the update :-)

The issue described below has been resolved and is described by Madden TL et al. in Bioinformatics (2019), 35:15, 2699–2700. Briefly, the behaviour was a bug affecting specific cases (but not all cases) of BLAST+. If -max_target_seqs was set to 1 but multiple equivalent hits were present in the database, the order of these hits in the database decides which one is reported by BLAST+. Current versions of BLAST+ provide a warning when -max_target_seqs is set below 5.

End of update to the update

Apparently, as of September 2018, the local version of BLAST has a function that behaves differently from the identically named function in the online version of BLAST. Option -max_target_seqs specifies N number of sequences returned in the output. In the local BLAST, it simply returns the N hits above the e value threshold, irrespectively of the BLAST score or the actual e value. In the online version of the BLAST, the same function returns the N top hits by score (which is what many people assume it should do). This has been very recently reported by Shah N. et al. in Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows. See also post by Peter Cock What BLAST’s max-target-sequences doesn’t do and Sujai Kumar’s report at Github for a bit more nuanced description of the behaviour. I also note that Peter’s and Sujai’s report do not suggest a different behaviour between online and local versions of the BLAST.

In my guidelines below, the local BLAST is run with -max_target_seqs set to 1. Contrary to my assumption, it does not necessarily return the best hit. As this behaviour is still being investigated, the safest solution currently is to run your local blast several times, with -max_target_seqs set to different values (depending on the size of your database) and then rank the output by score and e value for every query ID. Proceed with caution.

Download and install the programme

Prepare the query

The query for this exercise is a set of 1000 sequences of all unique oligonucleotide probes from Agilent’s D. melanogaster gene expression microarray in a multi-fasta format. The query is included in this repository directly and is called blast_query.tar.gz. Uncompressed file is 34 kB. Download the query and unzip it into a folder (let’s call the folder blast_pratice for the purpose of this exercise).

Prepare the database

The database contains list of all known and predicted transcripts from D. melanogaster in the Ensembl format, as downloaded in March 2017. The database is available to download from Figshare and the compressed file is approximately 100 MB. Download it and unzip into the blast_practice folder - it will unzip into its own folder called Dmel_transcripts_Ensembl.

I originally created the database to map Agilent probes against known and predicted transcripts to eliminate probes that do not align well. I used this data in the Bryk et al. 2017 paper Transcriptional effects of a positive feedback circuit in Drosophila melanogaster. If I have some time, I may reduce both the queries and the database for easier sharing and downloading.

If you want to download BLAST databases yourself, the best way to do it is to use NCBI’s ftp server. I created a little video guide on how to download BLAST executables and databases: it is available on Figshare.

BLAST off!

To run the BLAST (assuming it is installed), enter the blast_practice folder and run a megablast of the query against the database with the following command:

blastn -task megablast -db Dmel_transcripts_Ensembl/Dmel_genes_all.fa -query blast_query.txt -dust no -max_target_seqs 1 -outfmt "6 qseqid sseqid evalue pident stitle" -out outputfile.txt

The various options are explained in the presentation and in the section BLAST options below.

BLAST databases available at NCBI

This is copied directly from the NCBI website and it was up to date as of March 6th, 2017.

                         The BLAST Databases
                    Last updated on March 6, 2017

This document describes the BLAST databases available on the NCBI FTP site under
the /blast/db directory. The direct URL is ftp://ftp.ncbi.nlm.nih.gov/blast/db

1. Quick Start
    * Get all numbered files for a database with the same base name:
      Each of these files represents a subset (volume) of that database,
      and all of them are needed to reconstitute the database.
    * After extraction, there is no need to concatenate the resulting files:
      Call the database with the base name, for nr database files, use "-db nr".
    * For easy download, use the update_blastdb.pl script from the blast+ package.
    * Incremental update is not available.

2. General Introduction

BLAST search pages under the Basic BLAST section of the NCBI BLAST home page
(http://blast.ncbi.nlm.nih.gov/) use a standard set of BLAST databases for
nucleotide, protein, and translated BLAST searches.  These databases are made
available as compressed archives of pre-formatted form) and can be donwloaded from
the /db directory of the BLAST ftp site (ftp://ftp.ncbi.nlm.nih.gov/blast/db/).
The FASTA files reside under the /FASTA subdirectory.

The pre-formatted databases offer the following advantages:
    * Pre-formatting removes the need to run makeblastdb;
    * Species-level taxonomy ids are included for each database entry;
    * Databases are broken into smaller-sized volumes and are therefore easier
      to download;
    * Sequences in FASTA format can be generated from the pre-formatted databases
      by using the blastdbcmd utility;
    * A convenient script (update_blastdb.pl) is available in the blast+ package
      to download the pre-formatted databases.

Pre-formatted databases must be downloaded using the update_blastdb.pl script or
via FTP in binary mode. Documentation for this script can be obtained by running
the script without any arguments; Perl installation is required.

The compressed files downloaded must be inflated with gzip or other decompress
tools. The BLAST database files can then be extracted out of the resulting tar
file using the tar utility on Unix/Linux, or WinZip and StuffIt Expander on
Windows and Macintosh platforms, respectively.

Large databases are formatted in multiple one-gigabyte volumes, which are named
using the basename.##.tar.gz convention. All volumes with the same base name are
required. An alias file is provided to tie individual volumes together so that
the database can be called using the base name (without the .nal or .pal
extension). For example, to call the est database, simply use "-db est" option
in the command line (without the quotes).

Additional BLAST databases that are not provided in pre-formatted formats may
be available in the FASTA subdirectory. For other genomic BLAST databases,
please check the genomes ftp directory at:
    ftp://ftp.ncbi.nlm.nih.gov/genomes/

3. Contents of the /blast/db/ directory

The pre-formatted BLAST databases are archived in this directory. The names of
these databases and their contents are listed below.

+-----------------------------+------------------------------------------------+
 File Name                    | Content Description
+-----------------------------+------------------------------------------------+
16SMicrobial.tar.gz	          | Bacterial and Archaeal 16S rRNA sequences from
                                BioProjects 33175 and 33117
FASTA/                        | Subdirectory for FASTA formatted sequences
README                        | README for this subdirectory (this file)
Representative_Genomes.*tar.gz| Representative bacterial/archaeal genomes database
cdd_delta.tar.gz              | Conserved Domain Database sequences for use with
                                stand alone deltablast
cloud/	                      | Subdirectory of databases for BLAST AMI; see
                                http://1.usa.gov/TJAnEt
env_nr.*tar.gz                | Protein sequences for metagenomes
env_nt.*tar.gz                | Nucleotide sequences for metagenomes
est.tar.gz                    | This file requires est_human.*.tar.gz,
                                est_mouse.*.tar.gz, and est_others.*.tar.gz files
                                to function. It contains the est.nal alias so that
                                searches against est (-db est) will include
                                est_human, est_mouse and est_others.
est_human.*.tar.gz            | Human subset of the est database from the est
                                division of GenBank, EMBL and DDBJ.
est_mouse.*.tar.gz            | Mouse subset of the est databasae
est_others.*.tar.gz           | Non-human and non-mouse subset of the est database
gss.*tar.gz                   | Sequences from the GSS division of GenBank,
                                EMBL, and DDBJ
htgs.*tar.gz                  | Sequences from the HTG division of GenBank, EMBL,
                                and DDBJ
human_genomic.*tar.gz         | Human RefSeq (NC_### ##) chromosome records with
                                gap adjusted concatenated NT_ contigs
nr.*tar.gz                    | Non-redundant protein sequences from GenPept,
                                Swissprot, PIR, PDF, PDB, and NCBI RefSeq
nt.*tar.gz                    | Partially non-redundant nucleotide sequences from
                                all traditional divisions of GenBank, EMBL, and DDBJ
                                excluding GSS,STS, PAT, EST, HTG, and WGS.
other_genomic.*tar.gz         | RefSeq chromosome records (NC_### ##) for non-human
                                organisms
pataa.*tar.gz                 | Patent protein sequences
patnt.*tar.gz                 | Patent nucleotide sequences. Both patent databases
                                are directly from the USPTO, or from the EPO/JPO
                                via EMBL/DDBJ
pdbaa.*tar.gz                 | Sequences for the protein structure from the
                                Protein Data Bank
pdbnt.*tar.gz                 | Sequences for the nucleotide structure from the
                                Protein Data Bank. They are NOT the protein coding
                                sequences for the corresponding pdbaa entries.
refseq_genomic.*tar.gz        | NCBI genomic reference sequences
refseq_protein.*tar.gz        | NCBI protein reference sequences
refseq_rna.*tar.gz            | NCBI Transcript reference sequences
sts.*tar.gz                   | Sequences from the STS division of GenBank, EMBL,
                                and DDBJ
swissprot.tar.gz              | Swiss-Prot sequence database (last major update)
taxdb.tar.gz                  | Additional taxonomy information for the databases
                                listed here
                              | providing common and scientific names
tsa_nt.*tar.gz                | Sequences from the TSA division of GenBank, EMBL,
                                and DDBJ
vector.tar.gz                 | Vector sequences from 2010, see Note 2 in section 4.
+-----------------------------+------------------------------------------------+

4. Contents of the /blast/db/FASTA directory

This directory contains FASTA formatted sequence files. The file names
and database contents are listed below. These files must be unpacked and
processed through blastdbcmd before they can be used by the BLAST programs.

+-----------------------+-----------------------------------------------------+
|File Name              | Content Description                                 |
+-----------------------+-----------------------------------------------------+
alu.a.gz                | translation of alu.n repeats
alu.n.gz                | alu repeat elements (from 2003)
drosoph.aa.gz           | CDS translations from drosophila.nt
drosoph.nt.gz           | genomic sequences for drosophila (from 2003)
env_nr.gz*              | Protein sequences for metagenomes, taxid 408169
env_nt.gz*              | Nucleotide sequences for metagenomes, taxid 408169
est_human.gz*           | human subset of the est database (see Note 1)
est_mouse.gz*           | mouse subset of the est database
est_others.gz*          | non-human and non-mouse subset of the est database
gss.gz*                 | sequences from the GSS division of GenBank, EMBL,
                          and DDBJ
htgs.gz*                | sequences from the HTG division of GenBank, EMBL,
                          and DDBJ
human_genomic.gz*       | human RefSeq (NC_### ##) chromosome records
                          with gap adjusted concatenated NT_ contigs
igSeqNt.gz              | human and mouse immunoglobulin variable region
                          nucleotide sequences
igSeqProt.gz            | human and mouse immunoglobulin variable region
                          protein sequences
mito.aa.gz              | CDS translations of complete mitochondrial genomes
mito.nt.gz              | complete mitochondrial genomes
nr.gz*                  | non-redundant protein sequence database with entries
                           from GenPept, Swissprot, PIR, PDF, PDB, and RefSeq
nt.gz*                  | nucleotide sequence database, with entries from all
                          traditional divisions of GenBank, EMBL, and DDBJ;
                          excluding bulk divisions (gss, sts, pat, est, htg)
                          and wgs entries. Partially non-redundant.
other_genomic.gz*       | RefSeq chromosome records (NC_### ##) for organisms
                          other than human
pataa.gz*               | patent protein sequences
patnt.gz*               | patent nucleotide sequences. Both patent sequence
                          files are from the USPTO, or EPO/JPO via EMBL/DDBJ
pdbaa.gz*               | protein sequences from pdb protein structures
pdbnt.gz*               | nucleotide sequences from pdb nucleic acid
                          structures. They are NOT the protein coding
                          sequences for the corresponding pdbaa entries.
sts.gz*                 | database for sequence tag site entries
swissprot.gz*           | swiss-prot database (last major release)
vector.gz               | vector sequences from 2010. (See Note 2)
yeast.aa.gz             | protein translations from yeast.nt
yeast.nt.gz             | yeast genomes (from 2003)
+-----------------------+---------------------------------------------------+
NOTE:
(1) NCBI does not provide the complete est database in FASTA format. One
    needs to get all three subsets (est_human, est_mouse, and est_others
    and concatenate them into the complete est fasta database).
(2) For screening for vector contamination, use the UniVec database:
    ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/
 *  marked files have pre-formatted counterparts.

5. Database updates

The BLAST databases are updated regularly. There is no established incremental
update scheme. We recommend downloading the complete databases regularly to
keep their content current.

6. Non-redundant defline syntax

The non-redundant databases are nr, nt and pataa. Identical sequences are
merged into one entry in these databases. To be merged two sequences must
have identical lengths and every residue at every position must be the
same.  The FASTA deflines for the different entries that belong to one
record are separated by control-A characters invisible to most
programs. In the example below both entries Q57293.1 and AAB05030.1
have the same sequence, in every respect:

>Q57293.1 RecName: Full=Fe(3+) ions import ATP-binding protein FbpC ^AAAB05030.1 afuC
[Actinobacillus pleuropneumoniae] ^AAAB17216.1 afuC [Actinobacillus pleuropneumoniae]
MNNDFLVLKNITKSFGKATVIDNLDLVIKRGTMVTLLGPSGCGKTTVLRLVAGLENPTSGQIFIDGEDVTKSSIQNRDIC
IVFQSYALFPHMSIGDNVGYGLRMQGVSNEERKQRVKEALELVDLAGFADRFVDQISGGQQQRVALARALVLKPKVLILD
EPLSNLDANLRRSMREKIRELQQRLGITSLYVTHDQTEAFAVSDEVIVMNKGTIMQKARQKIFIYDRILYSLRNFMGEST
ICDGNLNQGTVSIGDYRFPLHNAADFSVADGACLVGVRPEAIRLTATGETSQRCQIKSAVYMGNHWEIVANWNGKDVLIN
ANPDQFDPDATKAFIHFTEQGIFLLNKE

Individual sequences are now identifed simply by their accession.version.

For databases whose entries are not from official NCBI sequence databases,
such as Trace database, the gnl| convention is used. For custom databases,
this convention should be followed and the id for each sequence must be
unique, if one would like to take the advantage of indexed database,
which enables specific sequence retrieval using blastdbcmd program included
in the blast executable package.  One should refer to documents
distributed in the standalone BLAST package for more details.

7. Formatting a FASTA file into a BLASTable database

FASTA files need to be formatted with makeblastdb before they can be used in local
blast search. For those from NCBI, the following makeblastdb commands are
recommended:

For nucleotide fasta file:   makeblastdb -in input_db -dbtype nucl -parse_seqids
For protein fasta file:      makeblastdb -in input_db -dbtype prot -parse_seqids

In general, if the database is available as BLAST database, it is better to use the
preformatted database.

8. Technical Support

Questions and comments on this document and NCBI BLAST related questions
should be sent to the blast-help group at:
      blast-help@ncbi.nlm.nih.gov

For information about other NCBI resources/services, please send email to
NCBI User Service at:
      info@ncbi.nlm.nih.gov

BLAST options

DESCRIPTION

Nucleotide-Nucleotide BLAST 2.2.31+

USAGE

blastn [-h] [-help] [-import_search_strategy filename] [-export_search_strategy filename] [-task task_name] [-db database_name] [-dbsize num_letters] [-gilist filename] [-seqidlist filename] [-negative_gilist filename] [-entrez_query entrez_query] [-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm] [-subject subject_input_file] [-subject_loc range] [-query input_file] [-out output_file] [-evalue evalue] [-word_size int_value] [-gapopen open_penalty] [-gapextend extend_penalty] [-perc_identity float_value] [-qcov_hsp_perc float_value] [-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value] [-xdrop_gap_final float_value] [-searchsp int_value] [-sum_stats bool_value] [-penalty penalty] [-reward reward] [-no_greedy] [-min_raw_gapped_score int_value] [-template_type type] [-template_length int_value] [-dust DUST_options] [-filtering_db filtering_database] [-window_masker_taxid window_masker_taxid] [-window_masker_db window_masker_db] [-soft_masking soft_masking] [-ungapped] [-culling_limit int_value] [-best_hit_overhang float_value] [-best_hit_score_edge float_value] [-window_size int_value] [-off_diagonal_range int_value] [-use_index boolean] [-index_name string] [-lcase_masking] [-query_loc range] [-strand strand] [-parse_deflines] [-outfmt format] [-show_gis] [-num_descriptions int_value] [-num_alignments int_value] [-line_length line_length] [-html] [-max_target_seqs num_sequences] [-num_threads int_value] [-remote] [-version]

OPTIONAL ARGUMENTS

Input query options

General search options

BLAST-2-Sequences options

Formatting options

Options 6, 7, and 10 can be additionally configured to produce a custom format specified by space delimited format specifiers. The supported format specifiers are:

When - -outfmt options are not specified, the default value is: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore, which is equivalent to the keyword std.

Query filtering options

Restrict search or results

Discontiguous MegaBLAST options

Statistical options

Search strategy options

Extension options

Miscellaneous options