Categories
BLOG

blast seed

Mastering seeds for genomic size nucleotide BLAST searches

Affiliation

  • 1 Institute of Molecular Evolutionary Genetics and Department of Biology, The Pennsylvania State University, 514 Mueller Lab, University Park, PA 16802, USA.
  • PMID: 14627826
  • PMCID: PMC290255
  • DOI: 10.1093/nar/gkg886

Free PMC article

Mastering seeds for genomic size nucleotide BLAST searches

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

Authors

Affiliation

  • 1 Institute of Molecular Evolutionary Genetics and Department of Biology, The Pennsylvania State University, 514 Mueller Lab, University Park, PA 16802, USA.
  • PMID: 14627826
  • PMCID: PMC290255
  • DOI: 10.1093/nar/gkg886

Abstract

One of the most common activities in bioinformatics is the search for similar sequences. These searches are usually carried out with the help of programs from the NCBI BLAST family. As the majority of searches are routinely performed with default parameters, a question that should be addressed is how reliable the results obtained using the default parameter values are, i.e. what fraction of potential matches have been retrieved by these searches. Our primary focus is on the initial hit parameter, also known as the seed or word, used by the NCBI BLASTn, MegaBLAST and other similar programs in searches for similar nucleotide sequences. We show that the use of default values for the initial hit parameter can have a big negative impact on the proportion of potentially similar sequences that are retrieved. We also show how the hit probability of different seeds varies with the minimum length and similarity of sequences desired to be retrieved and describe methods that help in determining appropriate seeds. The experimental results described in this paper illustrate situations in which these methods are most applicable and also show the relationship between the various BLAST parameters.

Figures

( A ) Graphical representation…

( A ) Graphical representation of the deterministic finite automaton (DFA) that recognizes…

Variation of contiguous seed hit…

Variation of contiguous seed hit probability with alignment similarity (the interval between 50…

Variation of contiguous seed hit…

Variation of contiguous seed hit probability with alignment size (distributions for alignment sizes…

The actual number of hits…

The actual number of hits that are generated by running MegaBLAST and the…

The seed size shown is…

The seed size shown is the largest seed with hit probability >0.95 for…

One of the most common activities in bioinformatics is the search for similar sequences. These searches are usually carried out with the help of programs from the NCBI BLAST family. As the majority of searches are routinely performed with default parameters, a question that should be addressed is ho …