Glossary of terms (with links to reference material)

The following is a list of terminology used on these pages, as well as some references to other pages which describe the terminology and methods used in this project in more detail. This list is partially derived from questions received from participants regarding terminology or methods they have been unclear on.

Annotation - (Yes, we're starting with the basics!) Notation added to sequence data including where genes, regulatory elements, etc. are located, and descriptions of probable or known functions of said genes (etc.).

BLAST - Basic Local Alignment Search Tool - a set of similarity search programs used identify sequences in a database that share similarity to your query sequence. This method uses a heuristic algorithm which seeks local as opposed to global alignments and can therefore identify isolated regions of similarity. For more information see NCBI's BLAST overview, BLAST 2.0 information, the BLAST manual, Altschul et al., 1990, a list of further references, and the brief description of each of the BLAST programs.

BLASTP, BLASTX, TBLASTN etc... - see NCBI's summary of the different BLAST programs and the BLAST definition shown above for more information.

COGs - clusters of orthologous groups. COGs are groups of related protein sequences that are present in at least 3 phylogenetic lineages (8 complete genomes representing 6 major phylogenetic lineages have been used for the analysis so far). Each COG corresponds to an ancient conserved domain (since it must be present in at least 3 of the deeply branching phylogenetic lineages. See the COGs web site for more details.

contig - a contiguous region of sequence constructed by aligning many sequence "reads" (one "read" is the data generated from one sequencing reaction).

E. coli/Blattner functional categories - Categories used by the Blattner Laboratory to group E. coli genes by function.  For more information, see Blattner, F.R. et al. The complete genome sequence of Escherichia coli K-12. Science 277(5331), 1453-1462 (1997). Please note that E. coli functional groups will not necessarily be used for the final functional group categorization of P. aeruginosa genes. Monica Riley's classification, used by TIGR and other Genome Centers, will likely be adapted for use with P. aeruginosa (Monica Riley. 1993. Microbiol. Rev. 57, 862).

Expect value (BLAST Expect value) - Sometimes referred to as a probability value. Estimates the statistical significance of a match, specifying the number of matches, with a given score, that are expected in a search of a database of the given size by chance. An Expect value of two, with a given score, would indicate that two matches with this score, are expected purely by chance. The Expect value is often set at a certain threshold for reporting matches against database sequences.

filtering - Masks off segments of your BLAST query sequence that have low compositional complexity, as determined by the SEG program of Wootton & Federhen (Computers and Chemistry, 1993). The segments are replaced with XXXXX's or NNNNN's, as viewed in your BLAST output. For more information, see NCBI's description of filtering.

hit or hits - sequence(s) in a database that is (are) found to be similar to a given query sequence - also used as a verb.

in silico - computer generated.

ORF - open reading frame within a sequence that may be a gene, but has not yet been demonstrated as such.

orthologs - genes derived from a common ancestor that diverged due to divergence of the organisms they are associated with. They tend to have similar function.

paralogs - genes derived from a common ancestral gene that duplicated within an organism and then subsequently diverged. They tend to have differing functions.

probability (for BLAST analyses) - see Expect value.

tab-delimited file format - a basic text file format that can be imported into any spreadsheet program, such as Microsoft Excel. This format uses tabs to separate the fields in a file. It can also be opened in any text file viewing program, however this is not advised as the columns of data will not be lined up. In Excel, this tab-delimited format can be opened directly into the program by using File-Open commands, and then just clicking on "finish" when the Text Import Wizard appears. The resulting table columns can be manipulated to the appropriate size.

query - the sequence you are using to perform your search of a database.

score (BLAST score) - The score in a BLAST output is usually given in 'bits'. The bit score is defined as: S' (bits) =  [lambda * S (raw)  -  ln K] / ln 2    where lambda and K are Karlin-Altschul parameters. The expression of the score in terms of bits makes it independent of the scoring system used (i.e., which matrix). A more intuitive way to rank results involves the use of the Expect value (see above definition).

subject - a sequence in the database that shares similarity to your query sequence.

word size threshold (BLAST word size) - this refers to the neighborhood word score threshold (Altschul et al., 1990). A critical part of the process involved in the BLAST search method, the initial word hits act as seeds for initiating searches to find longer regions of similarity. A higher threshold means that larger words are allowed to seed the search for finding regions of similarity. The larger this word is allowed to become, the faster your search, however, the accuracy of your search will be lower.
 



Pseudomonas aeruginosa Community Annotation Project
Last updated: January 12, 1999
Copyright © 1999