1.
In any segment of DNA, typically only one frame
in one strand is
used for a protein-coding gene. That
is, each double-stranded
segment of DNA is
generally part of only one gene.
2.
Genes do not often overlap by more than a few
bp, although up to
about 30 bp is legitimate.
3.
The gene density in phage genomes is very
high, so
genes tend to be tightly packed. Thus, there
are typically
not large non-coding gaps between genes.
4.
Most protein-coding genes will have coding potential predicted
by
Glimmer, GeneMarkS (self), or GeneMarkHost (version 2.5).
Start sites are
chosen to include all coding potential. These
are, by far, the strongest pieces
of
data for predicting genes.
5.
Many phage genes are unique, and will not have any
homologues in
any databases. This is OK, and lack of similar
sequences in
databases should not be the sole reason for
removing a Glimmer or
GeneMark gene prediction from an
annotation
6.
Some protein-coding genes may not be predicted by
Glimmer or
GeneMark. Therefore, all ORFs over 120bp
that fall into gaps in
predicted genes in the annotation
should be carefully evaluated
for similarity to genes in the
databases. In this case, evidence
such as strong
sequence similarity to previously annotated genes
in
GenBank or phagesdb.org, or a likely functional
prediction
with HHPred is sufficient for inclusion in the
annotation. If
you have no data to
support the filling of a gap, do not
fill
the gap.
7.
If there are two genes transcribed in opposite directions
whose
start sites are near one another, there typically has to be
space
between them for transcription promoters in both
directions. This
usually requires ≥
50 bp gap.
8.
Protein-coding genes are generally at
least 120 bp (40 codons)
long. There are a
small number of exceptions. Genes
below
about 200 bp require careful examination.
9.
Switches in gene orientation (from forward
to
reverse, or vice versa) are relatively rare.
In other
words, it is common to find groups of
genes
transcribed in the same direction.
10.
Each protein-coding gene ends with a stop codon
(TAG,
TGA, or TAA).
11.
Each protein-coding gene starts with an initiation
codon,
ATG, GTG, or TTG. Note that ATGs account for 68%
of
starts called in the Actinobacteriophage database of
phage
genes, GTG for 26%, and TTG for 7%.
12.
An important task is choosing between different
possible
translation initiation (i.e., start) codons. The best
choice of
start site is gene-specific, and gene function and
synteny must
be carefully considered. As phage genes are
frequently
co-transcribed and co-translated, less weight may
be given to
optimal ribosome binding site sequences in start
site selection.
Identifying the correct start site is not always
easy and is
predicated on the following sub-principles:
12a.
The relationship to the closest upstream gene is
important.
Usually, there is neither a large gap nor a large
overlap (i.e., more than
about 7 bp). If the genes are part of an
operon, a 1 or 4bp overlap
(ATGA), where a
start codon overlaps the stop codon of the upstream
gene, is
preferred by the ribosome. Therefore, RBS scores may have
little
bearing in this type of gene arrangement. (The 4bp overlap
is
commonly found in the genes of the genomes in
the
Actinobacteriophage database. This is demonstrated by the
data: TGA
stops are the most commonly used codons at 65% of the
time, with
TAG at 17%, and TAA at 18%.)
12b.
The position of the start site is often conserved among
homologues of genes.
Therefore, the start site of a
gene in your phage is likely to be in the same position as those in
related genes in other genomes. But be aware that one or more
previously annotated and published genes could be suboptimal, and you
may have the opportunity to help change it to a more optimal one.
Homologues in more distantly related genomes (those of a different
cluster) may prove more informative because alternate incorrect start
sites are less likely to be conserved. Use Starterator!
12c.
The preferred start site usually has a favorable RBS score within
all
the potential start codons, but not necessarily the best. A
notable
exception is the integrase in many
genomes, which has a very low RBS
score. Our experimental data
suggests that some genes do not have an
SD sequence.
12d.
Manual inspection can be helpful to distinguish between
possible
start sites. The consensus is as follows: AAGGAGG –
3-12
bp – start codon.
12e.
Your final start-site selection will likely represent a compromise of
these sub-principles. A corollary to the choosing start guidelines:
Sometimes the best start leads to the choice between 2 tandem start
codons (i.e. one is right after the other). From a small amount of
mass spec data and some basic biology principles, always choose the
second start codon. For example, the Met-Met “ATGATG” or Met-Leu
“TGATGTTGA” start codons
• Important to check the six-frame translation!
13.
tRNA genes are not called precisely in the program
embedded in
DNA Master, and require extra attention.
14.
Protein assignments require rigorous review of the
ever-
increasing available data. At a minimum, each gene should
be
evaluated using HHPred and BLASTP, as well as examined
in
the context of the functions of the flanking genes (synteny).
15.
Iteration is key. Annotation is like writing a paper;
after
you've made a rough draft, you will need to refine, revise,
and
polish all your genes calls to produce a cohesive whole.