 |
trGEN User Manual
Release 15, December 2005
|
trGEN contains sequences translated from the EMBL Nucleotide Sequence
Database, prepared by the European Bioinformatics Institute. For a recent
reference see: Stoesser G., Baker W., van den Broek A., Camon E.,
Garcia-Pastor M., Kanz C., Kulikova T., Lombard V., Lopez R., Parkinson H.,
Redaschi N., Sterk P., Stoehr P., Tuli M.A.; Nucleic Acids Res. 29:17-21(2001).
Since the trGEN format is the similar to the one for SWISS-PROT, only changes
in the format as compared to SWISS-PROT are listed. The main changes as compared to SWISS-PROT, however, are the FT lines. All other informations concerning the SWISS-PROT format can be found in the SWISS-PROT Protein Knowledgebase User Manual.
Cross-references are made in trGEN to:
- The EMBL Nucleotide Sequence Database, prepared by the European Bioinformatics Institute.
Copyright notice
trGEN is copyright. It is produced at the Swiss Institute of Bioinformatics.
There are no restrictions on its use by non-profit institutions as long as its
content is in no way modified. Usage by and for commercial entities requires a
license agreement. For information about the licensing scheme see:
COPYRIGHT.
The above copyright notice also applies to this user manual as well as to any other
trGEN document.
Citation
If you want to cite trGEN in a publication, please use the following reference:
-
- Pagni M., Iseli, C., Junier, T., Falquet L., Jongeneel V. and Bucher P.
- trEST, trGEN and Hits: access to databases of predicted protein sequences
- Nucleic Acids Res. 29:148-151(2001).
1. What is trGEN?
2. Conventions used in the database
-
- 2.1. General structure of the database
- 2.2. Classes of data
- 2.3. Structure of a sequence entry
3. The different line types
-
- 3.1 The ID line
- 3.2 The AC line
- 3.3 The DT line
- 3.4 The DE line
- 3.6 The OS, OC and OX lines
- 3.7 The CC line
- 3.8 The DR line
- 3.9 The FT line
- 3.10 The SQ line
- 3.11 The sequence data line
- 3.12 The // line
4. How is trGEN updated?
The amino acid sequences of the trGEN database are predicted from genomic and High Throughput Genome
(HTG) sequences from the EMBL database for species like Homo sapiens, Mus musculus, Rattus
norvegicus etc. Vectors and bacterial contaminants (if any) are masked, and sequences under
10,000 bp are discarded. The sequences are then searched for putative genes and their coding regions
with GENSCAN. Although GENSCAN is one of the best gene prediction programs available, it is not foolproof,
and it will miss about one in ten exons. Similarly, about one in ten of the predicted exons may be
false positives. This means that the majority of trGEN entries contain some errors. Moreover, it must
be stressed that trGEN entries are NOT real protein sequences. They are intended to help researchers
retrieve relevant genomic or HTG entries from the EMBL/Genbank databases, and use these as a basis
for further gene discovery.
- 2.1. General structure of the database
The trGEN protein sequence database is composed of sequence entries. Each
entry corresponds to a single contiguous protein sequence as predicted by GENSCAN.
References to positions within a sequence are made using sequential numbering of the residues,
beginning with 1 at the N-terminal end of the sequence.
- 2.2. Classes of data
Due to the nature of the trGEN entries, a new class (HYPOTHETICAL) has been defined. All entries in trGEN belong
to this class.
- 2.3. Structure of a sequence entry
The entries in the trGEN database are structured like those in the SWISS-PROT database.
ID AC106881_2 HYPOTHETICAL; PRT; 320 AA.
AC AC106881_2;
DT 30-Sep-2002 (Rel. 04.00, Created)
DT 30-Sep-2002 (Rel. 04.00, Last sequence update)
DT 30-Sep-2002 (Rel. 04.00, Last annotation update)
DE Homo sapiens; Clone: RP11-710C12; Chromosome: 4; Map: 4; 1 ordered
DE piece.
OS Homo sapiens (Human).
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
OX NCBI_TaxID=9606;
CC --------------------------------------------------------------------------
CC This entry is NOT a real protein sequence. It is a hypothetical protein
CC obtained by genscan from a DNA sequence. It is known to be error prone.
CC --------------------------------------------------------------------------
DR EMBL; AC106881; -; -.
FT GENSCAN 1 226 FIRST EXON; p-value: 0.159.
FT GENSCAN 227 262 INTERNAL EXON; p-value: 0.093.
FT GENSCAN 263 269 INTERNAL EXON; p-value: 0.065.
FT GENSCAN 270 320 LAST EXON; p-value: 0.074.
SQ SEQUENCE 320 AA; 36554 MW; 60628D6FC1F6FAC0 CRC64;
MIISIDAEKA FDKIQQPFML KTLNKLGIDG TYLKIIRAIY DKPTANITLN GQKLEAFPLK
TGTRQGCPLS PLLFNIVLEV LARAIRQEKE IKGIQLGKEE VKLSLFADDM IVYLENPIVS
AQNLLKLIRN FSKVSGYKIN VQKSQAFLYT NNRQTESQIM SELPFTIASK RIKYLGIQLK
RDVKDLFKEN YKPLLNEIKE DTNKWNNIPC SWVGRINIVK MAILPKKIPF KYQYVSKLYN
VAVTDIVWEH QGQPDHFGLY QNVSAKAIKM LCIYVHYRLQ RCPSTTTTKI FLVAVSITTS
TSGLTNLGLV MVAKYVNHTE
//
Each line begins with a two-character line code, which indicates the type of data contained
in the line. The only current line types and line codes and the order in which they appear in an
entry, are shown in the table below. The line types and line codes used in trGEN are a reduced
set of those appearing in SWISS-PROT
-
Line code |
Content |
Occurrence in an entry |
ID | Identification | Once; starts the entry |
AC | Accession number(s) | Once or more |
DT | Date | Three times |
DE | Description | Once or more |
OS | Organism species | Once |
OC | Organism classification | Once or more |
OX | Taxonomy cross-reference(s) | Once |
CC | Comments or notes | Once or more |
DR | Database cross-references | Once or more |
FT | Feature table data | Once or more |
SQ | Sequence header | Once |
| (blanks) sequence data | Once or more |
// | Termination line | Once; ends the entry |
The following line codes are currently omitted: GN, OG, RN, RP, RC, RX, RA, RT, RL and KW.
Each entry must begin with an identification line (ID) and end with a terminator line (//).
A detailed description of each line type that differs from a standard SWISS-PROT entry as defined
in the SWISS-PROT Protein Knowledgebase User Manual
is given in the next section of this document.
The two-character line-type code that begins each line is always followed
by three blanks, so that the actual information begins with the sixth
character. Information is not extended beyond character position 75.
3. The different line types |
The ID (IDentification) line is always the first line of an entry. The general form of the ID line
is:
ID ENTRY_NAME DATA_CLASS; MOLECULE_TYPE; SEQUENCE_LENGTH.
The first item on the ID line is the entry name of the sequence. This name is a useful means
of identifying a sequence. The entry name is composed of the EMBL accession number of the contig,
on which the protein was predicted plus a number that enumerates the proteins as they are found on
the contig.
Two examples of ID lines are shown below:
ID AC106881_1 HYPOTHETICAL; PRT; 108 AA.
ID AC106881_2 HYPOTHETICAL; PRT; 163 AA.
The AC (ACcession number) line lists the accession number(s) associated
with an entry. In the trGEN database the AC line is identical to the ID
line.
An accession number is dropped only when the data to which it was assigned
have been completely removed from the database.
The DT (DaTe) lines show the date of creation and last modification of the
database entry. The format of the DT line is:
DT DD-MMM-YYYY (Rel. XX.YY, Comment)
The main difference to the DT line found in standard SWISS-PROT entries is, that
the release number is split in two parts, where 'XX' is the trGEN main
release number and 'YY' the number of the weekly release.
Example of a block of DT lines:
DT 21-Aug-2002 (Rel. 04.00, Created)
DT 21-Aug-2002 (Rel. 04.00, Last sequence update)
DT 21-Aug-2002 (Rel. 04.00, Last annotation update)
The DE (DEscription) lines contain general descriptive information about
the sequence stored. Since the protein is not known in most cases, the DE
lines contain information about the underlying clone, chromosome,
the number of pieces the clone consists of and which one of those pieces encodes
for the protein.
The format of the DE line is:
DE Description.
Example:
DE Clone: RP11-710C12; Chromosome: 4; 6 unordered pieces;
DE Part 3.
3.6. The OS, OC and OX lines |
The OS, OC and OX lines are identical to the ones found in a standard SWISS-PROT entry (see also here).
The CC lines are free text comments on the entry, and are used to convey
any useful information. Due to lack of information about the protein, currently
the CC line in the trGEN database only reads as follows:
CC This entry is NOT a real protein sequence. It is a hypothetical protein
CC obtained by genscan from a DNA sequence. It is known to be error prone.
The DR (Database cross-Reference) lines are used as pointers to information related to
trGEN entries and found in data collections other than trGEN. Currently trGEN is only
cross-referenced to the EMBL database. For more information about the DR line see also here.
The format of the DR line is:
DR EMBL; ACCESSION_NUMBER; -; -.
and the corresponding entry in trGEN:
DR EMBL; AC106881; -; -.
The FT (Feature Table) lines provide a precise but simple means for the annotation of the
sequence data. For details about the format see here.
An example of a feature table is shown below:
FT GENSCAN 1 44 FIRST EXON; p-value: 0.666.
FT GENSCAN 45 75 INTERNAL EXON; p-value: 0.218.
The only FT Key currently used in trGEN entries is 'GENSCAN '. It was specifically introduced
for trGEN.
GENSCAN - Predictions made by GENSCAN.
Examples of GENSCAN key feature lines:
FT GENSCAN 1 44 FIRST EXON; p-value: 0.666.
FT GENSCAN 45 75 INTERNAL EXON; p-value: 0.218.
FT GENSCAN 76 76 AA on splice site: CT/T -> L.
FT GENSCAN 77 92 INTERNAL EXON; p-value: 0.337.
FT GENSCAN 93 93 AA on splice site: G/TT -> V.
FT GENSCAN 94 163 LAST EXON; p-value: 0.787.
FT GENSCAN 1 196 SINGLE EXON; p-value: 0.973.
The predictions made by GENSCAN that appear in the descriptions for the key 'GENSCAN '
are the following:
- FIRST EXON
- INTERNAL EXON
- LAST EXON
- SINGLE EXON for single exon genes
The p-value is the probability of exon (sum over all parses containing exon) calculated by GENSCAN
and serves as an indication about the degree of certainty which should be ascribed to exons predicted
by the program. For more information about GENSCAN see:
-
- Burge C and Karlin S.
- Prediction of Complete Gene Structures in Human Genomic DNA
- J. Mol. Biol. 268:78-94 (1997)
and:
- AA on splice site: CT/T -> L.
This description indicates an aa whose codon is split in two. The 5' being at the 3' end of
the previous exon and the 3' part being at the 5' end of the subsequent exon. The slash '/'
indicates the intron.
The SQ (SeQuence header) line marks the beginning of the sequence data and gives a
quick summary of its content. For more details see here.
The molecular Mass in trGEN entries takes unknown aa like 'X' into account.
The mass used is the average mass of all aa, namely 136.9 Da.
3.16. The sequence data line |
The characters used for the amino acids are the standard IUPAC one letter codes (see also
SWISS-PROT usermanual: Appendix A or http://www.chem.qmul.ac.uk/iupac/AminoAcid/).
An example of sequence data lines is shown here:
DWMVSMIMDR EYSVAVEAVR LLILILKNME GVLMDVDCES VYPIVLFYPE CEIRTMGGRE
QRQSPGAQRT FFQLLLSFFV ESKVTYTEIT LAVVHRTYKW AGVGGSRX
The // (terminator) line contains no data or comments and designates the end of an entry.
trGEN is updated weekly. The list of entries is compared to the list of EMBL contigs. Where a new sequence version
for a contig exists, the old sequences are removed and the new sequences are added.