SIB logo trGEN
User Manual

Release 15, December 2005

trGEN contains sequences translated from the EMBL Nucleotide Sequence Database, prepared by the European Bioinformatics Institute. For a recent reference see: Stoesser G., Baker W., van den Broek A., Camon E., Garcia-Pastor M., Kanz C., Kulikova T., Lombard V., Lopez R., Parkinson H., Redaschi N., Sterk P., Stoehr P., Tuli M.A.; Nucleic Acids Res. 29:17-21(2001).

Since the trGEN format is the similar to the one for SWISS-PROT, only changes in the format as compared to SWISS-PROT are listed. The main changes as compared to SWISS-PROT, however, are the FT lines. All other informations concerning the SWISS-PROT format can be found in the SWISS-PROT Protein Knowledgebase User Manual.

Cross-references are made in trGEN to:

Copyright notice

trGEN is copyright. It is produced at the Swiss Institute of Bioinformatics. There are no restrictions on its use by non-profit institutions as long as its content is in no way modified. Usage by and for commercial entities requires a license agreement. For information about the licensing scheme see: COPYRIGHT.

The above copyright notice also applies to this user manual as well as to any other trGEN document.


Citation

If you want to cite trGEN in a publication, please use the following reference:

Pagni M., Iseli, C., Junier, T., Falquet L., Jongeneel V. and Bucher P.
trEST, trGEN and Hits: access to databases of predicted protein sequences
Nucleic Acids Res. 29:148-151(2001).

Table of contents

1. What is trGEN?

2. Conventions used in the database
2.1.   General structure of the database
2.2.   Classes of data
2.3.   Structure of a sequence entry

3. The different line types
3.1    The ID line
3.2    The AC line
3.3    The DT line
3.4    The DE line
3.6    The OS, OC and OX lines
3.7  The CC line
3.8  The DR line
3.9  The FT line
3.10  The SQ line
3.11  The sequence data line
3.12  The // line
4. How is trGEN updated?

1. What is trGEN?

The amino acid sequences of the trGEN database are predicted from genomic and High Throughput Genome (HTG) sequences from the EMBL database for species like Homo sapiens, Mus musculus, Rattus norvegicus etc. Vectors and bacterial contaminants (if any) are masked, and sequences under 10,000 bp are discarded. The sequences are then searched for putative genes and their coding regions with GENSCAN. Although GENSCAN is one of the best gene prediction programs available, it is not foolproof, and it will miss about one in ten exons. Similarly, about one in ten of the predicted exons may be false positives. This means that the majority of trGEN entries contain some errors. Moreover, it must be stressed that trGEN entries are NOT real protein sequences. They are intended to help researchers retrieve relevant genomic or HTG entries from the EMBL/Genbank databases, and use these as a basis for further gene discovery.

2.1. General structure of the database
The trGEN protein sequence database is composed of sequence entries. Each entry corresponds to a single contiguous protein sequence as predicted by GENSCAN.

References to positions within a sequence are made using sequential numbering of the residues, beginning with 1 at the N-terminal end of the sequence.

2.2. Classes of data
Due to the nature of the trGEN entries, a new class (HYPOTHETICAL) has been defined. All entries in trGEN belong to this class.

2.3. Structure of a sequence entry
The entries in the trGEN database are structured like those in the SWISS-PROT database.


ID   AC106881_2 HYPOTHETICAL;      PRT;   320 AA.
AC   AC106881_2;
DT   30-Sep-2002 (Rel. 04.00, Created)
DT   30-Sep-2002 (Rel. 04.00, Last sequence update)
DT   30-Sep-2002 (Rel. 04.00, Last annotation update)
DE   Homo sapiens; Clone: RP11-710C12; Chromosome: 4; Map: 4; 1 ordered
DE   piece.
OS   Homo sapiens (Human).
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC   Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
OX   NCBI_TaxID=9606;
CC   --------------------------------------------------------------------------
CC   This entry is NOT a real protein sequence.  It is a hypothetical protein
CC   obtained by genscan from a DNA sequence. It is known to be error prone.
CC   --------------------------------------------------------------------------
DR   EMBL; AC106881; -; -.
FT   GENSCAN       1    226       FIRST EXON; p-value: 0.159.
FT   GENSCAN     227    262       INTERNAL EXON; p-value: 0.093.
FT   GENSCAN     263    269       INTERNAL EXON; p-value: 0.065.
FT   GENSCAN     270    320       LAST EXON; p-value: 0.074.
SQ   SEQUENCE   320 AA;  36554 MW;  60628D6FC1F6FAC0 CRC64;
     MIISIDAEKA FDKIQQPFML KTLNKLGIDG TYLKIIRAIY DKPTANITLN GQKLEAFPLK
     TGTRQGCPLS PLLFNIVLEV LARAIRQEKE IKGIQLGKEE VKLSLFADDM IVYLENPIVS
     AQNLLKLIRN FSKVSGYKIN VQKSQAFLYT NNRQTESQIM SELPFTIASK RIKYLGIQLK
     RDVKDLFKEN YKPLLNEIKE DTNKWNNIPC SWVGRINIVK MAILPKKIPF KYQYVSKLYN
     VAVTDIVWEH QGQPDHFGLY QNVSAKAIKM LCIYVHYRLQ RCPSTTTTKI FLVAVSITTS
     TSGLTNLGLV MVAKYVNHTE
//
Each line begins with a two-character line code, which indicates the type of data contained in the line. The only current line types and line codes and the order in which they appear in an entry, are shown in the table below. The line types and line codes used in trGEN are a reduced set of those appearing in SWISS-PROT

Line code Content Occurrence in an entry
IDIdentificationOnce; starts the entry
ACAccession number(s)Once or more
DTDateThree times
DEDescriptionOnce or more
OSOrganism speciesOnce
OCOrganism classificationOnce or more
OXTaxonomy cross-reference(s)Once
CCComments or notesOnce or more
DRDatabase cross-referencesOnce or more
FTFeature table dataOnce or more
SQSequence headerOnce
  (blanks) sequence dataOnce or more
//Termination lineOnce; ends the entry


The following line codes are currently omitted: GN, OG, RN, RP, RC, RX, RA, RT, RL and KW.

Each entry must begin with an identification line (ID) and end with a terminator line (//).

A detailed description of each line type that differs from a standard SWISS-PROT entry as defined in the SWISS-PROT Protein Knowledgebase User Manual is given in the next section of this document.

The two-character line-type code that begins each line is always followed by three blanks, so that the actual information begins with the sixth character. Information is not extended beyond character position 75.

3. The different line types

3.1. The ID line

The ID (IDentification) line is always the first line of an entry. The general form of the ID line is:

ID   ENTRY_NAME DATA_CLASS; MOLECULE_TYPE; SEQUENCE_LENGTH.
The first item on the ID line is the entry name of the sequence. This name is a useful means of identifying a sequence. The entry name is composed of the EMBL accession number of the contig, on which the protein was predicted plus a number that enumerates the proteins as they are found on the contig.

Two examples of ID lines are shown below:

ID   AC106881_1 HYPOTHETICAL;      PRT;   108 AA.

ID   AC106881_2 HYPOTHETICAL;      PRT;   163 AA.

3.2. The AC line

The AC (ACcession number) line lists the accession number(s) associated with an entry. In the trGEN database the AC line is identical to the ID line. An accession number is dropped only when the data to which it was assigned have been completely removed from the database.

3.3. The DT line

The DT (DaTe) lines show the date of creation and last modification of the database entry. The format of the DT line is:
DT   DD-MMM-YYYY (Rel. XX.YY, Comment)
The main difference to the DT line found in standard SWISS-PROT entries is, that the release number is split in two parts, where 'XX' is the trGEN main release number and 'YY' the number of the weekly release. Example of a block of DT lines:

DT   21-Aug-2002 (Rel. 04.00, Created)
DT   21-Aug-2002 (Rel. 04.00, Last sequence update)
DT   21-Aug-2002 (Rel. 04.00, Last annotation update)
3.4. The DE line

The DE (DEscription) lines contain general descriptive information about the sequence stored. Since the protein is not known in most cases, the DE lines contain information about the underlying clone, chromosome, the number of pieces the clone consists of and which one of those pieces encodes for the protein.

The format of the DE line is:
DE   Description.
Example:

DE   Clone: RP11-710C12; Chromosome: 4; 6 unordered pieces;
DE   Part 3.
3.6. The OS, OC and OX lines

The OS, OC and OX lines are identical to the ones found in a standard SWISS-PROT entry (see also here).

3.11. The CC line

The CC lines are free text comments on the entry, and are used to convey any useful information. Due to lack of information about the protein, currently the CC line in the trGEN database only reads as follows:
CC   This entry is NOT a real protein sequence.  It is a hypothetical protein
CC   obtained by genscan from a DNA sequence. It is known to be error prone.
3.12. The DR line

The DR (Database cross-Reference) lines are used as pointers to information related to trGEN entries and found in data collections other than trGEN. Currently trGEN is only cross-referenced to the EMBL database. For more information about the DR line see also here.

The format of the DR line is:

DR   EMBL; ACCESSION_NUMBER; -; -.
and the corresponding entry in trGEN:

DR   EMBL; AC106881; -; -.
3.14. The FT line

The FT (Feature Table) lines provide a precise but simple means for the annotation of the sequence data. For details about the format see here. An example of a feature table is shown below:

FT   GENSCAN      1     44       FIRST EXON; p-value: 0.666.
FT   GENSCAN     45     75       INTERNAL EXON; p-value: 0.218.
The only FT Key currently used in trGEN entries is 'GENSCAN '. It was specifically introduced for trGEN.

GENSCAN - Predictions made by GENSCAN.

Examples of GENSCAN key feature lines:

FT   GENSCAN      1     44       FIRST EXON; p-value: 0.666.
FT   GENSCAN     45     75       INTERNAL EXON; p-value: 0.218.
FT   GENSCAN     76     76       AA on splice site: CT/T -> L.
FT   GENSCAN     77     92       INTERNAL EXON; p-value: 0.337.
FT   GENSCAN     93     93       AA on splice site: G/TT -> V.
FT   GENSCAN     94    163       LAST EXON; p-value: 0.787.

FT   GENSCAN      1    196       SINGLE EXON; p-value: 0.973.
The predictions made by GENSCAN that appear in the descriptions for the key 'GENSCAN ' are the following:
The p-value is the probability of exon (sum over all parses containing exon) calculated by GENSCAN and serves as an indication about the degree of certainty which should be ascribed to exons predicted by the program. For more information about GENSCAN see:
Burge C and Karlin S.
Prediction of Complete Gene Structures in Human Genomic DNA
J. Mol. Biol. 268:78-94 (1997)
and: This description indicates an aa whose codon is split in two. The 5' being at the 3' end of the previous exon and the 3' part being at the 5' end of the subsequent exon. The slash '/' indicates the intron.

3.15. The SQ line

The SQ (SeQuence header) line marks the beginning of the sequence data and gives a quick summary of its content. For more details see here.

The molecular Mass in trGEN entries takes unknown aa like 'X' into account. The mass used is the average mass of all aa, namely 136.9 Da.

3.16. The sequence data line

The characters used for the amino acids are the standard IUPAC one letter codes (see also SWISS-PROT usermanual: Appendix A or http://www.chem.qmul.ac.uk/iupac/AminoAcid/).

An example of sequence data lines is shown here:

     DWMVSMIMDR EYSVAVEAVR LLILILKNME GVLMDVDCES VYPIVLFYPE CEIRTMGGRE
     QRQSPGAQRT FFQLLLSFFV ESKVTYTEIT LAVVHRTYKW AGVGGSRX
3.17. The // line

The // (terminator) line contains no data or comments and designates the end of an entry.

4. How is trGEN updated?

trGEN is updated weekly. The list of entries is compared to the list of EMBL contigs. Where a new sequence version for a contig exists, the old sequences are removed and the new sequences are added.