THE SWISS-PROT PROTEIN SEQUENCE DATA BANK USER MANUAL Release 36, July 1998 Amos Bairoch Swiss Institute of Bioinformatics (SIB) Centre Medical Universitaire 1, rue Michel Servet 1211 Geneva 4 Switzerland Telephone: +41-22-784 40 82 Fax: +41-22-702 55 02 Electronic mail address: bairoch@medecine.unige.ch WWW server: http://www.expasy.ch/ Rolf Apweiler The EMBL Outstation - The European Bioinformatics Institute (EBI) Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD United Kingdom Telephone: +44-1223-494 400 Fax: +44-1223-494 468 Electronic mail address: datalib@ebi.ac.uk WWW server: http://www.ebi.ac.uk/ ----------------------------------------------------------------------- Acknowledgements This release of SWISS-PROT has been prepared by: o Amos Bairoch, Marie-Claude Blatter Garin, Brigitte Boeckmann, Silvia Braconi, Nathalie Farriol, Serenella Ferro, Alain Gateau, Chantal Hulo, Janet James, Madelaine Moinat, Julia Williams Nef and Shyamala Sundaram at the Swiss Institute of Bioinformatics and the Medical Biochemistry Department of the University of Geneva; o Rolf Apweiler, Sergio Contrino, Christian Desaintes, Vivien Junker, Stephanie Kappus, Fiona Lang, Michele Magrane, Maria Jesus Martin, Nicoletta Mitaritonna and Claire O'Donovan at the European Bioinformatics Institute (EBI). SWISS-PROT contains sequences translated from the EMBL Nucleotide Sequence Database, prepared by the European Bioinformatics Institute For a recent reference see: Stoesser G., Moseley M.A., Sleep J., McGowran M., Garcia-Pastor M. and Sterk P.; Nucleic Acids Res. 26:8- 15(1998). A small part of the information in SWISS-PROT was originally adapted from information contained in the Protein Sequence Database of the Protein Information Resource (PIR) supported by the Division of Research Resources of the NIH, National Biomedical Research Foundation, Georgetown University Medical Center, 3900 Reservoir road, N.W., Washington, D.C. 20007, U.S.A. For a recent reference see: Barker W.C., Garavelli J.S., Haft D.H., Hunt L.T., Marzec C.R., Orcutt B.C., Srinivasarao G.Y., Mewes H.-W., Pfeiffer F. and Tsugita A.; Nucleic Acids Res. 26:27-32(1998). Cross-references are made in SWISS-PROT to: o The PROSITE dictionary of sites and patterns in proteins prepared by Amos Bairoch and Philipp Bucher at the Swiss Institute of Bioinformatics. - Reference: Bairoch A., Bucher P. and Hofmann K.; Nucleic Acids Res. 25:217-221(1997). o The X-ray crystallography Protein Data Bank (PDB) compiled at the Brookhaven National Laboratory. - Reference: Abola E.E., Manning N.O., Prilusky J., Stampf D.R. and Sussman J.L.; J. Res. Natl. Inst. Stand. Technol. 101:231-241(1996). o The Mendelian Inheritance in Man data bank (MIM) prepared under the supervision of Victor McKusick at John Hopkins University. - Reference: Pearson P., Francomano C., Foster P., Bocchini C., Li P. and McKusick V.A.; Nucleic Acids Res. 22:3470-3473(1994). o The Mouse Genome Database (MGD) prepared by the Mouse Genome Informatics group at Jackson Laboratory. - Reference: Blake J.A., Eppig J.T., Richardson J.E. and Davisson M.T.; Nucleic Acids Res. 26:130-137(1998). o The restriction enzymes database (REBASE) prepared by Richard Roberts and Dana Macelis at New England BioLabs. - Reference: Roberts R.J. and Macelis D.; Nucleic Acids Res. 26:338- 350(1998). o The G-protein--coupled receptor database (GCRDb) prepared by Lee Frank Kolakowski at the Department of Pharmacology of the University of Texas, San Antonio. - Reference: Kolakowski L.F. Jr.; Receptors Channels 2:1-7(1994). o The EcoGene section of the EcoSeq/EcoMap integrated Escherichia coli K12 database and the StyGene section of the StySeq/StyMap integrated Salmonella typhimurium LT2 database, both prepared by Ken Rudd at the Department of Biochemistry and Molecular Biolology of the University of Miami School of Medicine. - Reference: Rudd K.E.; ASM News 59:335-341(1993). o The Encyclopedia of Escherichia coli genes and metabolism (EcoCyc) prepared under the supervision of Peter Karp at Pangea Systems and Monical Riley at MBL. - Reference: Karp P.D., Riley M., Paley S.M., Pellegrini-Toole A. and Krummenacker M.; Nucleic Acids Res. 28:50-53(1998). o The gene-protein database of Escherichia coli K12 (2D-gel spots) (ECO2DBASE) prepared under the supervision of Ruth VanBogelen. - Reference: VanBogelen R.A. , Abshire K.Z., Moldover B., Olson E.R. and Neidhardt F.C.; Electrophoresis 18:1243-1251(1997). o The SubtiList relational database for the Bacillus subtilis 168 genome prepared under the supervision of Ivan Moszer at the Pasteur Institute. - Reference: Moszer I., Glaser P. and Danchin A.; Microbiology 141:261-268(1995). o The human keratinocyte 2D gel protein database from the universities of Aarhus and Ghent. - Reference: Celis J.E., Rasmussen H.H., Olsen E., Madsen P., Leffers H., Honore B., Dejgaard K., Gromov P., Hoffmann H.J., Nielsen M., Vassiliev A., Vintermyr O., Hao J., Celis A., Basse B., Lauridsen J.B., Ratz G.P., Andersen A.H., Walbum E., Kjaergaard I., Puype M., Van Damme J. and Vandekerckhove J.; Electrophoresis 14:1091- 1198(1993). o The 2D gel protein database (SWISS-2DPAGE) of the Faculty of Medicine of the University of Geneva. - Reference: Hoogland C., Sanchez J.-C., Tonella L., Bairoch A., Hochstrasser D.F. and Appel R.D.; Nucleic Acids Res. 26:332- 333(1998). o The Saccharomyces Genome Database (SGD) prepared under the supervision of Mike Cherry at Stanford. - Reference: Cherry J.M., Adler C., Ball C., Chervitz S.A, Dwight S.S., Hester E.T., Jia Y., Juvik G., Roe T., Schroeder M., Weng S. and Botstein D.; Nucleic Acids Res. 26:73-79(1998). o The Yeast Electrophoresis Protein Database (YEPD) prepared under the supervision of Jim Garrells from Proteone Inc. - Reference: Payne W.E. and Garrels J.I.; Nucleic Acids Res. 25:57- 62(1997). o The Harefield Hospital 2D gel protein databases prepared under the supervision of Mike Dunn. - Reference: Corbett J.M., Wheeler C.H., Baker C.S., Yacoub M.H. and Dunn M.J.; Electrophoresis 15:1459-1465(1994). o The Drosophila genome database (FlyBase) prepared under the supervision of Michael Ashburner at the Department of Genetics, University of Cambridge. - Reference: Nucleic Acids Res. 26:85-88(1998). o The Maize genome database (MaizeDB) developed by the USDA-ARS Maize Genome Project as part of the National Agricultural Library's Plant Genome Research Program. o The WormPep database prepared by Richard Durbin and Erik Sonnhammer from the MRC Laboratory of Molecular Biology and Sanger Center at Hinxton Hall, Cambridge. o The DictyDb database prepared by Douglas W. Smith and Bill Loomis from the University of California, San Diego (UCSD). o The Human Retroviruses and AIDS compilation of nucleic and amino acid sequences (HIV Sequence Database) edited by G. Myers, A.B. Rabson, S.F. Josephs, T.F. Smith, J.A. Berzofsky, F. Wong-Staal; published by the Theoretical Biology and Biophysics Group T-10 at Los Alamos National Laboratory; and funded by the AIDS program of the National Institute of Allergy and Infectious Diseases through an interagency agreement with the United States Department of Energy. o The database of Homology-derived Secondary Structure of Proteins (HSSP) prepared under the supervision of Chris Sander at the EBI. - Reference: Dodge C., Schneider R. and Sander C.; Nucleic Acids Res. 26:313-315(1998). o The transcription factor database (Transfac) developed under the supervision of Edgar Wingender from the Gesellschaft fuer Biotechnologische Forschung mbH in Braunschweig. - Reference: Heinemeyer T., Wingender E., Reuter I., Hermjakob H., Kel A.E., Kel O.V., Ignatieva E.V., Annako E.A., Podkolodnaya O.A., Kolpakov F.A., Podkolodny N.L. and Kolchanov N.A.; Nucleic Acids Res. 26:362-367(1998). Notes This manual and the accompanying data bank may be copied and redistributed freely, without advance permission, provided that this statement is reproduced with each copy. Suggestions and comments are welcome. For information on how to submit data, please refer to the data submission form (the SUBMIT.TXT file). Citation If you want to cite SWISS-PROT in a publication please use the following reference: Bairoch A. and Apweiler R. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1998. Nucleic Acids Res. 26:38-42(1998). ----------------------------------------------------------------------- TABLE OF CONTENTS 1) What is SWISS-PROT ? 2) Conventions used in the data bank 2.1 General structure of the data bank 2.2 Classes of data 2.3 Structure of a sequence entry 3) The different line types 3.1 The ID line 3.2 The AC line 3.3 The DT line 3.4 The DE line 3.5 The GN line 3.6 The KW line 3.7 The OS line 3.8 The OG line 3.9 The OC line 3.10 The reference (RN, RP, RC, RX, RA, RL) lines 3.11 The DR line 3.12 The FT line 3.13 The SQ line 3.14 The sequence data line 3.15 The CC line 3.16 The // line Appendix A: Feature table keys A.1 Change indicators A.2 Amino acid modifications A.3 Regions A.4 Secondary structure A.5 Others Appendix B: Amino acid codes Appendix C: Format differences between the SWISS-PROT and EMBL data banks C.1 Generalities C.2 Differences in line types present in both data banks C.3 Line types defined by SWISS-PROT but currently not used by EMBL C.4 Line types defined by EMBL but currently not used by SWISS-PROT ----------------------------------------------------------------------- (1). WHAT IS SWISS-PROT ? SWISS-PROT is an annotated protein sequence database established in 1986 and maintained collaboratively, since 1987, by the group of Amos Bairoch first at the Department of Medical Biochemistry of the University of Geneva and now at the Swiss Institute of Bioinformatics (SIB) and the EMBL Data Library (now the EMBL Outstation - The European Bioinformatics Institute (EBI)). The SWISS-PROT protein sequence data bank consists of sequence entries. Sequence entries are composed of different lines types, each with their own format. For standardization purposes the format of SWISS-PROT follows as closely as possible that of the EMBL Nucleotide Sequence Database. The SWISS-PROT database distinguishes itself from other protein sequence databases by four distinct criteria: a) Annotation In SWISS-PROT, as in most other sequence databases, two classes of data can be distinguished: the core data and the annotation. For each sequence entry the core data consists of the sequence data; the citation information (bibliographical references) and the taxonomic data (description of the biological source of the protein) while the annotation consists of the description of the following items: o Function(s) of the protein o Post-translational modification(s). For example carbohydrates, phosphorylation, acetylation, GPI-anchor, etc. o Domains and sites. For example calcium binding regions, ATP-binding sites, zinc fingers, homeobox, kringle, etc. o Secondary structure o Quaternary structure o Similarities to other proteins o Disease(s) associated with deficiencie(s) in the protein o Sequence conflicts, variants, etc. We try to include as much annotation information as possible in SWISS- PROT. To obtain this information we use, in addition to the publications that report new sequence data, review articles to periodically update the annotations of families or groups of proteins. We also make use of external experts, who have been recruited to send us their comments and updates concerning specific groups of proteins. We believe that our having systematic recourse both to publications other than those reporting the core data and to subject referees represents a unique and beneficial feature of SWISS-PROT. In SWISS-PROT, annotation is mainly found in the comment lines (CC), in the feature table (FT) and in the keyword lines (KW). Most comments are classified by `topics'; this approach permits the easy retrieval of specific categories of data from the database. b) Minimal redundancy Many sequence databases contain, for a given protein sequence, separate entries which correspond to different literature reports. In SWISS-PROT we try as much as possible to merge all these data so as to minimize the redundancy of the database. If conflicts exist between various sequencing reports, they are indicated in the feature table of the corresponding entry. c) Integration with other databases It is important to provide the users of biomolecular databases with a degree of integration between the three types of sequence-related databases (nucleic acid sequences, protein sequences and protein tertiary structures) as well as with specialized data collections. SWISS-PROT is currently cross-referenced with 28 different databases. Cross-references are provided in the form of pointers to information related to SWISS-PROT entries and found in data collections other than SWISS-PROT. This extensive network of cross-references allows SWISS- PROT to play a major role as a focal point of biomolecular databases interconnectivity. d) Documentation SWISS-PROT is distributed with a large number of index files and specialized documentation files. Some of these files have been available for a long time (this user manual, the release notes, the various indices for authors, citations, keywords, etc.), but many have been created recently and we are continuously adding new files. The release notes contains an up to date descriptive listing of all the distributed document files. (2). CONVENTIONS USED IN THE DATA BANK The following sections describes the general conventions used in SWISS- PROT to achieve uniformity of presentation. Experienced users of the EMBL Database can skip these sections and directly refer to Appendix C, which lists the minor differences in format between the two data collections. (2.1). General structure of the data bank The SWISS-PROT protein sequence data bank is composed of sequence entries. Each entry corresponds to a single contiguous sequence as contributed to the bank or reported in the literature. In some cases, entries have been assembled from several papers that report overlapping sequence regions. Conversely, a single paper can provide data for several entries, e.g. when related sequences from different organisms are reported. References to positions within a sequence are made using sequential numbering, beginning with 1 at the N-terminal end of the sequence. Except for initiator N-terminal methionine residues, which are not included in a sequence when their absence from the mature sequence has been proven, the sequence data correspond to the precursor form of a protein before post-translational modifications and processing. (2.2). Classes of data In order to attempt to make data available to users as quickly as possible after publication, SWISS-PROT is now distributed with a supplement called TrEMBL, where entries are released before all their details are finalized. To distinguish between fully annotated entries and those in TrEMBL, the 'class' of each entry is indicated on the first (ID) line of the entry. The two defined classes are: STANDARD : Data which are complete to the standards laid down by the SWISS-PROT data bank. PRELIMINARY: Sequence entries which have not yet been annotated by the SWISS-PROT staff up to the standards laid down by SWISS-PROT. These entries are exclusively found in TrEMBL. (2.3). Structure of a sequence entry The entries in the SWISS-PROT data bank are structured so as to be usable by human readers as well as by computer programs. The explanations, descriptions, classifications and other comments are in ordinary English. Wherever possible, symbols familiar to biochemists, protein chemists and molecular biologists are used. Each sequence entry is composed of lines. Different types of lines, each with their own format, are used to record the various data which make up the entry. A sample sequence entry is shown in the next four pages. ID TNFA_HUMAN STANDARD; PRT; 233 AA. AC P01375; DT 21-JUL-1986 (REL. 01, CREATED) DT 21-JUL-1986 (REL. 01, LAST SEQUENCE UPDATE) DT 15-JUL-1998 (REL. 36, LAST ANNOTATION UPDATE) DE TUMOR NECROSIS FACTOR PRECURSOR (TNF-ALPHA) (CACHECTIN). GN TNFA. OS HOMO SAPIENS (HUMAN). OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA; OC EUTHERIA; PRIMATES. RN [1] RP SEQUENCE FROM N.A. RX MEDLINE; 87217060. RA NEDOSPASOV S.A., SHAKHOV A.N., TURETSKAYA R.L., METT V.A., RA AZIZOV M.M., GEORGIEV G.P., KOROBKO V.G., DOBRYNIN V.N., RA FILIPPOV S.A., BYSTROV N.S., BOLDYREVA E.F., CHUVPILO S.A., RA CHUMAKOV A.M., SHINGAROVA L.N., OVCHINNIKOV Y.A.; RL COLD SPRING HARB. SYMP. QUANT. BIOL. 51:611-624(1986). RN [2] RP SEQUENCE FROM N.A. RX MEDLINE; 85086244. RA PENNICA D., NEDWIN G.E., HAYFLICK J.S., SEEBURG P.H., DERYNCK R., RA PALLADINO M.A., KOHR W.J., AGGARWAL B.B., GOEDDEL D.V.; RL NATURE 312:724-729(1984). RN [3] RP SEQUENCE FROM N.A. RX MEDLINE; 85137898. RA SHIRAI T., YAMAGUCHI H., ITO H., TODD C.W., WALLACE R.B.; RL NATURE 313:803-806(1985). RN [4] RP SEQUENCE FROM N.A. RX MEDLINE; 86016093. RA NEDWIN G.E., NAYLOR S.L., SAKAGUCHI A.Y., SMITH D.H., RA JARRETT-NEDWIN J., PENNICA D., GOEDDEL D.V., GRAY P.W.; RL NUCLEIC ACIDS RES. 13:6361-6373(1985). RN [5] RP SEQUENCE FROM N.A. RX MEDLINE; 85142190. RA WANG A.M., CREASEY A.A., LADNER M.B., LIN L.S., STRICKLER J., RA VAN ARSDELL J.N., YAMAMOTO R., MARK D.F.; RL SCIENCE 228:149-154(1985). RN [6] RP SEQUENCE FROM N.A. RX MEDLINE; 86030296. RA MARMENOUT A., FRANSEN L., TAVERNIER J., DER HEYDEN J., TIZARD R., RA KAWASHIMA E., SHAW A., JOHNSON M.J., SEMON D., MUELLER R., RA RUYSSCHAERT M.R., VAN VLIET A., FIERS W.; RL EUR. J. BIOCHEM. 152:515-522(1985). RN [7] RP SEQUENCE FROM N.A. RX MEDLINE; 93272029. RA IRIS F.J.M., BOUGUELERET L., PRIEUR S., CATERINA D., PRIMAS G., RA PERROT V., JURKA J., RODRIGUEZ-TOME P., CLAVERIE J.-M., RA DAUSSET J., COHEN D.; RL NAT. GENET. 3:137-145(1993). RN [8] RP X-RAY CRYSTALLOGRAPHY (2.9 ANGSTROMS). RX MEDLINE; 89159409. RA JONES E.Y., STUART D.I., WALKER N.P.; RL NATURE 338:225-228(1989). RN [9] RP X-RAY CRYSTALLOGRAPHY (2.6 ANGSTROMS). RX MEDLINE; 90008932. RA ECK M.J., SPRANG S.R.; RL J. BIOL. CHEM. 264:17595-17605(1989). RN [10] RP X-RAY CRYSTALLOGRAPHY (2.9 ANGSTROMS). RX MEDLINE; 91193276. RA JONES E.Y., STUART D.I., WALKER N.P.; RL J. CELL SCI. SUPPL. 13:11-18(1990). RN [11] RP X-RAY CRYSTALLOGRAPHY (2.6 ANGSTROMS). RX MEDLINE; 90008932. RA ECK M.J., SPRANG S.R.; RL J. BIOL. CHEM. 264:17595-17605(1989). RN [12] RP MUTAGENESIS. RX MEDLINE; 91184128. RA OSTADE X.V., TAVERNIER J., PRANGE T., FIERS W.; RL EMBO J. 10:827-836(1991). RN [13] RP MYRISTOYLATION. RX MEDLINE; 93018820. RA STEVENSON F.T., BURSTEN S.L., LOCKSLEY R.M., LOVETT D.H.; RL J. EXP. MED. 176:1053-1062(1992). CC -!- FUNCTION: CYTOKINE WITH A WIDE VARIETY OF FUNCTIONS: IT CAN CC CAUSE CYTOLYSIS OF CERTAIN TUMOR CELL LINES, IT IS IMPLICATED CC IN THE INDUCTION OF CACHEXIA, IT IS A POTENT PYROGEN CAUSING CC FEVER BY DIRECT ACTION OR BY STIMULATION OF IL-1 SECRETION, IT CC CAN STIMULATE CELL PROLIFERATION & INDUCE CELL DIFFERENTIATION CC UNDER CERTAIN CONDITIONS. CC -!- SUBUNIT: HOMOTRIMER. CC -!- SUBCELLULAR LOCATION: TYPE II MEMBRANE PROTEIN. ALSO EXISTS AS CC AN EXTRACELLULAR SOLUBLE FORM. CC -!- PTM: THE SOLUBLE FORM DERIVES FROM THE MEMBRANE FORM BY CC PROTEOLYTIC PROCESSING. CC -!- DISEASE: CACHEXIA ACCOMPANIES A VARIETY OF DISEASES, INCLUDING CC CANCER AND INFECTION, AND IS CHARACTERIZED BY GENERAL ILL CC HEALTH AND MALNUTRITION. CC -!- SIMILARITY: BELONGS TO THE TUMOR NECROSIS FACTOR FAMILY. DR EMBL; X02910; G37210; -. DR EMBL; M16441; G339741; -. DR EMBL; X01394; G37220; -. DR EMBL; M10988; G339738; -. DR EMBL; M26331; G339764; -. DR EMBL; Z15026; G37212; -. DR PIR; B23784; QWHUN. DR PIR; A44189; A44189. DR PDB; 1TNF; 15-JAN-91. DR PDB; 2TUN; 31-JAN-94. DR PDB; 1A8M; 17-JUN-98. DR MIM; 191160; -. DR PROSITE; PS00251; TNF_1; 1. DR PROSITE; PS50049; TNF_2; 1. KW CYTOKINE; CYTOTOXIN; TRANSMEMBRANE; GLYCOPROTEIN; SIGNAL-ANCHOR; KW MYRISTYLATION; 3D-STRUCTURE. FT PROPEP 1 76 FT CHAIN 77 233 TUMOR NECROSIS FACTOR. FT TRANSMEM 36 56 SIGNAL-ANCHOR (TYPE-II PROTEIN). FT LIPID 19 19 MYRISTATE. FT LIPID 20 20 MYRISTATE. FT DISULFID 145 177 FT MUTAGEN 105 105 L->S: LOW ACTIVITY. FT MUTAGEN 108 108 R->W: BIOLOGICALLY INACTIVE. FT MUTAGEN 112 112 L->F: BIOLOGICALLY INACTIVE. FT MUTAGEN 162 162 S->F: BIOLOGICALLY INACTIVE. FT MUTAGEN 167 167 V->A,D: BIOLOGICALLY INACTIVE. FT MUTAGEN 222 222 E->K: BIOLOGICALLY INACTIVE. FT CONFLICT 63 63 F -> S (IN REF. 5). FT STRAND 89 93 FT TURN 99 100 FT TURN 109 110 FT STRAND 112 113 FT TURN 115 116 FT STRAND 118 119 FT STRAND 124 125 FT STRAND 130 143 FT STRAND 152 159 FT STRAND 166 170 FT STRAND 173 174 FT TURN 183 184 FT STRAND 189 202 FT TURN 204 205 FT STRAND 207 212 FT HELIX 215 217 FT STRAND 218 218 FT STRAND 227 232 SQ SEQUENCE 233 AA; 25644 MW; 666D7069 CRC32; MSTESMIRDV ELAEEALPKK TGGPQGSRRC LFLSLFSFLI VAGATTLFCL LHFGVIGPQR EEFPRDLSLI SPLAQAVRSS SRTPSDKPVA HVVANPQAEG QLQWLNRRAN ALLANGVELR DNQLVVPSEG LYLIYSQVLF KGQGCPSTHV LLTHTISRIA VSYQTKVNLL SAIKSPCQRE TPEGAEAKPW YEPIYLGGVF QLEKGDRLSA EINRPDYLDF AESGQVYFGI IAL // Each line begins with a two-character line code, which indicates the type of data contained in the line. The current line types and line codes and the order in which they appear in an entry, are shown below: ID - Identification. AC - Accession number(s). DT - Date. DE - Description. GN - Gene name(s). OS - Organism species. OG - Organelle. OC - Organism classification. RN - Reference number. RP - Reference position. RC - Reference comments. RX - Reference cross-references. RA - Reference authors. RL - Reference location. CC - Comments or notes. DR - Database cross-references. KW - Keywords. FT - Feature table data. SQ - Sequence header. - (blanks) sequence data. // - Termination line. Some entries do not contain all of the line types, and some line types occur many times in a single entry. Each entry must begin with an identification line (ID) and end with a terminator line (//). In addition the following line types are always present in an entry: AC (1 or more), DT (3 times), DE (1 or more), OS (1 or more), OC (1 or more), RN (1 or more), RP (1 or more), RA (1 or more), RL (1 or more), SQ (once), and at least one sequence data line. The other line types (GN, OG, RC, RX, CC, DR, KW and FT) are optional. A detailed description of each line type is given in the next section of this document. It must be noted that all SWISS-PROT line types exist in the EMBL Database. A description of the format differences between the SWISS- PROT and EMBL data banks is given in Appendix C of this document. The two-character line type code which begins each line is always followed by three blanks, so that the actual information begins with the sixth character. Information is not extended beyond character position 75 except for one exception, CC lines that contain the "DATABASE" topic (see section 3.15). (3). THE DIFFERENT LINE TYPES (3.1). The ID line The ID (IDentification) line is always the first line of an entry. The general form of the ID line is: ID ENTRY_NAME DATA_CLASS; MOLECULE_TYPE; SEQUENCE_LENGTH. (3.1.1). Entry Name The first item on the ID line is the entry name of the sequence. This name is a useful means of identifying a sequence. The entry name consists of up to ten uppercase alphanumeric characters. SWISS-PROT uses a general purpose naming convention which can be symbolized as X_Y, where X is a mnemonic code of at most 4 alphanumeric characters representing the protein name. Examples: B2MG is for Beta-2-microglobulin, HBA is for Hemoglobin alpha chain and INS is for Insulin. The `_' sign serves as a separator. Y is a mnemonic species identification code of at most 5 alphanumeric characters representing the biological source of the protein. This code is generally made of the first three letters of the genus and the first two letters of the species. Examples: PSEPU is for Pseudomonas putida and NAJNI is for Naja nivea. However, for species commonly encountered in the data bank, self- explanatory codes are used. There are 16 of those codes. They are: BOVIN for Bovine, CHICK for Chicken, ECOLI for Escherichia coli, HORSE for Horse, HUMAN for Human, MAIZE for Maize (Zea mays) , MOUSE for Mouse, PEA for Garden pea (Pisum sativum), PIG for Pig, RABIT for Rabbit, RAT for Rat, SHEEP for Sheep, SOYBN for Soybean (Glycine max), TOBAC for Common tobacco (Nicotina tabacum), WHEAT for Wheat (Triticum aestivum), YEAST for Baker's yeast (Saccharomyces cerevisiae). As it was not possible to apply the above rules to viruses, they were given arbitrary, but generally easy to remember, identification codes. In some cases it was not possible to assign a definitive code to a species. In these cases a temporary code was chosen. Examples of complete protein sequence entry names are: RL1_ECOLI for ribosomal protein L1 from Escherichia coli, FER_HALHA for ferredoxin from Halobacterium halobium. The name of all the presently defined species identification codes are listed in the SWISS-PROT document file SPECLIST.TXT. (3.1.2). Data class The second item on the ID line indicates the data class of the entry (see section 2.2). (3.1.3). Molecule type The third item on the ID line is a three letter code which indicates the type of molecule of the entry: in SWISS-PROT it is PRT (for PRoTein). (3.1.4). Length of the molecule The fourth and last item of the ID line is the length of the molecule, which is the total number of amino acids in the sequence. This number includes the positions reported to be present but which have not been determined (coded as `X'). The length is followed by the letter code AA (Amino Acids). (3.1.5). Examples of identification lines Two examples of ID lines are shown below: ID CYC_BOVIN STANDARD; PRT; 104 AA. ID GIA2_GIALA STANDARD; PRT; 296 AA. (3.2). The AC line The AC (ACcession number) line lists the accession numbers associated with an entry. An example of an accession number line is shown below: AC P00321; P05348; The accession numbers are separated by semicolons and the list is terminated by a semicolon. If necessary, more then one AC line will be used. The purpose of accession numbers is to provide a stable way of identifying entries from release to release. It is sometimes necessary for reasons of consistency to change the names of the entries, for example, to ensure that related entries have similar names. However, an accession number is always conserved, and therefore allows unambiguous citation of SWISS-PROT entries. Researchers who wish to cite entries in their publications should always cite the first accession number. Entries will have more than one accession number if they have been merged or split. For example, when two entries are merged into one, a new accession number goes at the start of the AC line, and those from the merged entries are listed after this one. Similarly, if an existing entry is split into two or more entries (a rare occurrence), the original accession number list is retained in all the derived entries. An accession number is dropped only when the data to which it was assigned have been completely removed from the data bank. (3.3). The DT line The DT (DaTe) lines show the date of entry or last modification of the sequence entry. The format of the DT lines is: DT DD-MMM-YEAR (REL. XX, COMMENT) where `DD' is the day, `MMM' the month, `YEAR' the year, and `XX' the SWISS-PROT release number. The comment portion of the line indicates the action taken on that date. There are ALWAYS three DT lines in each entry, each of them is associated with a specific comment: - The first DT line indicates when the entry first appeared in the data bank. The associated comment is `CREATED'. - The second DT line indicates when the sequence data was last modified. The associated comment is `LAST SEQUENCE UPDATE'. - The third DT line indicates when any data other than the sequence was last modified. The associated comment is `LAST ANNOTATION UPDATE'. Example of a block of DT lines: DT 01-JAN-1988 (REL. 06, CREATED) DT 01-JUL-1989 (REL. 11, LAST SEQUENCE UPDATE) DT 01-AUG-1992 (REL. 23, LAST ANNOTATION UPDATE) (3.4). The DE line The DE (DEscription) lines contain general descriptive information about the sequence stored. This information is generally sufficient to identify the sequence precisely. The format of the DE lines is: DE DESCRIPTION. The description is given in ordinary English and is free-format. In some cases, more than one DE line is required; in this case, the text is divided only between words and only the last DE line is terminated by a period. When the complete sequence was not determined the last information given on the DE lines will be `(FRAGMENT)' or `(FRAGMENTS)'. Two examples of description lines are given here: DE NADH DEHYDROGENASE (EC 1.6.99.3). DE LYSOPINE DEHYDROGENASE (EC 1.5.1.16) (OCTOPINE SYNTHASE) DE (LYSOPINE SYNTHASE) (FRAGMENT). (3.5). The GN line The GN (Gene Name) line contains the name(s) of the gene(s) that encode for the stored protein sequence. The format of the GN line is: GN NAME1[ AND|OR NAME2...]. Examples: GN ALB. GN REX-1. It often occurs that more than one gene name has been assigned to an individual locus. In that case all the synonyms will be listed. The word `OR' separates the different designations. The first name in the list is assumed to be the most correct (or most current) designation. Example: GN HNS OR DRDX OR OSMZ OR BGLY. In a few cases, multiple genes encode for an identical protein sequence. In that case all the different gene names will be listed. The word `AND' separates the designations. Example: GN CECA1 AND CECA2. In very rare cases `AND' and `OR' can both be present. In that case parenthesis are used as shown in the following example: GN GVPA AND (GVPB OR GVPA2). (3.6). The KW line The KW (KeyWord) lines provide information which can be used to generate cross-reference indexes of the sequence entries based on functional, structural, or other categories. The keywords chosen for each entry serve as a subject reference for the sequence. Often several KW lines are necessary for a single entry. The format of the KW lines is: KW KEYWORD[; KEYWORD...]. More than one keyword may be listed on each KW line; the keywords are separated by semicolons, and the last keyword is followed by a period. Keywords may consist of more than one word (they may contain blanks), but are never split between lines. An example of a KW line is: KW OXIDOREDUCTASE; ACETYLATION. The order of the keywords is not significant. The above example could also have been written: KW ACETYLATION; OXIDOREDUCTASE. (3.7). The OS line The OS (Organism Species) line specifies the organism(s) which was the source of the stored sequence. In the rare case where all the species information will not fit on a single line more than one OS line is used. The last OS line is terminated by a period. The species designation consists, in most cases, of the Latin genus and species designation followed by the English name (in parentheses). For viruses, only the common English name is given. In cases where a protein sequence is identical in more then one species the OS line(s) will list the names of all those species. Examples of OS lines are shown here: OS ESCHERICHIA COLI. OS HOMO SAPIENS (HUMAN). OS ROUS SARCOMA VIRUS (STRAIN SCHMIDT-RUPPIN). OS NAJA NAJA (INDIAN COBRA), AND NAJA NIVEA (CAPE COBRA). (3.8). The OG line The OG (OrGanelle) lines indicate if the gene coding for a protein originates from the mitochondria, the chloroplast, a cyanelle, or a plasmid. The format of the OG line is: OG CHLOROPLAST. OG CYANELLE. OG MITOCHONDRION. OG PLASMID name. Where 'name' is the name of the plasmid. (3.9). The OC line The OC (Organism Classification) lines contain the taxonomic classification of the source organism. The classification is listed top-down as nodes in a taxonomic tree in which the most general grouping is given first. The classification may be distributed over several OC lines, but nodes are not split or hyphenated between lines. The individual items are separated by semicolons and the list is terminated by a period. The format of the OC lines is: OC NODE[; NODE...]. For example the classification lines for a human sequence would be: OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA; OC EUTHERIA; PRIMATES. (3.10). The reference (RN, RP, RC, RX, RA, RL) lines These lines comprise the literature citations within SWISS-PROT. The citations indicate the papers from which the data has been abstracted. The reference lines for a given citation occur in a block, and are always in the order RN, RP, RC, RX, RA, RL. Within each such reference block the RN and RP lines occur once, the RC and RX lines occurs zero or more times, and the RA and RL lines each occur one or more times. If several references are given, there will be a reference block for each. An example of a complete reference is: RN [1] RP SEQUENCE FROM N.A., AND SEQUENCE OF 1-15. RC STRAIN=SPRAGUE-DAWLEY; TISSUE=LIVER; RX MEDLINE; 91002678. RA CHAN Y.-L., PAZ V., OLVERA J., WOOL I.G.; RL BIOCHIM. BIOPHYS. ACTA 1050:69-73(1990). The formats of the individual lines are explained below. (3.10.1). The RN line The RN (Reference Number) line gives a sequential number to each reference citation in an entry. This number is used to indicate the reference in comments and feature table notes. The format of the RN line is: RN [N] where N denotes the nth reference for this entry. The reference number is always enclosed in square brackets. (3.10.2). The RP line The RP (Reference Position) line describes the extent of the work carried out by the authors of the reference cited. The format of the RP line is: RP COMMENT. Typical examples of RP lines are shown below: RP SEQUENCE FROM N.A. RP SEQUENCE FROM N.A., AND SEQUENCE OF 12-35. RP SEQUENCE OF 34-56; 67-73 AND 123-345, AND DISULFIDE BONDS. RP REVISIONS TO 67-89. RP STRUCTURE BY NMR. RP X-RAY CRYSTALLOGRAPHY (1.8 ANGSTROMS). RP CHARACTERIZATION. RP MUTAGENESIS OF TYR-56. RP REVIEW. RP VARIANT ALA-58. RP VARIANTS XLI LEU-341; ARG-372 AND TYR-446. (3.10.3). The RC line The RC (Reference Comment) lines are optional lines which are used to store comments relevant to the reference cited. The format of the RC line is: RC TOKEN1=TEXT; TOKEN2=TEXT; ..... Where the currently defined tokens are: PLASMID SPECIES STRAIN TISSUE TRANSPOSON The `SPECIES' token is only used when an entry describes a sequence which is identical in more than one species; similarly the `PLASMID' is only used if an entry describes a sequence identical in more than one plasmid. An example of an RC line is: RC STRAIN=SPRAGUE-DAWLEY; TISSUE=LIVER; (3.10.4). The RX line The RX (Reference cross-reference) line is an optional line which is used to indicate the identifier assigned to a specific reference in a bibliographic database. The format of the RX line is: RX BIBLIOGRAPHIC_DATABASE_NAME; IDENTIFIER. where the valid bibliographic database names and their associated identifier are: Name: MEDLINE Database: Medline from the National Library of Medicine (NLM) Identifier: Eight digit Medline Unique Identifier (UID) Example of RX line: RX MEDLINE; 91002678. (3.10.5). The RA line The RA (Reference Author) lines list the authors of the paper (or other work) cited. All of the authors are included, and are listed in the order given in the paper. The names are listed surname first followed by a blank followed by initial(s) with periods. The authors' names are separated by commas and terminated by a semicolon. Author names are not split between lines. An example of the use of RA lines is shown below: RA YANOFSKY C., PLATT T., CRAWFORD I.P., NICHOLS B.P., CHRISTIE G.E., RA HOROWITZ H., VAN CLEEMPUT M., WU A.M.; As many RA lines as necessary are included for each reference. (3.10.6). The RL line The RL (Reference Location) lines contain the conventional citation information for the reference. In general, the RL lines alone are sufficient to find the paper in question. a) Journal citations The RL line for a journal citation includes the journal abbreviation, the volume number, the page range, and the year. The format for such a RL line is: RL JOURNAL VOL:PP-PP(YEAR). Journal names are abbreviated according to the conventions used by the National Library of Medicine (NLM) and are based on the existing ISO and ANSI standards. A list of the abbreviations currently in use is given in the SWISS-PROT document file JOURLIST.TXT. An example of an RL line is: RL J. MOL. BIOL. 168:321-331(1983). When a reference is made to a paper which is `in press' at the time when the data bank is released, the page range, and eventually the volume number are indicated as '0' (zero). An example of a RL line of such type is shown here: RL NUCLEIC ACIDS RES. 22:0-0(1994). b) Book citations A variation of the RL line format is used for papers found in books or other similar publications, which are cited as shown below: RL (IN) THE ENZYMES, 3RD ED., VOL.11, PART A, BOYER P.D., ED., RL PP.397-547, ACADEMIC PRESS, NEW YORK, (1975). The first RL line contains the designation `(IN)', which indicates that this is a book reference. These citations generally include the following information: the title of the book, the name of the editor(s), the page range, the publisher name, the city where it is published, and the year of publication (which is always shown between parenthesis). The (IN) prefix is also used for references to the electronic Plant Gene Register (See http://www.tarweed.com/pgr/). Example: RL (IN) PLANT GENE REGISTER PGR98-023. c) Unpublished results RL lines for unpublished results follows the format shown in the following example: RL UNPUBLISHED RESULTS, CITED BY: RL ULRICH E.L., KROGMANN D.W., MARKLEY J.L.; RL J. BIOL. CHEM. 257:9356-9364(1982). d) Unpublished observations For unpublished observations the format of the RL line is: RL UNPUBLISHED OBSERVATIONS (MMM-YEAR). Where `MMM' is the month and `YEAR' is the year. We use the `unpublished observations' RL line to cite communications by scientists to SWISS-PROT of unpublished information concerning various aspects of a sequence entry. e) Thesis For Ph.D. theses the format of the RL line is: RL THESIS (YEAR), INSTITUTION_NAME, COUNTRY. An example of such a line is given here: RL THESIS (1972), GEORGE WASHINGTON UNIVERSITY, U.S.A. f) Patent applications For patent applications the format of the RL line is: RL PATENT NUMBER PAT_NUMB, DD-MMM-YYYY. Where `PAT_NUMB' is the international publication number of the patent, `DD' is the day, `MMM' is the month and `YEAR' is the year. g) Submissions The final form that an RL line can take is that used for submissions. The format of such a RL line is: RL SUBMITTED (MMM-YEAR) TO DATABASE_NAME. Where `MMM' is the month, `YEAR' is the year and `DATABASE_NAME' is one of the following: EMBL/GENBANK/DDBJ DATA BANKS THE SWISS-PROT DATA BANK THE ECOSEQ DATA BANK THE HIV DATA BANK THE MIM DATA BANK THE NEWAT DATA BANK THE PDB DATA BANK THE PIR DATA BANK Two examples of submission RL lines are given here: RL SUBMITTED (APR-1994) TO EMBL/GENBANK/DDBJ DATA BANKS. RL SUBMITTED (FEB-1995) TO THE SWISS-PROT DATA BANK. (3.11). The DR line (3.11.1). Definition The DR (Database cross-Reference) lines are used as pointers to information related to SWISS-PROT entries and found in data collections other than SWISS-PROT. For example, if the X-ray crystallographic atomic coordinates of a sequence are stored in the Brookhaven Protein Data Bank (PDB) there will be DR line(s) pointing to the corresponding entri(es) in that data bank. For a sequence translated from a nucleotide sequence there can be DR lines pointing to entries in the EMBL/Genbank/DDBJ database which correspond to the DNA or RNA sequence(s) from which it was translated. The format of the DR line is: DR DATA_BANK_IDENTIFIER; PRIMARY_IDENTIFIER; SECONDARY_IDENTIFIER. Except for cross-references to the EMBL/Genbank/DDBJ nucleotide sequence and the PROSITE databases. The specific formats for these cross-references are described in sections 3.11.5 and 3.11.6. (3.11.2). Data bank identifier The first item on the DR line, the data bank identifier, is the abbreviated name of the data collection to which reference is made. The currently defined data bank identifiers are the following: EMBL Nucleotide sequence database of EMBL (EBI) (see 3.11.5) DICTYDB Dictyostelium discoideum genome database ECO2DBASE Escherichia coli gene-protein database (2D gel spots) (ECO2DBASE) ECOGENE Escherichia coli K12 genome database (EcoGene) FLYBASE Drosophila genome database (FlyBase) GCRDB G-protein--coupled receptor database (GCRDb) HIV HIV sequence database HSC-2DPAGE Harefield hospital 2D gel protein databases (HSC- 2DPAGE) HSSP Homology-derived secondary structure of proteins database (HSSP). MAIZEDB Maize genome database (MaizeDB) MAIZE-2DPAGE Maize genome 2D Electrophoresis database (Maize- 2DPAGE) MGD Mouse genome database (MGD) MENDEL Plant gene nomenclature database (Mendel) MIM Mendelian Inheritance in Man Database (MIM) PDB Brookhaven Protein Data Bank (PDB) PIR Protein sequence database of the Protein Information Resource (PIR) PROSITE PROSITE dictionary of sites and patterns in proteins (see 3.11.6) REBASE Restriction enzyme database (REBASE) AARHUS/GHENT-2DPAGE Human keratinocyte 2D gel protein database from Aarhus and Ghent universities SGD Saccharomyces Genome Database (SGD) STYGENE Salmonella typhimurium LT2 genome database (StyGene) SUBTILIST Bacillus subtilis 168 genome database (SubtiList) SWISS-2DPAGE Human 2D Gel Protein Database from the University of Geneva (SWISS-2DPAGE) TIGR The bacterial database(s) of 'The Institute of Genome Research' (TIGR) TRANSFAC Transcription factor database (Transfac) WORMPEP Caenorhabditis elegans genome sequencing project protein database (Wormpep) YEPD Yeast electrophoresis protein database (YEPD) (3.11.3). The primary identifier The second item on the DR line, the primary identifier, is an unambiguous pointer to the information entry in the data bank to which reference is being made. - For a DictyDb, EcoGene, FlyBase, GCRDb, HIV, HSC-2DPAGE, MAIZE- 2DPAGE, Mendel, MGD, MIM, PIR, REBASE, SGD, StyGene, SubtiList, SWISS-2DPAGE or Transfac reference the primary identifier is the first accession number (also called the Unique Identifier in some data banks) of the entry to which reference is being made. - For a PDB reference the primary identifier is the entry name. - For an AARHUS/GHENT-2DPAGE, ECO2DBASE or YEPD reference the primary identifier is the protein spot alphanumeric designation. - For a WormPep reference the primary identifier is the cosmid-derived name given to that protein by the C.elegans genome sequencing project. - For a MaizeDB reference the primary identifier is the "Gene-product" accession ID. - For a TIGR reference the primary identifier is the genome Open Reading Frame (ORF) code. - For a HSSP reference the primary identifier is the accession number of a SWISS-PROT entry cross-referenced to a PDB entry whose structure is expected to be similar to that of the entry in which the HSSP cross-reference is present. (3.11.4). The secondary identifier The third item on the DR line, the secondary identifier, is generally used to complement the information given by the first identifier. - For an HIV, PIR, or REBASE reference the secondary identifier is the entry's name. - For a PDB reference the secondary identifier is the most recent date on which PDB revised the entry (last `REVDAT' record). - For a DictyDb, EcoGene, FlyBase, Mendel, MGD, SGD, StyGene or SubtiList reference the secondary identifier is the gene designation. If the gene designation is not available a dash "-" is used. - For a ECO2DBASE reference the secondary identifier is the latest release number or edition of the database that has been used to derive the cross-reference. - For a SWISS-2DPAGE, HSC-2DPAGE or MAIZE-2DPAGE reference the secondary identifier is the species or tissue of origin. - For an AARHUS/GHENT-2DPAGE reference the secondary identifier is either `IEF' (for isoelectric focusing) or `NEPHGE' (for non- equilibrium pH gradient electrophoresis). - For a WormPep reference the secondary identifier is a number attributed by the C.elegans genome sequencing project to that protein. - For a GCRDb, MaizeDB, MIM, TIGR, Transfac or YEPD reference the secondary identifier is not defined and a dash "-" is stored in that field. - For a HSSP reference the secondary identifier is the entry name of the PDB structure related to that of the entry in which the HSSP cross-reference is present. Examples of complete DR lines are shown here: DR AARHUS/GHENT-2DPAGE; 8006; IEF. DR DICTYDB; DD01047; MYOA. DR ECO2DBASE; G052.0; 6TH EDITION. DR ECOGENE; EG10054; ARAC. DR FLYBASE; FBgn0000055; Adh. DR GCRDB; GCR_0087; -. DR HIV; K02013; NEF$BRU. DR HSC-2DPAGE; P47985; HUMAN. DR HSSP; P00438; 1DOB. DR MAIZEDB; 25342; -. DR MAIZE-2DPAGE; P80607; COLEOPTILE. DR MENDEL; 294; Amahy;psbA;1. DR MGD; MGI:87920; ADFP. DR MIM; 249900; -. DR PDB; 3ADK; 16-APR-88. DR PIR; A02768; R5EC7. DR REBASE; RB00005; EcoRI. DR SGD; L0000008; AAR2. DR STYGENE; SG10312; PROV. DR SUBTILIST; BG10774; OPPD. DR SWISS-2DPAGE; P10599; HUMAN. DR TIGR; MJ0125; -. DR TRANSFAC; T00141; -. DR WORMPEP; ZK637.7; CE00437. DR YEPD; 4270; -. (3.11.5). Cross-references to the nucleotide sequence database The specific format for cross-references to the EMBL/Genbank/DDBJ nucleotide sequence database is: DR EMBL; ACCESSION_NUMBER; PID; STATUS_IDENTIFIER. Where 'PID' stands for the "Protein IDentification" number. It is a number which is stored, in nucleotide sequence entries, in a qualifier called "/db_xref" which is tagged to every CDS in the nucleotide database. Example: FT CDS 54..1382 FT /note="ribulose-1,5-bisphosphate carboxylase/ FT oxygenase/activase precursor" FT /db_xref="PID:g1006835" The 'STATUS_IDENTIFIER' provides information about the relationship between the sequence in the SWISS-PROT entry and the CDS in the corresponding EMBL entry. a) In most cases the translation of the EMBL nucleotide sequence CDS results in the same sequence as shown in the corresponding SWISS-PROT entry or the differences are mentioned in the SWISS-PROT feature (FT) lines as CONFLICT, VARIANT or VARSPLIC and in the RP lines. In these cases the status identifier shows a dash ("-"). Example: DR EMBL; Y00312; G63880; -. b) In some cases the translation of the EMBL nucleotide sequence CDS results in a sequence different from the sequence shown in the corresponding SWISS-PROT entry and the differences are either not mentioned in the SWISS-PROT feature (FT) lines as CONFLICT, VARIANT or VARSPLIC and in the RP lines, or do simply not meet the criteria for such situations. 1) If the difference is due to a different start of the sequence (e.g. SWISS-PROT believes that the start of the sequence is upstream or downstream of the site annotated as the start of the sequence in the EMBL database), the status identifier shows the comment "ALT_INIT". Example: DR EMBL; L29151; G466334; ALT_INIT. 2) If the difference is due to a different termination of the sequence (e.g. SWISS-PROT believes that the termination of the sequence is upstream or downstream of the site annotated as the end of the sequence in the EMBL database), the status identifier shows the comment "ALT_TERM". Example: DR EMBL; L20562; G398099; ALT_TERM. 3) If the difference is due to frameshifts in the EMBL sequence, the status identifier shows the comment "ALT_FRAME". Example: DR EMBL; M95935; G146416; ALT_FRAME. 4) If the difference is not due to the cases mentioned above (e.g. wrong intron-exon boundaries given in the EMBL entry) or to a mixture of the cases mentioned above, the status identifier shows the comment "ALT_SEQ". Example: DR EMBL; X79206; G809602; ALT_SEQ. c) In some cases the nucleotide sequence of a complete CDS is divided in exons present in different EMBL entries. We point to the exon containing EMBL entries by citing the PID as secondary identifier and adding the comment "JOINED" into the status identifier. These EMBL entries are not containing a CDS feature, they contain exons joined to a CDS feature which is labeled with the given PID. Example: DR EMBL; M63397; G177196; -. DR EMBL; M63395; G177196; JOINED. DR EMBL; M63396; G177196; JOINED. In the above example the SWISS-PROT sequence is derived from the CDS labeled with the PID G177196. This CDS feature can be found in the EMBL entry M63397. Exons belonging to this CDS are not only found in EMBL entry M63397, but also in the EMBL entries M63395 and M63396. d) In some cases there is no CDS feature key annotating a protein translation in an EMBL entry and thus no PID for that CDS. Therefore it is not possible for us to point to a PID as a secondary identifier. In these cases we point to the relevant EMBL entries by including a dash ("-") in the position of the missing PID and "NOT_ANNOTATED_CDS" into the status identifier. Example: DR EMBL; J04126; -; NOT_ANNOTATED_CDS. (3.11.6). Cross-references to the PROSITE database The specific format for cross-references to the PROSITE protein domain and family database is: DR PROSITE; ACCESSION_NUMBER; ENTRY_NAME; STATUS. Where 'ACCESSION_NUMBER' stands for the accession number of the PROSITE pattern or profile entry; "ENTRY_NAME" is the name of the entry and 'STATUS' is one of the following: n FALSE_NEG PARTIAL UNKNOWN_n Where "n" is the number of hits of the pattern or profile in that particular protein sequence. The "FALSE_NEG" status indicates that while the pattern or profile did not detect the protein sequence, it is a member of that particular family or domain. The "PARTIAL" status indicates that the pattern or profile did not detect the sequence because that sequence is not complete and lacks the region on which is based the pattern/profile. Finally the "UNKNOWN" status indicates uncertainties as to the fact that the sequence is a member of the family or domain described by the pattern/profile. Example of PROSITE cross-references: DR PROSITE; PS00107; PROTEIN_KINASE_ATP; 1. DR PROSITE; PS00028; ZINC_FINGER_C2H2; 6. DR PROSITE; PS00237; G_PROTEIN_RECEPTOR; FALSE_NEG. DR PROSITE; PS01128; SHIKIMATE_KINASE; PARTIAL. DR PROSITE; PS00383; TYR_PHOSPHATASE_1; UNKNOWN_1. (3.12). The FT line The FT (Feature Table) lines provide a precise but simple means for the annotation of the sequence data. The table describes regions or sites of interest in the sequence. In general the feature table lists post- translational modifications, binding sites, enzyme active sites, local secondary structure or other characteristics reported in the cited references. Sequence conflicts between references are also included in the feature table. The feature table is updated when more becomes known about a given sequence. The FT lines have a fixed format. The column numbers allocated to each of the data items within each FT line are shown in the following table (column numbers not referred to in the table are always occupied by blanks): +---------------+-----------------------+ | Columns | Data item | +---------------+-----------------------+ | 1- 2 | FT | | 6-13 | Key name | | 15-20 | `FROM' endpoint | | 22-27 | `TO' endpoint | | 35-75 | Description | +---------------+-----------------------+ The key name and the endpoints are always on a single line, but the description may require continuation. For this purpose, the next line contains blanks in the key, the `FROM', and the `TO' columns positions, and the description is continued in its normal position. Thus a blank key always denotes a continuation of the previous description. An example of a feature table is shown below: FT NON_TER 1 1 FT PEPTIDE 1 9 ARG-VASOPRESSIN. FT PEPTIDE 13 107 NEUROPHYSIN 2. FT PEPTIDE 109 147 COPEPTIN. FT DISULFID 1 6 FT MOD_RES 9 9 AMIDATION (ACTIVE ARG-VASOPRESSIN). FT CONFLICT 102 102 D -> S (IN REF. 2). FT CONFLICT 105 105 MISSING (IN REF. 3). FT CARBOHYD 114 114 The first item on each FT line is the key name, which is a fixed abbreviation (up to 8 characters) with a defined meaning. A list of the currently defined key names can be found in Appendix A of this document. Following the key name are the `FROM' and `TO' endpoint specifications. These fields designate (inclusively) the endpoints of the feature named in the key field. In general, these fields simply contain residue numbers indicating positions in the sequence as listed. Note that these positions are always specified assuming a numbering of the listed sequence from 1 to n; this numbering is not necessarily the same as that used in the original reference(s). The following should be noted in interpreting these endpoints: - If the `FROM' and `TO' specifications are equal, the feature indicated consists of the single amino acid at that position. - When a feature is known to extend beyond the end(s) of the sequenced region, the endpoint specification will be preceded by < for features which continue to the left end (N-terminal direction) or by > for features which continue to the right end (C-terminal direction). - Unknown endpoints are denoted by `?'. See also the notes concerning each of the key names in the appendix A. The remaining portion of the FT line is a description which contains additional information about the feature. For example, for a residue post-translational modification (key MOD_RES) the chemical nature of that modification is given, while for a sequence variation (key VARIANT) the nature of the variation is indicated. This portion of the line is generally in free form, and may be continued on additional lines when necessary. (3.13). The SQ line The SQ (SeQuence header) line marks the beginning of the sequence data and gives a quick summary of its content. The format of the SQ line is: SQ SEQUENCE XXXX AA; XXXXX MW; XXXXX CRC32; The line contains the length of the sequence in amino acids (AA) followed by the molecular weight (MW) rounded to the nearest gram and the sequence 32-bit CRC (Cyclic Redundancy Check) value (CRC32). An example of an SQ line is shown here: SQ SEQUENCE 233 AA; 25630 MW; 146A1B48 CRC32; The information in the SQ line can be used as a check on accuracy or for statistical purposes. The word `SEQUENCE' is present solely for readability. (3.14). The sequence data line The sequence data line has a line code consisting of two blanks rather than the two-letter codes used up until now. The sequence is written 60 amino acids per line, in groups of 10 amino acids, beginning in position 6 of the line. The characters used for the amino acids are the standard IUPAC one letter codes (see Appendix B). An example of sequence data lines is shown here: GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA ANKNKGIIWG EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK ATNE (3.15). The CC line The CC lines are free text comments on the entry, and may be used to convey any useful information. The comments always appears below the last reference line and are grouped together in comment blocks, a block being made of 1 or more comment lines. The first line of a block start is marked with the characters `-!-'. The format of a comment block is: CC -!- FIRST LINE OF A COMMENT BLOCK; CC SECOND AND SUBSEQUENT LINES OF A COMMENT BLOCK. A major proportion of the comment blocks are arranged according to what we designate as 'topics`. The format of a comment block which belongs to a 'topic` is: CC -!- TOPIC: FREE TEXT DESCRIPTION. The current topics and their definition are: ALTERNATIVE PRODUCTS Description of the existence of related protein sequence(s) produced by alternative splicing of the same gene(s) or by the use of alternative initiation codons. CATALYTIC ACTIVITY Description of the reaction(s) catalysed by an enzyme [1]. CAUTION This topic warns you about possible errors and/or grounds for confusion. COFACTOR Description of an enzyme cofactor. DATABASE Description of a cross-reference to a network database/resource for a specific protein [2]. DEVELOPMENTAL STAGE Description of the developmental specific expression of a protein. DISEASE Description of the disease(s) associated with a deficiency of a protein. DOMAIN Description of the domain structure of a protein. ENZYME REGULATION Description of an enzyme regulatory mechanism. FUNCTION General description of the function(s) of a protein. INDUCTION Description of the compound(s) which stimulate the synthesis of a protein. MASS SPECTROMETRY Reports the exact molecular weight of a protein or part of a protein as determined by mass spectrometric methods [3]. PATHWAY Description of the metabolic pathway(s) to which is associated a protein. POLYMORPHISM Description of polymorphism(s). PTM Description of a post-translational modification. SIMILARITY Description of the similariti(es) (sequence or structural) of a protein with other proteins. SUBCELLULAR LOCATION Description of the subcellular location of a mature protein product. SUBUNIT Description of the quaternary structure of a protein. TISSUE SPECIFICITY Description of the tissue specificity of a protein. [1] For the 'CATALYTIC ACTIVITY' topic: Whenever it was possible we have used, to describe the catalytic activity of an enzyme, the recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (IUBMB) as published in Enzyme Nomenclature, NC-IUBMB, Academic Press, New-York, (1992). [2] The syntax of the 'DATABASE' topic is: CC -!- DATABASE: NAME=Text[; NOTE=Text][; WWW="Address"][; FTP="Address"]. where - "NAME" is the name of the database; - "NOTE" (optional) is a free text note; - "WWW" (optional) is the WWW address (URL) of the database; - "FTP" (optional) is the anonymous FTP address (including the directory name) where the database file(s) are stored. Note: this is currently the only part of the database where line longer than 75 characters can be found as we do not reformat long URL or FTP addresses. [3] The syntax of the 'MASS SPECTROMETRY' topic is: CC -!- MASS SPECTROMETRY: MW=XXX[; MW_ERR=XX][; METHOD=XX][; RANGE=XX-XX]. where - "MW=XX" is the determined molecular weight (MW); - "MW_ERR=XX" (optional) is the accuracy or error range of the MW measurement; - "METHOD=XX" (optional) is the mass spectrometric method; - "RANGE=XX-XX" (optional) is used to indicate what part of the protein sequence entry corresponds to the molecular weight. If this qualifier is not present, the MW value corresponds to the full length of the protein sequence. We show here, for each of the defined topic, two examples of its usage: CC -!- ALTERNATIVE PRODUCTS: TWO FORMS OF CLC-K2 ARE PRODUCED BY CC ALTERNATIVE SPLICING. THE SHORT FORM (CLC-K2S) DIFFERS FROM CC THE LONG FORM (CLC-K2L) BY A DELETION OF 55 RESIDUES. CC -!- ALTERNATIVE PRODUCTS: USING ALTERNATIVE INITIATION CODONS IN CC THE SAME READING FRAME, THE GENE TRANSLATES INTO THREE CC ISOZYMES: ALPHA, BETA AND BETA'. CC -!- CATALYTIC ACTIVITY: ATP + L-GLUTAMATE + NH(3) = ADP + CC GLUTAMINE + PHOSPHATE. CC -!- CATALYTIC ACTIVITY: (R)-2,3-DIHYDROXY-3-METHYLBUTANOATE + CC NADP(+) = (S)-2-HYDROXY-2-METHYL-3-OXOBUTANOATE + NADPH. CC -!- CAUTION: REF.2 SEQUENCE DIFFERS FROM THAT SHOWN IN POSITIONS CC 92 TO 165 DUE TO A FRAMESHIFT. CC -!- CAUTION: IT IS UNCERTAIN WHETHER MET-1 OR MET-3 IS THE CC INITIATOR. CC -!- COFACTOR: PYRIDOXAL PHOSPHATE. CC -!- COFACTOR: FAD AND NONHEME IRON. CC -!- DATABASE: NAME=CD40Lbase; CC NOTE=European CD40L defect database (mutation db); CC WWW="http://www.expasy.ch/www/cd40lbase.html"; CC FTP="ftp://www.expasy.ch/databases/cd40lbase". CC -!- DATABASE: NAME=PROW; NOTE=CD guide CD80 entry; CC WWW="http://www.ncbi.nlm.nih.gov/prow/cd/cd80.htm". CC -!- DEVELOPMENTAL STAGE: EXPRESSED EARLY DURING CONIDIAL (DORMANT CC SPORES) DIFFERENTIATION. CC -!- DEVELOPMENTAL STAGE: EXPRESSED IN EMBRYONIC AND EARLY LARVAL CC STAGES. CC -!- DISEASE: DEFECTS IN PHKA1 ARE LINKED TO X-LINKED MUSCLE CC GLYCOGENOSIS, A DISEASE CHARACTERIZED BY SLOWLY PROGRESSIVE, CC PREDOMINANTLY DISTAL MUSCLE WEAKNESS AND ATROPHY. CC -!- DISEASE: DEFECTS IN ALD ARE THE CAUSE OF X-LINKED CC ADRENOLEUKODYSTROPHY, A PEROXISOMAL DISORDER CHARACTERIZED BY CC PROGRESSIVE DEMYLEINATION OF THE CNS AND ADRENAL CC INSUFFICIENCY. CC -!- DOMAIN: CONTAINS A COILED-COIL DOMAIN ESSENTIAL FOR VESICULAR CC TRANSPORT AND A DISPENSABLE C-TERMINAL REGION. CC -!- DOMAIN: THE B CHAIN IS COMPOSED OF TWO DOMAINS, EACH DOMAIN CC CONSISTS OF 3 HOMOLOGOUS SUBDOMAINS (ALPHA, BETA, GAMMA). CC -!- ENZYME REGULATION: THE ACTIVITY OF THIS ENZYME IS CONTROLLED CC BY ADENYLATION. THE FULLY ADENYLATED ENZYME IS INACTIVE. CC -!- ENZYME REGULATION: ACTIVATED BY GRAM-NEGATIVE BACTERIAL CC LIPOPOLYSACCHARIDES AND CHYMOTRYPSIN. CC -!- FUNCTION: PROFILIN PREVENTS THE POLYMERIZATION OF ACTIN. CC -!- FUNCTION: INHIBITOR OF FUNGAL POLYGALACTURONASE. IT IS AN CC IMPORTANT FACTOR FOR PLANT RESISTANCE TO PHYTOPATHOGENIC CC FUNGI. CC -!- INDUCTION: BY SALT STRESS AND BY ABSCISIC ACID (ABA). CC -!- INDUCTION: BY INFECTION, PLANT WOUNDING, OR ELICITOR CC TREATEMENT OF CELL CULTURES. CC -!- MASS SPECTROMETRY: MW=71890; MW_ERR=7; METHOD=ELECTROSPRAY. CC -!- MASS SPECTROMETRY: MW=8597.5; METHOD=ELECTROSPRAY; CC RANGE=40-119. CC -!- PATHWAY: FIRST STEP IN PROLINE BIOSYNTHESIS PATHWAY. CC -!- PATHWAY: LAST STEP IN PROTOHEME BIOSYNTHESIS. IN ERYTHROID CC CELLS, FERROCHELATASE APPEARS TO BE THE RATE-LIMITING ENZYME. CC -!- POLYMORPHISM: THE ALLELIC FORM OF THE ENZYME WITH GLN-191 CC HYDROLYZES PARAOXON WITH A LOW TURNOVER NUMBER AND THE ONE CC WITH ARG-191 WITH A HIGH TURNOVER NUMBER. CC -!- POLYMORPHISM: THE TWO MAIN ALLELES OF HP ARE CALLED HP1F CC (FAST) AND HP1S (SLOW). THE SEQUENCE SHOWN HERE IS THAT OF THE CC HP1S FORM. CC -!- PTM: O-GLYCOSYLATED; AN UNUSUAL FEATURE AMONG VIRAL CC GLYCOPROTEINS. CC -!- PTM: THE SOLUBLE FORM DERIVES FROM THE MEMBRANE FORM BY CC PROTEOLYTIC PROCESSING. CC -!- SIMILARITY: BELONGS TO PEPTIDASE FAMILY S8; ALSO KNOWN AS THE CC SUBTILASE FAMILY. CC -!- SIMILARITY: BELONGS TO THE ATP-BINDING TRANSPORT PROTEIN CC FAMILY (ABC TRANSPORTERS). BELONGS TO THE MDR SUBFAMILY. CC -!- SUBCELLULAR LOCATION: MITOCHONDRIAL MATRIX. CC -!- SUBCELLULAR LOCATION: INTEGRAL MEMBRANE PROTEIN. INNER CC MEMBRANE. CC -!- SUBUNIT: HOMOTETRAMER. CC -!- SUBUNIT: HETERODIMER OF A LIGHT CHAIN AND A HEAVY CHAIN LINKED CC BY A DISULFIDE BOND. CC -!- TISSUE SPECIFICITY: KIDNEY, SUBMAXILLARY GLAND, AND URINE. CC -!- TISSUE SPECIFICITY: SHOOTS, ROOTS, AND COTYLEDON FROM CC DEHYDRATING SEEDLINGS. (3.16). The // line The // (terminator) line contains no data or comments. It designates the end of an entry. ----------------------------------------------------------------------- APPENDIX A: FEATURE TABLE KEYS The definition of each of the key names used in the feature table is explained here. It is probable that new key names will be progressively be added to this list. For each key a number of examples are presented. (A.1). Change indicators CONFLICT - Different papers report differing sequences. Examples of CONFLICT key feature lines: FT CONFLICT 33 33 MISSING (IN REF. 2). FT CONFLICT 60 60 P -> A (IN REF. 3 AND 4). FT CONFLICT 81 84 ASTQ -> GWT (IN REF. 3). VARIANT - Authors report that sequence variants exist. Examples of VARIANT key feature lines: FT VARIANT 3 3 V -> I. FT VARIANT 87 87 L -> T (IN STRAIN 2.3.1). FT VARIANT 1 2 MISSING (IN 25% OF THE CHAINS). VARSPLIC - Description of sequence variants produced by alternative splicing. Examples of VARSPLIC key feature lines: FT VARSPLIC 194 196 GRP -> DVR (IN SHORT FORM). FT VARSPLIC 197 211 MISSING (IN SHORT FORM). MUTAGEN - Site which has been experimentally altered. Examples of MUTAGEN key feature lines: FT MUTAGEN 65 65 H->F: 100% LOSS OF ACTIVITY. FT MUTAGEN 123 123 G->R,L,M: DNA BINDING LOST. (A.2). Amino-acid modifications MOD_RES - Post-translational modification of a residue. The chemical nature of the modification is given in the description. The general format of the MOD_RES description field is: FT MOD_RES xxx xxx MODIFICATION (COMMENT). The most frequently occuring modifications are the following: ACETYLATION N-terminal or other. AMIDATION Generally at the C-terminal of a mature active peptide. BLOCKED Undetermined N- or C-terminal blocking group. FORMYLATION Of the N-terminal methionine. GAMMA-CARBOXYGLUTAMIC ACID HYDROXYLATION Of asparagine, aspartic acid, proline or lysine. METHYLATION Generally of lysine or arginine. PHOSPHORYLATION Of serine, threonine, tyrosine, aspartic acid or histidine. PYRROLIDONE CARBOXYLIC ACID N-terminal glutamate which has formed an internal cyclic lactam. SULFATATION Generally of tyrosine. Examples of MOD_RES key feature lines: FT MOD_RES 1 1 ACETYLATION. FT MOD_RES 11 11 PHOSPHORYLATION (BY PKC). FT MOD_RES 2 2 SULFATATION (BY SIMILARITY). FT MOD_RES 8 8 AMIDATION (G-9 PROVIDE AMIDE GROUP). FT MOD_RES 9 9 METHYLATION (MONO-, DI- & TRI-). LIPID - Covalent binding of a lipidic moiety The chemical nature of the bound lipid moiety is given in the description. The general format of the LIPID description field is: FT LIPID xxx xxx MODIFICATION (COMMENT). The modifications which are currently defined are the following: MYRISTATE Myristate group attached through an amide bond to the N-terminal glycine residue of the mature form of a protein [1,2] or to an internal lysine residue. PALMITATE Palmitate group attached through a thioether bond to a cysteine residue or through an ester bond to a serine or threonine residue [1,2]. FARNESYL Farnesyl group attached through a thioether bond to a cysteine residue [3,4]. GERANYL-GERANYL Geranyl-geranyl group attached through a thioether bond to a cysteine residue [3,4]. GPI-ANCHOR Glycosyl-phosphatidylinositol (GPI) group linked to the alpha-carboxyl group of the C-terminal residue of the mature form of a protein [5,6]. N-ACYL DIGLYCERIDE N-terminal cysteine of the mature form of a prokaryotic lipoprotein with an amide-linked fatty acid and a glyceryl group to which two fatty acids are linked by ester linkages [7]. [1] Grand R.J.A. Biochem. J. 258:626-638(1989). [2] McLhinney R.A.J. Trends Biochem. Sci. 15:387-391(1990). [3] Glomset J.A., Gelb M.H., Farnsworth C.C. Trends Biochem. Sci. 15:139-142(1990). [4] Sinensky M., Lutz R.J. BioEssays 14:25-31(1992). [5] Low M.G. FASEB J. 3:1600-1608(1989). [6] Low M.G. Biochim. Biophys. Acta 988:427-454(1989). [7] Hayashi S., Wu H.C. J. Bioenerg. Biomembr. 22:451-471(1990). Examples of LIPID key feature lines: FT LIPID 1 1 MYRISTATE. FT LIPID 65 65 PALMITATE (BY SIMILARITY). FT LIPID 354 354 GPI-ANCHOR. DISULFID - Disulfide bond. The `FROM' and `TO' endpoints represent the two residues which are linked by an intra-chain disulfide bond. If the `FROM' and `TO' endpoints are identical, the disulfide bond is an interchain one and the description field indicates the nature of the cross-link. Examples of DISULFID key feature lines: FT DISULFID 27 44 PROBABLE. FT DISULFID 14 14 INTERCHAIN (WITH A LIGHT CHAIN). THIOLEST - Thiolester bond. The `FROM' and `TO' endpoints represent the two residues which are linked by the thiolester bond. THIOETH - Thioether bond. The `FROM' and `TO' endpoints represent the two residues which are linked by the thioether bond. CARBOHYD - Glycosylation site. The nature of the carbohydrate (if known) is given in the description field. Examples of CARBOHYD key feature lines: FT CARBOHYD 103 103 GLUCOSYLGALACTOSE. FT CARBOHYD 256 256 POTENTIAL. METAL - Binding site for a metal ion. The description field indicates the nature of the metal. Examples of METAL key feature lines: FT METAL 18 18 IRON (HEME AXIAL LIGAND). FT METAL 87 87 COPPER (POTENTIAL). BINDING - Binding site for any chemical group (co-enzyme, prosthetic group, etc.). The chemical nature of the group is given in the description field. Examples of BINDING key feature lines: FT BINDING 14 14 HEME (COVALENT). FT BINDING 250 250 PYRIDOXAL PHOSPHATE. (A.3). Regions SIGNAL - Extent of a signal sequence (prepeptide). TRANSIT - Extent of a transit peptide (mitochondrial, chloroplastic, cyanelle or for a microbody). Examples of TRANSIT key feature lines: FT TRANSIT 1 42 CHLOROPLAST. FT TRANSIT 1 34 CYANELLE (BY SIMILARITY). FT TRANSIT 1 25 MITOCHONDRION. FT TRANSIT 1 23 MICROBODY (POTENTIAL). PROPEP - Extent of a propeptide. Examples of PROPEP key feature lines: FT PROPEP 27 28 ACTIVATION PEPTIDE. FT PROPEP 550 574 REMOVED IN MATURE FORM. CHAIN - Extent of a polypeptide chain in the mature protein. Examples of CHAIN key feature lines: FT CHAIN 21 119 BETA-2 MICROGLOBULIN. FT CHAIN 37 >42 FACTOR XIIIA. PEPTIDE - Extent of a released active peptide. Examples of PEPTIDE key feature lines: FT PEPTIDE 13 107 NEUROPHYSIN 2. FT PEPTIDE 235 239 MET-ENKEPHALIN. DOMAIN - Extent of a domain of interest on the sequence. The nature of that domain is given in the description field. Examples of DOMAIN key feature lines: FT DOMAIN 22 788 EXTRACELLULAR (POTENTIAL). FT DOMAIN 140 152 ANCESTRAL CALCIUM SITE. CA_BIND - Extent of a calcium-binding region. DNA_BIND - Extent of a DNA-binding region. NP_BIND - Extent of a nucleotide phosphate binding region. The nature of the nucleotide phosphate is indicated in the description field. Examples of NP_BIND key feature lines: FT NP_BIND 13 25 ATP. FT NP_BIND 45 49 GTP (POTENTIAL). FT NP_BIND 8 34 FAD (ADP PART). TRANSMEM - Extent of a transmembrane region. ZN_FING - Extent of a zinc finger region. Examples of ZN_FING key feature lines: FT ZN_FING 110 134 GATA-TYPE. FT ZN_FING 559 579 C4-TYPE. SIMILAR - Extent of a similarity with another protein sequence. Precise information, relative to that sequence is given in the description field. Examples of SIMILAR key feature lines: FT SIMILAR 351 456 STRONG, WITH KAPPA CHAIN V REGIONS. FT SIMILAR 580 1182 HIGH, WITH ERBB TRANSFORMING PROTEIN. REPEAT - Extent of an internal sequence repetition. Examples of REPEATS key feature lines: FT REPEAT 75 300 APPROXIMATE. FT REPEAT 390 600 APPROXIMATE. (A.4). Secondary structure The feature table of sequence entries of proteins whose tertiary structure is known experimentally contains the secondary structure information corresponding to that protein. The secondary structure assignment is made according to DSSP (see Kabsch W., Sander C.; Biopolymers, 22:2577-2637(1983)) and the information is extracted from the coordinate data sets of the Protein Data Bank (PDB). In the feature table only three types of secondary structure are specified : helices (key HELIX), beta-strand (key STRAND) and turns (key TURN). Residues not specified in one of these classes are in a `loop' or `random-coil' structure). Because the DSSP assignment has more than the three common secondary structure classes, we have converted the following DSSP assignments to HELIX, STRAND, and TURN: DSSP DSSP definition SWISS-PROT assignment code H Alpha-helix HELIX G 3(10) helix HELIX I Pi-helix HELIX E Hydrogen bonded beta-strand (extended strand) STRAND B Residue in an isolated beta-bridge STRAND T H-bonded turn (3-turn, 4-turn or 5-turn) TURN S Bend (five-residue bend centered at residue i) Not specified One should be aware of the following facts: a) Segment Length. For helices (alpha and 3-10), the residue just before and just after the helix as given by DSSP participates in the helical hydrogen bonding pattern with a single H-bond. For some practical purposes, one can therefore extend the HELIX range by one residue on each side. E.g. HELIX 25-35 instead of HELIX 26-34. Also, the ends of secondary structure segments are less well defined for lower resolution structures. A fluctuation of +/- one residue is common. b) Missing segments. In low resolution structures, badly formed helices or strands may be omitted in the DSSP definition. c) Special helices and strands. Helices of length three are 3-10 helices, those of length four and longer are either alpha-helices or 3-10 helices (pi helices are extremely rare). A strand of length one corresponds to a residue in an isolated beta-bridge. Such bridges can be structurally important. d) Missing secondary structure. No secondary structure is currently given in the feature table in the following cases: - No sequence data in the PDB entry. - Structure for which only C-alpha coordinates are in PDB. - NMR structure with more than one coordinate data set. - Model (i.e. theoretical) structure. Examples: FT HELIX 3 14 FT TURN 15 15 FT TURN 20 21 FT STRAND 23 23 FT HELIX 25 35 (A.5). Others ACT_SITE - Amino acid(s) involved in the activity of an enzyme. Examples of ACT_SITE key feature lines: FT ACT_SITE 193 193 ACCEPTS A PROTON DURING CATALYSIS. FT ACT_SITE 99 99 CHARGE RELAY SYSTEM. SITE - Any other interesting site on the sequence. Examples of SITE key feature lines: FT SITE 285 288 PREVENT SECRETION FROM ER. FT SITE 241 242 CLEAVAGE (BY ANIMAL COLLAGENASES). INIT_MET - The sequence is known to start with an initiator methionine. This feature key is mostly associated with a zero value in the `FROM' and `TO' fields. FT INIT_MET 0 0 NON_TER - The residue at an extremity of the sequence is not the terminal residue. If applied to position 1, this signifies that the first position is not the N-terminus of the complete molecule. If applied to the last position, it signifies that this position is not the C-terminus of the complete molecule. There is no description field for this key. Examples of NON_TER key feature lines: FT NON_TER 1 1 FT NON_TER 150 150 NON_CONS - Non consecutive residues. Indicates that two residues in a sequence are not consecutive and that there are a number of unsequenced residues between them. Examples of NON_CONS key feature lines: FT NON_CONS 1036 1037 FT NON_CONS 33 34 N-TERMINAL / C-TERMINAL. UNSURE - Uncertainties in the sequence Used to describe region(s) of a sequence for which the authors are unsure about the sequence assignment. ----------------------------------------------------------------------- APPENDIX B: AMINO ACID CODES The one-letter and three-letter codes for amino acids used in SWISS- PROT are those adopted by the commission on Biochemical Nomenclature of the IUPAC-IUB (see the reference listed below). A Ala Alanine. R Arg Arginine. N Asn Asparagine. D Asp Aspartic acid. C Cys Cysteine. Q Gln Glutamine. E Glu Glutamic acid. G Gly Glycine. H His Histidine. I Ile Isoleucine. L Leu Leucine. K Lys Lysine. M Met Methionine. F Phe Phenylalanine. P Pro Proline. S Ser Serine. T Thr Threonine. W Trp Tryptophan. Y Tyr Tyrosine. V Val Valine. B Asx Aspartic acid or Asparagine. Z Glx Glutamine or Glutamic acid. X Xaa Any amino acid. Reference IUPAC-IUB Joint Commission on Biochemical Nomenclature (JCBN). Nomenclature and Symbolism for Amino Acids and Peptides. Recommendations 1983. Eur. J. Biochem. 138:9-37(1984). ----------------------------------------------------------------------- APPENDIX C: FORMAT DIFFERENCES BETWEEN THE SWISS-PROT AND EMBL DATA BANKS (C.1). Generalities The format of SWISS-PROT follows as closely as possible that of the EMBL Database. The general structure of an entry is identical. The data classes used in both data banks are the same except that SWISS-PROT does not make use of the `UNREVIEWED' and `UNANNOTATED' data classes. Two line types used in SWISS-PROT do not exist in the EMBL Database (see section C.3); conversely SWISS-PROT does not currently make use of every EMBL line type (see section C.4). (C.2). Differences in line types present in both data banks (C.2.1). The ID line (IDentification) Differences with the EMBL Database ID line format are: o The entry name can be up to 10 characters long (instead of 9 in the EMBL Database) and can begin with a numerical character. o EMBL entry ID lines have an additional three letters taxonomic division `token' inserted between the data class and the molecule type. o The molecule type is listed as `PRT' rather than `DNA' or `RNA'. o The length of the molecule is followed by `AA' (Amino Acid) instead of `BP' (Base Pairs). (C.2.2). The AC line (ACcession number) The format of this line type completely follows that defined by the EMBL Database. SWISS-PROT accession numbers do not overlap with those used in the EMBL/GenBank/DDBJ nucleotide sequence database. (C.2.3). The DT line (DaTe) o In the EMBL Database there are two DT lines per entry instead of three in SWISS-PROT. o The format of the DT line that serves to indicate when an entry was created is identical to that defined in SWISS-PROT; but the two DT lines that convey information relevant to the updating of an entry are replaced by a single line in the EMBL Database. This is shown in the example below. DT lines in a SWISS-PROT entry: DT 21-JUL-1986 (REL. 01, CREATED) DT 23-OCT-1986 (REL. 02, LAST SEQUENCE UPDATE) DT 01-APR-1990 (REL. 14, LAST ANNOTATION UPDATE) DT lines in an EMBL Database entry: DT 10-MAR-1990 (REL. 22, CREATED) DT 12-APR-1990 (REL. 23, LAST UPDATED, VERSION 3) (C.2.4). The DE line (DEscription) o In SWISS-PROT The species of origin is not included in the description. o In the EMBL Database the last DE line is not terminated by a period. (C.2.5). The OS line (Organism Species) o In some cases the SWISS-PROT OS line includes more than one organism name (when the relevant sequence is completely conserved in different species). o In the EMBL Database the last OS line is not terminated by a period. (C.2.6). The OG line (Organelle) o EMBL makes a distinction between `MITOCHONDRION', and `KINETOPLAST' while SWISS-PROT does not use the later designation. o In the EMBL Database the OG line is not terminated by a period. (C.2.7). The RP and RC lines o In the EMBL Database, contrariwise to SWISS-PROT, the RC line precede the RP line. o In the EMBL Database the RC line is in free format and is generally not used. (C.2.8). The FT line (Feature Table) The format of this line is totally different from that currently defined for the EMBL Database. The format used in SWISS-PROT is similar to that which was used in older versions of the EMBL Database, prior to the introduction of the common EMBL/GenBank/DDBJ feature table. (C.2.9). The CC line (Comment) The comment lines, which are free text in the EMBL Database and can appear anywhere in an entry, are grouped together in the SWISS-PROT data bank, are always listed below the last reference line, and follow a precise syntax (see section 3.15). (C.2.10). The SQ line (SeQuence header) Although the rough format and purpose of this line type is conserved, its exact content differs from that of the EMBL Database. The numerical length of the sequence is listed, followed by AA (Amino Acid) instead of BP (Base Pairs). To replace the sequence composition, which for protein sequences, would not fit in a single line, the molecular weight and the 32-bit CRC (Cyclic Redundancy Check) value of the sequence are included. (C.3). Line types defined by SWISS-PROT but currently not used by EMBL Presently, there is only one line type which exists in SWISS-PROT and which is not used in the EMBL Database; it is the GN line. (C.4). Line types defined by EMBL but currently not used by SWISS-PROT There are three line types which exist in the EMBL Database and which are not, presently, used in SWISS-PROT. These are: FH, RT and XX. - The FH and XX lines contain no data and are present in the EMBL Database only to improve readability of an entry when it is printed or displayed on a terminal screen. These lines are not included in SWISS-PROT so as to keep it as compact as possible and thereby facilitate its use on small computer systems. - The RT line type (Reference Title) is presently not implemented. --End of document--