SWISS-PROT RELEASE 36.0 RELEASE NOTES !! Important: do not forget to read section 11 of these release notes. It contains an important announcement relevant to SWISS-PROT and PROSITE !! 1. INTRODUCTION Release 36.0 of SWISS-PROT contains 74'019 sequence entries, comprising 26'840'295 amino acids abstracted from 59'911 references. This represents an increase of 7% over release 35. The growth of the data bank is summarized below. Release Date Number of Number of amino entries acids 2.0 09/86 3939 900 163 3.0 11/86 4160 969 641 4.0 04/87 4387 1 036 010 5.0 09/87 5205 1 327 683 6.0 01/88 6102 1 653 982 7.0 04/88 6821 1 885 771 8.0 08/88 7724 2 224 465 9.0 11/88 8702 2 498 140 10.0 03/89 10008 2 952 613 11.0 07/89 10856 3 265 966 12.0 10/89 12305 3 797 482 13.0 01/90 13837 4 347 336 14.0 04/90 15409 4 914 264 15.0 08/90 16941 5 486 399 16.0 11/90 18364 5 986 949 17.0 02/91 20024 6 524 504 18.0 05/91 20772 6 792 034 19.0 08/91 21795 7 173 785 20.0 11/91 22654 7 500 130 21.0 03/92 23742 7 866 596 22.0 05/92 25044 8 375 696 23.0 08/92 26706 9 011 391 24.0 12/92 28154 9 545 427 25.0 04/93 29955 10 214 020 26.0 07/93 31808 10 875 091 27.0 10/93 33329 11 484 420 28.0 02/94 36000 12 496 420 29.0 06/94 38303 13 464 008 30.0 10/94 40292 14 147 368 31.0 02/95 43470 15 335 248 32.0 11/95 49340 17 385 503 33.0 02/96 52205 18 531 384 34.0 10/96 59021 21 210 389 35.0 11/97 69113 25 083 768 36.0 07/98 74019 26 840 295 2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 35 2.1 Sequences and annotations 4'976 sequences have been added since release 35, the sequence data of 712 existing entries has been updated and the annotations of 9'954 entries have been revised. 2.2 What's happening with the model organisms We have selected a number of organisms that are the target of genome sequencing and/or mapping projects and for which we intend to: . Be as complete as possible. All sequences available at a given time should be immediately included in SWISS-PROT. This also includes sequence corrections and updates; . Provide a higher level of annotation; . Provide cross-references to specialized database(s) that contain, among other data, some genetic information about the genes that code for these proteins; . Provide specific indices or documents. What was done since the last release or in preparation for the next release concerning model organisms: - We have continued our effort in catching up with the backlog of sequences from other model organisms. In particular we added about 350 entries from human and from E.coli, 300 from mouse, 250 from S.pombe, 200 from M.jannaschii, 150 from C.elegans, 100 from B.subtilis, H.pylori and from M.tuberculosis. - We plan to finish as quickly as possible the annotation of the Escherichia coli and Haemophilus influenzae sequence entries which are not yet part of SWISS-PROT. Here is the current status of the model organisms in SWISS-PROT: Organism Database Index file Number of cross-referenced sequences -------------- ---------------- -------------- --------- A.thaliana None yet In preparation 719 B.subtilis SubtiList SUBTILIS.TXT 1970 C.albicans None yet CALBICAN.TXT 192 C.elegans Wormpep CELEGANS.TXT 1887 D.discoideum DictyDB DICTY.TXT 280 D.melanogaster FlyBase FLY.TXT 1042 E.coli EcoGene ECOLI.TXT 4416 H.influenzae HiDB (TIGR) HAEINFLU.TXT 1693 H.sapiens MIM MIMTOSP.TXT 4980 H.pylori HpDB (TIGR) HPYLORI.TXT 334 M.genitalium MgDB (TIGR) MGENITAL.TXT 470 M.musculus MGD MGDTOSP.TXT 3253 M.jannaschii MjDB (TIGR) MJANNASC.TXT 1283 M.tuberculosis None yet None yet 873 S.cerevisiae SGD YEAST.TXT 4787 S.typhimurium StyGene SALTY.TXT 706 S.pombe None yet POMBE.TXT 1315 S.solfataricus None yet None yet 72 Collectively the entries from the above model organisms represent 40.9% of all SWISS-PROT entries. 2.3 Changes affecting the accession numbers With the creation of the TrEMBL database (see section 6) and the rapid increase in the amount of sequence data, we are faced with a problem of availability of accession numbers. Currently we use a system based on a one-letter prefix followed by 5 digits. This system was also used by the nucleotide sequence databases which had originally reserved for SWISS-PROT the prefix letters 'P' and 'Q'. The nucleotide databases having run out of space (due mainly to EST's), have been forced to start using a new format based on a two-letter prefix followed by 6 digits. We have used up all possible numbers with 'P' and 'Q' and the only letter prefix which was not used by the nucleotide database is 'O'. As we believe that changing the format of the accession numbers to that used now by the nucleotide database would create havoc on the numerous software packages using SWISS-PROT, we have decided to keep a system of accession numbers based on a six-character code, but with the following changes: 1) We have started using 'O'. This extra letter should allow the continuation of the present format (1 prefix letter + 5 digits) for approximately one year. 2) When we will have finished using up 'O', we will introduce a system based on the following format: 1 2 3 4 5 6 [O,P,Q] [0-9] [A-Z, 0-9] [A-Z, 0-9] [A-Z, 0-9] [0-9] What the above means is that we will keep a six-character code, but that in positions 3, 4 and 5 of this code any combination of letters and numbers can be present. This format allows a total of 14 million accession numbers (up from 300'000 with the current system). We only allow numbers in positions 2 and 6 so that the SWISS-PROT accession numbers can not be mistaken with gene names, acronyms, other type of accession numbers or any type of words! Examples: P0A3S2, Q2ASD4, O13YX2, P9B123 2.4 Changes concerning the reference location line (RL) The (IN) prefix used for books is now also used for references to the electronic Plant Gene Register (See http://www.tarweed.com/pgr/). Example: RL (IN) PLANT GENE REGISTER PGR98-023. 2.5 Cleaning up of the SIMILARITY comment line (CC) topic We started a major overhaul of the "SIMILARITY" topic. We would like the majority of the information stored in this topic to be usable by computer programs (while being human-readable). We are therefore standardizing the format of this topic using two different subformats. One to describe to which family a protein belongs to: CC - !- SIMILARITY: BELONGS TO THE {Name1} FAMILY [OF {Name2}]. CC [{Name3} SUBFAMILY.] Examples: CC - !- SIMILARITY: BELONGS TO THE 14-3-3 FAMILY. CC - !- SIMILARITY: BELONGS TO THE 6-PHOSPHOGLUCONATE DEHYDROGENASE CC FAMILY. CC - !- SIMILARITY: BELONGS TO THE AAA FAMILY OF ATPASES. CC - !- SIMILARITY: BELONGS TO THE IRON/ASCORBATE-DEPENDENT FAMILY OF CC OXIDOREDUCTASES. CC - !- SIMILARITY: BELONGS TO THE ANTP FAMILY OF HOMEOBOX PROTEINS. CC "DEFORMED" SUBFAMILY. CC - !- SIMILARITY: BELONGS TO THE KINESIN-LIKE PROTEIN FAMILY. KINESIN CC SUBFAMILY. And one to describe which domains are found in a given protein: CC - !- SIMILARITY: CONTAINS n {Name} [DOMAIN|REPEAT][S]. Examples: CC - !- SIMILARITY: CONTAINS 1 FHA DOMAIN. CC - !- SIMILARITY: CONTAINS 45 EGF-LIKE DOMAINS. CC - !- SIMILARITY: CONTAINS 2 SH3 DOMAINS. CC - !- SIMILARITY: CONTAINS 2 SUSHI (SCR) REPEATS. We already have updated many entries in this release and plan to continue to do so for the next release. 2.6 Changes concerning cross-references (DR line) We have added cross-references from SWISS-PROT to the Mendel database, a plant gene nomenclature database from the Commission for Plant Gene Nomenclature (CPGN). These cross-references are present in the DR lines: Data bank identifier: MENDEL Primary identifier : The Mendel accession number for a gene in a given species. Secondary identifier: Composed of the acronym of the species (generally the same five-letter code as that defined and used by SWISS-PROT in the entry name), the gene name and a number. Example: DR MENDEL; 294; Amahy;psbA;1. 3. PLANNED CHANGES 3.1 Extension of the accession number system As already explained in detail under 2.3, we will extend the accession number system when we will have used up the 'O' series of accession numbers. This can be anticipated for October 1998. 3.2 Switch to the NCBI taxonomy To standardize the taxonomies used by different databases we will change with release 37 our taxonomy. We will switch to the NCBI taxonomy, which is already used as the common taxonomy by the DDBJ/EMBL/GenBank nucleotide sequence databases. 3.3 Introduction of RT lines With release 37 we will introduce a new line type, the RT (Reference Title) line. This optional line will be placed between the RA and RL line. The RT line gives the title of the paper (or other work) as exactly as possible given the limitations of the computer character set. The form which will be used is that which would be used in a citation rather than displayed at the top of the published paper. For instance, where journals capitalize major title words this is not preserved. The title is enclosed in double quotes, and may be continued over several lines as necessary. The title lines are terminated by a semicolon. An example of the use of RT lines is shown below: RT "Sequence analysis of the genome of the unicellular cyanobacterium RT Synechocystis sp. strain PCC6803. I. Sequence features in the 1 Mb RT region from map positions 64% to 92% of the genome."; 4. STATUS OF THE DOCUMENTATION FILES SWISS-PROT is distributed with a large number of documentation files. Some of these files have been available for a long time (the user manual, release notes, the various indices for authors, citations, keywords, etc.), but many have been created recently and we are continuously adding new files. Since release 35, we have added three new document files. The following table lists all the documents that are currently available. USERMAN.TXT User manual RELNOTES.TXT Release notes OLDRLNOT.TXT Release notes for previous release [1,2] SHORTDES.TXT Short description of entries in SWISS-PROT JOURLIST.TXT List of abbreviations for journals cited [3] KEYWLIST.TXT List of keywords in use SPECLIST.TXT List of organism identification codes TISSLIST.TXT List of tissues [4] EXPERTS.TXT List of on-line experts for PROSITE and SWISS-PROT SUBMIT.TXT Submission of sequence data to SWISS-PROT ACINDEX.TXT Accession number index AUTINDEX.TXT Author index CITINDEX.TXT Citation index KEYINDEX.TXT Keyword index SPEINDEX.TXT Species index DELETEAC.TXT Deleted accession number index 7TMRLIST.TXT List of 7-transmembrane G-linked receptors entries AATRNASY.TXT List of aminoacyl-tRNA synthetases ALLERGEN.TXT Nomenclature and index of allergen sequences BLOODGRP.TXT List of blood group antigen proteins CALBICAN.TXT Index of Candida albicans entries and their corresponding gene designations CDLIST.TXT CD nomenclature for surface proteins of human leucocytes CELEGANS.TXT Index of Caenorhabditis elegans entries and their corresponding gene Wormpep cross-references DICTY.TXT Index of Dictyostelium discoideum entries and their corresponding gene designations and DictyDb cross-references EC2DTOSP.TXT Index of Escherichia coli Gene-protein database entries referenced in SWISS-PROT ECOLI.TXT Index of Escherichia coli K12 chromosomal entries and their corresponding EcoGene cross-references EMBLTOSP.TXT Index of EMBL Database entries referenced in SWISS-PROT EXTRADOM.TXT Nomenclature of extracellular domains FLY.TXT Index of Drosophila entries and FlyBase cross- references GLYCOSID.TXT Classification of glycosyl hydrolase families and index of glycosyl hydrolase entries HAEINFLU.TXT Index of Haemophilus influenzae RD chromosomal entries HOXLIST.TXT Vertebrate homeotic Hox proteins: nomenclature and index HPYLORI.TXT Index of Helicobacter pylori strain 26695 chromosomal entries HUMCHR17.TXT Index of protein sequence entries encoded on human chromosome 17 [1] HUMCHR18.TXT Index of protein sequence entries encoded on human chromosome 18 HUMCHR19.TXT Index of protein sequence entries encoded on human chromosome 19 HUMCHR20.TXT Index of protein sequence entries encoded on human chromosome 20 HUMCHR21.TXT Index of protein sequence entries encoded on human chromosome 21 HUMCHR22.TXT Index of protein sequence entries encoded on human chromosome 22 HUMCHRX.TXT Index of protein sequence entries encoded on human chromosome X HUMCHRY.TXT Index of protein sequence entries encoded on human chromosome Y HUMPVAR.TXT Index of human proteins with sequence variants [1] INITFACT.TXT List and index of translation initiation factors MIMTOSP.TXT Index of MIM entries referenced in SWISS-PROT METALLO.TXT Classification of metallothioneins and index of entries in SWISS-PROT MGDTOSP.TXT Index of MGD entries referenced in SWISS-PROT MGENITAL.TXT Index of Mycoplasma genitalium chromosomal entries MJANNASC.TXT Index of Methanococcus jannaschii entries NGR234.TXT Table of putative genes in Rhizobium plasmid pNGR234a NOMLIST.TXT List of nomenclature related references for proteins PCC6803.TXT Index of Synechocystis strain PCC 6803 entries PDBTOSP.TXT Index of X-ray crystallography Protein Data Bank (PDB) entries referenced in SWISS-PROT PEPTIDAS.TXT Classification of peptidase families and index of peptidase entries PLASTID.TXT List of chloroplast and cyanelle encoded proteins POMBE.TXT Index of Schizosaccharomyces pombe entries in SWISS-PROT and their corresponding gene designations RESTRIC.TXT List of restriction enzyme and methylase entries RIBOSOMP.TXT Index of ribosomal proteins classified by families on the basis of sequence similarities SALTY.TXT Index of Salmonella typhimurium LT2 chromosomal entries and their corresponding StyGene cross- references SUBTILIS.TXT Index of Bacillus subtilis 168 chromosomal entries and their corresponding SubtiList cross-references UPFLIST.TXT UPF (Uncharacterized Protein Families) list and index of members YEAST.TXT Index of Saccharomyces cerevisiae entries and their corresponding gene designations YEAST1.TXT Yeast Chromosome I entries YEAST2.TXT Yeast Chromosome II entries YEAST3.TXT Yeast Chromosome III entries YEAST5.TXT Yeast Chromosome V entries YEAST6.TXT Yeast Chromosome VI entries YEAST7.TXT Yeast Chromosome VII entries YEAST8.TXT Yeast Chromosome VIII entries YEAST9.TXT Yeast Chromosome IX entries YEAST10.TXT Yeast Chromosome X entries YEAST11.TXT Yeast Chromosome XI entries YEAST13.TXT Yeast Chromosome XIII entries YEAST14.TXT Yeast Chromosome XIV entries Notes: 1 New in release 36. 2 We apologize for having not included, with release 35, the corresponding release notes. We are therefore including it with this release. As we believe that it may be useful to always distribute the release notes of the previous release, we will start to do so and such a file will be now known as "OLDRLNOT.TXT". 3 Has been extensively updated and contains Web links to more than 640 journals. 4 Has been extensively updated and now includes synonyms for many tissues, We have continued to include in some SWISS-PROT document files the references of Web sites relevant to the subject under consideration. There are now 24 documents that include such links. 5. THE EXPASY WORLD-WIDE WEB SERVER 5.1 Background information The most efficient and user-friendly way to browse interactively in SWISS-PROT, PROSITE, ENZYME, SWISS-2DPAGE and other databases. is to use the World-Wide Web (WWW) molecular biology server ExPASy. The ExPASy server was made available to the public in September 1993, it is reachable at the following address: http://www.expasy.ch/ The ExPASy WWW server allows access, using the user-friendly hypertext model, to the SWISS-PROT, PROSITE, ENZYME, SWISS-2DPAGE, SWISS-3DIMAGE and CD40Lbase databases and, through any SWISS-PROT protein sequence entry, to other databases such as EMBL, Eco2DBASE, EcoCyc, FlyBase, GCRDb, MaizeDB, SubtiList/NRSub, OMIM, PDB, HSSP, ProDom, REBASE, SGD, YEPD and Medline. ExPAsy also offers many tools for the analysis of protein sequences and 2D gels. 5.2 SWISS-SHOP We provide, on ExPASy, a service called SWISS-SHOP. SWISS-Shop allows any users of SWISS-PROT to indicate what proteins he/she is interested in. This can be done using various criteria that can be combined: - By entering one or more words that should be present in the description line; - By entering one or more species name(s) or taxonomic division(s); - By entering one or more keywords; - By entering one or more author names; - By entering the accession number (or entry name) of a PROSITE pattern or a user-defined sequence pattern; - By entering the accession number (or entry name) of an existing SWISS-PROT entry or by entering a "private" sequence. Every week, the new sequences entered in SWISS-PROT are automatically compared with all the criteria that have been defined by the users. If a sequence corresponds to the selection criteria defined by a user, that sequence is sent by electronic mail. 5.3 What is new on ExPASy ExPASy is constantly modified and improved. If you wish to be informed on the changes made to the server you can either: - Read the document "History of changes, improvements and new features" which is available at the address: http://www.expasy.ch/www/history.html - Subscribe to SWISS-Flash, a service that reports news of databases, software and services developments. By subscribing to this service, you will automatically get SWISS-Flash bulletins by electronic mail. To subscribe use the address: http://www.expasy.ch/www/swiss-flash.html 6. TREMBL - A SUPPLEMENT TO SWISS-PROT The ongoing genome sequencing and mapping projects have dramatically increased the number of protein sequences to be incorporated into SWISS- PROT. Since we do not want to dilute the quality standards of SWISS-PROT by incorporating sequences into the database without proper sequence analysis and annotation, we cannot speed up the incorporation of new incoming data indefinitely. But as we also want to make the sequences available as fast as possible, we have introduced with SWISS-PROT a computer annotated supplement. This supplement consists of entries in SWISS-PROT-like format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except those already included in SWISS-PROT. We name this supplement TrEMBL (Translation from EMBL). It can be considered as a preliminary section of SWISS-PROT. This SWISS-PROT release is supplemented by TrEMBL release 6. TrEMBL is split in two main sections; SP-TrEMBL and REM-TrEMBL: - SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries (150'329 in release 6) which should be incorporated into SWISS-PROT. SWISS-PROT accession numbers have been assigned for all SP-TrEMBL entries. - REM-TrEMBL (REMaining TrEMBL) contains the entries (27'428 in release 6) that we do not want to include in SWISS-PROT for a variety of reasons (synthetic sequences, pseudogenes, translations of uncorrect open reading frames, fragments with less than eight amino acids, patent-derived sequences, immunoglobulins and T-cell receptors, etc.) TrEMBL is available by FTP from the EBI server (ftp.ebi.ac.uk) in the directory '/pub/databases/trembl'. It can be queried on WWW by the EBI SRS server (http://www.ebi.ac.uk/). It is also available on the SWISS-PROT CD- ROM and is searchable on the FASTA, BIC and BLAST servers of the EBI. 7. WEEKLY UPDATES OF SWISS-PROT Weekly updates of SWISS-PROT are available by anonymous FTP. Three files are updated at each update: new_seq.dat Contains all the new entries since the last full release; upd_seq.dat Contains the entries for which the sequence data has been updated since the last release; upd_ann.dat Contains the entries for which one or more annotation fields have been updated since the last release. Currently these files are available on the following anonymous FTP servers: Organization Swiss Institute of Bioinformatics (SIB) Address ftp.expasy.ch Directory /databases/swiss-prot/updates Organization European Bioinformatics Institute (EBI) Address ftp.ebi.ac.uk Directory /pub/databases/swissprot/new !! Important notes !! - Although we try to follow a regular schedule, we do not promise to update these files every week. In some cases two weeks will elapse in- between two updates. - Due to the current mechanism used to build a release the entries that are provided in these updates are not guaranteed to be error free. - Instead of using the above files, you can, every week, download an updated copy of the SWISS-PROT database. This file is available in the directory containing the non-redundant database (see next section). 8. NON-REDUNDANT DATABASE A few months ago, we started to distribute on the ExPASy and EBI FTP servers, files that make up a non-redundant (see further) and complete protein sequence database consisting of three components: 1) SWISS-PROT 2) TrEMBL 3) New entries to be later integrated into TrEMBL (hereafter known as TrEMBL_New) Every week three files are completely rebuilt. These files are named: sprot.dat.Z, trembl.dat.Z and trembl_new.dat.Z. As indicated by their ".Z" extension these are Unix "compress" format files which, when decompressed, will produce ASCII files in SWISS-PROT format. Three others files are also available (sprot.fas.Z, trembl.fas.Z and trembl_new.fas.Z) Which are compressed "fasta" format sequence files useful for building the databases used by FASTA, BLAST and other sequence similarity search programs. Please do not use these files for other purpose as you loose all annotations by using this very primitive format. The files for the non-redundant database are stored in the directory "/databases/sp_tr_nrdb" on the ExPASy FTP server (ftp.expasy.ch) and in the directory "/pub/databases/sp_tr_nrdb" on the EBI FTP server (ftp.ebi.ac.uk). Additional notes - The SWISS-PROT file continuously grows as new annotated sequences are added. - The TrEMBL file decreases in size as sequences are moved out of that section after being annotated and moved into SWISS-PROT. Four times a year a new release of TrEMBL is built at EBI, at this point the TrEMBL file increases in size as it then includes all of the new data (see next section) that has accumulated since the last release. - The TrEMBL_New file starts as a very small file and grows in size until a new release of TrEMBL is available. - SWISS-PROT and TrEMBL share the same system of accession numbers. Therefore you will not find any primary accession number duplicated between the two sections. A TrEMBL entry (and its associated accession number(s)) can either move to SWISS-PROT as new entry or be merged with an existing SWISS-PROT entry. In the later case, the accession number(s) of that TrEMBL entry are added to that of the SWISS-PROT entry. - TrEMBL_New does not have real accession numbers. However it was necessary to have an "AC" line so as to be able to use it with different software products. This AC line contains a temporary identifier which consists of the pID (protein identifier) of the coding sequence in the parent nucleotide sequence. - While these three files allow you to build what we call a "non- redundant" database, it must be noted that this is not completely a true statement. Without going into a long explanation we can say that this is currently the best attempt in providing a complete selection of protein sequence entries yet trying to eliminate redundancies. While SWISS-PROT is completely (well 99.994% !) non-redundant, TrEMBL is far from being non-redundant and the addition of SWISS-PROT + TrEMBL is even less. - To describe to your users the version of the non-redundant database that you are providing to them, you should use a statement of the form: SWISS-PROT release 36 and updates until {current_date}; TrEMBL release 6 minus data integrated into SWISS-PROT as of {current_date}; New preliminary TrEMBL entries created since release 6 of TrEMBL 9. ENZYME and PROSITE 9.1 The ENZYME data bank Release 23.0 of the ENZYME data bank is distributed with release 36 of SWISS-PROT. ENZYME release 23.0 contains information relative to 3704 enzymes. It also differs from the previous release (22 of November 1997) in that the "DE" (Description), "AN" (Alternative Names), "CF" (Cofactor) and "CC" (Comments) lines are now in mixed-case characters instead of being all in UPPER case. Example, what was before: ID 1.4.4.2 DE GLYCINE DEHYDROGENASE (DECARBOXYLATING). AN GLYCINE DECARBOXYLASE. AN GLYCINE CLEAVAGE SYSTEM P-PROTEIN. CA GLYCINE + LIPOYLPROTEIN = S-AMINOMETHYLDIHYDROLIPOYLPROTEIN + CO(2). CF PYRIDOXAL-PHOSPHATE. CC -!- LIPOAMIDE CAN ALSO ACT AS ACCEPTOR. CC -!- A COMPONENT, WITH EC 2.1.2.10, OF THE GLYCINE CLEAVAGE SYSTEM, CC PREVIOUSLY KNOWN AS GLYCINE SYNTHASE. DI NONKETOTIC HYPERGLYCINEMIA TYPE II; MIM:238310. DR P54376, GCS1_BACSU; P54377, GCS2_BACSU; P49361, GCSA_FLAPR; DR P49362, GCSB_FLAPR; P15505, GCSP_CHICK; P33195, GCSP_ECOLI; DR O49850, GCSP_FLAAN; O49852, GCSP_FLATR; P23378, GCSP_HUMAN; DR Q50601, GCSP_MYCTU; P26969, GCSP_PEA ; Q09785, GCSP_SCHPO; DR O49954, GCSP_SOLTU; P49095, GCSP_YEAST; // is now: ID 1.4.4.2 DE Glycine dehydrogenase (decarboxylating). AN Glycine decarboxylase. AN Glycine cleavage system P-protein. CA GLYCINE + LIPOYLPROTEIN = S-AMINOMETHYLDIHYDROLIPOYLPROTEIN + CO(2). CF Pyridoxal-phosphate. CC -!- Lipoamide can also act as acceptor. CC -!- A component, with EC 2.1.2.10, of the glycine cleavage system, CC previously known as glycine synthase. DI NONKETOTIC HYPERGLYCINEMIA TYPE II; MIM:238310. DR P54376, GCS1_BACSU; P54377, GCS2_BACSU; P49361, GCSA_FLAPR; DR P49362, GCSB_FLAPR; P15505, GCSP_CHICK; P33195, GCSP_ECOLI; DR O49850, GCSP_FLAAN; O49852, GCSP_FLATR; P23378, GCSP_HUMAN; DR Q50601, GCSP_MYCTU; P26969, GCSP_PEA ; Q09785, GCSP_SCHPO; DR O49954, GCSP_SOLTU; P49095, GCSP_YEAST; // We plan to convert the "CA" (Catalytic Activity) lines to mixed-case for the next release. 9.2 The PROSITE data bank Release 15.0 of the PROSITE data bank is distributed with release 36 of SWISS-PROT. This release of PROSITE contains 1014 documentation entries that describe 1'352 different patterns, rules and profiles/matrices. 10. WE NEED YOUR HELP ! We welcome feedback from our users. We would especially appreciate that you notify us if you find that sequences belonging to your field of expertise are missing from the data bank. We also would like to be notified about annotations to be updated, if, for example, the function of a protein has been clarified or if new post-translational information has become available. To facilitate such feedback's we offer on the ExPASY WWW server a form that allows the submission of updates and/or corrections to SWISS-PROT: http://www.expasy.ch/sprot/sp_update_form.html It is also possible, from any entries in SWISS-PROT displayed by the ExPASy server, to submit updates and/or corrections for that particular entry. Finally, you can also send your comments by electronic mail to the address: swiss-prot@expasy.ch 11. IMPORTANT ANNOUNCEMENT It became obvious in the last years that the tremendous increase in data flow has created a requirement for resources which cannot be addressed in full by public funding. This is causing databases to fall behind the research. We believe that the only solution to the resource shortfall is to ask commercial users to participate by paying a license fee. No fee will be charged to academic users, nor will any restriction be imposed on their use or reuse of the data. both SWISS-PROT and PROSITE are concerned by these changes while this is not the case of ENZYME. A document fully describing what will be the impact of this change for SWISS-PROT is available with the SWISS-PROT distribution files on FTP (SP_98.TXT). You can also access the document as well as other relevant ones from: http://www.expasy.ch/announce/ http://www.ebi.ac.uk/news.html If you do not have the time to read this document, the most important take-home message is that these changes should not have any impact on the way SWISS-PROT or PROSITE are accessed or redistributed. Academic users will not be affected by these changes. Industrial end-users will also not directly be affected as long as their employer pays the license fee. The same holds true for bioinformatics companies. Academic software or database developers as well as providers of database distribution services will be only minimally affected by these changes. We hope to be able to keep the spirit of SWISS-PROT and PROSITE alive and at the same time ensure their long-term financial survival. We sincerely hope and believe that in the next two years the only change that will matter will be the increase in scope and timeliness of the databases. Finally, it should be noted that release 36 of SWISS-PROT and release 15 of PROSITE are not concerned by these changes. There are no restrictions on their use and their distribution. ======================================================================== APPENDIX A: SOME STATISTICS A.1 Amino acid composition A.1.1 Composition in percent for the complete data bank Ala (A) 7.58 Gln (Q) 3.99 Leu (L) 9.42 Ser (S) 7.15 Arg (R) 5.14 Glu (E) 6.35 Lys (K) 5.93 Thr (T) 5.69 Asn (N) 4.47 Gly (G) 6.83 Met (M) 2.37 Trp (W) 1.24 Asp (D) 5.28 His (H) 2.24 Phe (F) 4.08 Tyr (Y) 3.18 Cys (C) 1.67 Ile (I) 5.80 Pro (P) 4.91 Val (V) 6.56 Asx (B) 0.001 Glx (Z) 0.001 Xaa (X) 0.01 A.1.2 Classification of the amino acids by their frequency Leu, Ala, Ser, Gly, Val, Glu, Lys, Ile, Thr, Asp, Arg, Pro, Asn, Phe, Gln, Tyr, Met, His, Cys, Trp A.2 Repartition of the sequences by their organism of origin Total number of species represented in this release of SWISS-PROT: 6002 The first twenty species represent 35826 sequences: 48.4 % of the total number of entries. A.2.1 Table of the frequency of occurrence of species Species represented 1x: 2754 2x: 951 3x: 479 4x: 332 5x: 238 6x: 212 7x: 159 8x: 99 9x: 102 10x: 73 11- 20x: 277 21- 50x: 176 51-100x: 72 >100x: 78 A.2.2 Table of the most represented species Number Frequency Species 1 4980 Human 2 4787 Baker's yeast (Saccharomyces cerevisiae) 3 4416 Escherichia coli 4 3253 Mouse 5 2491 Rat 6 1970 Bacillus subtilis 7 1887 Caenorhabditis elegans 8 1693 Haemophilus influenzae 9 1315 Fission yeast (Schizosaccharomyces pombe) 10 1283 Methanococcus jannaschii 11 1088 Bovine 12 1042 Fruit fly (Drosophila melanogaster) 13 873 Mycobacterium tuberculosis 14 840 Chicken 15 719 Arabidopsis thaliana (Mouse-ear cress) 16 706 Salmonella typhimurium 17 697 African clawed frog (Xenopus laevis) 18 616 Synechocystis sp. (strain PCC 6803) 19 607 Pig 20 563 Rabbit 21 489 Mycoplasma pneumoniae 22 470 Mycoplasma genitalium 23 406 Maize 24 403 Rhizobium sp. (strain NGR234) 25 345 Pseudomonas aeruginosa 26 334 Helicobacter pylori 27 304 Rice 28 284 Dog 29 280 Slime mold (Dictyostelium discoideum) 30 278 Tobacco 31 273 Bacteriophage T4 32 253 Vaccinia virus (strain Copenhagen) 33 250 Mycobacterium leprae 34 244 Sheep 35 240 Pea 36 219 Porphyra purpurea 37 215 Barley 38 212 Staphylococcus aureus 39 209 Neurospora crassa 40 208 Soybean 41 205 Wheat 42 195 Tomato 43 193 Rhodobacter capsulatus 193 Human cytomegalovirus (strain AD169) 45 192 Candida albicans 192 Potato 47 191 Klebsiella pneumoniae 48 190 Methanobacterium thermoautotrophicum 49 185 Bacillus stearothermophilus 50 184 Vaccinia virus (strain WR) 51 178 Pseudomonas putida 52 164 Agrobacterium tumefaciens 53 160 Spinach 160 Guinea pig 55 158 Chlamydomonas reinhardtii 56 157 Rhizobium meliloti 57 154 Autographa californica nuclear polyhedrosis virus 58 150 Marchantia polymorpha (Liverwort) 59 146 Variola virus 146 Cyanophora paradoxa 61 145 Aspergillus nidulans 62 139 Odontella sinensis 63 136 Streptomyces coelicolor 136 Golden hamster 136 Lactococcus lactis (subsp. lactis) 66 134 Orgyia pseudotsugata multicapsid polyhedrosis virus 67 130 Horse 68 127 Kluyveromyces lactis 69 125 Thermus aquaticus (subsp. thermophilus) 70 124 Trypanosoma brucei brucei 71 122 Synechococcus sp. (strain PCC 7942) 72 114 Anabaena sp. (strain PCC 7120) 73 113 Bradyrhizobium japonicum 74 111 Alcaligenes eutrophus 75 110 Bombyx mori (Silk moth) 76 107 Archaeoglobus fulgidus 77 105 Yersinia enterocolitica 78 101 Brassica napus (Rape) A.3 Repartition of the sequences by size From To Number From To Number 1- 50 3048 1001-1100 667 51- 100 6272 1101-1200 511 101- 150 9004 1201-1300 348 151- 200 7032 1301-1400 233 201- 250 6626 1401-1500 193 251- 300 6172 1501-1600 119 301- 350 5852 1601-1700 112 351- 400 5882 1701-1800 86 401- 450 4500 1801-1900 91 451- 500 4176 1901-2000 58 501- 550 3138 2001-2100 33 551- 600 2191 2101-2200 68 601- 650 1688 2201-2300 67 651- 700 1221 2301-2400 35 701- 750 1095 2401-2500 41 751- 800 891 >2500 207 801- 850 685 851- 900 736 901- 950 509 951-1000 432 A.4 Longest sequences The longest sequences (>=4000 residues) are listed here: HTS1_COCCA 5217 MUC2_HUMAN 5179 FAT_DROME 5147 RYNR_RABIT 5037 RYNR_PIG 5035 RYNR_HUMAN 5032 RYNC_RABIT 4969 LRP_CAEEL 4753 DYHC_DICDI 4725 PLEC_RAT 4687 LRP2_RAT 4660 DYHC_RAT 4644 DYHC_DROME 4639 DYHC_CAEEL 4568 DYHB_CHLRE 4568 APB_HUMAN 4563 APOA_HUMAN 4548 LRP1_HUMAN 4544 LRP1_CHICK 4543 DYHC_PARTE 4540 RRPA_CVMJH 4488 DYHG_CHLRE 4485 DYHC_ANTCR 4466 DYHC_TRIGR 4466 GRSB_BACBR 4451 PKSK_BACSU 4447 PKSL_BACSU 4427 PGBM_HUMAN 4393 YP73_CAEEL 4385 DYHC_NEUCR 4367 DYHC_NECHA 4349 DYHC_EMENI 4344 PKD1_HUMAN 4303 DYHC_SCHPO 4196 DYHC_YEAST 4092 RRPA_CVH22 4085 A.5 Statistics for journal citations Total number of journals cited in this release of SWISS-PROT: 913 A.5.1 Table of the frequency of journal citations Journals cited 1x: 339 2x: 124 3x: 70 4x: 39 5x: 37 6x: 23 7x: 17 8x: 15 9x: 14 10x: 10 11- 20x: 63 21- 50x: 65 51-100x: 24 >100x: 73 A.5.2 List of the most cited journals in SWISS-PROT Citations Journal abbreviation --------- ---------------------------------- 6303 J. BIOL. CHEM. 3814 PROC. NATL. ACAD. SCI. U.S.A. 3384 NUCLEIC ACIDS RES. 2714 J. BACTERIOL. 2498 GENE 2058 FEBS LETT. 1932 EUR. J. BIOCHEM. 1780 BIOCHEM. BIOPHYS. RES. COMMUN. 1732 BIOCHEMISTRY 1713 EMBO J. 1617 NATURE 1438 BIOCHIM. BIOPHYS. ACTA 1339 J. MOL. BIOL. 1228 CELL 1184 MOL. CELL. BIOL. 953 MOL. GEN. GENET. 929 PLANT MOL. BIOL. 888 BIOCHEM. J. 873 GENOMICS 808 SCIENCE 768 MOL. MICROBIOL. 764 VIROLOGY 682 J. BIOCHEM. 515 J. VIROL. 464 YEAST 461 J. CELL BIOL. 445 J. GEN. VIROL. 417 PLANT PHYSIOL. 407 GENES DEV. 376 HUM. MOL. GENET. 346 J. IMMUNOL. 342 HUM. MUTAT. 323 ARCH. BIOCHEM. BIOPHYS. 319 CURR. GENET. 312 ONCOGENE 312 INFECT. IMMUN. 305 MOL. BIOCHEM. PARASITOL. 270 FEMS MICROBIOL. LETT. 264 BIOL. CHEM. HOPPE-SEYLER 261 STRUCTURE 254 AM. J. HUM. GENET. 247 NAT. GENET. 239 DEVELOPMENT 237 MOL. ENDOCRINOL. 234 J. CLIN. INVEST. 218 J. MOL. EVOL. 218 J. GEN. MICROBIOL. 213 HOPPE-SEYLER'S Z. PHYSIOL. CHEM. 204 MICROBIOLOGY 202 GENETICS 191 HUM. GENET. 188 NAT. STRUCT. BIOL. 186 DNA CELL BIOL. 182 J. EXP. MED. 181 BLOOD 175 DEV. BIOL. 174 APPL. ENVIRON. MICROBIOL. 172 NEURON 157 PROTEIN SCI. 153 DNA 145 IMMUNOGENETICS 137 ENDOCRINOLOGY 136 DNA SEQ. 125 PLANT CELL 115 HEMOGLOBIN 113 CANCER RES. 113 BIOCHIMIE 109 J. NEUROCHEM. 109 BIOORG. KHIM. 108 MOL. BIOL. EVOL. 107 AGRIC. BIOL. CHEM. 106 BRAIN RES. MOL. BRAIN RES. 105 PLANT J. ======================================================================== APPENDIX B: RELATIONSHIPS BETWEEN SWISS-PROT AND SOME BIOMOLECULAR DATABASES The current status of the relationships (cross-references) between SWISS-PROT and some biomolecular databases is shown in the following schematic: *********************** * EMBL Nucleotide * * Sequence Database * * [EBI] * *********************** ^ ^ ^ ^ ^ ^ ^ ^ ^ ****************** | | | I | | | | | ********************** * FlyBase * <-------+ | | I | | | | +-------> * MGD [Mouse] * ****************** | | | I | | | | | ********************** | | | I | | | | | ****************** | | | I | | | | | ********************** * SubtiList * <---------+ | I | | | +---------> * GCRDb [7TM recep.] * * [B.subtilis] * | | | I | | | | | ********************** ****************** | | | I | | | | | | | | I | | | | | ********************** ****************** | | | I | | +-----------> * EcoGene [E.coli] * * Mendel [Plant] * <-----+ | | | I | | | | | ********************** ****************** | | | | I | | | | | | | | | I | | | | | ********************** ****************** | | | | I +---------------> * SGD [Yeast] * * MaizeDb * <-----------+ I | | | | | ********************** * [Zea mays] * | | | | I | | | | | ****************** | | | | I | | | | | ********************** | | | | I | +-------------> * DictyDB [D.disco.] * ****************** | | | | I | | | | | ********************** * WormPep * | | | | I | | | | | * [C.elegans] * <---+ | | | | I | | | | | ********************** ****************** | | | | | I | | | | | +-----> * ENZYME [Nomencl.] * | | | | | I | | | | | | ********************** ****************** | v v v v v v v v v v v v * REBASE * ************************* ********************** * [Restriction * <-- * SWISS-PROT * ----> * OMIM [Human] * * enzymes] * * Protein Sequence * ********************** ****************** * Data Bank * ************************* ********************** ****************** ^ ^ ^ ^ ^ ^ ^ | ^ ^ ^ * ECO2DBASE [2D] * * StyGene * | | | | | | | | | | +--------> ********************** * [S.Typhimurium]* <----+ | | | | | | | | | ****************** | | | | | | | | | ********************** | | | | | | | | +----------> * Maize-2DPAGE [2D] * ****************** | | | | | | | | ********************** * Transfac * <------+ | | | | | | | ****************** | | | | | | | ********************** | | | | | | +------------> * SWISS-2DPAGE [2D] * ****************** | | | | | | ********************** * Harefield [2D] * <--------+ | | | | | ****************** | | | | | ********************** | | | | +--------------> * Aarhus/Ghent [2D] * ****************** | | | | ********************** * PROSITE * | | | | * [Patterns and * <----------+ | | +----------------> ********************** * profiles] * | | * YEPD [Yeast] [2D] * ****************** | +----------------+ ********************** | v | | *********************** +-> ********************** +--------> * PDB [3D structures] * <----- * HSSP [3D similar.] * *********************** ********************** =End=of=SWISS-PROT=release=36=notes=====================================