SWISS-PROT RELEASE 35.0 RELEASE NOTES 1. INTRODUCTION Release 35.0 of SWISS-PROT contains 69'113 sequence entries, comprising 25'083'768 amino acids abstracted from 59'101 references. This represents an increase of 18.3% over release 34. The growth of the data bank is summarized below. Release Date Number of Number of amino entries acids 2.0 09/86 3939 900 163 3.0 11/86 4160 969 641 4.0 04/87 4387 1 036 010 5.0 09/87 5205 1 327 683 6.0 01/88 6102 1 653 982 7.0 04/88 6821 1 885 771 8.0 08/88 7724 2 224 465 9.0 11/88 8702 2 498 140 10.0 03/89 10008 2 952 613 11.0 07/89 10856 3 265 966 12.0 10/89 12305 3 797 482 13.0 01/90 13837 4 347 336 14.0 04/90 15409 4 914 264 15.0 08/90 16941 5 486 399 16.0 11/90 18364 5 986 949 17.0 02/91 20024 6 524 504 18.0 05/91 20772 6 792 034 19.0 08/91 21795 7 173 785 20.0 11/91 22654 7 500 130 21.0 03/92 23742 7 866 596 22.0 05/92 25044 8 375 696 23.0 08/92 26706 9 011 391 24.0 12/92 28154 9 545 427 25.0 04/93 29955 10 214 020 26.0 07/93 31808 10 875 091 27.0 10/93 33329 11 484 420 28.0 02/94 36000 12 496 420 29.0 06/94 38303 13 464 008 30.0 10/94 40292 14 147 368 31.0 02/95 43470 15 335 248 32.0 11/95 49340 17 385 503 33.0 02/96 52205 18 531 384 34.0 10/96 59021 21 210 389 35.0 11/97 69113 25 083 768 2. DESCRIPTION OF THE CHANGES MADE TO SWISS-PROT SINCE RELEASE 34 2.1 Sequences and annotations 10'189 sequences have been added since release 34, the sequence data of 1654 existing entries has been updated and the annotations of 15'683 entries have been revised. 2.2 What's happening with the model organisms We have selected a number of organisms that are the target of genome sequencing and/or mapping projects and for which we intend to: . Be as complete as possible. All sequences available at a given time should be immediately included in SWISS-PROT. This also includes sequence corrections and updates; . Provide a higher level of annotation; . Provide cross-references to specialized database(s) that contain, among other data, some genetic information about the genes that code for these proteins; . Provide specific indices or documents. What was done since the last release or in preparation for the next release concerning model organisms: . We have added Methanoccocus jannaschii, Helicobacter pylori, Synechocystis PCC 6803 to the list of model organisms. The genome of these organisms has been completely sequenced and we plan to annotate them fully in SWISS-PROT. Specific documents have been added (see section 4) for each of these organisms. . We also have added mouse (Mus musculus) as a model organism. A significant effort has been done to add new mouse sequences (542 have been added since the last release); we have added links to MGD (the Mouse Genome Database; see section 2.4) and we also have created a specific document (MGDTOSP.TXT) that lists the cross-references between MGD and SWISS-PROT. . We have continued our effort in catching up with the backlog of sequences from other model organisms. In particular we added 410 entries from yeast, 644 from human, 89 from S.pombe, 527 from C.elegans, 95 from A.thaliana and 92 from D.melanogaster. . We have added in SWISS-PROT all the sequences from yeast chromosome XIII. We plan to integrate data from the remaining chromosomes (IV, XII, XV and XVI) very soon so as to have a complete set of annotated yeast sequences. . We have finished the annotation of all Mycoplasma genitalium entries. . We plan to finish as quickly as possible the annotation of the Escherichia coli and Haemophilus influenzae sequence entries which are not yet part of SWISS-PROT. Here is the current status of the model organisms in SWISS-PROT: Organism Database Index file Number of cross-referenced sequences -------------- ---------------- -------------- --------- A.thaliana None yet In preparation 658 B.subtilis SubtiList SUBTILIS.TXT 1882 C.albicans None yet CALBICAN.TXT 167 C.elegans Wormpep CELEGANS.TXT 1735 D.discoideum DictyDB DICTY.TXT 272 D.melanogaster FlyBase FLY.TXT 1002 E.coli EcoGene ECOLI.TXT 4098 H.influenzae HiDB (TIGR) HAEINFLU.TXT 1687 H.sapiens MIM MIMTOSP.TXT 4644 H.pylori HpDB (TIGR) HPYLORI.TXT 257 M.genitalium MgDB (TIGR) MGENITAL.TXT 470 M.musculus MGD MGDTOSP.TXT 2971 M.jannaschii MjDB (TIGR) MJANNASC.TXT 1064 M.tuberculosis None yet None yet 796 S.cerevisiae SGD YEAST.TXT 4750 S.typhimurium StyGene SALTY.TXT 680 S.pombe None yet POMBE.TXT 1045 S.solfataricus None yet None yet 42 Collectively the entries from the above model organisms represent 35.4% of all SWISS-PROT entries. 2.3 Changes affecting the accession numbers With the creation of the TrEMBL database (see section 6) and the rapid increase in the amount of sequence data, we are faced with a problem of availability of accession numbers. Currently we use a system based on a one-letter prefix followed by 5 digits. This system was also used by the nucleotide sequence databases which had originally reserved for SWISS- PROT the prefix letters 'P' and 'Q'. The nucleotide databases having run out of space (due mainly to EST's), have been forced to start using a new format based on a two-letter prefix followed by 6 digits. We have used up all possible numbers with 'P' and 'Q' and the only letter prefix which was not used by the nucleotide database is 'O'. As we believe that changing the format of the accession numbers to that used now by the nucleotide database would create havoc on the numerous software packages using SWISS-PROT, we have decided to keep a system of accession numbers based on a six-character code, but with the following changes: 1) We have started using 'O'. This extra letter should allow the continuation of the present format (1 prefix letter + 5 digits) for approximately one year. 2) When we will have finished using up 'O', we will introduce a system based on the following format: 1 2 3 4 5 6 [O,P,Q] [0-9] [A-Z, 0-9] [A-Z, 0-9] [A-Z, 0-9] [0-9] What the above means is that we will keep a six-character code, but that in positions 3, 4 and 5 of this code any combination of letters and numbers can be present. This format allows a total of 14 million accession numbers (up from 300'000 with the current system). We only allow numbers in positions 2 and 6 so that the SWISS-PROT accession numbers can not be mistaken with gene names, acronyms, other type of accession numbers or any type of words! Examples: P0A3S2, Q2ASD4, O13YX2, P9B123 2.4 Introduction of a new CC line-type topic (DATABASE) There are an increasing number of databases that cater for a specific protein or a for a very limited number of proteins. Most of these databases are mutation databases, reporting defects linked to a genetic disease. We want to add cross-references to these databases when they are available electronically, either by WWW or by FTP. We therefore added in this release, a new comments (CC) line-type 'topic': "DATABASE" whose syntax is the following: CC -!- DATABASE: NAME=Text[; NOTE=Text][; WWW="Address"][; FTP="Address"]. Where `NAME' is the name of the database; `NOTE' (optional) is a free text note; `WWW' (optional) is the WWW address (URL) of the database; `FTP' (optional) is the anonymous FTP address (including the directory name) where the database file(s) are stored. Examples of its usage: CC -!- DATABASE: NAME=CD40Lbase; CC NOTE=European CD40L defect database (mutation db); CC WWW="http://www.expasy.ch/www/cd40lbase.html"; CC FTP="ftp://www.expasy.ch/databases/cd40lbase". CC -!- DATABASE: NAME=PROW; NOTE=CD guide CD80 entry; CC WWW="http://www.ncbi.nlm.nih.gov/prow/cd/cd80.htm". Please note that this topic along with some forms of the DR lines (see next section) are the first occurrence in SWISS-PROT of lower case characters (yes, we plan to go to mixed cases soon!). It is also, currently, the only part of SWISS-PROT where line longer than 75 characters can be found as we do not reformat long URL or FTP addresses. 2.5 Changes concerning cross-references (DR line) 2.5.1 TIGR We have added cross-references from SWISS-PROT to the TIGR database, a collection of genomic databases for microbes, plants and animals maintained by The Institute for Genomic Research (TIGR) in Rockville, Maryland, USA. These cross-references are present in the DR lines: Data bank identifier : TIGR Primary identifier : The genome Open Reading Frame (ORF) code Secondary identifier : Not defined, a dash ("-") is stored in that field Example : DR TIGR; HP1563; -. 2.5.2 MGD We have added cross-references from SWISS-PROT to the Mouse Genome Database (MGD), maintained by The Jackson Laboratory in Bar Harbor, Maine, USA. These cross-references are present in the DR lines: Data bank identifier : MGD Primary identifier : The accession number Secondary identifier : The gene designation Example : DR MGD; MGI :109323; HTR2B. 2.5.3 LISTA We have removed the cross-references from SWISS-PROT to the LISTA database which is no longer maintained and which has been superseded by the SGD database to which SWISS-PROT is fully cross-referenced. 2.5.4 PROSITE The format for cross-references to the PROSITE protein domain and family database used to be: DR PROSITE; ACCESSION_NUMBER; ENTRY_NAME. It has been changed to: DR PROSITE; ACCESSION_NUMBER; ENTRY_NAME; STATUS. Where 'ACCESSION_NUMBER' stands for the accession number of the PROSITE pattern or profile entry; "ENTRY_NAME" is the name of the entry and `STATUS' is one of the following: n FALSE_NEG PARTIAL UNKNOWN_n Where "n" is the number of hits of the pattern or profile in that particular protein sequence. The "FALSE_NEG" status indicates that while the pattern or profile did not detect the protein sequence, it is a member of that particular family or domain. The "PARTIAL" status indicates that the pattern or profile did not detect the sequence because that sequence is not complete and lacks the region on which is based the pattern/profile. Finally the "UNKNOWN" status indicates uncertainties as to the fact that the sequence is a member of the family or domain described by the pattern/profile. Example of PROSITE cross-references: DR PROSITE; PS00107; PROTEIN_KINASE_ATP; 1. DR PROSITE; PS00028; ZINC_FINGER_C2H2; 6. DR PROSITE; PS00237; G_PROTEIN_RECEPTOR; FALSE_NEG. DR PROSITE; PS01128; SHIKIMATE_KINASE; PARTIAL. 2.5.5 REBASE Two small changes have been made to the syntax of cross-references to the REBASE database: - REBASE has recently changed its accession numbers to add an additional digit (an extra leading zero). - We are now using mixed case characters in the secondary identifier (the name of the restriction system) so as to represent exactly the information as stored in REBASE. Example: DR REBASE; RB0005; ECORI. has been changed to: DR REBASE; RB00005; EcoRI. 3. PLANNED CHANGES 3.1 Extension of the accession number system As already explained in detail under 2.3, we will extend the accession number system when we will have used up the 'O' series of accession numbers. This can be anticipated for October 1998. 3.2 Switch to the NCBI taxonomy To standardize the taxonomies used by different databases we will change with release 37 our taxonomy. We will switch to the NCBI taxonomy, which is already used as the common taxonomy by the DDBJ/EMBL/GenBank nucleotide sequence databases. 3.3 Introduction of RT lines With release 37 we will introduce a new line type, the RT (Reference Title) line. This optional line will be placed between the RA and RL line. The RT line gives the title of the paper (or other work) as exactly as possible given the limitations of the computer character set. The form which will be used is that which would be used in a citation rather than displayed at the top of the published paper. For instance, where journals capitalize major title words this is not preserved. The title is enclosed in double quotes, and may be continued over several lines as necessary. The title lines are terminated by a semicolon. An example of the use of RT lines is shown below: RT "Sequence analysis of the genome of the unicellular cyanobacterium RT Synechocystis sp. strain PCC6803. I. Sequence features in the 1 Mb RT region from map positions 64% to 92% of the genome."; 4. STATUS OF THE DOCUMENTATION FILES SWISS-PROT is distributed with a large number of documentation files. Some of these files have been available for a long time (the user manual, release notes, the various indices for authors, citations, keywords, etc.), but many have been created recently and we are continuously adding new files. Since release 34, we have added 15 new document files. The following table lists all the documents that are currently available. USERMAN.TXT User manual RELNOTES.TXT Release notes SHORTDES.TXT Short description of entries in SWISS-PROT JOURLIST.TXT List of abbreviations for journals cited KEYWLIST.TXT List of keywords in use SPECLIST.TXT List of organism identification codes TISSLIST.TXT List of tissues EXPERTS.TXT List of on-line experts for PROSITE and SWISS-PROT SUBMIT.TXT Submission of sequence data to SWISS-PROT ACINDEX.TXT Accession number index AUTINDEX.TXT Author index CITINDEX.TXT Citation index KEYINDEX.TXT Keyword index SPEINDEX.TXT Species index DELETEAC.TXT Deleted accession number index [1] 7TMRLIST.TXT List of 7-transmembrane G-linked receptors entries AATRNASY.TXT List of aminoacyl-tRNA synthetases ALLERGEN.TXT Nomenclature and index of allergen sequences BLOODGRP.TXT List of blood group antigen proteins [1] CALBICAN.TXT Index of Candida albicans entries and their corresponding gene designations CDLIST.TXT CD nomenclature for surface proteins of human leucocytes CELEGANS.TXT Index of Caenorhabditis elegans entries and their corresponding gene Wormpep cross-references DICTY.TXT Index of Dictyostelium discoideum entries and their corresponding gene designations and DictyDb cross-references EC2DTOSP.TXT Index of Escherichia coli Gene-protein database entries referenced in SWISS-PROT ECOLI.TXT Index of Escherichia coli K12 chromosomal entries and their corresponding EcoGene cross-references EMBLTOSP.TXT Index of EMBL Database entries referenced in SWISS-PROT EXTRADOM.TXT Nomenclature of extracellular domains FLY.TXT Index of Drosophila entries and FlyBase cross- references [1] GLYCOSID.TXT Classification of glycosyl hydrolase families and index of glycosyl hydrolase entries HAEINFLU.TXT Index of Haemophilus influenzae RD chromosomal entries HOXLIST.TXT Vertebrate homeotic Hox proteins: nomenclature and index HPYLORI.TXT Index of Helicobacter pylori strain 26695 chromosomal entries [1] HUMCHR18.TXT Index of protein sequence entries encoded on human chromosome 18 [1] HUMCHR19.TXT Index of protein sequence entries encoded on human chromosome 19 [1] HUMCHR20.TXT Index of protein sequence entries encoded on human chromosome 20 HUMCHR21.TXT Index of protein sequence entries encoded on human chromosome 21 HUMCHR22.TXT Index of protein sequence entries encoded on human chromosome 22 HUMCHRX.TXT Index of protein sequence entries encoded on human chromosome X HUMCHRY.TXT Index of protein sequence entries encoded on human chromosome Y INITFACT.TXT List and index of translation initiation factors [1] MIMTOSP.TXT Index of MIM entries referenced in SWISS-PROT METALLO.TXT Classification of metallothioneins and index of entries in SWISS-PROT [1] MGDTOSP.TXT Index of MGD entries referenced in SWISS-PROT [1] MGENITAL.TXT Index of Mycoplasma genitalium chromosomal entries [1] MJANNASC.TXT Index of Methanococcus jannaschii entries [1] NGR234.TXT Table of putative genes in Rhizobium plasmid pNGR234a [1] NOMLIST.TXT List of nomenclature related references for proteins PCC6803.TXT Index of Synechocystis strain PCC 6803 entries [1] PDBTOSP.TXT Index of X-ray crystallography Protein Data Bank (PDB) entries referenced in SWISS-PROT PEPTIDAS.TXT Classification of peptidase families and index of peptidase entries PLASTID.TXT List of chloroplast and cyanelle encoded proteins POMBE.TXT Index of Schizosaccharomyces pombe entries in SWISS-PROT and their corresponding gene designations RESTRIC.TXT List of restriction enzyme and methylase entries RIBOSOMP.TXT Index of ribosomal proteins classified by families on the basis of sequence similarities SALTY.TXT Index of Salmonella typhimurium LT2 chromosomal entries and their corresponding StyGene cross- references SUBTILIS.TXT Index of Bacillus subtilis 168 chromosomal entries and their corresponding SubtiList cross-references UPFLIST.TXT UPF (Uncharacterized Protein Families) list and index of members [1] YEAST.TXT Index of Saccharomyces cerevisiae entries and their corresponding gene designations YEAST1.TXT Yeast Chromosome I entries YEAST2.TXT Yeast Chromosome II entries YEAST3.TXT Yeast Chromosome III entries YEAST5.TXT Yeast Chromosome V entries YEAST6.TXT Yeast Chromosome VI entries YEAST7.TXT Yeast Chromosome VII entries YEAST8.TXT Yeast Chromosome VIII entries YEAST9.TXT Yeast Chromosome IX entries YEAST10.TXT Yeast Chromosome X entries YEAST11.TXT Yeast Chromosome XI entries YEAST13.TXT Yeast Chromosome XIII entries [1] YEAST14.TXT Yeast Chromosome XIV entries Notes: [1] New in release 35. We have continued to include in some SWISS-PROT document files the references of World-Wide Web sites relevant to the subject under consideration. There are now 12 documents that include such links. 5. THE EXPASY WORLD-WIDE WEB SERVER 5.1 Background information The most efficient and user-friendly way to browse interactively in SWISS-PROT, PROSITE, ENZYME, SWISS-2DPAGE and other databases. is to use the World-Wide Web (WWW) molecular biology server ExPASy. The ExPASy server was made available to the public in September 1993, it is reachable at the following address: http://www.expasy.ch/ The ExPASy WWW server allows access, using the user-friendly hypertext model, to the SWISS-PROT, PROSITE, ENZYME, SWISS-2DPAGE, SWISS-3DIMAGE and CD40Lbase databases and, through any SWISS-PROT protein sequence entry, to other databases such as EMBL, Eco2DBASE, EcoCyc, FlyBase, GCRDb, MaizeDB, SubtiList/NRSub, OMIM, PDB, HSSP, ProDom, REBASE, SGD, YEPD and Medline. ExPAsy also offers many tools for the analysis of protein sequences and 2D gels. 5.2 SWISS-SHOP We provide, on ExPASy, a service called SWISS-SHOP. SWISS-Shop allows any users of SWISS-PROT to indicate what proteins he/she is interested in. This can be done using various criteria that can be combined: - By entering one or more words that should be present in the description line; - By entering one or more species name(s) or taxonomic division(s); - By entering one or more keywords; - By entering one or more author names; - By entering the accession number (or entry name) of a PROSITE pattern or a user-defined sequence pattern; - By entering the accession number (or entry name) of an existing SWISS-PROT entry or by entering a "private" sequence. Every week, the new sequences entered in SWISS-PROT are automatically compared with all the criteria that have been defined by the users. If a sequence corresponds to the selection criteria defined by a user, that sequence is sent by electronic mail. 5.3 What is new on ExPASy ExPASy is constantly modified and improved. If you wish to be informed on the changes made to the server you can either: - Read the document "History of changes, improvements and new features" which is available at the address: http://www.expasy.ch/www/history.html - Subscribe to SWISS-Flash, a service that reports news of databases, software and services developments. By subscribing to this service, you will automatically get SWISS-Flash bulletins by electronic mail. To subscribe use the address: http://www.expasy.ch/www/swiss-flash.html 6. TREMBL - A SUPPLEMENT TO SWISS-PROT The ongoing genome sequencing and mapping projects have dramatically increased the number of protein sequences to be incorporated into SWISS- PROT. Since we do not want to dilute the quality standards of SWISS-PROT by incorporating sequences into the database without proper sequence analysis and annotation, we cannot speed up the incorporation of new incoming data indefinitely. But as we also want to make the sequences available as fast as possible, we have introduced with SWISS-PROT a computer annotated supplement. This supplement consists of entries in SWISS-PROT-like format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except those already included in SWISS-PROT. We name this supplement TrEMBL (Translation from EMBL). It can be considered as a preliminary section of SWISS-PROT. This SWISS-PROT release is supplemented by TrEMBL release 5. TrEMBL is split in two main sections; SP-TrEMBL and REM-TrEMBL: - SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries (140'555 in release 5) which should be incorporated into SWISS-PROT. SWISS-PROT accession numbers have been assigned for all SP-TrEMBL entries. - REM-TrEMBL (REMaining TrEMBL) contains the entries (25'806 in release 5) that we do not want to include in SWISS-PROT for a variety of reasons (synthetic sequences, pseudogenes, translations of uncorrect open reading frames, fragments with less than eight amino acids, patent-derived sequences, immunoglobulins and T-cell receptors, etc.) TrEMBL is available by FTP from the EBI server (ftp.ebi.ac.uk) in the directory '/pub/databases/trembl'. It can be queried on WWW by the EBI SRS server (http://www.ebi.ac.uk/). It is also available on the SWISS- PROT CD-ROM and is searchable on the FASTA, BIC_SW and BLAST servers of the EBI. 7. WEEKLY UPDATES OF SWISS-PROT Weekly updates of SWISS-PROT are available by anonymous FTP. Three files are updated at each update: new_seq.dat Contains all the new entries since the last full release; upd_seq.dat Contains the entries for which the sequence data has been updated since the last release; upd_ann.dat Contains the entries for which one or more annotation fields have been updated since the last release. Currently these files are available on the following anonymous FTP servers: Organization ExPASy (Geneva University Expert Protein Analysis System) Address expasy.hcuge.ch (or 129.195.254.61) Directory /databases/swiss-prot/updates Organization European Bioinformatics Institute (EBI) Address ftp.ebi.ac.uk (or 193.62.196.6) Directory /pub/databases/swissprot/new !! Important notes !!! - Although we try to follow a regular schedule, we do not promise to update these files every week. In some cases two weeks will elapse in- between two updates. - Due to the current mechanism used to build a release the entries that are provided in these updates are not guaranteed to be error free. 8. ENZYME and PROSITE 8.1 The ENZYME data bank Release 22.0 of the ENZYME data bank is distributed with release 35 of SWISS-PROT. ENZYME release 22.0 contains information relative to 3651 enzymes. 8.2 The PROSITE data bank Release 14.0 of the PROSITE data bank is distributed with release 35 of SWISS-PROT. This release of PROSITE contains 997 documentation entries that describe 1'335 different patterns, rules and profiles/matrices. Release 14.0 is the first completely new release of PROSITE since November 1995. Since that date we have added 114 entries and modified 566 entries. The long time that elapsed between this release of PROSITE and the last one is partially due to a complete rewriting of the software tools that maintain the database and allows it be bi- directionally inked to SWISS-PROT. Thanks to those changes, we will now be able to produce PROSITE releases at each release of SWISS-PROT and also to offer on the ExPASy server frequent updates of the database. 9. WE NEED YOUR HELP ! We welcome feedback from our users. We would especially appreciate that you notify us if you find that sequences belonging to your field of expertise are missing from the data bank. We also would like to be notified about annotations to be updated, if, for example, the function of a protein has been clarified or if new post-translational information has become available. To facilitate such feedback's we offer on the ExPASY WWW server a form that allows the submission of updates and/or corrections to SWISS-PROT: http://www.expasy.ch/sprot/sp_update_form.html It is also possible, from any entries in SWISS-PROT displayed by the ExPASy server, to submit updates and/or corrections for that particular entry. Finally, you can also send your comments by electronic mail to the address: swiss-prot@expasy.ch ======================================================================== APPENDIX A: SOME STATISTICS A.1 Amino acid composition A.1.1 Composition in percent for the complete data bank Ala (A) 7.57 Gln (Q) 4.00 Leu (L) 9.39 Ser (S) 7.15 Arg (R) 5.15 Glu (E) 6.34 Lys (K) 5.95 Thr (T) 5.70 Asn (N) 4.50 Gly (G) 6.83 Met (M) 2.35 Trp (W) 1.24 Asp (D) 5.29 His (H) 2.23 Phe (F) 4.08 Tyr (Y) 3.18 Cys (C) 1.68 Ile (I) 5.78 Pro (P) 4.91 Val (V) 6.55 Asx (B) 0.001 Glx (Z) 0.001 Xaa (X) 0.01 A.1.2 Classification of the amino acids by their frequency Leu, Ala, Ser, Gly, Val, Glu, Lys, Ile, Thr, Asp, Arg, Pro, Asn, Phe, Gln, Tyr, Met, His, Cys, Trp A.2 Repartition of the sequences by their organism of origin Total number of species represented in this release of SWISS-PROT: 5713 The first twenty species represent 34020 sequences: 49.2 % of the total number of entries. A.2.1 Table of the frequency of occurrence of species Species represented 1x: 2609 2x: 891 3x: 480 4x: 321 5x: 225 6x: 209 7x: 148 8x: 94 9x: 113 10x: 58 11- 20x: 261 21- 50x: 165 51-100x: 64 >100x: 75 A.2.2 Table of the most represented species Number Frequency Species 1 4750 Baker's yeast (Saccharomyces cerevisiae) 2 4644 Human 3 4098 Escherichia coli 4 2971 Mouse 5 2398 Rat 6 1882 Bacillus subtilis 7 1735 Caenorhabditis elegans 8 1687 Haemophilus influenzae 9 1064 Methanococcus jannaschii 10 1047 Bovine 11 1045 Fission yeast (Schizosaccharomyces pombe) 12 1002 Fruit fly (Drosophila melanogaster) 13 799 Chicken 14 786 Mycobacterium tuberculosis 15 680 Salmonella typhimurium 16 658 Arabidopsis thaliana (Mouse-ear cress) 17 648 African clawed frog (Xenopus laevis) 18 551 Pig 19 541 Rabbit 20 494 Synechocystis sp. (strain PCC 6803) 21 489 Mycoplasma pneumoniae 22 470 Mycoplasma genitalium 23 403 Rhizobium sp. (strain NGR234) 24 398 Maize 25 340 Pseudomonas aeruginosa 26 292 Rice 27 273 Bacteriophage T4 28 272 Slime mold (Dictyostelium discoideum) 29 257 Helicobacter pylori 30 256 Tobacco 31 253 Vaccinia virus (strain Copenhagen) 32 248 Dog 33 231 Pea 34 223 Sheep 35 219 Porphyra purpurea 36 209 Barley 37 203 Neurospora crassa 38 199 Wheat 199 Staphylococcus aureus 40 196 Mycobacterium leprae 41 193 Human cytomegalovirus (strain AD169) 42 192 Soybean 43 190 Klebsiella pneumoniae 44 184 Vaccinia virus (strain WR) 45 183 Rhodobacter capsulatus 183 Pseudomonas putida 47 180 Bacillus stearothermophilus 48 175 Potato 49 174 Tomato 50 167 Candida albicans 51 162 Agrobacterium tumefaciens 52 156 Spinach 53 154 Rhizobium meliloti 154 Autographa californica nuclear polyhedrosis virus 55 151 Chlamydomonas reinhardtii 56 150 Marchantia polymorpha (Liverwort) 57 149 Guinea pig 58 146 Variola virus 59 145 Cyanophora paradoxa 60 139 Odontella sinensis 61 138 Aspergillus nidulans 62 134 Orgyia pseudotsugata multicapsid polyhedrosis virus 63 132 Lactococcus lactis (subsp. lactis) 64 131 Streptomyces coelicolor 65 122 Thermus aquaticus (subsp. thermophilus) 66 120 Horse 67 116 Golden hamster 68 113 Trypanosoma brucei brucei 113 Anabaena sp. (strain PCC 7120) 113 Synechococcus sp. (strain PCC 7942) 71 108 Kluyveromyces lactis 72 107 Bombyx mori (Silk moth) 73 105 Bradyrhizobium japonicum 105 Alcaligenes eutrophus 75 102 Yersinia enterocolitica A.3 Repartition of the sequences by size From To Number From To Number 1- 50 2882 1001-1100 627 51- 100 5886 1101-1200 484 101- 150 8453 1201-1300 339 151- 200 6661 1301-1400 226 201- 250 6184 1401-1500 186 251- 300 5742 1501-1600 115 301- 350 5369 1601-1700 102 351- 400 5392 1701-1800 79 401- 450 4149 1801-1900 86 451- 500 3905 1901-2000 52 501- 550 2927 2001-2100 30 551- 600 2053 2101-2200 67 601- 650 1560 2201-2300 64 651- 700 1159 2301-2400 32 701- 750 1032 2401-2500 39 751- 800 831 >2500 203 801- 850 652 851- 900 685 901- 950 464 951-1000 396 A.4 Longest sequences The longest sequences (>=4000 residues) are listed here: HTS1_COCCA 5217 MUC2_HUMAN 5179 FAT_DROME 5147 RYNR_RABIT 5037 RYNR_PIG 5035 RYNR_HUMAN 5032 RYNC_RABIT 4969 LRP_CAEEL 4753 DYHC_DICDI 4725 PLEC_RAT 4687 LRP2_RAT 4660 DYHC_RAT 4644 DYHC_DROME 4639 DYHC_CAEEL 4568 DYHB_CHLRE 4568 APB_HUMAN 4563 APOA_HUMAN 4548 LRP1_HUMAN 4544 LRP1_CHICK 4543 DYHC_PARTE 4540 RRPA_CVMJH 4488 DYHG_CHLRE 4485 DYHC_ANTCR 4466 DYHC_TRIGR 4466 GRSB_BACBR 4451 PKSK_BACSU 4447 PKSL_BACSU 4427 PGBM_HUMAN 4393 YP73_CAEEL 4385 DYHC_NEUCR 4367 DYHC_NECHA 4349 DYHC_EMENI 4344 PKD1_HUMAN 4303 DYHC_YEAST 4092 RRPA_CVH22 4085 A.5 Statistics for journal citations Total number of journals cited in this release of SWISS-PROT: 861 A.5.1 Table of the frequency of journal citations Journals cited 1x: 326 2x: 117 3x: 61 4x: 39 5x: 30 6x: 23 7x: 14 8x: 13 9x: 10 10x: 12 11- 20x: 66 21- 50x: 58 51-100x: 23 >100x: 69 A.5.2 List of the most cited journals in SWISS-PROT Citations Journal abbreviation --------- ---------------------------------- 6038 J. BIOL. CHEM. 3672 PROC. NATL. ACAD. SCI. U.S.A. 3356 NUCLEIC ACIDS RES. 2604 J. BACTERIOL. 2352 GENE 1992 FEBS LETT. 1853 EUR. J. BIOCHEM. 1693 BIOCHEM. BIOPHYS. RES. COMMUN. 1651 EMBO J. 1596 BIOCHEMISTRY 1540 NATURE 1367 BIOCHIM. BIOPHYS. ACTA 1244 J. MOL. BIOL. 1177 CELL 1137 MOL. CELL. BIOL. 920 MOL. GEN. GENET. 899 PLANT MOL. BIOL. 850 BIOCHEM. J. 764 SCIENCE 750 VIROLOGY 748 GENOMICS 731 MOL. MICROBIOL. 661 J. BIOCHEM. 502 J. VIROL. 444 J. CELL BIOL. 439 YEAST 435 J. GEN. VIROL. 418 PLANT PHYSIOL. 381 GENES DEV. 333 HUM. MOL. GENET. 323 J. IMMUNOL. 313 CURR. GENET. 305 ARCH. BIOCHEM. BIOPHYS. 303 INFECT. IMMUN. 287 ONCOGENE 287 MOL. BIOCHEM. PARASITOL. 262 BIOL. CHEM. HOPPE-SEYLER 248 FEMS MICROBIOL. LETT. 230 MOL. ENDOCRINOL. 230 HUM. MUTAT. 220 J. CLIN. INVEST. 220 AM. J. HUM. GENET. 219 NAT. GENET. 219 DEVELOPMENT 216 J. GEN. MICROBIOL. 213 HOPPE-SEYLER'S Z. PHYSIOL. CHEM. 194 J. MOL. EVOL. 185 GENETICS 180 STRUCTURE 178 MICROBIOLOGY 177 BLOOD 172 HUM. GENET. 169 DNA CELL BIOL. 168 J. EXP. MED. 163 APPL. ENVIRON. MICROBIOL. 158 DEV. BIOL. 156 NEURON 152 DNA 136 IMMUNOGENETICS 124 ENDOCRINOLOGY 123 DNA SEQ. 122 PLANT CELL 115 NAT. STRUCT. BIOL. 109 HEMOGLOBIN 108 PROTEIN SCI. 108 BIOCHIMIE 106 AGRIC. BIOL. CHEM. 105 BIOORG. KHIM. 101 CANCER RES. =========================================================================== APPENDIX B: RELATIONSHIPS BETWEEN SWISS-PROT AND SOME BIOMOLECULAR DATABASES The current status of the relationships (cross-references) between SWISS-PROT and some biomolecular databases is shown in the following schematic: *********************** * EMBL Nucleotide * * Sequence Database * * [EBI] * *********************** ^ ^ ^ ^ ^ ^ ^ ^ ^ ****************** | | | I | | | | | ********************** * FlyBase * <-------+ | | I | | | | +-------> * MGD [Mouse] * ****************** | | | I | | | | | ********************** | | | I | | | | | ****************** | | | I | | | | | ********************** * SubtiList * <---------+ | I | | | +---------> * GCRDb [7TM recep.] * * [B.subtilis] * | | | I | | | | | ********************** ****************** | | | I | | | | | | | | I | | | | | ********************** ****************** | | | I | | +-----------> * EcoGene [E.coli] * * Mendel [Plant] * <-----+ | | | I | | | | | ********************** ****************** | | | | I | | | | | | | | | I | | | | | ********************** ****************** | | | | I +---------------> * SGD [Yeast] * * MaizeDb * <-----------+ I | | | | | ********************** * [Zea mays] * | | | | I | | | | | ****************** | | | | I | | | | | ********************** | | | | I | +-------------> * DictyDB [D.disco.] * ****************** | | | | I | | | | | ********************** * WormPep * | | | | I | | | | | * [C.elegans] * <---+ | | | | I | | | | | ********************** ****************** | | | | | I | | | | | +-----> * ENZYME [Nomencl.] * | | | | | I | | | | | | ********************** ****************** | v v v v v v v v v v v v * REBASE * ************************* ********************** * [Restriction * <-- * SWISS-PROT * ----> * OMIM [Human] * * enzymes] * * Protein Sequence * ********************** ****************** * Data Bank * ************************* ********************** ****************** ^ ^ ^ ^ ^ ^ ^ | ^ ^ ^ * ECO2DBASE [2D] * * StyGene * | | | | | | | | | | +--------> ********************** * [S.Typhimurium]* <----+ | | | | | | | | | ****************** | | | | | | | | | ********************** | | | | | | | | +----------> * Maize-2DPAGE [2D] * ****************** | | | | | | | | ********************** * Transfac * <------+ | | | | | | | ****************** | | | | | | | ********************** | | | | | | +------------> * SWISS-2DPAGE [2D] * ****************** | | | | | | ********************** * Harefield [2D] * <--------+ | | | | | ****************** | | | | | ********************** | | | | +--------------> * Aarhus/Ghent [2D] * ****************** | | | | ********************** * PROSITE * | | | | * [Patterns and * <----------+ | | +----------------> ********************** * profiles] * | | * YEPD [Yeast] [2D] * ****************** | +----------------+ ********************** | v | | *********************** +-> ********************** +--------> * PDB [3D structures] * <----- * HSSP [3D similar.] * *********************** ********************** ===End=of=SWISS-PROT=release=35=notes=====================================