SIFTS Methodology
We have adopted the NCBI taxonomic identifiers as a standard way of representing taxonomic information for all PDB entries within the PDBe database. Ideally every PDB entry should have a record of the organism from which each component of the particular structure derives, but in the legacy archive this is not always the case: many entries simply have no such record, whilst those records that are present have historically been prone to typographical or spelling errors. For entries with no taxonomy information, manual searches of the PDB file or accompanying literature were performed and for all entries we have put in place mechanisms that automatically check the user-supplied taxonomy information against the NCBI database, using the standard NCBI taxonomy identifier that is assigned to each PDB entry. This allows us to correct spelling mistakes in legacy PDB files and to identify PDB entries where the taxonomy information is simply incorrect. Furthermore, by using a stable, curated taxonomy identifier throughout the database, we gain access to the wealth of annotation information in the NCBI database or UniProt Taxonomy database, such as synonyms and hierarchical relationships between different taxonomic nodes. Figure 1 shows the database schema for taxonomy data mart in the PDBe database.
The cleaned-up taxonomic information for every macromolecular structure is available in the XML files from the FTP archive.
We have used sequence identity and taxonomy as the characteristics on which to link protein sequence data (from UniProt) and protein structure data (from PDBe).
Since the sequences of a structure in the PDBe may represent either the native protein sequence or that of an engineered mutant or other variant, during the automatic procedure the criterion for assessing sequence identity was that there should be 95% or higher agreement between the sequence of a protein structure and the corresponding sequence in UniProt. If no match is found then this criterion for sequence identity was relaxed further down to 90% during the manual annotation. For entries which are not represented in the UniProt archive, new UniProt entries were created based on the information given in the PDB entry.
As protein structure is better conserved across evolutionary time than protein sequence and the structural differences between proteins with high sequence identity are small, the rule for assessing taxonomy assignments to accept the correct UniProt cross-reference was relaxed to allow the taxonomy ID for the two entries, PDBe and UniProt, to be the same or to have a common parent within one or two levels up the taxonomic tree. Using the above rule, we have also cleaned up the UniProt cross-references for every entry in the PDB. Figure 2 shows the database schema for UniProt cross reference data mart in the PDBe database.
The clean-up of the UniProt cross references has allowed us to link the macromolecular structure information to other important data resources such as:
- GOA database which provides assignments of gene products to the Gene Ontology (GO) resource
- InterPro database which provides information on protein families, domains and functional sites
- Pfam database which is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families
- IntEnz database which provides up-to-date information on Enzyme Nomenclature
- SCOP database of protein families based on protein structure
- CATH database which is a hierarchical classification of protein domain structures
- Pubmed citation database
Residue level mapping
After the clean-up of the archive was completed, it was possible to map accurately the sequences from PDB entries on to corresponding UniProt entries. The main difficulty in determining this mapping is that many structures in the PDB have regions of unobserved residues in the middle of continuous polypeptide chains. This discontinuity in the sequence of the structure arises because it is often impossible to reliably construct a model for poorly defined regions of structure, such as flexible loops. Such gaps in the sequence are not taken into account by traditional sequence alignment algorithms, leading to incorrect alignments for regions surrounding the unobserved regions.
To circumvent this problem we modified the standard alignment protocol and developed software to use sequences of connected segments of a polypeptide chain from the PDB entry corresponding to the observed regions of a protein structure. The separate alignments for these segments were then merged together to assemble the complete alignment between the sequence of the observed residues from the PDB entry and the complete sequence of the protein used in the experiment. The latter sequence is shown in the SEQRES record in the PDB entry and does not have gaps reflecting unobserved residues. A similar procedure was carried out to obtain alignments between the sequences of observed residues and the corresponding UniProt entry. These two composite alignments were then merged to give the complete residue-level mapping between the sequence of the complete polypeptide from the experiment and its UniProt counterpart. This complex procedure also allows us to extract annotations from the PDB and UniProt entries to explain any differences detected between the two sequences, such as variants, isoforms, modified residues or engineered mutations. Unobserved residues and N- or C-terminal tags for the polypeptide chains in the PDB entry are also annotated. Regions from the UniProt entry that do not form part of the studied polypeptide and are not included in the PDB entry are clearly marked.
The programme also copes with the more complex situation in chimeric structures, where sequences from two or more UniProt entries are involved. In this case the correct boundaries are manually confirmed and this information is stored in a temporary table in the database. The programme uses this information to identify the correct alignments for each segment of the polypeptide chain.
Data update mechanism
Both the PDBe and UniProt groups have developed relational databases to store their data. The databases are implemented in Oracle and are used as the primary archival system for the data. This has allowed us to use various mechanisms provided by Oracle to exchange information between the two databases without exporting the data into flat files. Figure 3 shows how the data is exchanged between the databases.
When new PDB entries are deposited, the source taxonomy is validated against the NCBI taxonomy database and the tax_id is determined. The DBREF data is extracted and sent to the UniProt group who validate those proteins which have UniProt cross-references and determine the UniProt reference for those with only Genbank or EMBL cross-references. During the validation process, the UniProt group can directly access data in the PDBe production database, via views, which greatly facilitates the validation process. If an existing UniProt entry matching the sequence cannot be found, a new TrEMBL entry is created for that sequence. The validated taxonomy and DBREF data are stored in the PDBe production database.
Using the validated DBREF, the residue-level mapping is carried out as described above and the validated taxonomy, DBREF and mappings are loaded into the PDBe production database, which contains the rest of the data for the PDB entries. A series of views in the PDBe production database are made available to the UniProt group, who can then automatically access the structural information they need for inclusion in UniProt entries.
As both the PDBe and UniProt databases are stored in Oracle databases connected by database links, it greatly simplifies the problem of each database keeping track of changes in the other and facilitates the corresponding update of data in each database. For example, if the CRC64 of a sequence in the UniProt database changes, it is simple to determine that the sequence needs to be re-mapped in PDBe. Similarly it is possible to keep track of when a UniProt accession code becomes secondary or changes in the protein ID or the taxonomy of a particular UniProt entry. This is critical at a time when UniProt is demerging many entries, so that each accession code becomes associated with a single tax_id.