Supra-domains - evolutionary units larger than single protein domains

Christine Vogel1 *,Carlo Berzuini2, Matthew Bashton1, Julian Gough3, Sarah A. Teichmann1

1MRC Laboratory of Molecular Biology, Cambridge, CB2 2QH, UK; 2MRC Biostatistics Unit, Cambridge, CB2 2QH, UK; 3Genome Exploration Research Group, RIKEN Genomic Sciences Centre, W121 1-7-22 Suehiro-cho, Tsurumi-ki, Yokohama 230-0045, Japan, and Department of Structural Biology, Fairchild bldg, D109, Stanford, CA 94305-5126, USA. .

* To whom correspondence should be addressed: cvogel[at]mrc-lmb.cam.ac.uk

DATA PREPARATION

The structural assignments to all proteins were obtained from SUPERFAMILY (Gough et al.) for 131 genomes.

The domains defined in the Structural Assignments of Proteins database (SCOP). Each domain superfamily is encoded by a 5-digit number as used in SCOP and listed here.

The structural assignments were post-processed to obtain strings of consecutive domains which determine the domain architecture of a protein.
The length of the domains was taken as the average length of all domains of that superfamily in SCOP (LoConte et al., 2002). For an estimation of the probability with which an unknown domain is between two known domains in a protein, one standard deviation of the length of each of these two known domains was added to the respective domain. A threshold of 30 amino acids, which is the length of the smallest known domains, was chosen to determine the presence of any residual unknown domain. The unknown domain or 'unassigned region' is then denoted in the domain architecture of the proteins.
Further information can be found here.

Domain architectures as a string of consecutive domains for all sequences used.
Please do also refer to the downloads website of the SUPERFAMILY database.

IDENTIFICATION OF OVER-REPRESENTED SUPRA-DOMAINS

A more detailed description than in the manuscript of the statistical procedures can be found here.

The program (SPLUS) can be downloaded here.

Illustrations of the relationship of R2 and the p-value can be found here.

FUNCTIONAL ANNOTATION

Based on COGs, we defined 41 categories of domain function which were then grouped into six larger functional classes, listed here. The six larger functional classes are defined as

The functional classes are not entirely exclusive, Regulation can be seen as a subset of Information, and Energy can be seen as a subset of Metabolism. The relationships of the classes become clearer when looking at the smaller functional categories. Note that some of these could be associated with more than one class.
1105 superfamilies were assigned to one of these 41 functional categories and larger functional classes; the assignment is available upon request from the corresponding author.



Home Supra-Domains.
Christine Vogel cvogel[at]mrc-lmb.cam.ac.uk
Last modified: Tue Nov 25 11:33:58 GMT 2003