DIVCLUS

A protein sequence domain clustering program

Sarah Teichmann and Jong Park


Harvard Univ. Mirror

<Download divclus.pl here>


1. Domain clustering (also denoted as subclustering)

multidomain sequences result in wrong clusters ( 217 out of 1622 clusters). To avoid this problem,

protein sequence domains should be used instead of the whole sequence for clustering related

sequences. A detailed description of a more correct clustering will be given in this chapter.

1.0. Introduction: clustering protein sequences

However, as certain proteins are made up of more than one domain, when there are more than

two sequences, it is possible to cluster two unrelated sequences via one intermediate sequence

which share the domains with the other two as illustrated in figure 1.0.1. A, B, and C are

multidomain protein sequences having sequence domains of D1 to D5. Because D2 and D4 are

present in the sequence B, A and C are clustered together even though they do not have any

common region.

called 'sequence domain'. In this thesis, when the term 'seqlet' is used, it indicates a sequence

domain rather than a protein structural domain. A sequence domain may correspond to a

structural domain, but there is no study on how well sequence domains match with structural

domains.

Figure 1.0.1. 3 different sequences A, B, and C are clustered wrongly due to domains in

107

sequence B.

domains and cluster the domains to calculate the accurate level of gene duplication as shown in

figure 1.0.2. Originally, the subclustering of the initial wrong clustering was done manually by

first running Blastp program (Altschul, et.al., 1990) on the sets of sequences in each single cluster

to find the matched regions, and then the MSPcrunch program (Sonnhammer and Durbin, 1994)

to show the overlap regions. Only later was the SSEARCH program used. What MSPcrunch does

is to link the Blastp matched sequence fragments into bigger fragments when the gaps between

them are small (which is a parameter of the program). Careful examination by eye could show

wrongly clustered sequences. For small size genomes, such a manual sorting is possible, but

with larger genomes, it became too time-consuming. Also, for large clusters, it is very difficult to

sort correctly as the linkage can be very complex. Therefore, it was more appropriate to

implement the manual process as a computer program.

Figure 4.1.0.2. Clustering domains of multidomain proteins. A to G represent each single

domain in different sequences. B and C are the same domains of D2. E and F are the same

domains of D4. Therefore, there are 5 distinct sequence domains.

Alex Bateman and Jong Park) algorithms (in a program, divicl.pl) which replace the manual

sorting and breaking down of wrongly clustered sequence families by examining the alignments

made by either using SSEARCH or additionally using ClustalW (all after initial SSEARCH

comparison ).

(Gouzy, et. al., 1996). The main difference between the algorithms is that divicl.pl uses

expectation values from SSEARCH compared to the other work (Sonnhammer and Kahn, 1994),

where the output of the Blastp search method is used. Blastp produces fragments of sequence

matches which do not include gaps. Therefore, what it produces is not necessarily sequence

domains, but sequence fragments which are biologically less meaningful than the larger fragments

produced by SSEARCH method. MSPcrunch, which connects such fragmented Blastp output,

can improve the situation. However, we know of no study on whether connected Blastp output is

equivalent to the gapped search of full Smith-Waterman SSEARCH. Also, the P value scoring

system used in Blastp is much less selective than the E value of SSEARCH (Pearson, 1995) which

will result in less robust clustering. Simple test carried out in our lab showed that SSEARCH

produced much longer alignments than Blastp with post processing of MSPCrunch (Park, et. al.,

data not shown). The details of the procedure for the domain clustering will be described in the

following section (1.1.).

1.1. Outline of the procedure.

Figure 4.1.1. Schematic procedure of domain clustering.

1.2. Searching the database.

SSEARCH version 3 (Pearson, 1996) is used to compare all the sequences to each other. The

expectation (E) value used to determine the homology was 0.001 to allow a low rate of false

positives. In theory, the error rate corresponding to 0.001 is close to 0.01%. This is because, in a

previous study it was shown that 1% error was made with a search with a database size of 971

sequences at the expectation value of 0.01 (Brenner, et. al., 1997). As the database size of 6

genome is 12013, E value of 0.1 is supposed to bring around 1% error (as E value is dependent

on the size of database) . However, there is no way to get the truly correct error

rate with genomic sequences which do not have all the corresponding structures, so a very strict

expectation value of 0.001 was used to minimise the error rate. Theoretically, this should give us

0.01% error rate in clustering sequences by any single linkage program.

throughout the study this option was used. All the pairs of possible combinations were

compressed and saved in a UNIX machine. This file format is called SSO (SSEARCH Search

Output) file here.

1.3. Single linkage clustering after SSEARCH.

whenever they are significantly matched by SSEARCH with the threshold expectation value of

0.001. The program retrieves all the sequence match pairs with less than 0.001 E value. If any

sequence pair is connected to any other pairs of sequences, regardless of the region overlapped,

they are regarded as belonging to the same cluster. This produces wrong clusters but includes all

the sequences linked. The resulting file (CLU file, for CLUstering) had the list of sequences

belonging to the cluster size 2 to the largest found (with 6 genomes, maximum 828 sequences

were clustered together). This wrong clustering file is used as an input for the program which

produces MSP files which will be described in the next section (1.4).

1.4. Generating MSP files (format change)

(a modified MSPcrunch file format; Sonnhammer and Durbin, 1994) files. If there were 2000

clusters in various sizes, there would be 2000 MSP files which contain the details of

SSEARCH score, E value and most importantly, the matched region. An MSP file format

example is shown in figure 1.4. MSP files were made not as part of the algorithm, but as an

output which can provide important information about clustering for other later use. For

example, a program has been developed to visualise the alignments of the sequences

in MSP files. This is very useful to check any single linkage or subclustering by eye. In fact this

visual inspection was carried out until it was possible to use automatic subclustering to sort

the wrong clusters into smaller and correct clusters.

Figure 1.4. Example of MSP file format. The first column of MSP file has

Smith-Waterman scores. The 2nd column has expectation values. The 3rd and 5th

columns show the sequence regions matched. The 4th and 6th columns show the

match and target sequences.

1.5. Dividing complex clusters.

clusters. It compares the matched regions of all the sequences in one cluster made by any single

linkage clustering program. If all the matched regions of the sequences have a commonly

overlapping region in the cluster, the cluster is correct by the single linkage clustering. If not, the

cluster is broken down to smaller clusters according into the domain sequences. Details are

shown below.

1.5.1. Merging similar sequences.

are overlapping to a large extent in the original single linkage cluster. The accepted overlap

between any sequence pair is a variable of divicl.pl and the default threshold is 70% of the

commonly overlapped region of two pairs of sequence which contain one common sequence,

with the minimum overlapping region being 30 amino acid residues. This means if any 3

sequences (in 2 pairs) are overlap by 70%, they are regarded as truly matched. This is

illustrated in figure 1.5.1.1. Sequences A, B and C have regions which are overlapping by over

70% (overlap1 over overlap2 or overlap4 in the figure is more than 70%), so A, B and C are

regarded as having thesame seqlet and belong to the same cluster.

overlap2 or overlap4, or the shorter one (default) or the average of the two can be chosen as the

denominator in calculating the percentage of overlap as an option in the program. Subsequently,

according to the option either the smaller overlap region, or the larger region or the average of

the two overlap (default) becomes the new overlap of ABC from the pairs AB or BC. The smaller

common region subclustering will not produce wrong large subclusters, but it may fail to detect

the true connection between pairs of sequences. In other words, it can be too strict. The larger

overlap region option in this merging stage can cause wrong clusters, but it will have less

probability to fail to cluster true connections between pairs. Taking the average overlap size of

the compared pairs before comparing the next pair of sequences can produce wrong clusters,

but with a lower probability than taking the larger overlap size. The average size of overlap

option was chosen empirically after testing with some clusters. Only the chosen overlap region

(either as average, smaller, or larger ) of the randomly chosen pair is used for the next pair to be

compared.

percentage between overlap1 and overlap2 is much less than the default threshold of 70% of

the. As an option, thesize of overlap2 can be the average of the overlap regions of the pairs. This

overlap comparison process is iterative and the result of the overlap detection is put into an array

which has the sequence names with the overlapped regions. After iterations, another subroutine

compares the content of the array and examines the sequence names with overlap regions to

merge the elements of the array. This iteration continues until there is no change in the array.

More details on the actual program will not be described as it is too technical.

down all the clusters correctly especially with a large cluster which contains more than a certain

number of sequences (Errors were detected later in clusters with the size of 266. The errors are

being corrected after manual inspection). This is due to the fact that by default the average size

of the overlap of the 2 pairs of matched sequences are used. Also, the threshold for the overlap

region (70%) is arbitrary which is chosen empirically rather than calibrating the error rate at

various different thresholds due to the limit of computing power we have. For sequences with

different lengths, ideally it is necessary to set different thresholds. However, no fine tuning has

been done yet. It has been found that the 3 large clusters with 828, 340 and 266 sequences

respectively were not broken down thoroughly in manual inspections (with ClustalW alignments).

Also the 30 amino acid residue minimum is an arbitrary threshold chosen empirically from the

original manual subclustering. Even so, it is comparable with the sequence length minimum set

for the protein structure comparison program DALI (Holm and Sander, 1993) which has 40.

Several other values were tested showing that the members of wrong clusters increased when

the threshold was smaller than 30. However, larger values were too strict, resulting in more

and small subclusters. In figure 1.5.1.1. a case of mering is shown. In figure 1.5.1.2., a case of

not merging 2 pairs of sequence alignment is shown. The overall procedure of iterative merging

and subclustering is shown in figure 1.5.1.3.

Figure 1.5.1.1. Merging similar sequence pairs according to the region overlapped.

A case in which sequences are merged is shown.

Figure 1.5.1.2. Merging similar sequence pairs according to the region overlapped.

A case in which sequences are not merged is shown.

Figure 1.5.1.3. Schematic view of merging iteratively and subclustering the merged

sequences.

are: sso_to_msp.pl which converts raw SSEARCH output to MSP file format. divicl.pl which

merges sequences and subclusters according to given expectation values, scores and

overlapping factors. To increase the accuracy of the subclustering, the ratio of the overlap can be

increased as well as the overlap Smith-Waterman scores and expectation values in divicl.pl.

distributed as part of the Bioperl project (http://ind5.mrc-lmb.cam.ac.uk/bioperl.html) which

intends to provide various useful reusable subroutines and objects for biologists.

1.5.2. Speed of the subclustering.

12013 sequences and 1622 single link clusters, the time taken to subcluster was on the order

of days with a DEC alpha UNIX computer. Perl is an ideal language for prototyping for academic

applications which need a lot of testing procedures. However, it would be faster to run as a fully

compiled program after implementing the algorithms in a language like C.


Information on PRODOM/DOMAINER:

A link to PRODOM page

Prodom paper abstract (from medline)

Prodom paper full text


1.6. References

Brenner, S. E., Chothia, C., and Hubbard, T.J.P., (1998) Assessing sequence comparison

methods with reliable structurally identified distant evolutionary relationships. PNAS, 95, 6073-6078..

Brenner, S. E., Hubbard, T., Murzin, A., and Chothia, C., (1995) Gene duplications in H.

Influenzae. Nature, 378, 140.

Gouzy, J., Corpet, F., and Kahn, D., (1996) Graphical interface for Prodom domain families.

Trends Biochem. Sci., 21, 493.

Holm, L., and Sander, C., (1993) Protein structure comparison by alignment of distance

matrices. J. Mol. Biol., 233, 123-138.

Pearson, W., (1995) Effective protein sequence comparison, Methods in Enzymology, 268,

227-258.

Pearson, W. R., (1995) Comparison of the methods for searching protein-sequence databases.

Protein Sci., 4, 1145-1160.

Riley, M., and Labedan, B., (1997) Protein Evolution viewed through Escherichia coli protein

sequences: Introducing the notion of a structural segment of homology, the module. J. Mol. Biol.,

268, 857-868.

Sonnhammer, E. L., and Kahn, D., (1994) Modular arrangement of proteins as inferred from

analysis of homology. Protein Sci., 3, 482-492.

Sonnhammer, E.L. and Durbin, R., (1994) A workbench for large-scale sequence homology

analysis. Comput. Appl. Biosci., 10, 301-307.


Sarah Teichmann and Jong Park