DIVCLUS
A protein sequence domain clustering program
Sarah Teichmann and Jong Park
1. Domain clustering (also denoted as subclustering)
In the genome sequence analysis for the calculations of the levels of gene duplication,
multidomain sequences result in wrong clusters ( 217 out of 1622 clusters). To avoid this problem,
protein sequence domains should be used instead of the whole sequence for clustering related
sequences. A detailed description of a more correct clustering will be given in this chapter.
1.0. Introduction: clustering protein sequences
By using sequence comparison methods, it is possible to cluster homologous sequences.
However, as certain proteins are made up of more than one domain, when there are more than
two sequences, it is possible to cluster two unrelated sequences via one intermediate sequence
which share the domains with the other two as illustrated in figure 1.0.1. A, B, and C are
multidomain protein sequences having sequence domains of D1 to D5. Because D2 and D4 are
present in the sequence B, A and C are clustered together even though they do not have any
common region.
Here, 'domain' means something different to protein structural domains. It is more correctly
called 'sequence domain'. In this thesis, when the term 'seqlet' is used, it indicates a sequence
domain rather than a protein structural domain. A sequence domain may correspond to a
structural domain, but there is no study on how well sequence domains match with structural
domains.

Figure 1.0.1. 3 different sequences A, B, and C are clustered wrongly due to domains in
107
sequence B.
To avoid the wrong clustering, it is necessary to breakdown the protein sequences into
domains and cluster the domains to calculate the accurate level of gene duplication as shown in
figure 1.0.2. Originally, the subclustering of the initial wrong clustering was done manually by
first running Blastp program (Altschul, et.al., 1990) on the sets of sequences in each single cluster
to find the matched regions, and then the MSPcrunch program (Sonnhammer and Durbin, 1994)
to show the overlap regions. Only later was the SSEARCH program used. What MSPcrunch does
is to link the Blastp matched sequence fragments into bigger fragments when the gaps between
them are small (which is a parameter of the program). Careful examination by eye could show
wrongly clustered sequences. For small size genomes, such a manual sorting is possible, but
with larger genomes, it became too time-consuming. Also, for large clusters, it is very difficult to
sort correctly as the linkage can be very complex. Therefore, it was more appropriate to
implement the manual process as a computer program.

Figure 4.1.0.2. Clustering domains of multidomain proteins. A to G represent each single
domain in different sequences. B and C are the same domains of D2. E and F are the same
domains of D4. Therefore, there are 5 distinct sequence domains.
For the analysis of 6 complete bacterial genomes, we have developed (by Sarah Teichmann,
Alex Bateman and Jong Park) algorithms (in a program, divicl.pl) which replace the manual
sorting and breaking down of wrongly clustered sequence families by examining the alignments
made by either using SSEARCH or additionally using ClustalW (all after initial SSEARCH
comparison ).
There has been a domain finding program which was used to create the Prodom database
(Gouzy, et. al., 1996). The main difference between the algorithms is that divicl.pl uses
expectation values from SSEARCH compared to the other work (Sonnhammer and Kahn, 1994),
where the output of the Blastp search method is used. Blastp produces fragments of sequence
matches which do not include gaps. Therefore, what it produces is not necessarily sequence
domains, but sequence fragments which are biologically less meaningful than the larger fragments
produced by SSEARCH method. MSPcrunch, which connects such fragmented Blastp output,
can improve the situation. However, we know of no study on whether connected Blastp output is
equivalent to the gapped search of full Smith-Waterman SSEARCH. Also, the P value scoring
system used in Blastp is much less selective than the E value of SSEARCH (Pearson, 1995) which
will result in less robust clustering. Simple test carried out in our lab showed that SSEARCH
produced much longer alignments than Blastp with post processing of MSPCrunch (Park, et. al.,
data not shown). The details of the procedure for the domain clustering will be described in the
following section (1.1.).
1.1. Outline of the procedure.
The outline view of the whole domain clustering procedure is shown below (figure 1.1).

Figure 4.1.1. Schematic procedure of domain clustering.
1.2. Searching the database.
In this thesis, the database for clustering is comprised of 6 complete genomes.
SSEARCH version 3 (Pearson, 1996) is used to compare all the sequences to each other. The
expectation (E) value used to determine the homology was 0.001 to allow a low rate of false
positives. In theory, the error rate corresponding to 0.001 is close to 0.01%. This is because, in a
previous study it was shown that 1% error was made with a search with a database size of 971
sequences at the expectation value of 0.01 (Brenner, et. al., 1997). As the database size of 6
genome is 12013, E value of 0.1 is supposed to bring around 1% error (as E value is dependent
on the size of database) . However, there is no way to get the truly correct error
rate with genomic sequences which do not have all the corresponding structures, so a very strict
expectation value of 0.001 was used to minimise the error rate. Theoretically, this should give us
0.01% error rate in clustering sequences by any single linkage program.
SSEARCH has an option for machine readable output format specified by -m 10 and
throughout the study this option was used. All the pairs of possible combinations were
compressed and saved in a UNIX machine. This file format is called SSO (SSEARCH Search
Output) file here.
1.3. Single linkage clustering after SSEARCH.
An existing program (Brenner, et. al., 1995) was used to cluster sequences
whenever they are significantly matched by SSEARCH with the threshold expectation value of
0.001. The program retrieves all the sequence match pairs with less than 0.001 E value. If any
sequence pair is connected to any other pairs of sequences, regardless of the region overlapped,
they are regarded as belonging to the same cluster. This produces wrong clusters but includes all
the sequences linked. The resulting file (CLU file, for CLUstering) had the list of sequences
belonging to the cluster size 2 to the largest found (with 6 genomes, maximum 828 sequences
were clustered together). This wrong clustering file is used as an input for the program which
produces MSP files which will be described in the next section (1.4).
1.4. Generating MSP files (format change)
According to the clustering in CLU file, all the SSO files were processed to produce MSP
(a modified MSPcrunch file format; Sonnhammer and Durbin, 1994) files. If there were 2000
clusters in various sizes, there would be 2000 MSP files which contain the details of
SSEARCH score, E value and most importantly, the matched region. An MSP file format
example is shown in figure 1.4. MSP files were made not as part of the algorithm, but as an
output which can provide important information about clustering for other later use. For
example, a program has been developed to visualise the alignments of the sequences
in MSP files. This is very useful to check any single linkage or subclustering by eye. In fact this
visual inspection was carried out until it was possible to use automatic subclustering to sort
the wrong clusters into smaller and correct clusters.

Figure 1.4. Example of MSP file format. The first column of MSP file has
Smith-Waterman scores. The 2nd column has expectation values. The 3rd and 5th
columns show the sequence regions matched. The 4th and 6th columns show the
match and target sequences.
1.5. Dividing complex clusters.
A program (divicl.pl, by Perl programming language) was used to divide all the complex
clusters. It compares the matched regions of all the sequences in one cluster made by any single
linkage clustering program. If all the matched regions of the sequences have a commonly
overlapping region in the cluster, the cluster is correct by the single linkage clustering. If not, the
cluster is broken down to smaller clusters according into the domain sequences. Details are
shown below.
1.5.1. Merging similar sequences.
The first step in checking and dividing complex a cluster is to merge any sequences which
are overlapping to a large extent in the original single linkage cluster. The accepted overlap
between any sequence pair is a variable of divicl.pl and the default threshold is 70% of the
commonly overlapped region of two pairs of sequence which contain one common sequence,
with the minimum overlapping region being 30 amino acid residues. This means if any 3
sequences (in 2 pairs) are overlap by 70%, they are regarded as truly matched. This is
illustrated in figure 1.5.1.1. Sequences A, B and C have regions which are overlapping by over
70% (overlap1 over overlap2 or overlap4 in the figure is more than 70%), so A, B and C are
regarded as having thesame seqlet and belong to the same cluster.
In the comparison of 2, randomly chosen pairs are compared in this way first. Either
overlap2 or overlap4, or the shorter one (default) or the average of the two can be chosen as the
denominator in calculating the percentage of overlap as an option in the program. Subsequently,
according to the option either the smaller overlap region, or the larger region or the average of
the two overlap (default) becomes the new overlap of ABC from the pairs AB or BC. The smaller
common region subclustering will not produce wrong large subclusters, but it may fail to detect
the true connection between pairs of sequences. In other words, it can be too strict. The larger
overlap region option in this merging stage can cause wrong clusters, but it will have less
probability to fail to cluster true connections between pairs. Taking the average overlap size of
the compared pairs before comparing the next pair of sequences can produce wrong clusters,
but with a lower probability than taking the larger overlap size. The average size of overlap
option was chosen empirically after testing with some clusters. Only the chosen overlap region
(either as average, smaller, or larger ) of the randomly chosen pair is used for the next pair to be
compared.
A case where a merge is not done is shown in figure 1.5.1.2. In this case, the overlap region
percentage between overlap1 and overlap2 is much less than the default threshold of 70% of
the. As an option, thesize of overlap2 can be the average of the overlap regions of the pairs. This
overlap comparison process is iterative and the result of the overlap detection is put into an array
which has the sequence names with the overlapped regions. After iterations, another subroutine
compares the content of the array and examines the sequence names with overlap regions to
merge the elements of the array. This iteration continues until there is no change in the array.
More details on the actual program will not be described as it is too technical.
This iterative solution in the merging step is not perfect, so it is possible that it fails to break
down all the clusters correctly especially with a large cluster which contains more than a certain
number of sequences (Errors were detected later in clusters with the size of 266. The errors are
being corrected after manual inspection). This is due to the fact that by default the average size
of the overlap of the 2 pairs of matched sequences are used. Also, the threshold for the overlap
region (70%) is arbitrary which is chosen empirically rather than calibrating the error rate at
various different thresholds due to the limit of computing power we have. For sequences with
different lengths, ideally it is necessary to set different thresholds. However, no fine tuning has
been done yet. It has been found that the 3 large clusters with 828, 340 and 266 sequences
respectively were not broken down thoroughly in manual inspections (with ClustalW alignments).
Also the 30 amino acid residue minimum is an arbitrary threshold chosen empirically from the
original manual subclustering. Even so, it is comparable with the sequence length minimum set
for the protein structure comparison program DALI (Holm and Sander, 1993) which has 40.
Several other values were tested showing that the members of wrong clusters increased when
the threshold was smaller than 30. However, larger values were too strict, resulting in more
and small subclusters. In figure 1.5.1.1. a case of mering is shown. In figure 1.5.1.2., a case of
not merging 2 pairs of sequence alignment is shown. The overall procedure of iterative merging
and subclustering is shown in figure 1.5.1.3.

Figure 1.5.1.1. Merging similar sequence pairs according to the region overlapped.
A case in which sequences are merged is shown.

Figure 1.5.1.2. Merging similar sequence pairs according to the region overlapped.
A case in which sequences are not merged is shown.

Figure 1.5.1.3. Schematic view of merging iteratively and subclustering the merged
sequences.
All the programming was done with the Perl5 programming language. The main programs
are: sso_to_msp.pl which converts raw SSEARCH output to MSP file format. divicl.pl which
merges sequences and subclusters according to given expectation values, scores and
overlapping factors. To increase the accuracy of the subclustering, the ratio of the overlap can be
increased as well as the overlap Smith-Waterman scores and expectation values in divicl.pl.
All the subroutines used by the programs are kept in a Perl library file which will be
distributed as part of the Bioperl project (http://ind5.mrc-lmb.cam.ac.uk/bioperl.html) which
intends to provide various useful reusable subroutines and objects for biologists.
1.5.2. Speed of the subclustering.
The overall speed of the programs is dependent on how big the complex cluster is. With
12013 sequences and 1622 single link clusters, the time taken to subcluster was on the order
of days with a DEC alpha UNIX computer. Perl is an ideal language for prototyping for academic
applications which need a lot of testing procedures. However, it would be faster to run as a fully
compiled program after implementing the algorithms in a language like C.
Information on PRODOM/DOMAINER:
Prodom paper abstract (from medline)
1.6. References
Brenner, S. E., Chothia, C., and Hubbard, T.J.P., (1998) Assessing sequence comparison
methods with reliable structurally identified distant evolutionary relationships. PNAS, 95, 6073-6078..
Brenner, S. E., Hubbard, T., Murzin, A., and Chothia, C., (1995) Gene duplications in H.
Influenzae. Nature, 378, 140.
Gouzy, J., Corpet, F., and Kahn, D., (1996) Graphical interface for Prodom domain families.
Trends Biochem. Sci., 21, 493.
Holm, L., and Sander, C., (1993) Protein structure comparison by alignment of distance
matrices. J. Mol. Biol., 233, 123-138.
Pearson, W., (1995) Effective protein sequence comparison, Methods in Enzymology, 268,
227-258.
Pearson, W. R., (1995) Comparison of the methods for searching protein-sequence databases.
Protein Sci., 4, 1145-1160.
Riley, M., and Labedan, B., (1997) Protein Evolution viewed through Escherichia coli protein
sequences: Introducing the notion of a structural segment of homology, the module. J. Mol. Biol.,
268, 857-868.
Sonnhammer, E. L., and Kahn, D., (1994) Modular arrangement of proteins as inferred from
analysis of homology. Protein Sci., 3, 482-492.
Sonnhammer, E.L. and Durbin, R., (1994) A workbench for large-scale sequence homology
analysis. Comput. Appl. Biosci., 10, 301-307.
Sarah Teichmann and Jong Park