Sequence Search Algorithm Assessment and Testing Toolkit
(SAT)
Jong Park1,
Liisa
Holm1, and Cyrus Chothia
2
1 EBI,
Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
2 LMB,
MRC Centre, Hills Road, Cambridge, CB2 2QH, UK
1Tel: +44 (0)1223 494613
1Fax: +44 (0) 1223 494468
Abstract
Motivation: The SAT aims to be a complete package for
the comparison of different protein homology search algorithms. The structural
classification of proteins can provide us with a clear criterion for judgment
in homology detection. There have been several assessments based on structural
sequences with classifications, but a good deal of similar work is now
being repeated with locally developed procedures and programs. The SAT
will provide developers with a complete package which will save time and
produce more comparable performance assessments for search algorithms.
The package is complete in the sense that it provides a non-redundant large
sequence resource database, a well-characterized query database of proteins
domains, all the parsers and some previous results from PSI-BLAST and a
hidden markov model algorithm.
Results: An analysis on two different data sets was carried out using the SAT package. It compared the performance of a full protein sequence database (RSDB100) with a non-redundant representative sequence database derived from it (RSDB50). The performance measurement indicated that the full database is sub-optimal for a homology search. This result justifies the use of much smaller and faster RSDB50 than RSDB100 for the SAT.
Availability: A web site is up. The whole package is accessible via www and ftp. ftp://ftp.ebi.ac.uk/pub/contrib/holm/Jong/SAT and http://cyrah.ebi.ac.uk:1111/Proj/Bio/SAT . In the package, some previous assessment results produced by the package can also be found for reference.
Contact: jong@ebi.ac.uk
Introduction
The sequence search method is believed to be the most cost-effective
approach with biological sequences. Since the 1970s (Needleman
and Wunsch, 1970), there has been a consistent development in
algorithms (Smith
and Waterman, 1981, Waterman, 1986,
Pearson
and Lipman, 1988, Altschul,
et al., 1990) to cope with ever-increasing biological sequences
especially from genome projects.
An objective measurement for the performance of search methods for
biological databases can be critically important for the analysis and development
of such algorithms. For proteins, there have been several resources for
such an assessment because of the availability of well-characterized structural
database like SCOP (Murzin,
1995), CATH (Orengo,
1998) and FSSP (Holm
and Sander, 1998). Furthermore, large non redundant sequence
databases (Bleasby,
1994, Holm
and Sander, 1998, Kallberg,
1999) can provide us with a valuable resource for aligned multiple
sequences for better profiles (Gribskov,
et al., 1987) and hidden markov models (HMM, Krogh,
et al., 1994, Baldi,
et al., 1994, Eddy,
1995) for very sensitive searches like BLAST 2 (Altschul,
et al., 1997). A study has shown that using multiple sequences
from such non redundant databases can achieve 2 to 3 times the detection
increase (Park,
et al., 1998), especially with iterating the searches (Tatusov,
et al, 1994) over the big database.
Programs developed for the whole procedure of assessment of search
algorithm and database quality (Pearson,
1996, Park,
et al., 1997, Brenner,
et al., 1998, Rost, 1999) can
be easily modularized and distributed to avoid repetition and to standardize
the assessment procedure. Here, we present all the components necessary
in a package. It is easy to run and maintain, as all the databases are
included and programs are made in Perl programming language (http://www.perl.org/),
which can be run on almost all computer platforms. The databases for both
query and target sequences have been well-maintained and are reliable.
As examples of the package we assessed two databases with different sizes
by BLAST 2.0.9 algorithm. One is the original full-redundant sequence database
and the other is a reduced set of it with 50% or less mutual sequence identity.
Methods
The main idea of an assessment of search algorithms using structural
information is, first, to embed a set of structurally known protein sequences
in any large protein sequence database. Second, searches using the embedded
set as query will find matches which belong to the same structurally proven
kinship in the large database. Third, by comparing the numbers of true
and false matches, it is possible to assess the performance.
This is possible because structural similarity is more preserved than
sequence similarity in proteins. The most important components of the SAT
are a well characterised and accurately classified query database from
PDB structure database and a small (non-redundant) but an extensive resource
sequence database which has an even distribution of diverse sequences.
The SAT production and assessment procedures consist of the following steps. The databases and programs associated are described along with the steps.
(1) Target resource database preparation.
(2) Query structurally known sequence database preparation.
(3) Running any search algorithm.
(4) Parsing the outputs into a standard sequence pair matching (MSP)
format.
(4.1) Producing a ranked sequence pair file (PARF
file).
(4.2) Final non-homologue versus homologue column
file (NHCO file ) generation.
(4.2.1) Direct pair match
performance calculation result is generated.
(4.2.2) Additionally, an
indirect linkage clustering all pair match combination performance calculation
is generated.
(5) Analysis of the results.
1) The target resource sequence database preparation.
A non-redundant representative sequence database (RSDB) is used to
provide the neighbouring sequences. This is necessary for algorithms which
utilise multiple sequence alignment in building profiles and hidden markov
models (HMM). For the SAT package, the default database is "RSDB50",
although several others are also accessible. For the purpose of an analysis
to find the best performance in homology search, in this report, RSDB50
and "RSDB100"
are used.
The procedure of creating the resource database is:
(1) NRDB90 algorithm (Holm and Sander, 1998) was used over a sum of databases consisting of: Swissprot, Swissnew, Trembl, Tremblnew, Genbank, PIR, Wormpep and PDB. The NRDB90 algorithm uses two filters (deca- and penta-peptide filter) to remove redundant sequences and a dynamic programming algorithm, Smith-Waterman, to remove sequences down to 90% mutual sequence identity. This normally results around 46% in size reduction. The parameters used were: +1 for identities, -0.15 for amino acid substitutions and gap opening and elongation penalties of (0, -1) for insertions in the query sequence and (-1, 0) for deletions in the query sequence. The amino acid ambiguity codes BXZ always counted as a mismatch. The resulting RSDB90 database in FASTA format was subject to a filtering process which removed near-neighbour sequences according to the mutual pair percentage identity of 50% from the alignments generated by all against all searches with the gapped BLAST 2 program (version 2.0) with default parameters. The programs used for generating RSDB50 are not included in the SAT package. It is because RSDB50 requires a special database, PAIRSDB, which contains all the pairwise alignments of RSDB100 sequences. It is not portable for the moment. The RSDB50 will be updated periodically from ftp://ftp.ebi.ac.uk/pub/databases/rsdb.2) Query sequence database preparation:(2) The resulting database is RSDB50 in FASTA format. RSDB50 was subject to a low complexity region masking program SEG (Wootton, 1993). The parameters for SEG were all defaults: window size for scan = 12, low trigger complexty = 2.2, high extension complexity = 2.5 and -x option to replace low complex regions with 'x' character. 2,912,304 residues (7.8%) were masked from 37,427,010.
(1) A set of PDB (Protein Databank) sequences was derived from the classification in SCOP. PDB40D is generated from PDB100D which contains all the SCOP structure domains. Iterative removal of domains aligned by BLAST algorithm based on different identity cut-offs results in PDB90D, PDB50D, PDB40D and so on. This has a 40% mutual pair identity resulting in PDB40D-J5. The suffix J5 is a version number associated with Jong Park. It is a protein domain database so there is no risk of a multi-domain linkage problem in the evaluation stage. All the known linkage problem with the SCOP version of 1.37 were manually inspected and removed. There were very few cases of such multi-domain linkage problem with PDB40D.3) Running search algorithm.(2) PDB40D-J5 was subject to write_pdbg_files.pl which produces a file (PDBG file format)
containing all the SCOP superfamily information for look-up. The PDBG contains only the names of the SCOP domains and a statistical table for the folds and superfamilies at the end of the format. The PDBG file is accessed for SCOP classification look up when test query sets do not contain SCOP classification information. SCOP superfamily level is regarded as evolutionarily related with certainty. It also attaches the information on the number of possible homologues (also denoted as Homolog) and non-homologues (denoted as "Nomolog" in the file to make the word length equal to "Homolog").(3) There are 6,964 true possible homologue pairs in 261 superfamily level groups out of 1,226,961 total possible pairs. Therefore, there are 1,219,997 possible false positives. The composition of the query database according to secondary structures, there are 848 possible pairs of all alpha secondary structure only protein, 3,025 all beta secondary structure only protein and 3,006 possible pairs of alpha and beta class proteins.
4) Parsing the outputs into easy sequence pair matching formats.
Parsing the results of any algorithm can be done by the SAT user. As
long as a file format called MSP (Matching Sequence Pair) is created from
the search, all evaluation can be done automatically. For FASTA, SSEARCH
and BLAST algorithms, parsers for MSP format are included in the package
(convert_sso_to_msp.pl and convert_bla_to_msp.pl). An example of MSP file
is also included in the package. It is a simple representation of what
sequences match others with scores and regions of sequence matched. It
is space delimited form of many columns which can be parsed easily according
to the columns. Once an MSP file is created from a search, there are two
more steps which can be separate or one.
(1) With the given MSP file, a paired ranking file (PARF) can be generated which contains the homology information. The input of this stage is the previously created PDBG file for PDB40D-J5. The file format of PARF is shown in Figure 1. This format is the most informative for further analysis.
Figure 1. PARF file format. The first two columns show the matched pairs. They are sorted and hence non-reflexive. The third column shows the homology information. The fourth column shows the evalue (Expectation score value). The final two columns show the SCOP classification. The first three digits separated by dots are superfamily level classification. The columns are separated by spaces. The number sign (#) can be used for comments. (2) Upon PARF file, write_nhco_files.pl is used which produces the final two column data set which is for the numbers of non-homologues and the numbers of homologues. write_nhco_files.pl also calculates the performances for all alpha, all beta, alpha-beta and alpha plus beta classes from the PDB40D-J5. As another way of calculating performance, it does an indirectly matched single linkage clustering and combinatorial calculation for the pairs within each group. This results in a much higher performance showing the ability of finding far distant homologues by the search algorithm used. It is an additional feature for final analysis of the members of families found. The output results can not be configured easily for the moment. However, all the calculation can be easily changed by modifying write_nhco_files.pl as it is a pure Perl5 program.
(3) Performance measurement: There are different ways of calculating the performances (Pearson, 1996, Rost, 1999) which users can apply. However, different approaches usually do not affect the relative performance of algorithms and databases. The number of mismatches per possible good pairs (MPGP) is the easiest possible concept for the purpose of this report. If any search algorithm can find all the 6,964 true homologous pairs with 70 non-homologous pairs (around 1% of 6,964), its performance will be 100% at the error rate of 1%. The results of previous assessments using PSI-BLAST and SAM-T98 hidden markov model algorithms using PARF file format are available through FTP: ftp://ftp.ebi.ac.uk/pub/contrib/holm/Jong/SAT/Test_set_PARF_files/
The list of software and databases included in the SAT.
The following Table 1. shows all the software and databases included
in the SAT. These are stored in several separate subdirectories in the
above FTP URL.
| Software | |
| convert_bla_to_msp.pl | A conversion tool for BLAST output to MSP file. |
| convert_sso_to_msp.pl | A conversion tool for FASTA/SSEARCH output to MSP file. |
| simple_gap_blast_search.pl | A GAP-BLAST wrapper program in Perl. |
| simple_psi_blast_search.pl | A PSI-BLAST wrapper program in Perl |
| write_parf_files.pl | A PARF file generator from MSP files for the assessment. |
| write_nhco_files.pl | A NHCO file generator from PARF files for the assessment. |
| Bioinf.pl | A Perl5 subroutine library containing all the routines for the SAT. |
| Databases | |
| PDB40D-J5.mpfa | The query database of structural domain sequences. |
| pdb40d.mpfa | An older version of PDB40D from SCOP database |
| pdb40d_j.mpfa | A previous version of PDB40D-J for compatibility with some assessment |
| pdb40d_1.37.mpfa | The PDB40D version from SCOP 1.37 version. |
| pdb40c.mpfa | A full chain (multi-domain) PDB40D from SCOP. |
| RSDB50.mpfa | RSDB50 without PDB40D-J5 embedded. |
| RSDB50_segged_with_PDB40D-J5.mpfa | RSDB50 SEGged with PDB40D-J5 embedded. |
| RSDB90_segged_with_PDB40D-J5.mpfa | RSDB90 SEGged with PDB40D-J5 embedded. |
| RSDB100_segged_with_PDB40D-J5.mpfa | RSDB100 SEGged with PDB40D-J5. The full sequence database |
| pdb40d_1to5.pdbg | A PDBG file for PDB40D-J5. |
Table 1. The list of software and databases included in the SAT.A test assessment by the SAT package using RSDB50 database with PSI-BLAST.
* The extension .mpfa with the databases stands for Multiple Protein FAsta.
| Test Set | NRDB type | z value | j value | h value | Performance at 70 ismatches | b & v values |
| 1 | RSDB50 | 0 | 1 | N/A | 981 | 1,000 |
| 2 | RSDB100 | 0 | 5 | 0.0005 | 1,698 | 1,000 |
| 3 | RSDB100 | 37,427,010 | 5 | 0.0005 | 1,742 | 1,000 |
| 4 | RSDB100 | 0 | 5 | 0.0005 | 1,896 | 15,000 |
| 5 | RSDB50 | 111,958,534 | 5 | 0.0005 | 1,870 | 1,000 |
| 6 | RSDB50 | 0 | 5 | 0.0005 | 1,930 | 1,000 |
| 7 | RSDB50 | 0 | 5 | 0.0015 | 1,969 | 1,000 |
Table 2. Test sets with different parameters. The set 1 is for GAP-BLAST as a control. z value of 0 indicates the default which is the same as the physical size of database in amino acid residues. For RSDB100, it is 111,958,534 and for RSDB50 it is 37,427,010. The -v parameter in BLAST 2.0.9 is the number of one-line descriptions as an integer (default = 500). The -b parameter is the number of alignments to show in the output of BLAST 2.0.9 (default = 250). The default values of b and v are too low for large protein families dropping the sensitivity dramatically. The j value is for the iteration number in PSI-BLAST. The h value is for the step E-value cut-off for each iteration and profile generation in PSI-BLAST. The version of PSI-BLAST used was 2.0.9.
There were three RSDB50 performances with PSI-BLAST. All of them had the iteration option j=5 and evalue cut-off of 0.0005 which is shown to be safe (Park, et al., 1998) excluding dissimilar false positives in less than 1% mismatches per all the possible good pairs. The b and v parameters for all the RSDB50 test sets (test sets 1, 5, 6 and 7) were 1,000.
1) RSDB50 with 1 iteration of PSI-BLAST which is the same as normal GAP-BLAST 2.0 : test set 1. This test set is the control used as a performance reference.For comparison, RSDB100 performances with the same condition as RSDB50 were measured. They are:
2) RSDB50 with the effective database length of RSDB100 (z parameter) which will calculate the statistical E-value according to the size of RSDB100: test set 5.
3) RSDB50 with normal z parameter which will calculate the statistical evalue according to the real size of RSDB50 : test set 6.4) RSDB50 with evalue of 0.0015 which is derived from the manual adjustment by the difference (2.99 times) in the size of RSDB50 (37,427,010 aa residue) and RSDB100 (111,958,534 aa residue). This test set was used because the E-value is a function of the database size. However, as PSI-BLAST is an iterative search method with new matched sequences being incorporated into profiles during the search, calculating the statistical score and comparing it with the one from a different database size cannot be exact. Therefore, 0.0015 for RSDB50 is a rough manual adjustment to match 0.0005 of RSDB100. The manual setting is an alternative to changing z value : test set 7.
1) RSDB100 with evalue of 0.0005 with b and v parameters of 1000 at default z value : test set 2.All the searches were done with DEC Alpha UNIX machines running at 500 mhz with 256 megabyte or more memory.
2) RSDB100 with z value of 37,427,010 (from the size of RSDB50) with evalue of 0.0005 and b=v=1000 : test set 3.
3) RSDB100 with evalue of 0.0005 with b and v parameters of 15,000. The reason why higher b and v are additionally considered is that with a large redundant database, very close homologues flood the results of search algorithms. Over 10,000 IG domain hits can be recorded and some distant homologues will be excluded even though the evalue is statistically significant (2e-11, for example). The time taken for one single query with b and v of 15,000 was over 12 hours, therefore, the b=v=15,000 setting should be regarded as an experimental exception : test set 4.
Results and discussion.
Figure 2. shows that RSDB50 performance
is equivalent to RSDB100' which has a size 3 times bigger with many different
effective database length parameter values (z). The time taken is 6.3 times
the speed of RSDB100 (151 vs 24 hours). Even with the manual adjustment
of E-values to match the database size difference for a more precise comparision
between RSDB50 and RSDB100, RSDB50 scored consistently higher. The best
RSDB100 performance with high b and v is marginally better than the lowest
RSDB50 test set. However, high b and v values should not be considered
normal due to excessive time taken. Multiple computer CPUs had to be used
to finish the searches with b=v=15,000. Therefore, the decrease of performance
with default or reasonably high b and v should be regarded as more of a
feature for speed with a very large database like RSDB100 which have 370,000
sequences (March 11th, 1999 version). Therefore, it is clear that for the
SAT, RSDB50 is a better resource database than RSDB100.
Even though RSDB50 has been derived from RSDB100, the z parameter (37,427,010)
for the true database size is better than the one from RSDB100, emphasizing
the fact that the difference in information content between the two databases
is more relevant than the physical database size. Therefore, for speed
and sensitivity, generating a concentrated subset database like RSDB50
would be a better solution than using a full database. The extent of the
findings with the particular set of tools including PSI-BLAST, RSDBs and
SCOP classification may not be completely general. To assess the performance
of different algorithms with biased sequence representation, one can use
RSDB100 which is included in the SAT. The equivalent performance at down
to 50 % sequence identity level in the database corresponds to the fact
that protein sequences have a "twilight zone" where, down to a certain
level, it is very straightforward to align and search proteins sequences
(Rost, 1999).
The RSDB100 performance with z value of 37,427,010 (from RSDB50) is better
than the one with its own z value (the second dotted line from the bottom).
This, again, suggests that the database size parameter is more relevant
to the sequence diversity than is its physical size. It also supports the
value of good weighting in profile or HMM building (Krogh
and Mitchson, 1995, Karchin
and Hughey, 1998) and scoring or evenly concentrated database
like RSDB50. The PARF files from all the test sets are available in the
SAT to be used by users for comparisons with their own algorithms.
In conclusion, an assessment kit like SAT can help speed the analysis
and development of search algorithms and sequence database characteristics,
as it can provide objectivity.
Measuring the performance of a homology search algorithm does not necessarily
reflect the accuracy of the quality of the sequence alignments made by
the algorithms. Alignments are often very important for methods in protein
structure prediction, very sensitive homology search by profile and HMM
generation and strucural and functional model building. At present, the
SAT does not include any structural alignments which can be compared with
the sequence alignements generated by search algorithms. The structural
alignment assessment component with assessment program will be included
with a structure alignment database in the future.
Figure 2. The bottom solid line is for GAP-BLAST (i.e., PSI-BLAST iteration option j=1). The 3 dotted lines are for RSDB100. The bottom dotted line is the performance with b and v value of 1000. The second dotted line is for the performance with z=37,427,010 The third dotted line is for the performance for b and v value of 15,000. The top 3 solid lines are for the performances of RSDB50. The lowest of the three is for when z parameter was set to the same as the size of RSDB100 (with b=v=1000). The second lowest is with default z parameter ( 0 ) for the true size of RSDB50. The top line is for when the step evalue of PSI-BLAST ( h ) is set to 0.0015.
Acknowledgment
J.P. thanks many members of the George Church Lab at Harvard Medical
School, Sarah Teichmann for earlier collaboration, Tim Hubbard for PDB40D,
Steve Brenner for an earlier work, Alex Batemann for good discussions,
Terry Horsnell for computers, Maryana Huston for her caring assistance,
and most importantly many people who have shown passion, ruthless honesty
and objectivity in pursuing science.
References
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J., (1990), Basic local alignment search tool. J. Mol. Biol., 215, 403-410.
Altschul,
S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller,
W., and Lipman, D.J.,
(1997), Gapped BLAST and PSI-BLAST: a new generation of protein database
search programs. Nucleic Acids Res., 25, 3389-3402.
Baldi, P., Chauvin, Y., Hunkapiller, T., and McClure, M.A., (1994), Hidden Markov models of biological primary sequence information. Proc. Natl. Acad. Sci. USA, 91, 1059-1063.
Bleasby, A.J., Akrigg, D., and Attwood, T.K., (1994), OWL--a non-redundant composite protein sequecne database. Nucleic Acids Res., 22. 3574-3577.
Brenner, S.E., Chothia, C., and Hubbard, T., (1998), Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl. Acad. Sci. USA, 95, 6073-6078.
Eddy, S.R., Mitchison, G., and Durbin, R., (1995), Maximum discrimination hidden Markov models of sequence consensus. J. Comput Biol. 2, 9-23.
Eddy, S.R., (1995), ISMB 95 (Intelligent Systems in Molecular Biology conference) Multiple Alignment Using Hidden Markov Models, ISMB 95.
Gribskov, M., McLachlan, A.D., and Eisenberg, D., (1987), Profile analysis: detection of distantly related protiens, Proc. Natl. Acad. Sci. USA, 84, 4355-4358.
Holm, L., Sander C., (1996), Mapping the protein universe. Science, 273, 595-603
Holm, L., and Sander, C., (1998), Touring protein fold space with Dali/FSSP. Nucleic Acids Res., 26, 316-319.
Holm, L., and Sander, C., (1998), Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics,14, 423-429.
Holm, L. and Sander, C., (1999), Protein folds and families: sequence and structure alignments. Nucleic Acids Res., 27, 244-247.
Kallberg Y., and
Persson B., (1999), KIND-a non-redundant protein database.
Bioinformatics, 15, 260-261.
Karchin, R., and
Hughey, R., (1998), Weighting hidden Markov models for maximum discrimination.
Bioinformatics, 14, 772-782.
Krogh, A., Brown, M., Mian I.S., Sjolander, K., and Haussler, D., (1994), Hidden Markov-Models in Computational Biology - Applications in Protein Modelling. J. Mol. Biol., 235, 1501-1531.
Krogh, A., and Mitchison, G., (1995), ISMB95 (Intelligent Systems in Molecular Biology conference) Maximum entropy weighting of aligned sequences of proteins or DNA. ISMB95, 215-221.
Murzin,
A., Brenner, S.E., Hubbard, T., and Chothia, C., (1995), scop: a structural
classification of proteins database for the investigation of sequences
and structures. J. Mol. Biol., 247, 536-540.
:(see also: http://scop.mrc-lmb.cam.ac.uk/scop
).
Needleman, S. B., and Wunsch, C. D., (1970), A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443-453.
Orengo,
C. A., Martin, A. M., Hutchinson, G., Jones, S., Jones, D. T., Michie,
A. D., Swindells, M.B., and Thornton, J. M., (1998), Classifying a protein
in the CATH database of domain structures.
Acta Crystallogr D Biol Crystallogr., 54, 1155-1167.
Park, J., Teichmann, S.A., Hubbard, T., and Chothia, C., (1997), Intermediate sequences increase the detection of distant sequence homologies. J. Mol. Biol . 273, 349-354.
Park, J., Karplus, K., Barrett, C., Hughey, R., Haussler, D., Hubbard, T., and Chothia, C., (1998), Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol., 284, 1201-1210.
Pearson, W.R., and Lipman, D.J., (1988), Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA, 85(8), 2444-2448.
Pearson, W.R., (1996), Effective protein sequence comparison. Methods Enzymol. 266, 227-258.
Rost B., (1999), Twilight zone of protein sequence alignments. Protein Eng., 12, 85-94.
Smith, T.F., and Waterman, M.S., (1981), Identification of common molecular subsequences. J. Mol. Biol., 147, 195-197.
Tatusov, R.L., Altschul, S.F., and Koonin, E.V., (1994), Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. Proc. Natl. Acad. Sci. USA, 91, 12091-12095.
Waterman, M.S., (1986), Multiple sequence alignment by consensus. Nucleic Acids Research, 14, 9095-9102.
Wootton, J.C., (1994), Sequences
with unusual amino-acid compositions.
Curr. Op. Struct. Biol., 4,
413-421.