Bioinformatics – An aidfor biological research

M. Madan Babu,Center for Biotechnology, Anna University

Bioinformatics is the application ofInformation technology to store, organize and analyse the vastamount of biological data which is available in the form ofsequences and structures of proteins – the building blocksof organisms and nucleic acids – the information carrier.The biological information of nucleic acids is available assequences while the data of proteins is available as sequencesand structures. Sequences are represented in single dimensionwhere as the structure contains the three dimensional data ofsequences.

Initial interest in Bioinformatics waspropelled by the necessity to create databases of biologicalsequences. The first database was created within a short periodafter the Insulin protein sequence was made available in 1956.Incidentally, it may be noted that Insulin is the first proteinto be sequenced. The sequence of Insulin consisted of just 51residues (analogous to alphabets in a sentence ) whichcharacterise the sequence. Around mid nineteen sixties, the firstnucleic acid sequence of Yeast tRNA with 77 bases (individualunits of nucleic acids) was found out. During this period, threedimensional structure of proteins were studied and the well knownProtein Data Bank was developed as the first protein structuredatabase with only 10 entries in 1972. This has now grown in to alarge database with over 10,000 entries. While the initialdatabases of protein sequences were maintained at the individuallaboratories, the development of a consolidated formal databaseknown as SWISS-PROT protein sequence database was initiated in1986 which now has about 70,000 protein sequences from more than5000 model organisms, a small fraction of all known organisms.These huge variety of divergent data resources are now availablefor study and research by both academic institutions andindustries. These are made available as public domain informationin the larger interest of research community through Internet(www.ncbi.nlm.nih.gov) and CDROMs (on request from www.rcsb.org).These databases are constantly updated with additional entries.

Computers and software tools are extensivelyused for creating these databases and to predict the function ofproteins, model the structure of proteins, determine the coding(useful) regions of nucleic acid sequences, find suitable drugcompounds from a large pool and optimise the drug developmentprocess by predicting possible targets.

As the analysis involves a lot of computationaleffort, it is impossible to do it manually. Some of the softwaretools which are handy in the analysis include: BLAST –Commonly used for comparing sequences; Annotator – aninteractive genome (nucleic acid sequence) analysis tool;GeneFinder – Tool to identify coding regions and splicesites.

The sequence information generated by the humangenome research, initiated in 1988 has now been stored as aprimary information source for future applications in medicine.The available data is so huge that if compiled in books, the datawould run into 200 volumes of 1000 pages each and reading alone(ignoring understanding factor) would require 26 years workingaround the clock. For the population of about 5 billion humanbeings with two individuals differing in three million bases, thegenomic sequence difference database would have about 15,000,000billion entries. The present challenge to handle such a hugevolume of data is to improve database design, develop softwarefor database access and manipulation, and device data-entryprocedures to compensate for the varied computer procedures andsystems used in different laboratories. The much celebratedcomplete human genome sequence which was formally announced onthe 26th June 2000 involved more than 500 x 1018(500 million trillion) calculations during the process ofassembling the sequences alone. This is the biggest exercise inthe history of computational biology.

 

The growth of the primary databases gave riseto serious and valid questions on the format of the sequences,reliability and the comprehensiveness of the databases. Toaddress the format issues, in-house software solutions have beendeveloped to convert format of one database to another. A publicdomain software FORCON can also be used. The newer software toolswhich are used for analysis accept the data in multiple formats.The problem in the reliability of the data is the possibility ofmisannotations. The misannotations are some time introduced dueto the process of automation of annotation process which arecarried out extensively with the help of computers.Misannotations, if introduced, multiplies in subsequent additionsand may accumulate to an unbelievable extent and createconfusion. A possible solution to prevent this from happening isto flag the protein sequence which has been annotated by sequencecomparison but whose function has not been validated byexperimental methods.

Composite database amalgamates a variety ofdifferent primary database sources which obviates the need tosearch multiple resources. Different composite database usedifferent primary database and different criteria in their searchalgorithm. Various options for search has also been incorporatedin the composite database. The National Center for BiotechnologyInformation (NCBI) which host these nucleotide and proteindatabases in their large high available redundant array ofcomputer servers, provides free access to the various personsinvolved in research.

Secondary database created and hosted byvarious researchers at their individual laboratories such asSCOP, developed at Cambridge University, CATH developed atUniversity College of London, PROSITE of Swiss Institute ofBioinformatics, eMOTIF at Stanford contains pattern data. Thesepatterns are the most highly conserved and arrived by multiplesequence alignment of set of related proteins, that are oftencrucial to the structure and function of the protein.

Apart from maintaining the large database,mining useful information from these set of primary and secondarydatabases is very important. Lot of efficient algorithms havebeen developed for data mining and knowledge discovery. These arecomputation intensive and need fast and parallel computingfacilities for handling multiple queries simultaneously.

Structure and Function prediction with greataccuracy is one of the aspects in which people are working in thearea of bioinformatics. This area is known as structural andfunctional genomics, where the structure and function of proteinsfrom sequences are identified using a host of similarity searchcriteria. Visualising the structure plays an important role inunderstanding. However, they require computing facilities andspecial software tools such as MOLMOL, Rasmol, WebLab, etc.,

One of the still incomplete problems in biologyis the protein structure prediction (protein folding) of thenative 3D structure of the protein from its sequence. A strikingobservation that has been made is that, though there are morethat 10,000 entries of structures in the protein databank, thereare only about 1500 unique structure of proteins. Thus the 3Dstructure of proteins is more or less restricted to a relativelysmall structure space and any change in structure dramaticallyalters the function of a protein, and even the slightest changein the folding process can turn a desirable protein into adisease. The scientific community considers protein folding asone of the most significant and fundamental problem in biologicalscience that has broad economic and scientific impact and whosesolution can be advanced only by applying high-performancecomputing technologies. Understanding the gravity of thisproblem, IBM in Dec 1999, had announced a new $100 millionexploratory research initiative to build a supercomputer which is500 times more powerful than the worlds fastest existing computerand 2 million times faster than the today’s fastest desktopPC. This new computer nicknamed "Blue Gene" by IBMresearchers will be capable of performing more than onequadrillion operations per second (one petaflop equal to 1015operations per second). Better understanding of how proteins foldwill give scientists and doctors better insight into diseases andways to combat them.

Pharmaceutical companies could design high-techprescription drugs customised to the specific needs of individualpeople and doctors could respond more rapidly to changes inbacteria and viruses that cause them to become drug-resistant.According to the recent press release of IBM, "One day,you're going to be able to walk into a doctor's office and have acomputer analyse a tissue sample, identify the offendingpathogen, and then instantly prescribe a treatment best suitedyour specific illness and your genetic makeup."

 

From the pharmaceutical industry point of view,bioinformatics is the key to rational drug discovery. It reducesthe number of trials in the screening of drug compounds and inidentifying potential drug targets for a particular disease usinghigh power computing workstations and software like Insight. Thisprofound application of bioinformatics in genome sequence has ledto a new area in pharmacology – Pharmacogenomics, wherepotential targets for drug development is hypothesized from thegenome sequences. Molecular modeling which requires a lot ofcalculations has become faster due to the advances in computerprocessors and its architecture.

In plant biotechnology, bioinformatics is foundto be useful in the areas of identifying diseases resistancegenes and designing plants with high nutrition value.

To conclude, today it is possible to perform(using heuristic algorithms) 80% accurate searches - perhaps 90 -95% accuracy from the leading software systems. Sensitivealgorithms which improve the search accuracy, such as hiddenMarkov models and Smith-Waterman algorithm, are also availablebut take more time to execute the search. Now, to handle thesedemanding needs, computers are being designed around thebiologists.

Leading bioinformatics companies are developingsoftware systems which permit research scientists to integratetheir diverse data and tools under Common Graphical UserInterfaces (GUIs). This creates more opportunity for research anddiscovery, through savings in time and data co-ordination. Italso permits scientists to share information and provides apowerful solution to archive data.

Thus Bioinformatics has become synonymous withbiological research. New academic programs which train studentsin bioinformatics are providing them with background in molecularbiology and in computer science, including database design andanalytical approaches. Bioinformatics tools for efficientresearch will have a significant implications in medical sciencesand betterment of human lives.