Presentation is loading. Please wait.

Presentation is loading. Please wait.

How Bioinformatics can change your life Basic Concepts of Bioinformatics M. Alroy Mascrenghe MBCS, MIEEE, MIT A lecture given for the.

Similar presentations


Presentation on theme: "How Bioinformatics can change your life Basic Concepts of Bioinformatics M. Alroy Mascrenghe MBCS, MIEEE, MIT A lecture given for the."— Presentation transcript:

1 How Bioinformatics can change your life Basic Concepts of Bioinformatics M. Alroy Mascrenghe MBCS, MIEEE, MIT A lecture given for the BCS Wolerhampton Branch at the University of Wolverhampton

2 M.Alroy Mascrenghe2 TOC Introduction Introduction Basic concepts in Molecular biology Basic concepts in Molecular biology Bioinformatics techniques Bioinformatics techniques Areas in bioinformatics Areas in bioinformatics Applications Applications Related Computer Technology Related Computer Technology Conference in Glasgow Conference in Glasgow Acknowledgements Acknowledgements Reference Reference

3 M.Alroy Mascrenghe3 Introduction……

4 A Major event happened that was to change the course of human history A Major event happened that was to change the course of human history It was a joint British and American effort It was a joint British and American effort nothing to do with IRAQ! nothing to do with IRAQ! It was a race – who will complete first It was a race – who will complete first Race Test – not whether they have taken drugs but whether they can produce them! Race Test – not whether they have taken drugs but whether they can produce them! Human genome was sequenced Human genome was sequenced

5 M.Alroy Mascrenghe5 A Situ…somewhere in the near future A virus –not I love you virus- creates an epidemic A virus –not I love you virus- creates an epidemic Geneticists and bioinformaticians role on their sleeves Geneticists and bioinformaticians role on their sleeves Genetic material of the virus is compared with the existing base of known genetic material of other viruses Genetic material of the virus is compared with the existing base of known genetic material of other viruses As the characteristics of the other viruses are known As the characteristics of the other viruses are known From genetic material computer programs will derive the proteins necessary for the survival of the virus From genetic material computer programs will derive the proteins necessary for the survival of the virus When the protein (sequence and structure) is known then medicines can be designed When the protein (sequence and structure) is known then medicines can be designed

6 M.Alroy Mascrenghe6 What is The marriage between computer science and molecular biology The marriage between computer science and molecular biology The algorithm and techniques of computer science are being used to solve the problems faced by molecular biologists The algorithm and techniques of computer science are being used to solve the problems faced by molecular biologists Information technology applied to the management and analysis of biological data Information technology applied to the management and analysis of biological data Storage and Analysis are two of the important functions – bioinformaticians build tools for each Storage and Analysis are two of the important functions – bioinformaticians build tools for each

7 M.Alroy Mascrenghe7 Biology Chemistry Statistics Computer Science Bioinformatics

8 M.Alroy Mascrenghe8 What is.. This is the age of the Information Technology This is the age of the Information Technology However storing info is nothing new However storing info is nothing new Information to the volume of Britannica Encyclopedia is stored in each of our cells Information to the volume of Britannica Encyclopedia is stored in each of our cells Bioinformatics tries to determine what info is biologically important Bioinformatics tries to determine what info is biologically important

9 M.Alroy Mascrenghe9 Basicsof Molecular Biology….

10 M.Alroy Mascrenghe10 DNA & Genes DNA is where the genetic information is stored DNA is where the genetic information is stored Blonde hair and blue eyes are inherited by this Blonde hair and blue eyes are inherited by this Gene - The basic unit of heredity Gene - The basic unit of heredity There are genes for characteristics i.e. a gene for blond hair etc There are genes for characteristics i.e. a gene for blond hair etc Genes contain the information as a sequence of nucleotides Genes contain the information as a sequence of nucleotides Genes are abstract concepts – like longitude and latitudes in the sense that you cannot see them separately Genes are abstract concepts – like longitude and latitudes in the sense that you cannot see them separately Genes are made up of nucleotides Genes are made up of nucleotides

11 M.Alroy Mascrenghe11

12 M.Alroy Mascrenghe12 Nucleotide (nt) Each nt I made up of Each nt I made up of Sugar Sugar Phospate group Phospate group Base Base The base it (nt) contains makes the only difference between one nt and the other The base it (nt) contains makes the only difference between one nt and the other There are 4 different bases There are 4 different bases G(uanine),A(denine),T(hymine),C(ytosine) G(uanine),A(denine),T(hymine),C(ytosine) The information is in the order of nucleotide and the order is the info The information is in the order of nucleotide and the order is the info Genes can be many thousands of nt long Genes can be many thousands of nt long The complete set of genetic instructions is called genomes The complete set of genetic instructions is called genomes

13 M.Alroy Mascrenghe13 Chromosomes DNA strings make chromosomes DNA strings make chromosomes Analogy Analogy Letters - nt Letters - nt Sentences – genes Sentences – genes Individual volumes of Britannica encyclopedia – chromosomes Individual volumes of Britannica encyclopedia – chromosomes All voles together - Genome All voles together - Genome

14 M.Alroy Mascrenghe14 Double Helix The DNA is a double helix The DNA is a double helix Each strand has complementary information Each strand has complementary information Each particular base in one strand is bonded with another particular base in the next strand Each particular base in one strand is bonded with another particular base in the next strand G - C G - C A - T A - T For example - For example - AATGCone strand AATGCone strand TTACGother strand TTACGother strand

15 M.Alroy Mascrenghe15 Proteins Proteins are very important biological feature Proteins are very important biological feature Amino Acids make up the proteins Amino Acids make up the proteins 20 different amino acids are there 20 different amino acids are there The function of a protein is dependant on the order of the amino acids The function of a protein is dependant on the order of the amino acids

16 M.Alroy Mascrenghe16 Proteins… The information required to make aa is stored in DNA The information required to make aa is stored in DNA DNA sequence determines amino acid sequence DNA sequence determines amino acid sequence Amino Acid sequence determines protein structure Amino Acid sequence determines protein structure Protein structure determines protein function Protein structure determines protein function A Substance called RNA is used to carry the Info stored in the DNA that in turn is used to make proteins A Substance called RNA is used to carry the Info stored in the DNA that in turn is used to make proteins Storage - DNA Storage - DNA Information Transfer – RNA Information Transfer – RNA RNA is the message boy! RNA is the message boy!

17 M.Alroy Mascrenghe17 Central dogma DNA transcriptionRNA Translation Protein RNA Polymerase Ribosomes RNA Polymerase Ribosomes

18 M.Alroy Mascrenghe18

19 M.Alroy Mascrenghe19 Proteins….. Since there are 20 amino acids to translate one nt cannot correspond to one aa, neither can it correspond as twos Since there are 20 amino acids to translate one nt cannot correspond to one aa, neither can it correspond as twos So in triplet codes – codon – protein information is carried So in triplet codes – codon – protein information is carried The codons that do not correspond to a protein are stop codons – UAA, UAG, UGA (RNA has U instead of T) The codons that do not correspond to a protein are stop codons – UAA, UAG, UGA (RNA has U instead of T) Some codons are used as start codons - AUG as well as to code methionine Some codons are used as start codons - AUG as well as to code methionine

20 M.Alroy Mascrenghe20 Protein Structure Shows a wide variety as opposed to the DNA whose structure is uniform Shows a wide variety as opposed to the DNA whose structure is uniform X-ray crystallography or Nuclear Magnetic Resonance (NMR) is used to figure out the structure X-ray crystallography or Nuclear Magnetic Resonance (NMR) is used to figure out the structure Structure is related to the function or rather structure determines the function Structure is related to the function or rather structure determines the function Although proteins are created as a linear structure of aa chain they fold into 3 d structure. Although proteins are created as a linear structure of aa chain they fold into 3 d structure. If you stretch them and leave them they will go back to this structure – this is the native structure of a protein If you stretch them and leave them they will go back to this structure – this is the native structure of a protein Only in the native structure the proteins functions well Only in the native structure the proteins functions well Even after the translation is over protein goes through some changes to its structure Even after the translation is over protein goes through some changes to its structure

21 M.Alroy Mascrenghe21 Gene Expression Gene Expression – the process of Transcripting a DNA and translating a RNA to make protein Gene Expression – the process of Transcripting a DNA and translating a RNA to make protein Where do the genes begin in a chromosome? Where do the genes begin in a chromosome? How does the RNA identify the beginning of a gene to make a protein How does the RNA identify the beginning of a gene to make a protein A single nt cannot be taken to point out the beginning of a gene as they occur frequently A single nt cannot be taken to point out the beginning of a gene as they occur frequently But a particular combination of a nucleotide can be But a particular combination of a nucleotide can be Promoter sequences – the order of nt which mark the beginning of a gene Promoter sequences – the order of nt which mark the beginning of a gene

22 M.Alroy Mascrenghe22 Bioinformatics Techniques…..

23 M.Alroy Mascrenghe23 Prediction and Pattern Recognition The two main areas of bioinformatics are The two main areas of bioinformatics are Pattern recognition Pattern recognition A particular sequence or structure has been seen before and that a particular characteristic can be associated with it A particular sequence or structure has been seen before and that a particular characteristic can be associated with it Prediction Prediction From a sequence (what we know) we can predict the structure and function (what we dont know) From a sequence (what we know) we can predict the structure and function (what we dont know)

24 M.Alroy Mascrenghe24 Dot plots…. Simple way of evaluating similarity between two sequences Simple way of evaluating similarity between two sequences In a graph one sequence is on one side the next on the other side In a graph one sequence is on one side the next on the other side Where there are matches between the two sequences the graph is marked Where there are matches between the two sequences the graph is marked

25 M.Alroy Mascrenghe25

26 M.Alroy Mascrenghe26 Alignments A match for similarity between the characters of two or more sequences A match for similarity between the characters of two or more sequences Eg. Eg. TTACTATA TTACTATA TAGATA TAGATA There are so many ways to align the above two sequences There are so many ways to align the above two sequences TTACTATA TTACTATA TAGATA TAGATA TTACTATA TTACTATA TAGATA TAGATA TTACTATA TTACTATA TAGATA TAGATA So which one do we choose and on what basis? So which one do we choose and on what basis? Solution is to Provide a match score and mismatch score Solution is to Provide a match score and mismatch score

27 M.Alroy Mascrenghe27 Gaps Introduce gaps and a penalty score for gaps Introduce gaps and a penalty score for gaps TTACTATA TTACTATA T_A_GATA T_A_GATA In gap scores a single indel which is two characters long is preferred to two indels which are each one character long In gap scores a single indel which is two characters long is preferred to two indels which are each one character long However not all gaps are bad However not all gaps are bad TTGCAATCT TTGCAATCT CAA CAA How do we align? How do we align? ---CAA CAA--- These gaps are not biologically significant These gaps are not biologically significant Semi Global Alignments Semi Global Alignments

28 M.Alroy Mascrenghe28 Scoring Matrix For DNA/protein sequence alignment we create a matrix For DNA/protein sequence alignment we create a matrix If A and A score is 1 If A and A score is 1 If A and T score is -5 If A and T score is -5 If A and C score is -1 If A and C score is -1

29 M.Alroy Mascrenghe29 Dynamic Programming As the length of the query sequences increase and the difference of length between the two sequence also increases –more gaps has to be inserted in various places As the length of the query sequences increase and the difference of length between the two sequence also increases –more gaps has to be inserted in various places We cannot perform an exhaustive search We cannot perform an exhaustive search Combinatorial explosion occurs – too much combinations to search for Combinatorial explosion occurs – too much combinations to search for Dynamic programming is a way of using heuristics to search in the most promising path Dynamic programming is a way of using heuristics to search in the most promising path

30 M.Alroy Mascrenghe30 Databases Sequence info is stored in databases Sequence info is stored in databases So that they can be manipulated easily So that they can be manipulated easily The db (next slide) are located at diff places The db (next slide) are located at diff places They exchange info on a daily basis so that they are up-to-date and are in sync They exchange info on a daily basis so that they are up-to-date and are in sync Primary db – sequence data Primary db – sequence data

31 Major Primary DB Nucleic Acid Protein EMBL (Europe) PIR - Protein Information Resource GenBank (USA) MIPS DDBJ (Japan) SWISS-PROT University of Geneva, now with EBI TrEMBL A supplement to SWISS- PROT NRL-3D

32 M.Alroy Mascrenghe32 Composite DB As there are many db which one to search? Some are good in some aspects and weak in others? As there are many db which one to search? Some are good in some aspects and weak in others? Composite db is the answer – which has several db for its base data Composite db is the answer – which has several db for its base data Search on these db is indexed and streamlined so that the same stored sequence is not searched twice in different db Search on these db is indexed and streamlined so that the same stored sequence is not searched twice in different db

33 M.Alroy Mascrenghe33 Composite DB OWL has these as their primary db OWL has these as their primary db SWISS PROT (top priority) SWISS PROT (top priority) PIR PIR GenBank GenBank NRL-3D NRL-3D

34 M.Alroy Mascrenghe34 Secondary db Store secondary structure info or results of searches of the primary db Store secondary structure info or results of searches of the primary db Compo DB Primary Source PROSITESWISS-PROT PRINTSOWL

35 M.Alroy Mascrenghe35 Database Searches We have sequenced and identified genes. So we know what they do We have sequenced and identified genes. So we know what they do The sequences are stored in databases The sequences are stored in databases So if we find a new gene in the human genome we compare it with the already found genes which are stored in the databases. So if we find a new gene in the human genome we compare it with the already found genes which are stored in the databases. Since there are large number of databases we cannot do sequence alignment for each and every sequence Since there are large number of databases we cannot do sequence alignment for each and every sequence So heuristics must be used again. So heuristics must be used again.

36 M.Alroy Mascrenghe36 Areas in Bioinformatics…

37 M.Alroy Mascrenghe37 Genomics Because of the multicellular structure, each cell type does gene expression in a different way –although each cell has the same content as far as the genetic Because of the multicellular structure, each cell type does gene expression in a different way –although each cell has the same content as far as the genetic i.e. All the information for a liver cell to be a liver cell is also present on nose cell, so gene expression is the only thing that differentiates i.e. All the information for a liver cell to be a liver cell is also present on nose cell, so gene expression is the only thing that differentiates

38 M.Alroy Mascrenghe38 Genomics - Finding Genes Gene in sequence data – needle in a haystack Gene in sequence data – needle in a haystack However as the needle is different from the haystack genes are not diff from the rest of the sequence data However as the needle is different from the haystack genes are not diff from the rest of the sequence data Is whole array of nt we try to find and border mark a set o nt as a gene Is whole array of nt we try to find and border mark a set o nt as a gene This is one of the challenges of bioinformatics This is one of the challenges of bioinformatics Neural networks and dynamic programming are being employed Neural networks and dynamic programming are being employed

39 Organism Genome Size (Mb) bp * 1,000,000 Gene Number Web Site Yeast13.56,241 u/Saccharomyce s Fruit Flies 18013,601 indiana.edu Homo Sapiens 3,00045,000 lm.nih.gov/geno me/guide

40 M.Alroy Mascrenghe40 Proteomics Proteome is the sum total of an organisms proteins Proteome is the sum total of an organisms proteins More difficult than genomics More difficult than genomics Simple chemical makeupcomplex Simple chemical makeupcomplex Can duplicatecant Can duplicatecant We are entering into the post genome era We are entering into the post genome era Meaning much has been done with the Genes – not that its a over Meaning much has been done with the Genes – not that its a over

41 M.Alroy Mascrenghe41 Proteomics….. The relationship between the RNA and the protein it codes are usually very different The relationship between the RNA and the protein it codes are usually very different After translation proteins do change After translation proteins do change So aa sequence do not tell anything about the post translation changes So aa sequence do not tell anything about the post translation changes Proteins are not active until they are combined into a larger complex or moved to a relevant location inside or outside the cell Proteins are not active until they are combined into a larger complex or moved to a relevant location inside or outside the cell So aa only hint in these things So aa only hint in these things Also proteins must be handled more carefully in labs as they tend to change when in touch with an inappropriate material Also proteins must be handled more carefully in labs as they tend to change when in touch with an inappropriate material

42 M.Alroy Mascrenghe42 Protein Structure Prediction Is one of the biggest challenges of bioinformatics and esp. biochemistry Is one of the biggest challenges of bioinformatics and esp. biochemistry No algorithm is there now to consistently predict the structure of proteins No algorithm is there now to consistently predict the structure of proteins

43 M.Alroy Mascrenghe43 Structure Prediction methods Comparative Modeling Comparative Modeling Target proteins structure is compared with related proteins Target proteins structure is compared with related proteins Proteins with similar sequences are searched for structures Proteins with similar sequences are searched for structures

44 M.Alroy Mascrenghe44 Phylogenetics The taxonomical system reflects evolutionary relationships The taxonomical system reflects evolutionary relationships Phylogenetics trees are things which reflect the evolutionary relationship thru a picture/graph Phylogenetics trees are things which reflect the evolutionary relationship thru a picture/graph Rooted trees where there is only one ancestor Rooted trees where there is only one ancestor Un rooted trees just showing the relationship Un rooted trees just showing the relationship Phylogenetic tree reconstruction algorithms are also an area of research Phylogenetic tree reconstruction algorithms are also an area of research

45 M.Alroy Mascrenghe45 Applications….

46 M.Alroy Mascrenghe46 Medical Implications Pharmacogenomics Pharmacogenomics Not all drugs work on all patients, some good drugs cause death in some patients Not all drugs work on all patients, some good drugs cause death in some patients So by doing a gene analysis before the treatment the offensive drugs can be avoided So by doing a gene analysis before the treatment the offensive drugs can be avoided Also drugs which cause death to most can be used on a minority to whose genes that drug is well suited – volunteers wanted! Also drugs which cause death to most can be used on a minority to whose genes that drug is well suited – volunteers wanted! Customized treatment Customized treatment Gene Therapy Gene Therapy Replace or supply the defective or missing gene Replace or supply the defective or missing gene E.g: Insulin and Factor VIII or Haemophilia E.g: Insulin and Factor VIII or Haemophilia BioWeapons (??) BioWeapons (??)

47 M.Alroy Mascrenghe47 Diagnosis of Disease Diagnosis of disease Diagnosis of disease Identification of genes which cause the disease will help detect disease at early stage e.g. Huntington disease - Identification of genes which cause the disease will help detect disease at early stage e.g. Huntington disease - Symptoms – uncontrollable dance like movements, mental disturbance, personality changes and intellectual impairment Symptoms – uncontrollable dance like movements, mental disturbance, personality changes and intellectual impairment Death in years Death in years The gene responsible for the disease has been identified The gene responsible for the disease has been identified Contains excessively repeated sections of CAG Contains excessively repeated sections of CAG So once analyzed the couple can be counseled So once analyzed the couple can be counseled

48 M.Alroy Mascrenghe48 Drug Design Can go up to 15yrs and $700million Can go up to 15yrs and $700million One of the goals of bioinformatics is to reduce the time and cost involved with it. One of the goals of bioinformatics is to reduce the time and cost involved with it. The process The process Discovery Discovery Computational methods can improves this Computational methods can improves this Testing Testing

49 M.Alroy Mascrenghe49 Discovery Target identification Identifying the molecule on which the germs relies for its survival Identifying the molecule on which the germs relies for its survival Then we develop another molecule i.e. drug which will bind to the target Then we develop another molecule i.e. drug which will bind to the target So the germ will not be able to interact with the target. So the germ will not be able to interact with the target. Proteins are the most common targets Proteins are the most common targets

50 M.Alroy Mascrenghe50 Discovery… For example HIV produces HIV protease which is a protein and which in turn eat other proteins For example HIV produces HIV protease which is a protein and which in turn eat other proteins This HIV protease has an active site where it binds to other molecules This HIV protease has an active site where it binds to other molecules So HIV drug will go and bind with that active site So HIV drug will go and bind with that active site Easily said than done! Easily said than done!

51 M.Alroy Mascrenghe51 Discovery… Lead compounds are the molecules that go and bind to the target proteins active site Lead compounds are the molecules that go and bind to the target proteins active site Traditionally this has been a trial and error method Traditionally this has been a trial and error method Now this is being moved into the realm of computers Now this is being moved into the realm of computers

52 M.Alroy Mascrenghe52 Related Computer Technology………….

53 M.Alroy Mascrenghe53 PERL Perl is commonly used for bioinformatics calculations as its ability to manipulate character symbols Perl is commonly used for bioinformatics calculations as its ability to manipulate character symbols The default CGI language The default CGI language It started out as a scripting language but has become a fully fledged language It started out as a scripting language but has become a fully fledged language IT has everything now, even web service support IT has everything now, even web service support

54 M.Alroy Mascrenghe54 The place of XML & Web Services Various markup languages are being created – Gene Markup language etc to represent sequence/gene data Various markup languages are being created – Gene Markup language etc to represent sequence/gene data Web Services – program to program interaction, making the web application centric as opposed to human centric Web Services – program to program interaction, making the web application centric as opposed to human centric So this has to platform language independent So this has to platform language independent Protocols like SOAP help in this regard Protocols like SOAP help in this regard In bioinformatics various databases are being used, different platforms, languages etc In bioinformatics various databases are being used, different platforms, languages etc So web services helps achieve platform independence and program interaction So web services helps achieve platform independence and program interaction Since sequence data bases are in various formats, platforms SOAP also helps in this regards Since sequence data bases are in various formats, platforms SOAP also helps in this regards

55 M.Alroy Mascrenghe55 The place of GRID GRID - new kid on the block GRID - new kid on the block Using many computers to fulfill a single computational tasks Using many computers to fulfill a single computational tasks Bioinformatics is the ideal platform as it has to deal with a large amount of data in alignment and searches Bioinformatics is the ideal platform as it has to deal with a large amount of data in alignment and searches E-science initiative in the UK E-science initiative in the UK ORACLE 10g – the worlds first GRID database ORACLE 10g – the worlds first GRID database

56 M.Alroy Mascrenghe56 Data bases and Mining Lot of the sequence databases are available publicly Lot of the sequence databases are available publicly As there is a DB involved various data mining techniques are used to pull the data out As there is a DB involved various data mining techniques are used to pull the data out As there is a lot of literature – articles etc – on this area a data mining on the literature – not on the sequence data has also become a PhD topic for many As there is a lot of literature – articles etc – on this area a data mining on the literature – not on the sequence data has also become a PhD topic for many

57 M.Alroy Mascrenghe57 European Molecular Biology Network (EMBnet) A central system for sharing, training and centralizing up to date bio info A central system for sharing, training and centralizing up to date bio info Some of the EMBnet sites are: Some of the EMBnet sites are: SQENET SQENET UCL UCL wser/embnet/ wser/embnet/ wser/embnet/ wser/embnet/ EBI – European Bioinformatics Institute EBI – European Bioinformatics Institute

58 M.Alroy Mascrenghe58 References Dan E. Krane and Michael L. Raymer Dan E. Krane and Michael L. Raymer Basic Concepts of Bioinformatics Basic Concepts of Bioinformatics Arthur M Lesk Arthur M Lesk Intro to Bioinformatics Intro to Bioinformatics T.K. Attwood & D. J. Parry-Smith T.K. Attwood & D. J. Parry-Smith Intro to Bioinformatics Intro to Bioinformatics The genetic Revolution The genetic Revolution Dr Patrick Dixon Dr Patrick Dixon Prof David Gilberts Site Prof David Gilberts Site

59 M.Alroy Mascrenghe59 Thank You!


Download ppt "How Bioinformatics can change your life Basic Concepts of Bioinformatics M. Alroy Mascrenghe MBCS, MIEEE, MIT A lecture given for the."

Similar presentations


Ads by Google