Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bioinformatics Programming 1 EE, NCKU Tien-Hao Chang (Darby Chang)

Similar presentations


Presentation on theme: "Bioinformatics Programming 1 EE, NCKU Tien-Hao Chang (Darby Chang)"— Presentation transcript:

1 Bioinformatics Programming 1 EE, NCKU Tien-Hao Chang (Darby Chang)

2 Final Project 2

3 Topic Sequence alignment Protein clustering Classification Other analysis techniques –association rule –frequent pattern –network 3

4 Must be 4 A web server

5 Sequence alignment First class –a novel sequence alignment algorithm –ClustalW Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Clustal-W Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680. Second class –an application using ClustalW –E1DS Chien,T.Y., Chang,D.T.H., Chen,C.Y., Weng,Y.Z. and Hsu,C.M. (2008) E1DS: catalytic site prediction based on 1D signatures of concurrent conservation. Nucleic Acids Res., 36, W291–W296. 5

6 Protein clustering First class –CD-HIT Li,W., Jaroszewski,L. and Godzik,A. (2001) Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics, 17, 282–283. Second class –Protemot Chang,D.T.H., Weng,Y.Z., Lin,J.H., Hwang,M.J. and Oyang,Y.J. (2006) Protemot: prediction of protein binding sites with automatically extracted geometrical templates. Nucleic Acids Res., 34, W303–W309. 6

7 First class There might be some state-of-the-art packages –sequence alignment BLAST (1990), ClustalW, FASTA, HMMER (1998), HHpred/HHsearch (2005), PSI-BLAST (1997), T-coffee, SSEARCH and so on –overtaking them is very difficult, but there still some room, especially for special purpose alignment Abascal,F., Zardoya,R., and Telford,M.J. (2010) TranslatorX: multiple alignment of nucleotide sequences guided by amino acid translations. Nucleic Acids Res. Advance Access published on April 30, Some possible direction –add some constrains (special purpose), speed the algorithm –combine multiple tools, ex: domain-conserved alignment –instead of implementing from scratch manipulate the input to existing packages (preprocessing) start from the output of existing packages (postprocessing) 7

8 Second class The programming part is less challenging, but is still heavy and probably more niggling You need a good/interesting theme –predicting DNA-binding protein –predicting protein-protein interaction –mapping any ID to a specific database –connecting predicted TFBS to DNA/RNA sequences –…–… Implement a specific algorithm and web-lize it might be okay –http://nar.oxfordjournals.org/papbyrecent.dtl has many update-to-date web servershttp://nar.oxfordjournals.org/papbyrecent.dtl 8

9 9 In either class, you need to discuss with me

10 10 Final project schedule

11 Discuss with me 11 Before 5/12 (a soft deadline)

12 12 What is machine learning?

13 13 A very trivial machine learning tool K-Nearest-Neighbors (KNN) The predicted class of the query sample depends on the voting among its k nearest neighbors O X X O O X O ? X X O O X X O

14 14 O X X O O X O O X X O k = 3

15 15 O X X O O X O X X X O O X X O k = 5

16 16 Although KNN is very trivial, it can Example: in vitro fertilization Given: embryos described by 60 features Problem: selection of embryos that will survive Data: historical records of embryos and outcome Given a set of known instances, predict outcome for newly coming instances So, KNN learnt something related to the definition of a good embryo

17 17 Although KNN is very trivial, it can Example: in vitro fertilization Given: embryos described by 60 features Problem: selection of embryos that will survive Data: historical records of embryos and outcome Given a set of known instances, predict outcome for newly coming instances So, KNN learnt something related to the definition of a good embryo?

18 18 Can machines really learn? Notice that here we call KNN a machine Definitions of learning from dictionary: To get knowledge of by study, experience, or being taught To become aware by information or from observation To commit to memory To be informed of, ascertain; to receive instruction Operational definition: Things learn when they change their behavior in a way that makes them perform better in the future Difficult to measure Trivial for computers Does a slipper learn?

19 19 Shortly speaking, machine learning is Machine E.g. KNN Training data A set of known instances Testing data A query instance Outcome Class of the query instance Knowledge/ Information

20 20 Furthermore, learning is Machine E.g. KNN Training data A set of known instances Testing data A query instance Outcome Class of the query instance Knowledge/ Information When training data increases It delivers better (e.g. higher accuracy) outcome

21 Classifier 21 Intwo sets of samples, tr and te Outaccuracy of using tr to predict te Requirement - implement KNN with a parameter k - invoke RVKDE - complexity/teamwork report - using Perl would be the best Bonus - invoke LIBSVM - a script to decide the best k in a range

22 Deadline /5/11 23:59 Zip your code, step-by-step README, complexity analyses and anything worthy extra credit. to

23 Materials for the exercise 9 Input sample (Iris) –http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/iris.scalehttp://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/iris.scale –1 1: :0.25 3: : : : : : : : : Test your program on satimage –http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/satimage.scale.trhttp://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/satimage.scale.tr –http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/satimage.scale.thttp://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/satimage.scale.t RVKDE –http://mbi.ee.ncku.edu.tw/wiki/doku.php?id=rvkdehttp://mbi.ee.ncku.edu.tw/wiki/doku.php?id=rvkde –$ wget $ tar zxvf rvkde-current-linux32.tgz $ rvkde final/rvkde --classify --predict -v tr -V te -a 1 -b 1 --ks 10 --kt 10http://mbi.ee.ncku.edu.tw/rvkde/res/rvkde-current-linux32.tgz –rvkde has a built-in function of parameter tuning (see --cv) LIBSVM –http://www.csie.ntu.edu.tw/~cjlin/libsvm/http://www.csie.ntu.edu.tw/~cjlin/libsvm/ –see the manual –LIBSVM provides a script of parameter tuning (see grid.py) 23

24 24 Machine Learning and Bioinformatics

25 25 Why these two fields? From biologists view There are abundant data to analyze From computer guys view The data are suitable (large and well-studied) Biomedical problems are important There are various computer science techniques for various Bioinformatics applications

26 26 Circuit simulation & Computer graphics & Information retrieval 4e13-8b23-591f a/text_mining340x220.png 6/68/Social-network.svg/430px-Social-network.svg.png Network analysis

27 27 Applications, concepts and our approaches

28 28 Our online services Secondary structure prediction Catalytic site prediction Protein-ligand docking

29 29 Secondary structure prediction In biochemistry and structural biology, secondary structure (SSE) is the general three-dimensional form of local segments of biopolymers such as proteins and nucleic acids (DNA/RNA) 6/60/Myoglobin.png/542px-Myoglobin.png

30 30 Prote2S

31 31 Prote2S >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAE LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEH IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLT MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL

32 32 Concept of Prote2S Machine E.g. KNN Training data Residues with known SSE Testing data A residue as a vector Outcome SSE of the query residue Knowledge/ Information MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLG…

33 33 Feature encoding Most classifiers deal with a vector space Feature encoding means to generate the vector representation of an real world instance An instance is represented as several important attributes, or say, features >1A0OB RQLALEAKGETPSAVTRLSVVA KSEPQDEQSRSQSPRRIILS… PSSM PSI-BLAST Feature vector

34 34 Disorder region Conformational switch /resources/anim/figs/f2-9.gif Solvent accessibility issue7/images/large/zbc jpeg Protein-protein interaction Similar applications to Prote2S

35 35 A family tree Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques (Second Edition)

36 36 Family tree represented as a table The sister-of relation Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques (Second Edition)

37 37 Catalytic site prediction The catalytic site is usually a small pocket at the surface of the enzyme that contains residues responsible for the substrate specificity and catalytic residues which often act as proton donors or acceptors or are responsible for binding a cofactor 255/255enz/ES_complex.jpg

38 38 E1DS

39 39 E1DS >Paste your sequence in FASTA format to replace the sample FASTA here MIFSVDAVRADFPVLSREVNGLPLAYLDSAASAQKPSQVIDAEAEFYRHGYAAVHAGAHTLSAQATEKMENVRKRASLFI NARSAEELVFVRGTTEGINLVANSWGNSNVRAGDNIIISQMEHHANIVPWQMLCARVGAELRVIPLNPDGTLQLETLPTL FDAATRLLAITHVSNVLGTENPLAEMITLAHQHGAKVLVDGAQAVMHHPVDVQALDCDFYVFSGHKLYGPTGIGILYVKE ALLQEMPPWEGGGSMIATVSLSEGTTWTKAPWRFEAGTPNTGGIIGLGAALEYVSALGLNNIAEYEQNLMHYALSQLESV PDLTLYGPQARLGVIAFNLGAHHAYDVGSFLDNYGIAVRTGHHCAMPLMAYYNVPAMCRASLAMYNTHEEVDRLVTGLQR IHRLLG

40 40 Concept of E1DS

41 41 Concept of E1DS

42 42 Concept of E1DS t/figures/ S5-S8-3-l.jpg

43 43 Allowing large flexible gaps >1RPX:A SRVDKFSKSDIIVSPSILSANFSKLGEQVKAIEQAGCDWIHVDVMDGRFVPNITIGPLVVDSL RPITDLPLDVHLMIVEPDQRVPDFIKAGADIVSVHCEQSSTIHLHRTINQIKSLGAKAGVVLN PGTPLTAIEYVLDAVDLVLIMSVNPGFGGQSFIESQVKKISDLRKICAERGLNPWIEVDGGVG PKNAYKVIEAGANALVAGSAVFGAPDYAEAIKGIKTSKRPE PROSITE pattern [LIVMA]-x-[LIVM]-M-[ST]-[VS]-x-P-x(3)-[GN]-Q- x(0,1)-[FMK]-x(6)-[NKR]-[LIVMC] Our pattern H-x-D-x-M-D-x(94,144)-M-x-V-x-P-G-x(3)-Q- x(22,32)-D-G-G

44 44 Applications using pattern mining & wikipedia/en/thumb/8/8d/ChIP-on-chip_wet-lab.png/400px-ChIP-on-chip_wet-lab.png Transcription factor binding site issue7/images/large/zbc jpeg Protein-protein interaction

45 45 Protein-ligand docking The goal of protein-ligand docking is to predict the position and orientation of a ligand (a small molecule) when it is bound to a protein receptor images/docking-small.jpg

46 46 MEDock

47 47 MEDock

48 48 Concept of MEDock images/op_main_wl_3250.jpg

49 49 Genetic algorithm Crossover Mutation Nature selection ug.com/photos/ M-1.jpg

50 50 Protein folding Applications using optimization Gene network Microarray analysis

51 51 Quick summary What is machine learning A very simple classifier, KNN Three real Bioinformatics applications Secondary structure prediction Catalytic site prediction Protein-ligand docking The dependent techniques and related problems

52 52 Conclusion Biologists identify new problems Informatists enhance algorithms Bioinformatists transform biomedical problems into computer science problems

53 53 Appendix

54 54 RAME (Rank-based Adaptive Mutation Evolutionary) algorithm Current generationRank these samples Generate reference Gaussian distributions according to sample rank Generate the next generation according to these reference Gaussian distributions

55 55 Diversity of sampling locations RAME algorithmLGA in AutoDock

56 56 Suppose that the distances from ? to its nearest O and X are equal O O O O O O O O O O O O O O X X X X X X X X X XX X X X ? d d

57 57 It is more likely to be a X because of the density O O O O O O O O O O O O O O X X X X X X X X X XX X X X X

58 58 It looks like an outlier when being predicted as a O O O O O O O O O O O O O O O X X X X X X X X X XX X X X O


Download ppt "Bioinformatics Programming 1 EE, NCKU Tien-Hao Chang (Darby Chang)"

Similar presentations


Ads by Google