Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bioinformatics Programming 1 EE, NCKU Tien-Hao Chang (Darby Chang)

Similar presentations


Presentation on theme: "Bioinformatics Programming 1 EE, NCKU Tien-Hao Chang (Darby Chang)"— Presentation transcript:

1 Bioinformatics Programming 1 EE, NCKU Tien-Hao Chang (Darby Chang)

2 Final Project 2

3 Topic Sequence alignment Protein clustering Classification Other analysis techniques –association rule –frequent pattern –network 3

4 Must be 4 A web server

5 Sequence alignment First class –a novel sequence alignment algorithm –ClustalW http://www.ebi.ac.uk/Tools/clustalw2/index.html Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Clustal-W Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680. Second class –an application using ClustalW –E1DS http://e1ds.ee.ncku.edu.tw/ Chien,T.Y., Chang,D.T.H., Chen,C.Y., Weng,Y.Z. and Hsu,C.M. (2008) E1DS: catalytic site prediction based on 1D signatures of concurrent conservation. Nucleic Acids Res., 36, W291–W296. 5

6 Protein clustering First class –CD-HIT http://weizhong-lab.ucsd.edu/cdhit_suite/cgi-bin/index.cgi Li,W., Jaroszewski,L. and Godzik,A. (2001) Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics, 17, 282–283. Second class –Protemot http://protemot.ee.ncku.edu.tw/ Chang,D.T.H., Weng,Y.Z., Lin,J.H., Hwang,M.J. and Oyang,Y.J. (2006) Protemot: prediction of protein binding sites with automatically extracted geometrical templates. Nucleic Acids Res., 34, W303–W309. 6

7 First class There might be some state-of-the-art packages –sequence alignment BLAST (1990), ClustalW, FASTA, HMMER (1998), HHpred/HHsearch (2005), PSI-BLAST (1997), T-coffee, SSEARCH and so on –overtaking them is very difficult, but there still some room, especially for special purpose alignment Abascal,F., Zardoya,R., and Telford,M.J. (2010) TranslatorX: multiple alignment of nucleotide sequences guided by amino acid translations. Nucleic Acids Res. Advance Access published on April 30, 2010. Some possible direction –add some constrains (special purpose), speed the algorithm –combine multiple tools, ex: domain-conserved alignment –instead of implementing from scratch manipulate the input to existing packages (preprocessing) start from the output of existing packages (postprocessing) 7

8 Second class The programming part is less challenging, but is still heavy and probably more niggling You need a good/interesting theme –predicting DNA-binding protein –predicting protein-protein interaction –mapping any ID to a specific database –connecting predicted TFBS to DNA/RNA sequences –…–… Implement a specific algorithm and web-lize it might be okay –http://nar.oxfordjournals.org/papbyrecent.dtl has many update-to-date web servershttp://nar.oxfordjournals.org/papbyrecent.dtl 8

9 9 In either class, you need to discuss with me http://www.flickr.com/photos/meteorry/3452536272/

10 10 Final project schedule http://www.sxc.hu/photo/544232

11 Discuss with me 11 Before 5/12 (a soft deadline)

12 12 What is machine learning?

13 13 A very trivial machine learning tool K-Nearest-Neighbors (KNN) The predicted class of the query sample depends on the voting among its k nearest neighbors O X X O O X O ? X X O O X X O

14 14 O X X O O X O O X X O k = 3

15 15 O X X O O X O X X X O O X X O k = 5

16 16 Although KNN is very trivial, it can Example: in vitro fertilization Given: embryos described by 60 features Problem: selection of embryos that will survive Data: historical records of embryos and outcome Given a set of known instances, predict outcome for newly coming instances So, KNN learnt something related to the definition of a good embryo

17 17 Although KNN is very trivial, it can Example: in vitro fertilization Given: embryos described by 60 features Problem: selection of embryos that will survive Data: historical records of embryos and outcome Given a set of known instances, predict outcome for newly coming instances So, KNN learnt something related to the definition of a good embryo?

18 18 Can machines really learn? Notice that here we call KNN a machine Definitions of learning from dictionary: To get knowledge of by study, experience, or being taught To become aware by information or from observation To commit to memory To be informed of, ascertain; to receive instruction Operational definition: Things learn when they change their behavior in a way that makes them perform better in the future Difficult to measure Trivial for computers Does a slipper learn?

19 19 Shortly speaking, machine learning is Machine E.g. KNN Training data A set of known instances Testing data A query instance Outcome Class of the query instance Knowledge/ Information

20 20 Furthermore, learning is Machine E.g. KNN Training data A set of known instances Testing data A query instance Outcome Class of the query instance Knowledge/ Information When training data increases It delivers better (e.g. higher accuracy) outcome

21 Classifier 21 Intwo sets of samples, tr and te Outaccuracy of using tr to predict te Requirement - implement KNN with a parameter k - invoke RVKDE - complexity/teamwork report - using Perl would be the best Bonus - invoke LIBSVM - a script to decide the best k in a range

22 Deadline 22 2010/5/11 23:59 Zip your code, step-by-step README, complexity analyses and anything worthy extra credit. Email to darby@ee.ncku.edu.tw. darby@ee.ncku.edu.tw

23 Materials for the exercise 9 Input sample (Iris) –http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/iris.scalehttp://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/iris.scale –1 1:-0.555556 2:0.25 3:-0.864407 4:-0.916667 1 1:-0.666667 2:-0.166667 3:-0.864407 4:-0.916667 1 1:-0.777778 3:-0.898305 4:-0.916667. Test your program on satimage –http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/satimage.scale.trhttp://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/satimage.scale.tr –http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/satimage.scale.thttp://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/satimage.scale.t RVKDE –http://mbi.ee.ncku.edu.tw/wiki/doku.php?id=rvkdehttp://mbi.ee.ncku.edu.tw/wiki/doku.php?id=rvkde –$ wget http://mbi.ee.ncku.edu.tw/rvkde/res/rvkde-current-linux32.tgz $ tar zxvf rvkde-current-linux32.tgz $ rvkde-0.2.3-final/rvkde --classify --predict -v tr -V te -a 1 -b 1 --ks 10 --kt 10http://mbi.ee.ncku.edu.tw/rvkde/res/rvkde-current-linux32.tgz –rvkde has a built-in function of parameter tuning (see --cv) LIBSVM –http://www.csie.ntu.edu.tw/~cjlin/libsvm/http://www.csie.ntu.edu.tw/~cjlin/libsvm/ –see the manual –LIBSVM provides a script of parameter tuning (see grid.py) 23

24 24 Machine Learning and Bioinformatics

25 25 Why these two fields? From biologists view There are abundant data to analyze From computer guys view The data are suitable (large and well-studied) Biomedical problems are important There are various computer science techniques for various Bioinformatics applications

26 26 Circuit simulation http://www.sophion.dk/sophion/Open-close2.jpg & http://alford.bios.uic.edu/Images/586%20images/circuit%20model Computer graphics http://healthbolt.net/wp-content/uploads/2006/09/cell-animation.jpg & http://www.osmosis.com.au/animate/images/bloodcells.jpg Information retrieval http://www.dashboardinsight.com/CMS/e01d472c-862e- 4e13-8b23-591f8938889a/text_mining340x220.png http://upload.wikimedia.org/wikipedia/commons/thumb/ 6/68/Social-network.svg/430px-Social-network.svg.png Network analysis

27 27 Applications, concepts and our approaches

28 28 Our online services Secondary structure prediction Catalytic site prediction Protein-ligand docking

29 29 Secondary structure prediction In biochemistry and structural biology, secondary structure (SSE) is the general three-dimensional form of local segments of biopolymers such as proteins and nucleic acids (DNA/RNA) http://upload.wikimedia.org/wikipedia/commons/thumb/ 6/60/Myoglobin.png/542px-Myoglobin.png

30 30 Prote2S http://prote2s.csie.ntu.edu.tw/

31 31 Prote2S http://prote2s.csie.ntu.edu.tw/ >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAE LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEH IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLT MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL

32 32 Concept of Prote2S Machine E.g. KNN Training data Residues with known SSE Testing data A residue as a vector Outcome SSE of the query residue Knowledge/ Information MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLG…

33 33 Feature encoding Most classifiers deal with a vector space Feature encoding means to generate the vector representation of an real world instance An instance is represented as several important attributes, or say, features >1A0OB RQLALEAKGETPSAVTRLSVVA KSEPQDEQSRSQSPRRIILS… PSSM PSI-BLAST Feature vector

34 34 Disorder region Conformational switch http://www.oup.com/uk/orc/bin/97801 99265114/resources/anim/figs/f2-9.gif Solvent accessibility http://stke.sciencemag.org/content/jbc/vol280/ issue7/images/large/zbc0060586670006.jpeg Protein-protein interaction Similar applications to Prote2S

35 35 A family tree Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques (Second Edition)

36 36 Family tree represented as a table The sister-of relation Ian H. Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques (Second Edition)

37 37 Catalytic site prediction The catalytic site is usually a small pocket at the surface of the enzyme that contains residues responsible for the substrate specificity and catalytic residues which often act as proton donors or acceptors or are responsible for binding a cofactor http://fig.cox.miami.edu/~cmallery/ 255/255enz/ES_complex.jpg

38 38 E1DS http://e1ds.ee.ncku.edu.tw/

39 39 E1DS http://e1ds.ee.ncku.edu.tw/ >Paste your sequence in FASTA format to replace the sample FASTA here MIFSVDAVRADFPVLSREVNGLPLAYLDSAASAQKPSQVIDAEAEFYRHGYAAVHAGAHTLSAQATEKMENVRKRASLFI NARSAEELVFVRGTTEGINLVANSWGNSNVRAGDNIIISQMEHHANIVPWQMLCARVGAELRVIPLNPDGTLQLETLPTL FDAATRLLAITHVSNVLGTENPLAEMITLAHQHGAKVLVDGAQAVMHHPVDVQALDCDFYVFSGHKLYGPTGIGILYVKE ALLQEMPPWEGGGSMIATVSLSEGTTWTKAPWRFEAGTPNTGGIIGLGAALEYVSALGLNNIAEYEQNLMHYALSQLESV PDLTLYGPQARLGVIAFNLGAHHAYDVGSFLDNYGIAVRTGHHCAMPLMAYYNVPAMCRASLAMYNTHEEVDRLVTGLQR IHRLLG

40 40 Concept of E1DS http://jkweb.berkeley.edu/external/pdb/2004/heme/fig1.jpg

41 41 Concept of E1DS http://jkweb.berkeley.edu/external/pdb/2004/heme/fig1.jpg http://www.biomedcentral.com/content/figures/1471-2105-8-S5-S8-6-l.jpg

42 42 Concept of E1DS http://jkweb.berkeley.edu/external/pdb/2004/heme/fig1.jpg http://www.biomedcentral.com/content/figures/1471-2105-8-S5-S8-6-l.jpg http://www.biomedcentral.com/conten t/figures/1471-2105-8-S5-S8-3-l.jpg

43 43 Allowing large flexible gaps >1RPX:A SRVDKFSKSDIIVSPSILSANFSKLGEQVKAIEQAGCDWIHVDVMDGRFVPNITIGPLVVDSL RPITDLPLDVHLMIVEPDQRVPDFIKAGADIVSVHCEQSSTIHLHRTINQIKSLGAKAGVVLN PGTPLTAIEYVLDAVDLVLIMSVNPGFGGQSFIESQVKKISDLRKICAERGLNPWIEVDGGVG PKNAYKVIEAGANALVAGSAVFGAPDYAEAIKGIKTSKRPE PROSITE pattern [LIVMA]-x-[LIVM]-M-[ST]-[VS]-x-P-x(3)-[GN]-Q- x(0,1)-[FMK]-x(6)-[NKR]-[LIVMC] Our pattern H-x-D-x-M-D-x(94,144)-M-x-V-x-P-G-x(3)-Q- x(22,32)-D-G-G

44 44 Applications using pattern mining http://www.cs.uiuc.edu/homes/sinhas/img/DAILYILLINI.jpg & http://upload.wikimedia.org/ wikipedia/en/thumb/8/8d/ChIP-on-chip_wet-lab.png/400px-ChIP-on-chip_wet-lab.png Transcription factor binding site http://stke.sciencemag.org/content/jbc/vol280/ issue7/images/large/zbc0060586670006.jpeg Protein-protein interaction

45 45 Protein-ligand docking The goal of protein-ligand docking is to predict the position and orientation of a ligand (a small molecule) when it is bound to a protein receptor http://www-ucc.ch.cam.ac.uk/research/ images/docking-small.jpg

46 46 MEDock http://medock.csie.ntu.edu.tw/

47 47 MEDock http://medock.csie.ntu.edu.tw/ http://gemdock.life.nctu.edu.tw/dock/images/1jff-sum.gif

48 48 Concept of MEDock http://www.mathworks.com/cms images/op_main_wl_3250.jpg

49 49 Genetic algorithm Crossover Mutation Nature selection http://www.mannosidosis.org/images/inheritance.jpg http://wellington.pm.org/archive/200505/oo-perl/images/mutation.jpg http://myconstructionphotos.smugm ug.com/photos/58510454-M-1.jpg

50 50 Protein folding http://cnx.org/content/m11461/latest/protein_folding.jpg Applications using optimization Gene network http://www.ehponline.org/members/2007/10358/fig2.jpg Microarray analysis http://research.microsoft.com/users/manuelrg/microarray.gif

51 51 Quick summary What is machine learning A very simple classifier, KNN Three real Bioinformatics applications Secondary structure prediction Catalytic site prediction Protein-ligand docking The dependent techniques and related problems

52 52 Conclusion Biologists identify new problems Informatists enhance algorithms Bioinformatists transform biomedical problems into computer science problems

53 53 Appendix

54 54 RAME (Rank-based Adaptive Mutation Evolutionary) algorithm Current generationRank these samples Generate reference Gaussian distributions according to sample rank Generate the next generation according to these reference Gaussian distributions

55 55 Diversity of sampling locations RAME algorithmLGA in AutoDock

56 56 Suppose that the distances from ? to its nearest O and X are equal O O O O O O O O O O O O O O X X X X X X X X X XX X X X ? d d

57 57 It is more likely to be a X because of the density O O O O O O O O O O O O O O X X X X X X X X X XX X X X X

58 58 It looks like an outlier when being predicted as a O O O O O O O O O O O O O O O X X X X X X X X X XX X X X O


Download ppt "Bioinformatics Programming 1 EE, NCKU Tien-Hao Chang (Darby Chang)"

Similar presentations


Ads by Google