Presentation on theme: "HCS806 “Methods in Horticulture and Crop Science” Introduction to methods in Bioinformatics for plant science. David Francis (Coordinator) Ian Holford."— Presentation transcript:
HCS806 “Methods in Horticulture and Crop Science” Introduction to methods in Bioinformatics for plant science. David Francis (Coordinator) Ian Holford (Molecular and Cellular Imaging Center) Xiaodong Bai (Entomology)
Survey: HCS806SurveyPre.doc Goals: 1) Establish knowledge base and comfort level of students and staff. 2) Assess available equipment and internet capabilities.
The course is being taught under the “methods” number because it is intended to provide hands-on practical training. At the end of the class participants (graduate students and staff) are expected to have gained: Familiarity with sequence databases and how data are stored. Skills needed to retrieve, organize, and store sequence data. Working knowledge of LINUX commands for manipulating sequence files. Working knowledge of stand-alone BLAST and running stand-alone BLAST in the UNIX environment. Working knowledge of BioPerl and its use to parse BLAST outputs.
Estimated Time line (Week 1) Monday 7/13Introduction to BioInformatics (David Francis) Distributed resources on the web (DF) Creating and downloading datasets (DF) Tuesday 7/14Setting up your computer for the class (DF) Installing Unix emulation (CygWIN) for Windows (DF) Unix/Linux Commands (Ian Holford) Wednesday 7/15 Installing Stand alone BLAST (IH) Formatting Data for Stand-alone BLAST (IH) Thursday 7/16 Stand-alone BLAST and interpreting BLAST outputs
Estimated Time line (Week 2) Monday 7/ 20Introduction to Perl Lecture (Bai) Monday 7/ 20Bioperl installation demonstration (Bai) Tuesday 7/21 Bioperl modules (Bai)
BioInformatics: Def. From Wikipedia “Application of information technology to the field of molecular biology” “Entails the creation … [and manipulation] of databases, algorithms, computational and statistical techniques, and theory to solve formal and practical problems arising from the management and analysis of biological data” BioInformatics data are most commonly in the form of DNA or Protein Sequence. Computer scientists refer to this type of data as a “string”. BioInformatics aims to facilitate: sequence analysis, genome annotation, evolutionary biology, biodiversity, analysis of gene expression, analysis of regulation, prediction of structure, etc…
Algorithm: a procedure or formula for solving a problem. An algorithm describes an explicit series of steps that can be used to solve a problem. In this class we want to encourage the algorithm as a way of thinking: Formulating the biological questions is up to us. We then need to design the algorithms to address the question. If these procedures are repetitive, they lend themselves to automation. http://www.cs.sunysb.edu/~alorith/
Perhaps the most common tool used for sequence analysis is the Basic Local Alignment Search Technique (BLAST) BLAST finds regions of local similarity between sequences. The algorithm implemented by BLAST places an emphasis on speed not sensitivity. For more information on what BLAST does see: http://en.wikipedia.org/wiki/BLAST For more information on how to use BLAST see: http:www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html Know: The BLAST score indicates how many Words overlap. Significance scores are based on a distribution, base (or nucleotide) frequency, and database size. Alignments to look for similarity (form implies function).
Where do I go for data? General National Center for Biotechnology Information (NCBI): http://www.ncbi.nlm.nih.gov/ UniProt (SWISS-PROT ): http://www.uniprot.org/ European Molecular Biology Laboratory (EMBL) nucleotide sequence database http://www.ebi.ac.uk/embl/ Crop/family Specific databases Solanaceae Genomics Network (SGN): http://sgn.cornell.edu/ The arabidopsis information resource (TAIR): http://www.arabidopsis.org/ Gene indexes (Formerly TIGR): http://compbio.dfci.harvard.edu/tgi/
FASTA file format: FASTA is the standard for sequence data format. “>” is followed by a name/description of the sequence. Everything following the first paragraph break is expected to be a sequence string of nucleotide or protein sequence.
Descriptions of sequence databases: Nucleotide – Contains high quality annotated sequences EST – “Expressed Sequence Tag”. Derived from cDNA (mRNA) and therefore represents transcribed (expressed) sequences. Generally are derived from “single pass” Sanger sequencing. GSS – “Genomic short sequences”. Similar to EST archive, but contains genomic sequence. For example sequenced PCR products.
Other databases: The SWISS-PROT database contains high-quality annotation, is non-redundant and cross-referenced to many other databases in May 26, 2009, the SWISS-PROT database was merged into the UniProt database. http://www.uniprot.org/ European Molecular Biology Laboratory (EMBL) nucleotide sequence database http://www.ebi.ac.uk/embl/
Other databases: Crop/family specific databases e.g. Solanaceae Genomics Network (SGN) http://sgn.cornell.edu/ e.g. The arabidopsis information resource (TAIR) http://www.arabidopsis.org/ Gene indexes (Formerly TIGR) http://compbio.dfci.harvard.edu/tgi/