Presentation on theme: "Bioinformatics lectures at Rice University Li Zhang Lecture 1 Department of Bioinformatics and Computational Biology MD Anderson Cancer Center March-April,"— Presentation transcript:
Bioinformatics lectures at Rice University Li Zhang Lecture 1 Department of Bioinformatics and Computational Biology MD Anderson Cancer Center March-April, 2015
Contact information Li Zhang Phone: 713-563-4298 (office) 713-962-6661 (cell) Email: firstname.lastname@example.org@mdanderson.org URL: http://odin.mdacc.tmc.edu/~llzhang/RiceCourse/ Office location: FCT4.5034. Pickens Tower, 4 th floor, MD Anderson Cancer Center.
Homework There will be 2-3 assignments posted online. All students are required to complete the assignments. Homework will be submitted at the beginning of class on the due date. If circumstances beyond the student’s control arise and an assignment cannot be submitted on the due date, an instructor should be contacted prior to the due date. With an instructor’s permission, late homework may be accepted within one week of the due date. All decisions will be made on an individual student basis and the final decision rests with the instructor assigning the homework. A penalty of 10 percentage points will be applied to late homework.
What is bioinformatics? Bioinformatics is the application of computer science and information technology to the field of biology and medicine. Bioinformatics deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, software engineering, data mining, image processing, modeling and simulation, signal processing, discrete mathematics, control and system theory, circuit theory, and statistics, for generating new knowledge of biology and medicine, and improving & discovering new models of computation (e.g. DNA computing, neural computing, evolutionary computing, immuno-computing, swarm- computing, cellular-computing).computer scienceinformation technologybiology medicine Commonly used software tools and technologies in this field include Java, XML, Perl, C, C++, Python, R, MySQL, noSQL, CUDA, MATLAB, and Microsoft Excel.JavaXMLPerlCC++PythonRMySQLSQLCUDAMATLABMicrosoft Excel
Focus area of this course Reference book by in Pierre Baldi’s : “Bioinformatics: A machine learning approach” and a few key papers. Introducing high throughput technologies that provide the data. Machine learning algorithms and models to visualize and explore large datasets identify patterns & relationships. Computing language: R/Perl. Database: Non-relational database NoSQL. Not focused web applications, no structural biology.
Why should we study bioinformatics? Why it is important to study bioinformatics?
There are 187 billion bases in 171 million sequence records in the traditional GenBank as of Feburary 2015.
Growth Chart Of GEO (RNA etc) Gene Expression Omnibus (GEO) database holds over 10 000 experiments comprising 300 000 samples, 16 billion individual abundance measurements, for over 500 organisms, submitted by 5000 laboratories from around the world. The database typically receives over 60 000 query hits and 10 000 bulk FTP downloads per day, and has been cited in over 5000 manuscripts.
Distribution of the number and types of selected studies released by GEO each year since inception. Tanya Barrett et al. Nucl. Acids Res. 2013;41:D991-D995 Published by Oxford University Press 2012.
Growth of PDB (Protein Structures) The Protein Data Bank (PDB) is a repository for the 3-D structural data of large biological molecules, such as proteins and nucleic acids. Most structures are determined by X-ray diffraction, but about 15% of structures are determined by NMR. Large scale organized efforts by Structural Genomics Initiative and International Structural Genomics Consortium have greatly accelerated the pace of growth.
A brief history of the big bang of the digital universe
The age of big data “ The story is similar in fields as varied as science and sports, advertising and public health — a drift toward data-driven discovery and decision-making. It’s a revolution. We’re really just getting under way. But the march of quantification, made possible by enormous new sources of data, will sweep through academia, business and government. There is no area that is going to be untouched.” -------- By Steve Lohr, “The Age of Big data”, The New York Times, 2012.
What is big data? 3Vs of big data: High volume, High-velocity, High-variety --- A definition of big data, The Gartner Inc. Simply put, it is big and complex.
The big value of big data The value of big data is that analysis of the big data can lead to (1)enhanced decision making, (2)insight discovery and (3)process optimization. In business, big data can help to identify unknown needs, customize advertisement, monitor and evaluate operation, which leads to big profit and big saving. In science, big data is a huge resource for a lot of scientific discoveries.
The cost of sequencing has reduced 100 thousand fold in the past 12 years
Data explosion in the era of genomics There have been a large series of breakthroughs in micro-electronics and nano-electronics that have produced instruments that quantify and/or characterize large number of biological molecules in parallel using very small mount of biomaterial. Such technical advances have made possible to comprehensively characterize and quantify the building blocks (DNA, RNA, protein) in a biological system.
Genome, genomics and post genomic era List of sequenced genomes of mammals: TypeGenome size Year of completion Cow3.0 Gb2009 Dog2.4 Gb2005 Guinea Pig3.4 Gb Nine-banded Armadillo3.0 Gb Hedgehog-Tenrec Horse2.1 Gb2007 Western European Hedgehog Cat3 Gb2007 Human3.2 Gb2001 African Elephant3 Gb Rhesus Macaque Gray Mouse Lemur Gray Short-tailed Opossum3.5 Gb2007 Mouse2.5 Gb2002 Little Brown Bat American Pika Platypus Rabbit2.5 Gb Small-eared Galago, or Bushbaby Chimpanzee3.1 Gb2005 Orangutan3.0 Gb Rat2.8 Gb2004 European Shrew3.0 Gb Thirteen-lined Ground Squirrel Domestic pig2009 Northern Tree Shrew
Large Projects TCGA: The cancer genome Atlas 1000 Genome Project 1001 Genome Project ICGC: International cancer genome consortium The International HapMap Project …
Data Information Knowledge/power Bioinformatics provides tools to catalyze the transformations