Presentation is loading. Please wait.

Presentation is loading. Please wait.

24 August 2012Ganesha Associates1 Basic reading, writing and informatics skills for biomedical research Segment 4. Other types of database and browser.

Similar presentations


Presentation on theme: "24 August 2012Ganesha Associates1 Basic reading, writing and informatics skills for biomedical research Segment 4. Other types of database and browser."— Presentation transcript:

1 24 August 2012Ganesha Associates1 Basic reading, writing and informatics skills for biomedical research Segment 4. Other types of database and browser

2 24 August 2012Ganesha Associates2 Biological databases A database is an indexed collection of information Some databases contain mainly text, but others contain image, sequence or structural data A browser is a means of visualising this information and the relationships between data elements There is a growing amount of information in publicly available databases. For example, in 2011 the Nucleic Acids Research journal online Molecular Biology Database Collection listed 1380. Molecular Biology Database Collection The National Center for Biotechnology Information (NCBI) and the European Bioinformatics Institute(EBI) host some of the most important databases used for biomedical research.NCBIEBI Wikipedia also contains a list of biological databasesbiological databases Which databases are relevant to your project?

3 Data, data everywhere… “Rapid release of prepublication data has served the field of genomics well.” “With close to one million gene-expression data sets now in publicly accessible repositories, researchers can identify disease trends without ever having to enter a laboratory.” “Most researchers agree that open access to data is the scientific ideal, so what is stopping it happening [in other fields]?” “Earth scientists need better incentives, rewards and mechanisms to achieve free and open data exchange” 24 August 2012Ganesha Associates3

4 24 August 2012Ganesha Associates4 The database problem Volume of digital data (both high throughput and text) –One second of HD video = 2000 pages of text Distributed systems and databases, lack of data standards, incompatible data formats Costs of creation, curation and maintenance Retrieval: semantic search, metadata, images…

5 24 August 2012Ganesha Associates5 The problem – biomedical research Gene Expression Warehouse ProteinDisease SNP Enzyme Pathway Known Gene Sequence Cluster Affy Fragment Sequence LocusLink MGD ExPASy SwissProt PDB OMIM NCBI dbSNP ExPASy Enzyme KEGG SPAD UniGene Genbank NMR Metabolite

6 24 August 2012Ganesha Associates6 Cross-database search today - NCBI

7 24 August 2012Ganesha Associates7 The problem – biomedical research

8 24 August 2012Ganesha Associates8 The problem – biomedical research

9 24 August 2012Ganesha Associates9 The problem – healthcare

10 24 August 2012Ganesha Associates10 The problem - healthcare JOURNAL of the AMERICAN MEDICAL ASSOCIATION (JAMA) Vol 284, No 4, July 26th 2000 2,000 deaths/year from unnecessary surgery 7,000 deaths/year from medication errors in hospitals 20,000 deaths/year from other errors in hospitals 80,000 deaths/year from infections in hospitals 106,000 deaths/year from non-error, adverse effects of medications These total up to 225,000 deaths per year in the US from iatrogenic causes which ranks these deaths as the # 3 killer. Iatrogenic is a term used when a patient dies as a direct result of treatments by a physician, whether it is from misdiagnosis of the ailment or from adverse drug reactions used to treat the illness (drug reactions are the most common cause).

11 24 August 2012Ganesha Associates11 The problem - healthcare 17 year innovation adoption curve from discovery into accepted standards of practice Even if a standard is accepted, patients have a 50:50 chance of receiving appropriate care, a 5-10% probability of incurring a preventable, anticipatable adverse event Medical literature doubling every 19 years –Doubles every 22 months for AIDS care 2 million facts needed to practice Genomics and personalized medicine will increase the problem exponentially Typical drug order today with decision support accounts for, at best, Age, Weight, Height, Labs, Other Active Meds, Allergies, Diagnoses

12 24 August 2012Ganesha Associates12 So how will we find things in databases ? Search engine collects, indexes, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics (statistics), informatics, physics and computer science.

13 24 August 2012Ganesha Associates13 Indexing The purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query. Without an index, the search engine would scan every document in the corpus, which would require considerable time and computing power. For example, while an index of 10,000 documents can be queried within milliseconds, a sequential scan of every word in 10,000 large documents could take hours. The additional computer storage required to store the index, as well as the considerable increase in the time required for an update to take place, are traded off for the time saved during information retrieval.

14 24 August 2012Ganesha Associates14 Inverted indexing An inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents, in this case allowing full text search. There are two main variants of inverted indexes: –A record level inverted index contains a list of references to documents for each word. –A word level inverted index additionally contains the positions of each word within a document. –The latter form offers more functionality (like phrase searches), but needs more time and space to be created.

15 24 August 2012Ganesha Associates15 Example Texts T0 = "it is what it is", T1 = "what is it" and T2 = "it is a banana", have the following inverted file index (where the integers in the brackets refer to the subscripts T0, T1 etc.): –"a": {2} –"banana": {2} –"is": {0, 1, 2} –"it": {0, 1, 2} –"what": {0, 1} A search for the terms "what", "is" and "it" would give the set {0,1}

16 24 August 2012Ganesha Associates16 Example (cont’d) In the full inverted index, where the pairs are document numbers and local word numbers, "banana": {(2, 3)} means the word "banana" is in the third document (T2), and it is the fourth word in that document (position 3): –"a": {(2, 2)} –"banana": {(2, 3)} –"is": {(0, 1), (0, 4), (1, 1), (2, 1)} –"it": {(0, 0), (0, 3), (1, 2), (2, 0)} –"what": {(0, 2), (1, 0)} A phrase search for "what is it“ gets hits for all the words in both document 0 and 1, but the terms occur only consecutively in document 1.

17 24 August 2012Ganesha Associates17 Indexing algorithms Semantic –Stop words –Stemming –Synonyms –Thesauri –Ontologies Syntactic –Word order –Word type –Natural language processing Statistical –Word frequency –Word proximity

18 24 August 2012Ganesha Associates18 PubMed Related Articles Algorithm (I) The neighbors of a document are those documents in the database that are the most similar to it. The similarity between documents is measured by the words they have in common, with some adjustment for document lengths. To carry out such a program, one must first define what a word is. For us, a word is basically an unbroken string of letters and numerals with at least one letter of the alphabet in it. Words end at hyphens, spaces, new lines, and punctuation. A list of 310 common, but uninformative, words (also known as stopwords) are eliminated from processing at this stage.

19 24 August 2012Ganesha Associates19 PubMed Related Articles Algorithm (II) Next, a limited amount of stemming of words is done. Words from the abstract of a document are classified as text words. Words from titles are also classified as text words, but words from titles are added in a second time to give them a small advantage in the local weighting scheme. MeSH terms are placed in a third category, and a MeSH term with a subheading qualifier is entered twice, once without the qualifier and once with it. These three categories of words (or phrases in the case of MeSH) comprise the representation of a document. No other fields, such as Author or Journal, enter into the calculations. See http://ii.nlm.gov/MTI/related.shtml for more info.http://ii.nlm.gov/MTI/related.shtml

20 24 August 2012Ganesha Associates20 Ontologies, thesauri and taxonomies An ontology is a controlled vocabulary that describes objects and the relations between them in a formal way, and has a grammar for using the vocabulary terms to express something meaningful within a specified domain of interest. A thesaurus is a controlled list of terms linked together by semantic, hierarchical, and associative or equivalence relationships. A taxonomy is a set of interdependent concepts arranged in a lattice based on their relationships.

21 24 August 2012Ganesha Associates21 Semantic inference KeywordsDictionaryControlled Vocabulary ThesaurusTaxonomyOntology Discovery Integration Prediction

22 24 August 2012Ganesha Associates22 Semantic levels DefinitionSynonymsClassification (is_a) Properties (has_a) Other relations Keywords Dictionary Controlled vocabulary Thesaurus Taxonomy Ontology

23 24 August 2012Ganesha Associates23 The Medical Subject Headings classification Controlled vocabulary, thesaurus. MeSH terms are arranged in a hierarchy of "MeSH Tree Structures". When PubMed searches a MeSH term, it will automatically include narrower terms in the search, if applicable. This is also called "automatic explosion." When you click Go, PubMed will look for a match in up to four lists. It looks first for a match in the MeSH Translation Table. If it doesn't find a match, it looks in the Journals Translation Table, then in the Phrase List, and finally in the Author Index.

24 24 August 2012Ganesha Associates24

25 24 August 2012Ganesha Associates25

26 24 August 2012Ganesha Associates26

27 24 August 2012Ganesha Associates27 The Gene Ontology organisation The objective of GO is to provide controlled vocabularies for the description of the molecular function, biological process and cellular component of gene products. These terms are to be used as attributes of gene products by collaborating databases, facilitating uniform queries across them. The controlled vocabularies of terms are structured to allow both attribution and querying to be at different levels of granularity. http://www.geneontology.org

28 24 August 2012Ganesha Associates28 Gene Ontology organisation GO collaborators have developed three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner. There are three separate aspects to this effort: –They write and maintain the ontologies themselves –They make cross-links between the ontologies and the genes and gene products in the collaborating databases –They develop tools that facilitate the creation, maintainence and use of ontologies. Useful links: http://www.amigo.orghttp://www.amigo.org 10 April 2008

29 Ganesha Associates2924 August 2012

30 Ganesha Associates30

31 24 August 2012Ganesha Associates31

32 24 August 2012Ganesha Associates32 Clark et al., 2005 part_of is_a Is_a and part_of relationships

33 24 August 2012Ganesha Associates33 Mitochondrial P450 ( CC24 PR01238; MITP450CC24) An example of annotation GO cellular component term: mitochondrial inner membrane ; GO:0005743 GO molecular function term: monooxygenase activity ; GO:0004497 GO biological process term: electron transport ; GO:0006118

34 24 August 2012Ganesha Associates34

35 24 August 2012Ganesha Associates35 attacked time control Puparial adhesion Molting cycle hemocyanin Defense response Immune response Response to stimulus Toll regulated genes JAK-STAT regulated genes Immune response Toll regulated genes Amino acid catabolism Lipid metobolism Peptidase activity Protein catabloism Immune response Bregje Wertheim at the Centre for Evolutionary Genomics, Department of Biology, UCL and Eugene Schuster Group, EBI. MicroArray data analysis with GO

36 24 August 2012Ganesha Associates36 GoPubMed GoPubMed is a knowledge-based search engine for biomedical texts. The Gene Ontology (GO) and Medical Subject Headings (MeSH) serve as "Table of contents" in order to structure the millions of articles of the MEDLINE data base. GoPubMed is one of the first Web 2.0 search engines. The system was developed at the Technical University of Dresden by Michael Schroeder and his team and at Transinsight. http://www.gopubmed.org

37 24 August 2012Ganesha Associates37

38 24 August 2012Ganesha Associates38 Medline Cognition Cognition's Semantic NLP Understands: Word stems - the roots of words; Words/Phrases - with individual meanings of ambiguous words and phrases listed out; The morphological properties of each word/phrase, e.g., what type of plural does it take, what type of past tense, how does it combine with affixes like "re" and "ation"; How to disambiguate word senses - This allows Cognition's technology to pick the correct word meaning of ambiguous words in context; The synonym relations between word meanings; The ontological relations between word meanings; one can think of this as a hierarchical grouping of meanings or a gigantic "family tree of English" with mothers, daughters, and cousins; The syntactic and semantic properties of words. This is particularly useful with verbs, for example. Cognition encodes the types of objects different verb meanings can occur with.

39 24 August 2012Ganesha Associates39

40 24 August 2012Ganesha Associates40 iHOP Information Hyperlinked over Proteins. iHOP provides the network of genes and proteins as a natural way of accessing the millions of abstracts in PubMed

41 24 August 2012Ganesha Associates41 iHOP The minimal information view contains general information, like the symbol, name and organism of a gene. Moreover it provides: –Useful links to external resources (e.g. UniProt, NCBI, OMIM, etc.) –Links to other iHOP views on this gene –Homologues Other views contain all sentences found in the literature: –For the main gene of a page and other genes (gene B) which iteract. –That mention the main gene together with relevant biomedical terms such as lymphoma. Sentences are ranked by significance, so that screening over a few sentences will be usually sufficient to gain an idea of a gene's function.

42 24 August 2012Ganesha Associates42

43 24 August 2012Ganesha Associates43 GenMAPP GenMAPP is a free computer application designed to visualize gene expression and other genomic data on maps representing biological pathways and groupings of genes. Integrated with GenMAPP are programs to perform a global analysis of gene expression or genomic data in the context of hundreds of pathway MAPPs and thousands of Gene Ontology Terms.

44 24 August 2012Ganesha Associates44 Automatic rendering of pathway interactions

45 24 August 2012Ganesha Associates45 Other ways to search – BLAST, PubChem, UCSC Genome Browser >DinoDNA from JURASSIC PARK p. 103 nt 1-1200 GAATTCCGGAAGCGAGCAAGAGATAAGTCCTGGCATCAGATACAGTTGGAGA TAAGGACGGACGTGTGGCAGCTCCCGCAGAGGATTCACTGGAAGTGCATTA CCTATCCCATGGGAGCCATGGAGTTCGTGGCGCTGGGGGGGCCGGATGCG GGCTCCCCCACTCCGTTCCCTGATGAAGCCGGAGCCTTCCTGGGGCTGGGG GGGGGCG By sequence – BLAST: By structure – PubChem:

46 24 August 2012Ganesha Associates46 Example of BLAST search results

47 24 August 2012Ganesha Associates47 PC Compound Record

48 24 August 2012Ganesha Associates48 UCSC Genome Browser The Genome Browser zooms and scrolls over chromosomes, showing the work of annotators worldwide.Genome Browser The Gene Sorter shows expression, homology and other information on groups of genes that can be related in many ways.Gene Sorter Blat quickly maps your sequence to the genome. The Table Browser provides convenient access to the underlying database.Blat Table Browser VisiGene lets you browse through a large collection of in situ mouse and frog images to examine expression patterns.VisiGene Genome Graphs allows you to upload and display genome-wide data sets.Genome Graphs

49 24 August 2012Ganesha Associates49


Download ppt "24 August 2012Ganesha Associates1 Basic reading, writing and informatics skills for biomedical research Segment 4. Other types of database and browser."

Similar presentations


Ads by Google