Presentation is loading. Please wait.

Presentation is loading. Please wait.

Literature Retrieval and Mining Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.

Similar presentations


Presentation on theme: "Literature Retrieval and Mining Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520."— Presentation transcript:

1 Literature Retrieval and Mining Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

2 Outline Introduction to PubMed PubMed Related Articles Search engines and Google features H index 2

3 PubMed PubMed  NCBI  NLM  NIH –Biomedical literature database –> 21M citations from 4800 journals since 1948. –Entrez is the retrieval system PubMed entry –Citation (paper) published (recent papers could be indexed upon epub) –Citation indexed in PubMed with PubMedID assigned –Citation indexed with MeSH (Medical Subject Heading, like keywords) terms For direct full article access: http://www.ncbi.nlm.nih.gov.ezp1.harvard.edu/sites/entrez ?holding=hulib http://www.ncbi.nlm.nih.gov.ezp1.harvard.edu/sites/entrez ?holding=hulib 3

4 PubMed Articles 4

5 Search by Author / Journal / Date By author: –Lastname FirstMiddleInitial [au]: Liu JS –First author [1au], last author [lastau] –Full name [fau]: Jun S Liu By journal title: [ta] –Full journal title or MEDLINE abbreviation: PNAS –Index to get journal title “proceedings of the national” By date: –yyyy/mm/dd [dp] –Date range (:) or “last x days/months/years” Jun S Liu [au] AND "last 3 years" [dp] 5

6 Search Syntax and Tags Boolean: AND, OR, NOT Field tags –First author [1AU], author [AU], full author [FAU], 1 st author affiliation [AD] –MeSH term [MH] –Title [TI], title/abstract [TIAB], text words [TW] –Publication date [DP] –Publication type [PT], journal title [TA], Language [LA] Sorted by: Date, author, journal Details: View and edit actual query 6

7 Advanced Search and PubMed Display Advanced search –Index: refine search before display –History: keep most recent 100 queries for 8 hours, e.g. #5 AND #3 Displayed by: –Summary –Abstract –Send To: text, file, email, Clipboard, RSS! More on PubMed: –http://www.nlm.nih.gov/bsd/disted/pubmed.html 7

8 Getting Information from Text 8

9 Literature Mining Terms Corpus: Collection of documents. E.g. all papers in PubMed Term frequency: Number of times a word appears in a document. E.g. “polymerase” appeared 41 times in a paper Document frequency: Number of documents a word appears in. E.g. 1234x papers has the word “transcription” Collection frequency: Total number of times a word appears in a corpus. E.g. “transcription” appeared 6789X times in all of PubMed indexed papers Stop words: Words in the corpus that contribute little to meaning. E.g. to, is, an Stemming: Group together different variations of the same word. E.g. activate vs. activated vs. activating 9

10 A document is summarized as a vector of word counts. Each dimension contains the number of times a word appears. Can calculate similarity between two documents by comparing their vectors acid 2 amino 2 analysis 1 comparison 1 control 1 environments 2 […] our 1 ”Our analysis includes comparison of amino acid environments with random control environments as well as with each of the other amino acid environments.” Documents Represented as Vectors 10

11 Comparing Two Documents Intuitive comparison between two papers  correlation coefficient of their word occurrence vectors Correlation measures the strength of linear relationship between two random variables a = c(1, 3, 5, 1, 8, 20, 0, 0, 0, 3, 1) b = c(2, 3, 4, 0, 10, 25, 1, 0, 2, 4, 3) c = c(2, 0, 1, 10, 2, 4, 7, 1, 5, 0, 8) cor(a, b)0.985615Correlated cor(b, c)-0.110328Not correlated 11

12 Term Weighting Considerations Give different terms different weight Global weight –Document frequency 12

13 Term Weighting Considerations Give different terms different weight Global weight –Document frequency: Fewer documents, more weight. E.g. progesterone vs gene Local weight –Term frequency 13

14 Term Weighting Considerations Give different terms different weight Global weight –Document frequency: log(N / df) Local weight –Term frequency: 1 + log(tf) –Document length 14

15 Related Citations –Similarity between two documents:  all terms (local wt1 × local wt2 × global wt) –Pre-computed related articles for each citation –Rank ordered by relevance, then date How to evaluate: –Tradeoff between precision and recall –Precision = # relevant hits / # hits –Recall = # relevant hits / # relevant –Often # relevant is arbitrary, or sampled 15

16 Jane http://www.biosemantics.org/jane/ Have you recently written a paper, but you're not sure to which journal you should submit it? Or maybe you want to find relevant articles to cite in your paper? Or are you an editor, and do you need to find reviewers for a particular paper? Maybe you are a reviewer, and wonder whether the authors’ ‘novel’ approach or finding is really novel? Jane can help! 16

17 Search Engines Components The crawler: visit all websites, traverse all links The index –Check keywords and full text –Ignore stop words: e.g. is, at, from… –Paid inclusion: not Google –Google looks at semantics and logic The search engine software: how to rank –Location (front) and frequency (more) –Off the page factors: how many pages link to this one –Clickthrough measurement: lower the rank for search results not clicked 17

18 Other Useful Google Features http://www.google.com/intl/en/help/features.html Conversion: 88 cm in inches or 10000 yen in USD Time: time Beijing Definition: define proteomics Local search: CVS 02115 Travel info: united 134 Site search: training grant site:hsph.harvard.edu Who links to this: link:www.ncbi.nlm.nih.gov Filetype: comparative genomics filetype:ppt 18

19 H-index Hirsch, PNAS 2005 Simultaneously measures productivity and impact of a scientist A scholar with an index of h has published h papers each of which has been cited by others at least h times Check citation from scholar.google.com 19

20 H-index For physical sciences: ~12 tenured associate prof, ~18 full prof, ~45 member of national academy of science Few biases: –Does not care where author appears on paper and total number of authors on papers –Advantage for older, alive scientists and sustained productivity –Ignore context (wrong results) 20

21 Google Scholar Scholar.google.com Set preference to access open source and Harvard Lib: http://scholar.google.com/scholar_preferences?hl= en http://scholar.google.com/scholar_preferences?hl= en Where is a paper cited Get free pdf without subscription Get H-index “My Citations” –Manually update papers after the initial creation 21

22 Want to Learn a Comp Bio Topic? Pubmed: –Recent reviews in good journal –“related articles” Nature Biotechnology CompBio track PLoS Computational Biology Collection –http://collections.plos.org/ploscompbiol/index.phphttp://collections.plos.org/ploscompbiol/index.php –10 Simple rules, Educational Search Google for topic, “lecture notes” or “tutorial”, filetype : ppt or pdf Search http://www.wikipedia.org/ and Google definitionhttp://www.wikipedia.org/ Try http://CompBio.pbwiki.com 22

23 Acknolwedgement Russ Altman Soumya Raychaudhuri John Quackenbush Jeff Chang 23


Download ppt "Literature Retrieval and Mining Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520."

Similar presentations


Ads by Google