Presentation is loading. Please wait.

Presentation is loading. Please wait.

Samuel O’Malley Supervisor: Prof. Jiuyong Li Associate Supervisor: Dr. Jixue Liu

Similar presentations


Presentation on theme: "Samuel O’Malley Supervisor: Prof. Jiuyong Li Associate Supervisor: Dr. Jixue Liu"— Presentation transcript:

1 Samuel O’Malley oymsj001@mymail.unisa.edu.au Supervisor: Prof. Jiuyong Li jiuyong.li@unisa.edu.au Associate Supervisor: Dr. Jixue Liu jixue.liu@unisa.edu.au

2 Motivation Background Research Question Contribution Implementation References Do not remove this notice. Copyright Notice COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been produced and communicated to you by or on behalf of the University of South Australia pursuant to Part VB of the Copyright Act 1968 (the Act). The material in this communication may be subject to copyright under the Act. Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. Do not remove this notice.

3 Motivation Background Research Question Contribution Implementation References Overview DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.  Motivation  Background  Research Question  Contribution  Implementation  Examples  References

4 Motivation Background Research Question Contribution Implementation References Motivation DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.  microRNA research is increasing exponentially  Databases can not be curated fast enough  A researcher can not be “current” in the field of microRNA  Automatic curation tools exist for other areas of biomedical research

5 Motivation Background Research Question Contribution Implementation References microRNA – What are they? DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.  microRNA are small non-coding lengths of RNA  They inhibit the creation of proteins Video from rossettagenomics.com

6 Motivation Background Research Question Contribution Implementation References miRBase DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.  A database of microRNA sequences and annotations.  Human microRNA 150 is also called MIR150, hsa- mir-150, MIRN150 etc.  miRBase provides the human readable name as well as a machine readable ID  Example:  hsa-mir-150 has an ID of MI0000479 and HGNC:MIR150 A. Kozomara and S. Griffihs-Jones, “mirbase: integrating microrna annotation and deep- sequencing data”, Nucleic Acids Research, vol. 39, no. suppl 1, pp. D152-D157,2011.

7 Motivation Background Research Question Contribution Implementation References Disease Related Enzymes DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.  Finds occurrences of an Enzyme and a Disease mentioned in the same sentence  Classifies their relationship using a Support Vector Machine  Uses a training-set of pre-classified sentences.  Example:  “Chronic granulomatous disease (CGD) results from mutations of phagocyte NADPH oxidase.”  Classified as “Causal Interaction” C. Sohngen, A. Chang, and D. Schomburg, “Development of a classication scheme for disease- related enzyme information”, BMC Bioinformatics, vol. 12, no. 1, p. 329, 2011.

8 Motivation Background Research Question Contribution Implementation References Gene Name Disambiguation DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.  Genes can have many different names or variations  Humans can understand “context”, for machines this is a challenge  Example:  Five sentences in the paper refer to different genes.  Four of these are referring to a human gene, however the fifth is ambiguous as a human gene or a fly gene. C.J. Sun, X.L.Wang, L. Lin, and Y.-C. Liu, “A multi-level disambiguation framework for gene name normalization”, Acta Automatica Sinica, vol. 35, no. 2, pp. 193-197, 2009.

9 Motivation Background Research Question Contribution Implementation References LINNAEUS – Species Identification DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.  LINNAEUS uses a set of simple regular expressions to find indicators of what species a text is refering to.  In my research I use a modified list to incorporate the specific MicroRNA domain knowledge.  Example -These words can all be used when talking about humans (ID: 9606):  [hH]umans? [pP]atients? [pP]articipants? [wW]oman [wW]omen [mM]en [gG]irls? [bB]oys? [pP]eoples? [Cc]hild(ren)? [Ii]nfants? [Pp]ersons? Gerner, M, Nenadic, G & Bergman, C 2010, 'LINNAEUS: A species name identification system for biomedical literature', BMC Bioinformatics, vol. 11, no. 1, p. 85.

10 Motivation Background Research Question Contribution Implementation References Research Question DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE. What is the most suitable technique for discovering and classifying microRNA - gene relationships from biomedical literature?

11 Motivation Background Research Question Contribution Implementation References Contribution DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE. 1. A normalisation and disambiguation technique for gene names will be adapted to fit the unique microRNA ontology. 2. Automatic curation of microRNA and gene relationships in biomedical literature. (Not completed yet)

12 Motivation Background Research Question Contribution Implementation References MYSQL Database Backend Table NameRows AbstractsIDAbstractTitle Stop_AbstractsIDAbstractTitle SpeciesIDName Micro_PrefixPrefixSpecies_ID Species_MentionsAbstract_IDSpecies_IDSentence_NumWord_Num MicroRNA_MentionsAbstract_IDMicro_IDSentence_NumWord_Num

13 Motivation Background Research Question Contribution Implementation References Full Example – Original Abstract  microRNA profiling in Epstein-Barr virus-associated B-cell lymphoma.  The Epstein-Barr virus (EBV) is an oncogenic human Herpes virus found in ~15% of diffuse large B-cell lymphoma (DLBCL). EBV encodes miRNAs and induces changes in the cellular miRNA profile of infected cells. MiRNAs are small, non-coding RNAs of ~19-26?nt which suppress protein synthesis by inducing translational arrest or mRNA degradation. Here, we report a comprehensive miRNA-profiling study and show that hsa-miR-424, -223, - 199a-3p, -199a-5p, -27b, -378, -26b, -23a, -23b were upregulated and hsa-miR-155, -20b, -221, -151-3p, -222, -29b/c, -106a were downregulated more than 2-fold due to EBV-infection of DLBCL. All known EBV miRNAs with the exception of the BHRF1 cluster as well as EBV-miR- BART15 and -20 were present. A computational analysis indicated potential targets such as c-MYB, LATS2, c-SKI and SIAH1. We show that c-MYB is targeted by miR-155 and miR-424, that the tumor suppressor SIAH1 is targeted by miR-424, and that c-SKI is potentially regulated by miR-155. Downregulation of SIAH1 protein in DLBCL was demonstrated by immunohistochemistry. The inhibition of SIAH1 is in line with the notion that EBV impedes various pro-apoptotic pathways during tumorigenesis. The down-modulation of the oncogenic c-MYB protein, although counter- intuitive, might be explained by its tight regulation in developmental processes.

14 Motivation Background Research Question Contribution Implementation References Full Example – Stopwords Removed DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.  Epstein-Barr virus EBV oncogenic human Herpes virus found 15 diffuse large B-cell lymphoma DLBCL  …  MiRNAs small non-coding RNAs 19-26 nt suppress protein synthesis inducing translational arrest mRNA degradation. we report comprehensive miRNA- profiling study show hsa-miR-424 223 199a-3p 199a- 5p 27b 378 26b 23a 23b upregulated hsa-miR-155 20b 221 151-3p 222 29b c 106a downregulated 2-fold due EBV-infection DLBCL  …

15 Motivation Background Research Question Contribution Implementation References Full Example – Stopwords Removed DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.  First replace all full stops with “. “ and remove the final full stop: ◦ $abstract =~ s/([^\s])\.\s+/$1. /gm; ◦ $abstract =~ s/([^\s])\.\s*\Z/$1/gm; ◦ “Ph.D” will not be affected by this  Then split the words into the following chunks: ◦ $abstract =~ /(([a-zA-Z0-9']+-)*[a-zA-Z0-9'\.]+)/g) ◦ And remove the word if it matches Lingua’s stopword list (James 2002). ◦ Essentially this algorithm splits each word up but still keeps hyphens, apostrophes and numbers. ◦ Most stopword algorithms remove numbers and hyphens but they are essential for microRNA detection.

16 Motivation Background Research Question Contribution Implementation References Full Example – Analysis DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.  These two lines from the text specify 17 different MicroRNAs:  hsa-miR-424 223 199a-3p 199a-5p 27b 378 26b 23a 23b  hsa-miR-155 20b 221 151-3p 222 29b c 106a  The“hsa-” prefix confirms to us that this is a human sequence.  If there are competing species in the same document we use a distance function to calculate which one to use, and the others we use as backups.

17 Motivation Background Research Question Contribution Implementation References Full Example – Detection DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.  This regular expression captures all microRNA written in the standard format: ◦ m/^((([a-zA-Z]+-)?(mir|let)-?)[\d][\d\-a-z]*$)/mi  For example: ◦ hsa-miR-27b ◦ hsa-miR-29b-1 ◦ let-7b ◦ MIR298A  It does not capture the following string: ◦ hsa-miR-424 -223 ◦ It would only see the first microRNA, but miss 223 ◦ My algorithm appends each number to the last seen microRNA prefix if the number occurs immediately after a valid microRNA

18 Motivation Background Research Question Contribution Implementation References Full Example – Real Detection Missing Entries:

19 Motivation Background Research Question Contribution Implementation References Full Example – Review DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.  To Review the effectiveness of this algorithm: 1.We will manually annotate a random selection of abstracts with correct MicroRNA information.  Pros:  Accurate, wide selection of different types of writing  Cons:  Slow and laborious 2.We will do a reverse lookup from MIRBase (which references pubmed IDs and assume that they contain the microRNA from MIRBase in the abstract.  Pros:  Fast and Automated  Cons:  The microRNA might not be mentioned at all in the abstract (False Negatives)  The microRNA are likely to be specified with their fully qualified names and perhaps not represent the target population fully.

20 Motivation Background Research Question Contribution Implementation References Some Statistics DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.  There are 18,314 entries in my Abstracts table ◦ Of those, there are 17,231 with useable Abstracts  48% of these abstracts contain species indicators.  When the abstracts finished downloading (after 2 hours) there were already 16 new abstracts available.  My database has 21,222 unique microRNA listed from MIRBase.  There are 62,036 MicroRNA with no ambiguity in the abstracts. 53% of total detections were improved by the species detection.

21 Motivation Background Research Question Contribution Implementation References References DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.  Imig, J, Motsch, N, Zhu, JY, Barth, S, Okoniewski, M, Reineke, T, Tinguely, M, Faggioni, A, Trivedi, P, Meister, G, Renner, C & Grasser, FA 2011, 'microRNA profiling in Epstein-Barr virus-associated B-cell lymphoma', Nucleic Acids Res, vol. 39, no. 5, Mar, pp. 1880-1893.  M. Gerner, G. Nenadic, and C. Bergman, 2010, 'LINNAEUS: A species name identification system for biomedical literature', BMC Bioinformatics, vol. 11, no. 1, p. 85.  L. J. Jensen, J. Saric, and P. Bork, “Literature mining for the biologist: from information retrieval to biological discovery," Nat Rev Genet, vol. 7, no. 2, pp. 119-129, 2006.  A. Kozomara and S. Griffihs-Jones, “mirbase: integrating microrna annotation and deep- sequencing data”, Nucleic Acids Research, vol. 39, no. suppl 1, pp. D152-D157,2011.  C. Sohngen, A. Chang, and D. Schomburg, “Development of a classication scheme for disease-related enzyme information”, BMC Bioinformatics, vol. 12, no. 1, p. 329, 2011.  C.J. Sun, X.L.Wang, L. Lin, and Y.-C. Liu, “A multi-level disambiguation framework for gene name normalization”, Acta Automatica Sinica, vol. 35, no. 2, pp. 193-197, 2009.  H. C. Wang, Y. H. Chen, H. Y. Kao, and S. J. Tsai, “Inference of transcriptional regulatory network by bootstrapping patterns”, Bioinformatics (Oxford, England), vol. 27, no. 10, pp. 1422-1428, 2011.

22 Motivation Background Research Question Contribution Implementation References Questions DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE. Any Questions?


Download ppt "Samuel O’Malley Supervisor: Prof. Jiuyong Li Associate Supervisor: Dr. Jixue Liu"

Similar presentations


Ads by Google