Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo,

Similar presentations


Presentation on theme: "An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo,"— Presentation transcript:

1 An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri presented by Thiago Pardo USP NLP Group and UFSCar Database Group, São Carlos, BR

2 http://gbd.dc.ufscar.br Context and Motivation A lot of electronic documents that report experiments treatment adopted patients with some kind of disease number of patients enrolled in the treatment symptoms and risk factors positive and negative effects There are several transactions and journals e.g., American Journal of Hematology, Blood, and Haematologica An Environment for Data Analysis - IEA-AIE2010 06/02/10 2/22

3 http://gbd.dc.ufscar.br Context and Motivation Nowadays, researchers and doctors are not able to process this huge number of documents An Environment for Data Analysis - IEA-AIE2010 06/02/10 3/22

4 http://gbd.dc.ufscar.br Context and Motivation These documents are in unstructured format, i.e., in plain textual form, specially in PDF There is necessary to transform these data from unstructured to structured format in order to submit it to an automatic knowledge discovery process An Environment for Data Analysis - IEA-AIE2010 06/02/10 4/22

5 http://gbd.dc.ufscar.br Goal Development of an environment called IEDSS-Bio for analyzing data of biomedical domain, i.e., Sickle Cell Anemia Support the expert in making decisions: Extracting relevant information from biomedical documents Storing the information in a data warehouse (DW) Mining interesting knowledge from the DW 06/02/10 An Environment for Data Analysis - IEA-AIE2010 5/22

6 http://gbd.dc.ufscar.br Contributions Theoretical: Domain Knowledge Methodology of Information Extraction Practical: Resources: collection of documents, dictionary and rules Tools: Converter, Information Extraction, Data Warehouse, Data Mining systems An Environment for Data Analysis - IEA-AIE2010 06/02/10 6/22

7 http://gbd.dc.ufscar.br The Environment for Data Analysis An Environment for Data Analysis - IEA-AIE2010 06/02/10 How many patients had clinical improvement and were treated with the hydroxyurea drug? A significant amount of patients under treatment with the hydroxyurea drug tend to have marrow depression. 7/22

8 http://gbd.dc.ufscar.br Converter Module An Environment for Data Analysis - IEA-AIE2010 06/02/10 8/22

9 http://gbd.dc.ufscar.br Converter Module An Environment for Data Analysis - IEA-AIE2010 06/02/10 9/22

10 http://gbd.dc.ufscar.br Information Extraction Module An Environment for Data Analysis - IEA-AIE2010 06/02/10 Processed Sections: Abstract, Results and Discussion (class of positive and negative effects) All Sections (class of patient) 10/22

11 http://gbd.dc.ufscar.br Sentence Classification An Environment for Data Analysis - IEA-AIE2010 Output Training Positive Effect Negative Effect Others Test Several files about complication sentences Several files about benefit sentences Several files about other sentences New Text TXT Set of sentences classified into classes Classes 06/02/10 11/22

12 http://gbd.dc.ufscar.br Identification of Relevant Information An Environment for Data Analysis - IEA-AIE2010 06/02/10 Dictionary Biomedical Database 12/22

13 http://gbd.dc.ufscar.br Identification of Relevant Information An Environment for Data Analysis - IEA-AIE2010 06/02/10 Identification of Information Pipeline Example of Sentences Relevant Information Rules 13/22

14 http://gbd.dc.ufscar.br Experiments: Sentence Classification 1. How do human beings manually perform the sentence classification? 2. Is it feasible to automate the sentence classification task? 3. What kind of classification algorithm performs better in this task? An Environment for Data Analysis - IEA-AIE2010 06/02/10 14/22

15 http://gbd.dc.ufscar.br Manual Classification by humans? Annotation Agreement in 50 sentences An Environment for Data Analysis - IEA-AIE2010 06/02/10 Fleiss (1971) 1 15/22

16 http://gbd.dc.ufscar.br It is feasible to automate this task? AnnotatorAll the classes 3 experts0.63 3 naïve subjects0.71 experts + naïve subjects0.65 An Environment for Data Analysis - IEA-AIE2010 06/02/10 AgreementScale PoorUnder 0 Slight0 a 0.2 Fair0.21 a 0.4 Moderate 0.41 a 0.60 Substantial0.61 a 0.80 Almost PerfectBetween 0.81 and 1 Landis e Koch (1977) 2 16/22

17 http://gbd.dc.ufscar.br What kind of classification algorithm performs better in this task? An Environment for Data Analysis - IEA-AIE2010 06/02/10 Distribution of classes for each sample 3 17/22

18 http://gbd.dc.ufscar.br An Environment for Data Analysis - IEA-AIE2010 06/02/10 Bag-of-words model AVM configuration: Minimum Frequency = 2 Attributes: 1 to 3-grams 1, for the case the n-gram occurs in the sentence (present); 0 otherwise (absent). Not considered: stopwords removal and stemming Sentence Classification Process: training and testing phase 3 18/22

19 http://gbd.dc.ufscar.br An Environment for Data Analysis - IEA-AIE2010 06/02/10 Evaluation 3 Partitioning method: 10-fold cross-validation 19/22

20 http://gbd.dc.ufscar.br Conclusions The environment proposed – Information Extraction and Decision Support System in Biomedical domain – aims at being a general environment for mining relevant information in the biomedical domain First experiments on sentence classification a step of the whole process very good results (95.9% accuracy) for papers about Sickle Cell Anemia (SCA) Task of sentence classification in the SCA domain is well defined and possible to be automated An Environment for Data Analysis - IEA-AIE2010 06/02/10 20/22

21 http://gbd.dc.ufscar.br Future Work Investigate the identification of treatment and symptoms information in scientific papers Extract of the relevant sentence pieces for populating our databases using IE approaches, e.g., rule-based and dictionary-based Investigate the use of parallel processing to optimize the more time- consuming tasks, e.g., the application of data mining algorithms and the analytical query processing Other biomedical areas may also benefit from our text mining approach An Environment for Data Analysis - IEA-AIE2010 06/02/10 21/22

22 An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems USP NLP Group and UFSCar Database Group, São Carlos, BR Questions ?

23 http://gbd.dc.ufscar.br References ANTHONY, L.; LASHKIA, G. V. Mover: a machine learning tool to assist in the reading and writing of technical papers. IEEE Transactions on Professional Communication, v. 46, n. 3, p. 185-193, 2003. FLEISS, J. L. Measuring nominal scale agreement among many raters. Psychological Bulletin, v. 76, n. 5, p. 378-382, 1971. LANDIS, J. R.; KOCH, G. G. The measurement of observer agreement for categorical data. Biometrics, v. 33, n. 1, p. 159-174, 1977. PINTO, A. C. S. et al. Technical Report "Sickle Cell Anemia". São Carlos: Department of Computer Science, Federal University of São Carlos, 2009. p. 16. Available at:.http://sca.dc.ufscar.br/download/files/report.sca.pdf An Environment for Data Analysis - IEA-AIE2010 06/02/10 23/22


Download ppt "An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Pablo Matos, Leonardo Lombardi, Thiago Pardo,"

Similar presentations


Ads by Google