Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ying He Wuhan University of Technology

Similar presentations


Presentation on theme: "Ying He Wuhan University of Technology"— Presentation transcript:

1 Ying He Wuhan University of Technology
Bar charts detection and analysis in biomedical literature of PubMed Central Session Title: Innovations in Information Retrieval Session Number:S103 Ying He Wuhan University of Technology I am so honored to be here to communicate with you. The title of my presentation is “bar charts detection and analysis in biomedical literature of PubMed Central”. First of all, I want to introduction myself. My name is He Ying. I come from Wuhan University of Technology.

2 Disclosure I and my spouse/partner have no relevant relationships with commercial interests to disclose. I and my partner have no relevant relationships with commercial interests to disclose. AMIA | amia.org

3 Learning Objectives After participating in this session the learner should be better able to: understand how image informatics can help biomedical researchers understand how to extract information from bar charts After participating in this session, the learner should be better able to understand how image informatics can help biomedical researchers and understand how to extract information from bar charts. AMIA | amia.org

4 Outline Introduction Method Result Conclusion
The content is listed as follows: AMIA | amia.org

5 Introduction An enormous increase in the amount of open-access heterogeneous biomedical image production and publication. Biomedical image research including: Biomedical image retrieval Biomedical image analysis Biomedical image informatics Fig.1 MEDLINE citation count. Recently, there has an enormous increase in the amount of open-access heterogeneous biomedical image production and publication. This figure shows the enormous increase in the citation count at MEDLINE over the last six decades. The importance of information retrieval in the scientific community is well known, biomedical image research is becoming a trend study, including image retrieval, analysis and informatics. Below, I present our approach to detect and access bar charts in biomedical publication. AMIA | amia.org

6 Introduction summarize experimental results
present multi-faceted data sets most common types of subfigure share the common patterns Fig.2 Four typical examples of bar charts. Bar charts are crucial to summarize experimental results and present multi-faceted data sets in biomedical publications. And bar chart is the most common types of subfigure according to Kutn’s statistics, accounting for 12.4% of entire set of images. Furthermore, the bar charts share some common patterns. All this reasons reinforce our determination to study bar charts. This figure shows four typical bar chart images. The slight differences about the text and the bars should be noticed. Weather have a horizontal grid, the direction of the characters, the direction of the bars, and the different style of legends should be noticed. AMIA | amia.org

7 Introduction Fig.3 Two examples of bar charts from biomedical publications (PMC and PMC ) Figure 3 shows two common bar charts together with a table representation of the extracted information. Since the priority target of our approach is to automatically extract the relation of the corresponding quantitative proportion to the categorical data, we focus on the text information as well as the length proportion of the bars. A bar chart can be considered a kind of matrix with pictures of experimental artifacts as content. The tables to the right illustrate the semantic relations encoded in the bar charts. Each relation instance consists of a condition, a measurement and a result. AMIA | amia.org

8 Method Fig.4 The procedure of our bar charts mining method. (from PMC ) Our approach to image mining from bar charts consists of 6 components: figure extraction, image preprocessing, bar segment detection, in-image text recognition, panel segmentation, and quantitative information extraction. AMIA | amia.org

9 Method Figure extraction Image preprocessing
the data interface (OA web service) are manually accessed in PubMed. (PDF/NXML format articles, GIF/JPEG format images) Image preprocessing size color the number of connected domains In the figure extraction steps, documents with different formats are downloaded by the data interface in PubMed. We deal mainly with the JPEG format images. In Image preprocessing module, non-informative figures are removed to reduce computing cost. Features including size, color, and the number of connected domains are chosen to filter the non-informative figures, such as formula conference and journal logos. AMIA | amia.org

10 Method Bar segments detection Table 1. CNN method dataset
Training: 12000(+),4000(-) Test :3000(+),1500(-) vector size 28*28 learning rate 1 bench poach 50 iterator times 800 precision 93.80% Fig.5 hand-coded method For bar segment detection, we use a detection procedure based on hand-coded rules and convolutional neural network (CNN) method to detect bar segments. As a baseline, we first propose a relatively simple hand-coded bar segment detection method. Such bar segments typically have several bars with the same width distributed uniformly on the x-axis, which is the most distinct feature of the bar chart. For this reason, a projection method is used to detect such bar segments. There is an alternative method on using machine learning approaches for image classification. This table shows the parameters we chosen. AMIA | amia.org

11 Method In-image text recognition Bar chart segmentation
Text location (position of axes, character spacing and character size, and the character region) OCR(open-source tool OCROPY ) Text correction(Levenshtein Distance between in-image text and the caption) Bar chart segmentation A rectangle segmentation algorithm is used. Quantitative information extraction recognition of the x-axis and the height of each bar For In-Image Text Processing, we first perform a text location method. The coordinate axes are used to partition the region. Then, an open-source tool OCROPY was used to recognize the image text. Finally, Levenshtein Distance between in-image text and the caption was computing to correct the recognized text. Additionally, the Panel Segmentation module combines the results obtained from the previous two modules. And a rectangle segmentation algorithm is used to extract the potential sub bar graphs contained within the figures. For the quantitative information extraction, we first removed all the graduated lines including the x-axis to separate all the bars, then the bars are filled in to make them all solid. Finally the projection method used in bar segment detection also used to obtain the number of bars and the corresponding quantitative information. AMIA | amia.org

12 Result To test our approach, we created a gold standard corpus of images. 300 open-access article folds, 1769 figures, an average of about 6 figures per article 534 (32.2%) images containing at least one bar chart and in total 1659 sub bar chart. Table 2. Evaluation of our approach. Task Method Precision Recall F-measure Bar segment detection Hand-coded 0.9547 0.5908 0.7299 CNN 0.8945 0.8545 0.8691 Hand-coded + CNN 0.9770 0.8115 0.8866 Panel segmentation 0.9811 0.5273 0.6859 Information extraction 0.8240 0.3981 0.5368 To test our approach, we created a gold standard corpus of images. We randomly selected a sample of 300 open-access articles folds, which contain 1769 figures. Altogether, we manually annotated the number of image containing bar charts is 534 and the number of sub bar charts is The analysis demonstrated that bar charts are indeed a very important image type in the biomedical literature. The first column shows the result of bar segment detection. The CNN classifier combined with hand-coded algorithm achieve the highest precision and best F-measures, while the CNN classifier alone achieve the highest recall rate. The center column shows the result of panel segmentation. We define the complete panel segmentation as to segment the panel with the x-label, y-label and the legends. Our method correctly produce 98.11% of the panel segmentation at a recall of 52.73%, which leads to an F-measure of 68.59%. The last column shows the result of information extraction. If information extracted from the bar charts can fill all the columns of the corresponding table, it is defined as a correct extraction. Almost 39.81% of the bar panels are extracted % thereof are correct. AMIA | amia.org

13 Result We obtain a subset for our future use case by searching the keyword combination of “soybean,” “gene” and “expression.” Table 3. The results of running the pipeline on the open access subset of PubMed Central. Total articles 14596 Processed articles 11973 Total figures from processed articles 80378 Processed figures 61238 Detected bar charts 44537 Table 2 shows the results of running the pipeline on the subset of PubMed Central. We start with about articles folds. About 18% article folds are discarded by the reason of containing no article with XML format or no JPG image. Remained articles contain around figures. In order to reduce additional computational cost, no-information figures are filtered in the image preprocessing steps. We ended up with more than figures, in which about bar charts are detected. AMIA | amia.org

14 Conclusion we have developed a comprehensive system for automatically extracting information from bar charts. the automatic analysis of vector diagram seems to be an efficient way to extract such relations from existing publications in the future. We have developed a comprehensive system for automatically extracting information from bar charts. The results show that the hand-coded algorithm and the CNN method we proposed can detect the bar segment at a high accuracy. We also depict that relation and quantitative information can be extracted from the bar charts with satisfactory precision. The low resolution of the image is the most important reason that affect the result. So, the automatic analysis of vector diagram seems to be an efficient way to extract such relations from existing publications in the future. AMIA | amia.org

15 AMIA is the professional home for more than 5,400 informatics professionals, representing frontline clinicians, researchers, public health experts and educators who bring meaning to data, manage information and generate new knowledge across the research and healthcare enterprise. AMIA | amia.org

16 Thank you!


Download ppt "Ying He Wuhan University of Technology"

Similar presentations


Ads by Google