Presentation is loading. Please wait.

Presentation is loading. Please wait.

Developing systems for full-text search in biomedicine. Anna Divoli School of Information University of California, Berkeley 07 Aug 2007 University of.

Similar presentations


Presentation on theme: "Developing systems for full-text search in biomedicine. Anna Divoli School of Information University of California, Berkeley 07 Aug 2007 University of."— Presentation transcript:

1 Developing systems for full-text search in biomedicine. Anna Divoli School of Information University of California, Berkeley 07 Aug 2007 University of Manchester

2 outline motivation how to go about it first study biotext search engine future work current studies (brief)

3 motivation All prominent literature search systems are based on information from abstracts alone when: Researchers in the area of text-mining have started to investigate approaches for full-text analysis. PubMed Central has made available a large collection of full- text articles (Open Access), overcoming licensing restrictions from the publishers. Figure captions and figures can be especially useful for locating experimental results (e.g., 2002 KDD).

4 how to go about it Search engine interface that meets the unique needs of bioscientists  Provide flexible, useful, appealing search for bioscientists Provide abstract? Full text? Figures? Expansions? Links? How? User-centered search interface design. Simple design Usability studies Iteration Design Prototype Evaluate

5 hci principles Design for the user, not for the designers or the system Needs assessment:who users are what their goals are what tasks they need to perform Task analysis:characterize what steps users need to take create scenarios of actual use decide which users and tasks to support Iterate between: designing & evaluating

6 hci principles - cont. Make use of cognitive principles where available Important guidelines:Reduce memory load Speak the user’s language Provide helpful feedback Respect perceptual principles Prototypes: Get feedback on the design faster Experiment with alternative designs Fix problems before code is written Keep the design centered on the user

7 first study - goals Marti A. Hearst, Anna Divoli, Jerry Ye and Michael A. Wooldridge (2007) “Exploring the efficacy of caption search for bioscience journal search interfaces” ACL 2007 Workshop on BioNLP, Prague, Czech Republic Primary Goal: Determine whether biological researchers would find the idea of caption search and figure display to be useful or not ( evidence that figures and their captions are very informative). Secondary Goal: Should caption search and figure display be useful, how best to support these features in the interface.

8 first study - description When introducing a new search interface idea, great care must be taken to get the details right. Practiced user-centered design: first prototype, then test the results with potential users, then refine the design based on their responses, and repeat…  Tested a few designs. ~1hour sessions with participants. Standardized questions & open discussions (design & content).

9 full text first study - designs full text with figure caption & figure caption, figure & small thumbnails grid

10 first study - outcomes 7 out of 8 in favor of caption-search & figure-display. Different views serve different roles (for general concepts, abstract search would be more suitable but for a specific method, caption view would be better). Best to show all the thumbnails as the result of a full-text or abstract-text search. Info as few clicks away as possible with as few ‘distracting’ links & options. More metadata for the grid view. All participants favored the ability to browse all figures from a paper once they find the abstract or one of the figures relevant to their query.

11 first study - outcomes cont. Caption-Figure design x-axis: participant # y-axis: 7 = strongly agree 1 = strong disagree

12 biotext search engine Marti A. Hearst, Anna Divoli, Harendra Guturu, Alex Ksikes, Preslav Nakov, Michael A. Wooldridge and Jerry Ye (2007) “BioText Search Engine: beyond abstract search” Bioinformatics; doi:10.1093/bioinformatics/btm301

13 biotext search engine - the views The interface is carefully designed according to usability principles and techniques. The BioText Search Engine allows users to search in: Abstracts - list view Captions - list view Captions - grid view Clicking on any figure opens a new window with a large version of the figure accompanied with its caption.

14 abstracts - list view Searches in:TITLES ABSTRACTS AUTHOR NAMES Returns in a list:TITLE CITATION ABSTRACT thumbnails of FIGURES from the paper HTML & PDF links to the paper link to the “Endgame view”

15

16 captions - list view Searches in:CAPTIONS Returns in a list:TITLE CITATION CAPTION corresponding FIGURE HTML & PDF links to the paper link to the “Endgame view”

17

18 captions - grid view Searches in:CAPTIONS Returns:corresponding FIGURES in a grid short CAPTION excerpt CITATION in tooltip link to the “Endgame view”

19

20 endgame view All views lead to the Endgame view. The “Endgame view” displays a summary of a paper that an abstract or caption originates. This summary comprises of: TITLE CITATION ABSTRACT all FIGURES and corresponding CAPTIONS of the paper in a list HTML & PDF links to the paper

21

22 technical details Indexes all Open Access articles available at PMC - collection consists of more than 150 journals, 20,000 articles, and 80,000 figures (new articles are downloaded and indexed daily). The figures are stored locally (to present thumbnails quickly). The Lucene open source search engine issued to index, retrieve, and rank the text (using default statistical ranking). Publication date is stored as a separate field and can also be used to sort the results. The interface is web-based and is implemented in python and PHP, logs and other information are stored using MySQL.

23 future work The search engine is a work in progress. More functionality will be added over time. We plan to: Provide full-text search. (Since the usability of different ranking functions for biology articles is still not well-understood, we plan to do usability testing, research how different sections should be weighted differently for different query types and investigate how best to show excerpts or summaries from full text before supporting this feature.) Augment the caption search by indexing the parts of the full text that refer to the caption. Provide search over table captions.

24 future work cont. Incorporate topical features such as genes/proteins and organisms. For the grid view, we plan to provide grouping according to categories that are of interest to biologists, such as “sequence alignments” and “phylogenetic trees”. (We are building a classifier for figures and their captions. We have developed an image annotation interface and are soliciting help with hand-labeling mated caption classifier.) Additional future developments on the BioText Search Engine will depend on feedback and requests we receive from users, and the results of extensive usability testing.

25 labeling interface

26

27

28

29 current study in brief (online surveys) First part: Biological Information Preferences Second part: Gene/Protein Name Expansion Preferences Whether or not bioscience literature searchers wish to see related term suggestions, in particular, gene and protein names We plan to assess presentation of other results of text analysis, such as the entities corresponding to diseases, pathways, gene interactions, localization information, function information, and so on. Assess the usability of one feature at a time, see how participants respond, and then test out other features

30 results of current study in brief (based on 38 responses, numerous specializations) Related Information Type Avg rating # selecting 1 or 2 Gene’s Synonyms4.4 2 Gene’s Synonyms refined by organism4.0 2 Gene’s Homologs3.75 Genes from same family: parents 3.4 7 Genes from same family: children 3.6 4 Genes from same family: siblings 3.2 9 Related Information Type Avg rating # selecting 1 or 2 Genes this gene interacts with3.74 Diseases this gene is associated with 3.46 Chemicals/drugs this gene is associated with 3.2 8 Localization information for this gene 3.73 1 2 3 45 (Do NOT want this) (Neutral) (REALLY want this)

31 results of current study in brief - cont. Strong desire for the search system to suggest information closely related to gene/protein names. Some interest in less closely related information. Most participants want to see organism names in conjunction with gene names. A majority of participants prefer to see term suggestions grouped by type. Split in preference between single-click hyperlink interaction and checkbox-style interaction. Need to experiment with hybrid designs, e.g., checkboxes for the individual terms and a link that immediately adds all terms in the group and executes the query. Adding more information will require a delicate balancing act between usefulness and clutter.

32 acknowledgements Marti Hearst Mike Wooldridge Jerry Ye Preslav Nakov Harendra Gututru Supported by NSF DBI-0317510 Available at: http://biosearch.berkeley.edu


Download ppt "Developing systems for full-text search in biomedicine. Anna Divoli School of Information University of California, Berkeley 07 Aug 2007 University of."

Similar presentations


Ads by Google