Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cis-Regulatory/ Text Mining Interface Discussion.

Similar presentations


Presentation on theme: "Cis-Regulatory/ Text Mining Interface Discussion."— Presentation transcript:

1 Cis-Regulatory/ Text Mining Interface Discussion

2 Questions (1) What does ORegAnno want from text mining? –Curation queue –Document mark-up –Mapping to database IDs (2) What does text mining need from ORegAnno? (3) What can text mining provide? –What level of performance is needed? (4) What is the right way to proceed? –Data sets for BioCreAtIvE? –Custom tools for individual “early adopters”?

3 Answers: (1) What does ORegAnno Want from Text Mining Management of curation queue –Ideally, user customized, so that user annotates those documents of immediate interest to her/him Document mark-up to highlight relevant passages –A workflow pipeline making either the html or pdf version of the document available, with the (potentially) relevant terms highlighted –Support for “cut and paste” transfer of relevant regions to the database comments fields Mapping to IDs, ontology codes –Gene, transcription factor (protein), organism, cell and tissue type, evidence types

4 Answers: (2) What does Text Mining Need From ORegAnno? Significant quantity of reliably annotated data to train text mining systems –Annotated at a level useful for natural language processing (e.g., marked for evidence at the phrase, sentence or passage level, depending on task) This requires that ORegAnno have: –A clear statement of the scope of the ORegAnno database and a stable set of annotation guidelines –Annotations with high inter-annotator agreement –Tracking of entries by annotator, including depth of annotation (different annotators will annotate to different levels of detail, depending on interests)

5 Answers: (3) What Can Text Mining Provide? Curation queue management: –Document classification approaches (from e.g., TREC Genomics or BioCreAtIvE) can be applied and evaluated, making use of new training data from pre-jamboree and jamboree annotation –We can experiment with “user defined” criteria, based on restrictions for gene, transcription factor, organism, tissue, etc. Document mark-up –Users could be provided with a list of genes/transcription factors in a paper, with hot links into the paper to find relevant passages –This would allow the annotator to drive the annotation process, selecting only those annotations that are correct and relevant. This in turn provides feedback using ORegAnno annotations to validate & train the text mining –Such a tool should make it easy for the annotator to provide the underlying text passages as evidence for the annotation, to provide more training data Mapping to unique identifiers/controlled vocabulary/ontology –For each entity type (gene, transcription factor, organism, tissue type...), a tool can provide a mapping to the correct identifier; where there is possible ambiguity, the tool could provide a ranked list for the annotator to choose from –A tool can also flag different evidence types, with suggested code(s)

6 Answers: (4) How to Proceed? Stabilize guidelines and redo the inter-annotator agreement expt (and write up) Prepare a Gold Standard data set of expert annotated data for training new annotators Collect sufficient amount of training data for the various tasks (queue management, document mark up, automated mapping) Develop end-to-end pipeline (in the style of the FlySlip project) to capture whole documents in machine-readable form for mark-up

7 Recommendations: Training Materials & Tools Case studies and gold-standard annotated articles On-line training –Perhaps with a way for new annotators to test themselves against a set of gold standard annotations –This will require automated comparison of annotations for certain fields Best tools links Tools: –Copy mechanism for largely duplicated record


Download ppt "Cis-Regulatory/ Text Mining Interface Discussion."

Similar presentations


Ads by Google