Presentation is loading. Please wait.

Presentation is loading. Please wait.

BeeSpace Informatics Research

Similar presentations


Presentation on theme: "BeeSpace Informatics Research"— Presentation transcript:

1 BeeSpace Informatics Research
ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library & Information Science University of Illinois at Urbana-Champaign BeeSpace Workshop, May 22, 2009

2 Overview of BeeSpace Technology
Users Task Support Gene Summarizer Function Annotator Space Navigation Space/Region Manager, Navigation Support Search Engine Text Miner Relational Database Words/Phrases Entities Content Analysis Natural Language Understanding Meta Data Literature Text

3 Part 1: Content Analysis

4 Natural Language Understanding
…We have cloned and sequenced a cDNA encoding Apis mellifera ultraspiracle (AMUSP) and examined its responses to … NP VP Gene Gene

5 Sample Technique 1: Automatic Gene Recognition
Syntactic clues: Capitalization (especially acronyms) Numbers (gene families) Punctuation: -, /, :, etc. Contextual clues: Local: surrounding words such as “gene”, “encoding”, “regulation”, “expressed”, etc. Global: same noun phrase occurs several times in the same article

6 Maximum Entropy Model for Gene Tagging
Given an observation (a token or a noun phrase), together with its context, denoted as x Predict y  {gene, non-gene} Maximum entropy model: P(y|x) = K exp(ifi(x, y)) Typical f: y = gene & candidate phrase starts with a capital letter y = gene & candidate phrase contains digits Estimate i with training data

7 Domain overfitting problem
When a learning based gene tagger is applied to a domain different from the training domain(s), the performance tends to decrease significantly. The same problem occurs in other types of text, e.g., named entities in news articles. Training domain Test domain F1 mouse 0.541 fly 0.281 Reuters 0.908 WSJ 0.643

8 Observation I Overemphasis on domain-specific features in the trained model wingless daughterless eyeless apexless fly “suffix –less” weighted high in the model trained from fly data

9 Observation II Generalizable features: generalize well in all domains
…decapentaplegic and wingless are expressed in analogous patterns in each primordium of… (fly) …that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues. (mouse)

10 Observation II Generalizable features: generalize well in all domains
…decapentaplegic and wingless are expressed in analogous patterns in each primordium of… (fly) …that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues. (mouse) “wi+2 = expressed” is generalizable

11 Generalizability-based feature ranking
training data fly mouse D3 Dm 1 2 3 4 5 6 7 8 -less expressed 1 2 3 4 5 6 7 8 expressed -less 1 2 3 4 5 6 7 8 expressed -less 1 2 3 4 5 6 7 8 expressed -less expressed -less 0.125 0.167

12 Adapting Biological Named Entity Recognizer
test data T1 Tm training data learning entity recognizer d = λ0d0 + (1 – λ0) (λ1d1 + … + λmdm) d features λ0, λ1, … , λm testing O1 Om individual domain feature ranking domain-specific features feature re-ranking O’ generalizable features feature selection for D1 feature selection for D0 top d0 features for D0 top d1 features for D1 feature selection for Dm top dm features for Dm

13 Effectiveness of Domain Adaptation
Exp Method Precision Recall F1 F+M→Y Baseline 0.557 0.466 0.508 Domain 0.575 0.516 0.544 % Imprv. +3.2% +10.7% +7.1% F+Y→M 0.571 0.335 0.422 0.582 0.381 0.461 +1.9% +13.7% +9.2% M+Y→F 0.583 0.097 0.166 0.591 0.139 0.225 +1.4% +43.3% +35.5% Text data from BioCreAtIvE (Medline) 3 organisms (Fly, Mouse, Yeast)

14 Gene Recognition in V3 A variation of the basic maximum entropy
Classes: {Begin, Inside, Outside} Features: syntactical features, POS tags, class labels of previous two tokens Post-processing to exploit global features Leverage existing toolkit: BMR

15 Part 2: Navigation Support

16 Space-Region Navigation
Topic Regions Intersection, Union,… My Regions/Topics Bird Singing EXTRACT Fly Rover EXTRACT Bee Forager MAP MAP Bee Bird Fly My Spaces SWITCHING Intersection, Union,… Behavior Literature Spaces

17 MAP: Topic/RegionSpace
MAP: Use the topic/region description as a query to search a given space Retrieval algorithm: Query word distribution: p(w|Q) Document word distribution: p(w|D) Score a document based on similarity of Q and D Leverage existing retrieval toolkits: Lemur/Indri

18 EXTRACT: Space Topic/Region
Assume k topics, each being represented by a word distribution Use a k-component mixture model to fit the documents in a given space (EM algorithm) The estimated k component word distributions are taken as k topic regions Likelihood: Maximum likelihood estimator: Bayesian estimator:

19 A Sample Topic & Corresponding Space
Word Distribution (language model) labels Meaningful labels actin filaments flight muscle flight muscles filaments muscle actin z filament myosin thick thin sections er band muscles antibodies myofibrils flight images Example documents actin filaments in honeybee-flight muscle move collectively arrangement of filaments and cross-links in the bee flight muscle z disk by image analysis of oblique sections identification of a connecting filament protein in insect fibrillar flight muscle the invertebrate myosin filament subfilament arrangement of the solid filaments of insect flight muscles structure of thick filaments from insect flight muscle

20 Incorporating Topic Priors
Either topic extraction or clustering: User exploration: usually has preference. E.g., want one topic/cluster is about foraging behavior Use prior to guild topic extraction Prior as a simple language model E.g. forage 0.2; foraging 0.3; food 0.05; etc.

21 Incorporating a Topic Prior
Original EM: EM with Prior:

22 Incorporating Topic Priors: Sample Topic 1
age division labor colony foraging foragers workers task behavioral behavior older tasks old individual ages young genotypic social Prior: labor division 0.2

23 Incorporating Topic Priors: Sample Topic 2
behavioral age maturation task division labor workers colony social behavior performance foragers genotypic differences polyethism older plasticity changes Prior: behavioral 0.2 maturation 0.2

24 Exploit Prior for Concept Switching
foraging foragers forage food nectar colony source hive dance forager information feeder rate recruitment individual reward flower dancing behavior foraging nectar food forage colony pollen flower sucrose source behavior individual rate recruitment time reward task sitter rover rovers

25 Part 3: Task Support

26 Gene Summarization Task: Automatically generate a text summary for a given gene Challenge: Need to summarize different aspects of a gene Standard summarization methods would generate an unstructured summary Solution: A new method for generating semi-structured summaries

27 An Ideal Gene Summary GP EL SI GI MP WFPI

28 Semi-structured Text Summarization

29 Summary example (Abl)

30 A General Entity Summarizer
Task: Given any entity and k aspects to summarize, generate a semi-structured summary Assumption: Training sentences available for each aspect Method: Train a recognizer for each aspect Given an entity, retrieve sentences relevant to the entity Classify each sentence into one of the k aspects Choose the best sentences in each category

31 Summary All the methods we developed are
General Scalable The problems are hard, but good progress has been made in all the directions The V3 system has only incorporated the basic research results More advanced technologies are available for immediate implementation Better tokenization for retrieval Domain adaptation techniques Automatic topic labeling General entity summarizer More research to be done in Entity & relation extraction Graph mining/question answering Domain adaptation Active learning

32 Looking Ahead: X-Space…
Users Task Support Gene Summarizer Function Annotator Space Navigation Space/Region Manager, Navigation Support Search Engine Text Miner Relational Database Words/Phrases Entities Content Analysis Natural Language Understanding Meta Data Literature Text

33 Thank You! Questions?


Download ppt "BeeSpace Informatics Research"

Similar presentations


Ads by Google