Presentation is loading. Please wait.

Presentation is loading. Please wait.

Artificial Intelligence Research Laboratory

Similar presentations


Presentation on theme: "Artificial Intelligence Research Laboratory"— Presentation transcript:

1 Artificial Intelligence Research Laboratory
Department of Computer Science RECOMB 2009 Combining Abstraction and Super-structuring on Macromolecular Sequence Classification Adrian Silvescu, Cornelia Caragea, and Vasant Honavar Introduction: The choice of features that are used to describe the data presented to a learner, and the level of detail at which they describe the data, can have a major impact on the difficulty of learning, and the accuracy, complexity, and comprehensibility of the learned predictive model. The representation has to be rich enough to capture the distinctions that are relevant from the standpoint of learning, but not so rich as to make the task of learning infeasible. Constructing Abstractions over k-grams: Results: Comparison of super-structuring and abstraction (SS+ABS) with super-structuring and feature selection (SS+FSEL), super-structuring only (SS_ONLY), and unigram (UNIGRAM) on the Eukaryotes and Prokaryotes data sets. greedy agglomerative procedure initially map each abstraction to a k-gram recursively group pairs of abstractions until m abstractions are obtained, e.g., m=2 Eukaryotes 3-grams Prokaryotes 3-grams Eukaryotes 2-grams Prokaryotes 2-grams a1 a2 a3 a4 a5 a6 a7 a9 a8 a10 a11 SIN INQ NQK QKL KLA Problem: Predict the subcellular localization for a protein sequence. Example: Distance between Abstractions: Let The weighted Jensen-Shannon divergence is given by: X be a finite set, and two probability distributions over X Previous Approaches to Feature Construction: Super-structuring: generating k-grams Class distributions induced by one of the m abstractions, and the class distributions induced by three 3-grams sampled from the abstraction on the Eukaryotes 3-gram data set, where (a) m=10; and (b) m=1000. The number of classes is 4. INQ SIN NQK QKL KLA LVI LAL ALV SINQKLALVIKSGKYTLGYKSTVKSLRQGKSKLIIIAANTPVLRKSELEYYAMLSKTKVYYFQGGNNELGTAVGKLFRVGVVSILEAGDSDILTTLA 10 Abstractions 1000 Abstractions Then, distance between two abstractions is defined as follows: Abstraction: grouping similar features to generate more abstract features where Y is the class variable. Feature selection: alternative approach to reducing the number of k-grams to m k-grams we used mutual information between the class variable and k-grams to rank the k-grams Conclusions: We have shown that: combining super-structuring and abstraction makes it possible to construct predictive models that use significantly smaller number of features than those obtained using super-structuring alone. abstraction in combination with super-structuring yields better performing models than those obtained by feature selection in combination with super-structuring. Data sets: Our Approach: Eukaryotes contains 2,427 protein sequences classified into one of four classes Prokaryotes contains 997 protein sequences classified into one of three classes Combining super-structuring and abstraction to construct new features Acknowledgements: This work is supported in part by a grant from the National Science Foundation (NSF ) to Vasant Honavar.


Download ppt "Artificial Intelligence Research Laboratory"

Similar presentations


Ads by Google