1 Mining User Behavior Mining User Behavior Eugene Agichtein Mathematics & Computer Science Emory University
2 The Big Picture: Intelligent Information Access
3 Text Mining for Patient Medical Care with E. V. Garcia (Emory SoM) and A. Ram (Georgia Tech) Rule Discovery from Medical Literature (MERLIN project): Rule Discovery from Medical Literature (MERLIN project): –Identify articles containing useful clinical knowledge –Extract new expert system rules, test/modify based on patient DB Personalized diagnosis and care (PRETEX project): Personalized diagnosis and care (PRETEX project): –Extract relevant clinical variables from text in patient records –Personalize expert system rules for a given patient or population –Automatically identify harmful drug interactions and side effects
4 Mining Textual Data in Patient Electronic Medical Records
5 More info: Archana Bhattarai et al., poster at reception this evening
6 Example rule: IF LV_stress_perfusion_is_abnormal THEN STRONG POSITIVE EVIDENCE THAT Diseased_coronary_is(LAD) From Medical Literature to Structured Clinical Knowledge
7 Baoli Li et al., poster at reception this evening
8 This study claims WHAT?!? If it’s printed, must be true If it’s printed, must be true –Published studies are never disproven –Experimental study data is never massaged Big Pharma funding overstated claims Big Pharma funding overstated claims R. Smith, 2005: Medical journals are an extension of the marketing arm of pharmaceutical companies, PLoS Medicine R. Smith, 2005: Medical journals are an extension of the marketing arm of pharmaceutical companies, PLoS Medicine How to evaluate quality/soundness of literature? How to evaluate quality/soundness of literature?
9
10 Challenges Authority and trust Authority and trust Privacy of contributors vs. authority Privacy of contributors vs. authority Many dimensions of quality Many dimensions of quality –Equipment sensitivity –Recency (studies grow obsolete) –Size of the clinical trial –Correlational vs. controlled –Randomization –… Work in progress Work in progress
11 The Big Picture: Intelligent Information Access
12 Social media: Planetary-scale user behavior experiment Real information needs and subjective relevance judgments Real information needs and subjective relevance judgments Traces of many interactions recorded Traces of many interactions recorded Allows shared, reproducible experiments Allows shared, reproducible experiments Some semantic organization (tags, categories) Some semantic organization (tags, categories)
13 Social Media (emerging)
14 Traditional vs. social media
15
16
17
18
19
20
21
22
23
24
25
26Community
27
28
29
30
31
32
33
34 How to find relevant and high-quality content in social media?
35 Learning-based Approach Content features Community interaction Features relevance Quality Unified Ranking Function
36 Ranking Algorithm – GBrank [Zheng 2007] Start with an initial guess h 0, for k = 1,2, … Start with an initial guess h 0, for k = 1,2, … Using h k-1 as the current approximation of h, we separate S into two disjoint sets Using h k-1 as the current approximation of h, we separate S into two disjoint sets Fit a regression function g k (x) using Gradient Boosting Tree [Friedman 2001] and the following training data Fit a regression function g k (x) using Gradient Boosting Tree [Friedman 2001] and the following training data Form the new ranking function as Form the new ranking function as
37 Experimental Results Removing textual features Removing community interaction features Baseline GBrank
38 Intelligent Information Access
39 User Behavior: The 3 rd Dimension of the Web Amount exceeds web content and structure Amount exceeds web content and structure –Published: 4Gb/day; Social Media: 10gb/Day –Page views: 100Gb/day [Andrew Tomkins, Yahoo! Search, 2007]
40 Clickthrough for Queries with Known Position of Top Relevant Result Relative clickthrough for queries with known relevant results in position 1 and 3 respectively Higher clickthrough at top non-relevant than at top relevant document E. Agichtein, E. Brill, and S. Dumais, SIGIR 2006
41 Full Search Engine, User Behavior: NDCG, MAP MAPGain RN0.270 RN+ALL ( 19.13%) BM BM25+ALL (23.71%)
42 User Behavior Complements Content and Web Topology RN (Content + Links)0.632 RN + All (User Behavior) (10%) BM BM25+All (31%)
43 Fine grained behavior analysis
Data captured with Tobii eye tracker, courtesy Andy Edmonds,
45 Preliminary results on using mouse trajectories to infer user intent Q. Guo and E. Agichtein, to appear in SIGIR 2008
46