Presentation is loading. Please wait.

Presentation is loading. Please wait.

Understanding User Intents in Online Health Forums Thomas Zhang, Jason H.D. Cho, Chengxiang Zhai Department of Computer Science University of Illinois.

Similar presentations


Presentation on theme: "Understanding User Intents in Online Health Forums Thomas Zhang, Jason H.D. Cho, Chengxiang Zhai Department of Computer Science University of Illinois."— Presentation transcript:

1 Understanding User Intents in Online Health Forums Thomas Zhang, Jason H.D. Cho, Chengxiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign 5 th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics Newport Beach, California 22 nd September

2 Online Health Forums Purpose: To provide a convenient platform to facilitate discussion among patients and professionals Huge user base, and still growing! In 2011, 80% of all web users searched for health information online, of which 6% participated in health related discussions Forums contain valuable information – Contain rich, often first hand experiences 2

3 Deficiencies of Forums Threads are scattered Similar questions are asked again and again Keyword search is inadequate – Finding several keyword matches in a thread does not necessarily mean that the thread is relevant 3

4 Post about cholinergic urticaria in April 2004 Received 3 rd and final reply a week later Post from March 2012 No replies as of July

5 Applications of Intents Improving thread retrieval – e.g. A thread whose original post matches both keywords and intent specified by the user are more likely to be helpful Filtering threads – e.g. To treat a condition, only look at posts asking about treatment Understanding user behavior in forums – i.e. users of different forums have different intents 5

6 This Paper Introduces problem of identifying user intents in health forums as a classification problem Derives the first taxonomy of user intents Designs a set of novel features for use with machine learning to solve the problem Create the first dataset for evaluation, and conducted experiments to make empirical findings 6

7 Roadmap 1.Problem formulation 2.Intent taxonomy derivation 3.Methodology – Support vector machines – Hierarchical classification – Feature design 4.Evaluation – Dataset – Experiments – Results 5.Intents in MedHelp forums 6.Wrap-up 7

8 Problem Formulation 8

9 Taxonomy Derivation No taxonomy exists for health forum intents Solution: Create our own! First reduce top ten most commonly asked generic questions by doctors (Ely et al, 2000) into three intent classes – Classes match the intents of users who search for health information online (Choudhury et al, 2014) Next introduce two additional intent classes that are specific to health forum posts 9

10 Manage: How should I manage or treat condition X? Cause: What is the cause of symptom/physical/test finding X? Adverse: Can drug or treatment X cause adverse finding Y? Combo: Combination (at least two of first three) Story: Story telling, news, sharing or asking about experience, soliciting support, or others Taxonomy 10

11 Where are we? 1.Problem formulation 2.Intent taxonomy derivation 3.Methodology – Support vector machines – Feature Selection – Hierarchical classification 4.Evaluation – Dataset – Experiments – Results 5.Intents in MedHelp forums 6.Wrap-up 11

12 Support Vector Machines (SVM) Main idea: Learn a hyperplane from examples to separate them into two classes Use learned hyperplane to classify unseen examples Capable of non-linear and multiclass classification Shown to have good performance on high dimensional data 12

13 Post Representation How should we represent posts? – SVMs require examples to be represented as a vector of features What are features? – Some measurable property of the observed data How should we select them? 13

14 Feature Selection A good feature should be: 1.Generic enough to be found in many posts 2.Sufficiently discriminative for different intents 14

15 Solution: Patterns! Sequence of (possibly non-contiguous) tokens that represent recurring text patterns in sentences Very generic – Lowercasing, stemming – POS tagging – UMLS semantic group tagging Very discriminative – “What could X be…?” signifies Cause intent, but “What does X do…?” signifies Manage intent 15

16 Pattern Types Each pattern falls under one of four types: LSP: Lowercased + stemmed tokens only – E.g. “…what can caus…” POSP: LSP + POS tags – E.g. “…how to …” SGP: LSP + semantic group tags – E.g. “…if works…” ALL: All types of tokens and tags – E.g. “… make feel…” 16

17 UMLS Semantic Groups MetaMap labels text phrases with semantic group labels from the UMLS Metathesaurus 17

18 Caveat Patterns possess limitations – Difficult to achieve good coverage without sacrificing discriminative properties – Impossible to extract for posts with large content variations (e.g. Story posts) However, we still want complete coverage of our dataset! 18

19 Solution: Hierarchical Classification! Two cascading SVM classifiers – The first uses binary pattern features (Pattern SVM) – The second uses unigram features with TF-IDF weighting (Word SVM) Complete coverage allows comparison with unigram baseline 19 Input Post Match ≥ 1 pattern? YesNo Pattern SVM Word SVM Output Class

20 Where are we? 1.Problem formulation 2.Intent taxonomy derivation 3.Methodology – Support vector machines – Hierarchical classification – Feature design 4.Evaluation – Dataset – Experiments – Results 5.Intents in MedHelp forums 6.Wrap-up 20

21 Dataset No labeled dataset exists, since this is a new problem So we create our own! – 1,192 original HealthBoards posts, evenly divided among four topics: allergies, breast cancer, depression, and heart disease Ideally want more posts, but labeling is expensive Why the four topics? 21

22 Dataset Labeling 22 *Per Landis and Koch, 1977

23 Experiments What is the best performing set of patterns? – Try different type combinations of patterns How does hierarchical compare with baseline? – Five-fold cross validation (CV) Does performance suffer if we train on posts from three topics and test on the fourth? – Four-fold forum CV 23

24 Selecting a Pattern Set 24

25 Patterns reach labeling agreement upper bound CV Takeaways Overall improvement is underwhelming, why? 25 Patterns give high precision but low recall – Why is this acceptable? Patterns generalize well across forum topics Hierarchical Classification Performance Word Classifier (Baseline) Performance

26 Intents in MedHelp Forums We applied our Pattern SVM to 61,225 MedHelp posts split across allergies, breast cancer, depression, and heart disease 26

27 Concluding Remarks Introduced the new problem of forum post intent analysis Designed the first taxonomy and dataset for classification Proposed a novel set of pattern features for SVMs Proved that patterns give high classification precision while generalizing well across forums 27

28 Future Work Administer study of health forum user intents Expand pattern feature set to improve recall Handle classification of Story posts Identify all intents from Combo posts Further evaluation with larger datasets 28

29 Thank you! Questions? Comments? 29


Download ppt "Understanding User Intents in Online Health Forums Thomas Zhang, Jason H.D. Cho, Chengxiang Zhai Department of Computer Science University of Illinois."

Similar presentations


Ads by Google