Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Probabilistic Model for Classification of Multiple-Record Web Documents June Tang Yiu-Kai Ng.

Similar presentations


Presentation on theme: "A Probabilistic Model for Classification of Multiple-Record Web Documents June Tang Yiu-Kai Ng."— Presentation transcript:

1 A Probabilistic Model for Classification of Multiple-Record Web Documents June Tang Yiu-Kai Ng

2 Overview Probabilistic Model – Bayes decision theory – Document and query representations – Ranking-function construction Multivariant Statistical Analysis

3 Approach Constructing a rank function for a probabilistic model based on multivariant statistical analysis Minimizing expected cost of misclassification Deriving a classification rule Deriving a linear classification rule Deriving a sample linear classification rule

4 Application Ontology

5 Document Representation (Year, Make, Model, Mileage, Price, Feature, PhoneNr) Total records: 60 (Year:62) (Make:58) (Model:48) (Mileage:12) (Price:58) (Feature:49) (PhoneNr:33) (62,58,48,12,58,49,33) (1.03,0.97,0.80,0.20,0.97,0.82,0.55)

6 Elementary Concepts Variables are things that we measure, control, or manipulate in research Multi-variant analysis considers multiple variables together as a single unit Normal distribution represents one of the empirically verified elementary "truths about the general nature of reality"

7 Multivariant Statistical Analysis Let A be an application ontology D be a set of Web documents R be a set of relevant documents R be a set of irrelevant document X = (X1, X2, …, Xp) represent a document  be the set of all possible values on which X can take  =  1   2

8 Expected Cost of Misclassification (ECM) Here, Two density functions f1 and f2

9 Classification Rule

10 Multivariate Normal Density Functions Where Assume that density functions are normal

11 Document x is classified as relevant if Linear Classification Rule Assume that density functions are normal and  1,  2, and  are equal

12 Linear Discrimination Function Threshold: ?

13 Parameter Estimations Suppose we have n1 relevant documents and n2 irrelevant documents Such that n1+n2>=p and p is the dimension of vector x

14 Parameter Estimations (Cont.)

15 Sample Classification Rule Document x is classified as relevant if

16 Misclassification Probabilities Lachenbruch’s “holdout” procedure where

17 Precision Measure

18 Experimental Result (Relevant)

19 Experimental Result (Irrelevant)

20 Conclusion Precision: 85% (VSM: 77.5%) Multivariant Statistical Analysis Extendibility to Multiple Categorization Classification


Download ppt "A Probabilistic Model for Classification of Multiple-Record Web Documents June Tang Yiu-Kai Ng."

Similar presentations


Ads by Google