Presentation is loading. Please wait.

Presentation is loading. Please wait.

6/11/20151 A Binary-Categorization Approach for Classifying Multiple-Record Web Documents Using a Probabilistic Retrieval Model Department of Computer.

Similar presentations


Presentation on theme: "6/11/20151 A Binary-Categorization Approach for Classifying Multiple-Record Web Documents Using a Probabilistic Retrieval Model Department of Computer."— Presentation transcript:

1 6/11/20151 A Binary-Categorization Approach for Classifying Multiple-Record Web Documents Using a Probabilistic Retrieval Model Department of Computer Science Brigham Young University Quan Wang November 2001

2 6/11/20152 Overview Probabilistic Retrieval Model –Application ontology –Document representations –Ranking documents based on logistic regression analysis Experimental Result

3 6/11/20153 Application Ontology Car YearPrice Make Model Mileage FeaturePhoneNr 1:* 0:0.975:10:0.8:1 0:0.908:1 0:1.15:* 0:2.2:* 0:0.925:1 0:0.45:1

4 6/11/20154 Document Representation A set of pairs A 1 :x 1, …….. A n :x n. A density heuristic value y; A grouping heuristic value z; Document d (x 1, ……,x n, y, z)(V, y, z)

5 6/11/20155 Independence Assumption P(R|x 1, ……,x n, y, z) Independence assumption P(R|x 1 ) P(R|x n ) P(R|y)P(R|z) * * * *

6 6/11/20156 Logistic Regression P x P(R|x) * ** * ******* *** * ******* ** * xixi P(R|x i ) P(R| x) = 1/(1+exp(-(C 0 +C 1 x))), ln(O(R|x) = C 0 +C 1 x.

7 6/11/20157 Probabilistic Retrieval Based on Logistic Regression Analysis Data processing Data analysis Probabilistic retrieval on car-ads application ontology Correlation relations

8 6/11/20158 Data Processing The corresponding normalized vector V’ = (X 1 ’, …….. X n ’) is computed as V’ = |V| / |u| V where V is a document vector, u is an ontology vector.,

9 6/11/20159 Data Distributions **** ** *** **

10 6/11/201510 Logistic Regression-1

11 6/11/201511 Logistic Regression-2 Regression coefficients P-value

12 6/11/201512 Statistical Information : P-Value A p-value is a significance indicator. A large p-value indicates either a bad regression model or a statistically insignificant index term. We should keep only significant index terms.

13 6/11/201513 Select Important Index Terms FeaturesPhoneNDensityGrouping P-value.001.034.052.012 YearMakeModelMileagePrice P-value.679.002.074.002.001 The car-ads application ontology Double S-curve

14 6/11/201514 Probabilistic Retrieval Model ln(O(R|x i )), ln(O(R|y)), ln(O(R|z)) > 0< 0 relevantirrelevant

15 6/11/201515 Correlation Relations Correlation: There are strong positive correlations among document properties (e.g. Death Date & Birth Date in the obituaries). Correlations are extra information implicitly contained in a document. Correlation relations handle “patterns”, e.g., Birth Date-Death Date pair appearing in obituaries application ontology.

16 6/11/201516 Special Web Documents Multiple-record Web documents Similar content, format (e.g. item for sale) Same lexical object values (e.g. Honda makes cars and motorcycles) 8 documents (motorcycle, boat, snowmobile, bicycle) for the car-ads, and 5 documents (death notice, bibliography for famous people, find a graveyard, politician died young, famous people died in car accident) for the obituary.

17 6/11/201517 Experimental Results Car-adsobituary recall 100% precision83.3%*83.3% accuracy92.9%92.0% *Ten out of eighteen negative documents are specially selected.

18 6/11/201518 Conclusions We propose a probabilistic model which is suitable for classifying multiple-record Web documents. The model performance on a random chosen test document set could be better than the results we present in the thesis.


Download ppt "6/11/20151 A Binary-Categorization Approach for Classifying Multiple-Record Web Documents Using a Probabilistic Retrieval Model Department of Computer."

Similar presentations


Ads by Google