Presentation is loading. Please wait.

Presentation is loading. Please wait.

7/16/20151 Ontology-Based Binary-Categorization of Multiple- Record Web Documents Using a Probabilistic Retrieval Model Department of Computer Science.

Similar presentations


Presentation on theme: "7/16/20151 Ontology-Based Binary-Categorization of Multiple- Record Web Documents Using a Probabilistic Retrieval Model Department of Computer Science."— Presentation transcript:

1 7/16/20151 Ontology-Based Binary-Categorization of Multiple- Record Web Documents Using a Probabilistic Retrieval Model Department of Computer Science Brigham Young University Q Wang November, 2000

2 7/16/20152 Multiple-Record Web Documents-1 Acura Integra 1990 $4,000 (1/27/00) ACURA'90 Integra, AC, AM/FM cassette, cruise, new tires. Asking $4,000. (302) 226-5444.+ Acura Integra 1992 $5,900 (1/27/00) ACURA'92 Integra RS, white, excellent condition. $5,900. 410-548-1353 Relevant document--a chunk of Car-sale Ads

3 7/16/20153 Multiple-Record Web Documents-2 '97 HONDA ACE SHADOW 1100cc 4k. Customized. $7.5K/obo 410-465-0870 '97 HONDA CR250 Exc. cond. $3300/OBO. (410) 479-4499 Irrelevant document--a chunk of Motorcycle Ads

4 7/16/20154 Application Ontology Car YearPrice Make Model Mileage FeaturePhoneNr 1:* 0:0.975:10:0.8:1 0:0.908:1 0:1.15:* 0:2.2:* 0:0.925:1 0:0.45:1

5 7/16/20155 Document Representation A set of pairs A 1 :x 1, …….. A n :x n A density heuristic value A grouping heuristic value P(R|d)P(R|(x 1, ……,x n ), P(R|Density), P(R|Grouping)

6 7/16/20156 Independence Assumption P(R|(Year, ……,Make) Independence assumption P(R|(Year)P(R|(Make)

7 7/16/20157 Logistic Regression Prob. 1 1….. 0 0 Make 0.80.90.10.3 Logistic regression package C 0 C 1 P-value 8.358 -1.606 0.002 Input from a training set data Output

8 7/16/20158 Probability Estimation x Make = 0.4514 P(R| Make) = 1/(1+exp(-(C 0 +C 1 x Make ))) = 1/(1+exp(-(8.358+(-1.606 * 0.4514)))) = 0.9995 For a test document, the term frequency of index term Make is 0.4514.

9 7/16/20159 Probability Fitting Curve P x P(R|x) * ** * ******* *** * ******* ** * xixi P(R|x i ) P(R| x) = 1/(1+exp(-(C 0 +C 1 x)))

10 7/16/201510 Relevance Probability Calculation For a Car Sale document in a test set, we have C 0 = [.6,8.4,3.7,22.8,15.5,5.9,–2.5,61.9,29.2] C 1 = [-.2,-1.6,-.9,-1.7,-3.0,-2.5,1.1,-10,1,-20.5 ] X = [.26,.25,.14,.07,.23,.84,.26,.15,.33 ] I = [1, 1, 1, 1, 1, 1, 1, 1,1] Index = [Ye,Ma,Mo,Mi,Pr,Fe,Ph,De,Gr] Y = C 0 * I T + C 1 * X T = 134.111 P(R|d) = 1 + 1/exp(-Y) = 1

11 7/16/201511 Statistical Information : P-Value A p-value is a significance indicator. A large p-value indicates either a bad regression model or a statistically insignificant index term. We should keep only significant index terms.

12 7/16/201512 Dependent Relations Dependent relation exists among index terms. Independence assumption oversimplifies the problem & causes distortion. For example, in the Car Ads application ontology, we expect Make and Model are likely appearing together. The performance can be improved by including significant dependent relations in relevance probability calculation.

13 7/16/201513 Estimation of relevance probability-2 P(R|Density) P(R|Grouping) P(R|Year)P(R|Feature) P(R|Correlation-n) P(R|d) Multiplication P(R|Correlation-1)

14 7/16/201514 Comparison EvaluationVSM VSM & Machine Learning Probabilistic Car Sale Precision 100% Recall 85.7%91%100% Obituary Precision 100%91%100% Recall 100%

15 7/16/201515 Contribution We propose a probabilistic model which can accurately classify multiple-record Web documents. We will study the impact of dependent relations on the performance of our model.


Download ppt "7/16/20151 Ontology-Based Binary-Categorization of Multiple- Record Web Documents Using a Probabilistic Retrieval Model Department of Computer Science."

Similar presentations


Ads by Google