Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009.

Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009

Classifying Low/High Findable Documents Data used in the Experiment USPC Class 422 (Chemical apparatus and process disinfecting, deodorizing, preserving, or sterilizing), and USPC 423 (Chemistry of inorganic compounds). Total Documents 54,353 Queries 3 terms Queries (Total 753,682), using Frequent Terms Extraction concept. (QG-FT). Retrieval System used TFIDF

Patents Extracted for Analysis Next, I extract bottom 173 (Low Findable documents) and Top 157 (High Findable documents) for analysis.

Features Extraction Next, I try to extract features from these patents, so that, can we classify Low or High Findable documents using Classification Model, without doing heavy Findability Measurement. Features that I considered useful are – Patent Length size (Only Claim). – Number of Two Terms Pairs in Claim section, which have support greater than 2. – Two Terms Pairs Frequencies in individual Patents. – Two Terms Pairs Frequencies in all Collection. – Two Terms Paris Frequencies in its most 30 Similar Patents.

Features Analysis Patent Length size (Only Claim). (First Feature) Clearly with only considering Patent Length, we can’t differentiate Low and High Findable documents. Some short length patents are high Findable, and many Longer length patents are low findable.

Features Analysis – Number of Two Term Pairs in Claim section, which have support greater than 2. (Second Feature) Again, clearly with only considering this feature, we can’t differentiate Low and High Findable documents. However, on High Findable Patents, the Support goes little bit up.

Features Analysis – Two Terms Pairs Frequencies in individual Patents, which have support greater than 2 in Claim section. (Third Feature) – The main aim of checking this feature was to analyze, are Patent writers try to hide their information (from Retrieval Systems) by lowering the frequencies of terms? – Since, there could many pairs in each Patents, therefore in analysis, I take average of their support values.

Features Analysis – The frequency goes little bit up for High Findable documents, – However, still some high findable Patents have low frequencies, and some low findable Patents have high frequencies.

Features Analysis – Two Terms Pairs Frequencies in all Collection. (Fourth Feature) – The main aim of checking this feature was to analyze, the presence of Rare Term Pairs in individual Patens. – Since, there could many pairs in each Patents, therefore in analysis, I take average of their support values.

Features Analysis – The frequency goes up for High Findable documents, – That’s mean Low Findable Patents frequently used Rare Terms.

Features Analysis – Two Terms Paris Frequencies in their most 30 Similar Patents. (Fifth Feature) – In last Rare terms checking analysis, I used whole collection by considering it as a single cluster. – In this feature, I create cluster for every Patent using K-NN approach. – In K-NN, I consider only 30 most Similar Patents.

Features Analysis – The frequency goes up for High Findable documents, – That’s mean the Term Pairs that are used in Low Findable Patents, could not be found in their most similar Patents.

Putting all Together Classifying Low/High Findable documents, without using Findability Measurement. I used all these features of Patents, for training classification models. For classification training, I used WEKA toolkit. In class I used L (for Low Findable), and H (for High Findable).

#r(d)F1F2F3F4F5Class 1 64434.46152613.307694.538462215H 2 238488.3333613.333335.33333397H 3 171047.6148813.363642.613636285L 4 101471.1251613.3754187H 5 176496.6256413.3753.5153H 6 341033.3969613.43754.333333266L 7 19405.6251613.56.62572L F1: Patent Length size (Only Claim). F2: Number of Two Terms Pairs in Claim section, which have support greater than 2. F3: Two Terms Pairs Frequencies in individual Patents. F4: Two Terms Pairs Frequencies in all Collection. F5: Two Terms Paris Frequencies in its most 30 Similar Patents Class: L (Low Findable), H (High Findable) Sample Dataset

Multilayer Perceptron (with Cross- Validation 100) Correctly Classified Instances 245 74.2424 % Incorrectly Classified Instances 85 25.7576 % Kappa statistic 0.4848 Mean absolute error 0.3238 Root mean squared error 0.4309 Relative absolute error 64.918 % Root relative squared error 86.2466 % Total Number of Instances 330 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.756 0.27 0.715 0.756 0.735 0.794 L 0.73 0.244 0.77 0.73 0.749 0.794 H Weighted Avg. 0.742 0.256 0.744 0.742 0.743 0.794

Accuracy with J48 Correctly Classified Instances 237 71.8182 % Incorrectly Classified Instances 93 28.1818 % Kappa statistic 0.4364 Mean absolute error 0.3592 Root mean squared error 0.4722 Relative absolute error 72.0151 % Root relative squared error 94.5234 % Total Number of Instances 330 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.731 0.293 0.691 0.731 0.71 0.663 L 0.707 0.269 0.745 0.707 0.726 0.663 H Weighted Avg. 0.718 0.281 0.72 0.718 0.718 0.663

Naïve Bayes Correctly Classified Instances 220 66.6667 % Incorrectly Classified Instances 110 33.3333 % Kappa statistic 0.3251 Mean absolute error 0.4227 Root mean squared error 0.4841 Relative absolute error 84.7803 % Root relative squared error 96.9639 % Total Number of Instances 330 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.558 0.236 0.68 0.558 0.613 0.701 L 0.764 0.442 0.658 0.764 0.707 0.701 H Weighted Avg. 0.667 0.345 0.668 0.667 0.663 0.701

Some Other Features could be Frequency of Term Pairs in Referenced or Cited Patents. Frequency of Terms Pairs in Similar USPC classes.

Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009.

Similar presentations

Presentation on theme: "Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009.

Similar presentations

Presentation on theme: "Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009."— Presentation transcript:

Similar presentations

About project

Feedback