Presentation is loading. Please wait.

Presentation is loading. Please wait.

Harvesting SSL Certificate Data to Identify Web-Fraud Reporter : 鄭志欣 Advisor : Hsing-Kuo Pao 2010/10/04 1.

Similar presentations


Presentation on theme: "Harvesting SSL Certificate Data to Identify Web-Fraud Reporter : 鄭志欣 Advisor : Hsing-Kuo Pao 2010/10/04 1."— Presentation transcript:

1 Harvesting SSL Certificate Data to Identify Web-Fraud Reporter : 鄭志欣 Advisor : Hsing-Kuo Pao 2010/10/04 1

2 Mishari Al Mishari, Emiliano De Cristofaro, Karim El Defrawy, and Gene Tsudik. "Harvesting SSL Certificate Data to Identify Web-Fraud.",Submitted to ICDCS’10,http://arxiv.org/abs/0909.3688http://arxiv.org/abs/0909.3688 2 Conference

3  Introduction  X.509 certificates  Measurements and Analysis of SSL Certificates  Certificate-Based Classifier  Conclusion 3 Outline

4  Web-fraud is one of the most unpleasant features of today’s Internet.  Phishing, Typosquatting  Can we use the information in the SSL certificates to identify web-fraud activities such as phishing and typosquatting, without compromising user privacy?  This paper presents a novel technique to detect web- fraud domains that utilize HTTPS. 4 Introduction

5 5 Typosqatting

6  The classifier achieves a detection accuracy over 80% and, in some cases, as high as 95%.  Our classifier is orthogonal to prior mitigation techniques and can be integrated with other methods.  Note that the classifier only relies on data in the SSL certificate and not any other private user information. 6 Contributions

7 7 X.509 certificates

8  A. HTTPS Usage and Certificate Harvest  Legitimate  Phishing and Typosquatting  B. Certificate Analysis  Analysis of Certificate Boolean Features  Analysis of Certificate Non-Boolean Features 8 Measurements and Analysis of SSL Certificates

9 9 A. HTTPS Usage and Certificate Harvest

10  Legitimate and Popular Domain Data Sets.  Alexa: 100, 000 most popular domains according to Alexa. .com: 100, 000 random samples of.com domain zone file, collected from VeriSign. .net: 100, 000 random samples of.net domain zone file, collected from VeriSign.  We find that 34% of Alexa domains use HTTPS; 21% in.com and 16% in.net. (Commercial) 10 A. HTTPS Usage and Certificate Harvest

11  Phishing Data Set  We collected 2, 811 domains considered to be hosting phishing scams from the PhishTank web site.  30% of these phishing web sites employ HTTPS.  Typosquatting Data Set  we first identified the typo domains in our.com and.net data sets by using Google’s typo correction service.  We discovered that 9, 830 out of 38, 617 are parked domains. 11 A. HTTPS Usage and Certificate Harvest

12 FeatureNameType Used in Classifier Notes F1md5booleanYes The Signature Algorithm of the certificate is "md5WithRSAWncryoption" F2bogus subjectbooleanYes The subject section of the certificate has bogus values (e.g., -. Somestate, somecity) F3self-signedbooleanYesThe certificate is self-signed F4expiredbooleanYesThe certificate is expired F5 verification failed booleanNo The certificate passed the verification of OpenSSL 0.9.8k 25 Mar 2009 (for Debian Linux) F6 common certificate booleanYes The certificate of the given domain is the same as a certificate of another domain F7common serialbooleanYes The serical number of the certificate is the same as the serial of another one. F8 validity period > 3 yesrs booleanYesThe validity period is more than 3 years F9 issuer common name stringYesThe common name of the issuer F10 issuer organization stringYesThe organization name of the issuer F11issuer countrystringYesThe country name of the issuer F12subject countrystringYesThe country name of the subject F13 exact validity duration integerNo The number of days between the starting date and the expiration date F14 serial number length integerYesThe number of characters in the serial number F15 host-common name distance realYes The Jaccard distance value between host name and common name in the subject section 12 B. Certificate Analysis

13  Analysis of Certificate Boolean Features 13 B. Certificate Analysis

14 14 B.Certificate Analysis Fig : CDF of Serial Number Length of Alexa,.com.net (c) phishing (d) typosquatting F14 : Serial Number Length

15 15 Certificate Analysis F15 : Jaccard Distance

16  Around 20% of legitimate popular domains are still using the signature algorithm “md5WithRSAEncryption“ despite its clear insecurity.  A significant percentage (> 30%) of legitimate domain certificates are expired and/or self-signed.  Duplicate certificate percentages are very high in phishing domains.  For most features, the difference in distributions between Alexa and malicious sets is larger than that between.com/.net and malicious sets. 16 Summary of certificate Feature Analysis

17  A. Phishing Classifier  B. Typosquatting Classifier 17 Certificate-Based Classifier

18 ClassifierPositive RecallPositive Precision Random Forest0.740.77 SVM0.680.75 Decision Tree0.700.79 Bagging Decision Tree0.730.80 Boosting Decision Tree0.740.69 Decision Tree0.720.78 Nesrest Neighbor0.740.73 Neural Networks0.700.77 18 Phishing Classifier Table IV Performance of classifiers - Data set consists of (A)420 phishing certificates and (B)420 non-phishing certificates (Alexa,.COM and.NET) ClassifierPositive RecallPositive Precision Random Forest0.900.89 SVM0.910.87 Decision Tree0.860.89 Bagging Decision Tree0.900.88 Boosting Decision Tree0.890.81 Decision Tree0.900.84 Nesrest Neighbor0.870.86 Neural Networks0.870.89 Table V Performance of classifiers - Data set consists of (A)420 phishing certificates and (B)420 non-phishing certificates (Alexa)

19 19 Typosquatting Classifier ClassifierPositive RecallPositive Precision Random Forest0.860.88 SVM0.860.88 Decision Tree0.840.93 Bagging Decision Tree0.870.90 Boosting Decision Tree0.890.85 Decision Tree0.860.87 Nesrest Neighbor0.860.84 Neural Networks0.840.89 Table VI Preformance of classifiers - Data set consists of (A)486 typosquatting certificates and (B)486 non-typosquatting certificates (Top Alexa,.COM and.NET) ClassifierPositive RecallPositive Precision Random Forest0.950.93 SVM0.940.92 Decision Tree0.950.94 Bagging Decision Tree0.960.94 Boosting Decision Tree0.980.90 Decision Tree0.960.92 Nesrest Neighbor0.930.94 Neural Networks0.95 Table VII Preformance of classifiers - Data set consists of (A)486 typosquatting certificates and (B)486 popular domain certificates

20  We design and build a machine-learning-based classifier that identifies fraudulent domains using HTTPS based solely on their SSL certificates, thus also preserving user privacy.  We believe that our results may serve as a motivating factor to increase the use of HTTPS on the Web.  Use of HTTPS can help identifying web-fraud domains. 20 Conclusion


Download ppt "Harvesting SSL Certificate Data to Identify Web-Fraud Reporter : 鄭志欣 Advisor : Hsing-Kuo Pao 2010/10/04 1."

Similar presentations


Ads by Google