In the search of resolvers Jing Qiao 乔婧, Sebastian Castro – NZRS DNS-WG, RIPE 73, Madrid.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Detecting Malicious Flux Service Networks through Passive Analysis of Recursive DNS Traces Roberto Perdisci, Igino Corona, David Dagon, Wenke Lee ACSAC.
2.1 Installing the DNS Server Role Overview of the Domain Name System Role Overview of the DNS Namespace DNS Improvements for Windows Server 2008 Considerations.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Wide-scale Botnet Detection and Characterization Anestis Karasaridis, Brian Rexroad, David Hoeflin.
Three kinds of learning
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Evaluation of MineSet 3.0 By Rajesh Rathinasabapathi S Peer Mohamed Raja Guided By Dr. Li Yang.
A Study of DNS Lameness Edward Lewis. July 14, 2002 IETF 54 Slide 2 Agenda Lameness Why (Surprise:) Spotty(?) results Approach Plans.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
1 A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions Zhihong Zeng, Maja Pantic, Glenn I. Roisman, Thomas S. Huang Reported.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 5 Introduction to DNS in Windows Server 2008.
Unconstrained Endpoint Profiling (Googling the Internet)‏ Ionut Trestian Supranamaya Ranjan Aleksandar Kuzmanovic Antonio Nucci Northwestern University.
Data Mining: A Closer Look
Team HauntQuant October 25th, 2013
Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 김지연.
 C. C. Hung, H. Ijaz, E. Jung, and B.-C. Kuo # School of Computing and Software Engineering Southern Polytechnic State University, Marietta, Georgia USA.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Test cases for domain checks – a step towards a best practice Mats Dufberg,.SE Sandoche Balakrichenan, AFNIC.
Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers Victor Sheng, Foster Provost, Panos Ipeirotis KDD 2008 New York.
Module 8 DNS Tools & Diagnostics. Objectives Understand dig and nslookup Understand BIND toolset Understand BIND logs Understand wire level messages.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
AI Week 14 Machine Learning: Introduction to Data Mining Lee McCluskey, room 3/10
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong.
Leveraging Asset Reputation Systems to Detect and Prevent Fraud and Abuse at LinkedIn Jenelle Bray Staff Data Scientist Strata + Hadoop World New York,
Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.
Validating an Access Cost Model for Wide Area Applications Louiqa Raschid University of Maryland CoopIS 2001 Co-authors V. Zadorozhny, T. Zhan and L. Bright.
Unconstrained Endpoint Profiling Googling the Internet Ionut Trestian, Supranamaya Ranjan, Alekandar Kuzmanovic, Antonio Nucci Reviewed by Lee Young Soo.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
A study of caching behavior with respect to root server TTLs Matthew Thomas, Duane Wessels October 3 rd, 2015.
Module 8 DNS Tools & Diagnostics. Dig always available with BIND (*nix) and windows Nslookup available on windows and *nix Dig on windows – unpack zip,
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
1 Unsupervised Learning and Clustering Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of.
Data Mining and Decision Support
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Fraud Detection Notes from the Field. Introduction Dejan Sarka –Data.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
1 A Statistical Matching Method in Wavelet Domain for Handwritten Character Recognition Presented by Te-Wei Chiang July, 2005.
Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
Experience Report: System Log Analysis for Anomaly Detection
Machine Learning with Spark MLlib
In the search of resolvers
Geoff Huston APNIC March 2017
Implementation of ARIN's Lame DNS Delegation Policy
IMPLEMENTING NAME RESOLUTION USING DNS
Configuring and Troubleshooting DNS
Geoff Huston APNIC Labs
Transfer Learning in Astronomy: A New Machine Learning Paradigm
SOCIAL COMPUTING Homework 3 Presentation
Geoff Huston, Joao Damas APNIC Roy Arends ICANN
Baselining PMU Data to Find Patterns and Anomalies
Design open relay based DNS blacklist system
NanoBPM Status and Multibunch Mark Slater, Cambridge University
Data Mining Chapter 6 Search Engines
Human Speech Perception and Feature Extraction
Supervised vs. unsupervised Learning
Classification and Prediction
Cross-validation Brenda Thomson/ Peter Fox Data Analytics
Web Mining Research: A Survey
Topological Signatures For Fast Mobility Analysis
Trust Anchor Signals from Custom Applications
Unconstrained Endpoint Profiling (Googling the Internet)‏
ECDSA P-256 support in DNSSEC-validating Resolvers
What part of “NO” is so hard for the DNS to understand?
TRANCO: A Research-Oriented Top Sites Ranking Hardened Against Manipulation By Prudhvi raju G id:
Presentation transcript:

In the search of resolvers Jing Qiao 乔婧, Sebastian Castro – NZRS DNS-WG, RIPE 73, Madrid

Domain Popularity Ranking Derive Domain Popularity by mining DNS data Noisy nature of DNS data Certain source addresses represent resolvers, the rest a variety of behavior Can we pinpoint the resolvers? Background RIPE 73 In the search of resolvers 2

Long tail of addresses sending a few queries on a given day Noisy DNS RIPE 73 In the search of resolvers 3

To identify resolvers, we need some data Base curated data 836 known resolvers addresses Local ISPs, Google DNS, OpenDNS 276 known non-resolvers addresses Monitoring addresses from ICANN Asking for Addresses sending only NS queries Data Collection RIPE 73 In the search of resolvers 4

Do all resolvers behave in a similar way resolvers-from-our-point-of-view-2/ Conclusions There are some patterns Primary/secondary address Validating resolvers Resolvers in front of mail servers Exploratory Analysis RIPE 73 In the search of resolvers 5

Can we predict if a source address is a resolver? 14 features per day per address Fraction of A, AAAA, MX, TXT, SPF, DS, DNSKEY, NS, SRV, SOA Fraction of NoError and NxDomain responses Fraction of CD and RD queries Training data Extract 1 day of DNS traffic (653,232 unique source addresses) Supervised classifier RIPE 73 In the search of resolvers 6

Training Model LinearSVC __________________________________________________________________________ Training: LinearSVC(C=1.0, class_weight='balanced', dual=True, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class='ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0) train time: 0.003s Cross-validating: Accuracy: 1.00 (+/- 0.00) CV time: 0.056s test time: 0.000s accuracy: dimensionality: 14 density: classification report: precision recall f1-score support avg / total RIPE 73 In the search of resolvers 7 100% Accuracy! Success! Random Forest: 100% accuracy! K Neighbors: 100% accuracy!

Use the model with different days Resolver is represented as 1, and non-resolver as 0. Test the model df = predict_result(model, " ") df.isresolver_predict.value_counts() df = predict_result(model," ") df.isresolver_predict.value_counts() df = predict_result(model," ") df.isresolver_predict.value_counts() RIPE 73 In the search of resolvers 8 Very high proportion of resolvers?

Most of the addresses classified as resolvers List of non-resolvers show a very specific behaviour Model is fitting that specific behaviour Improve the training data to include different patterns. Preliminary Analysis RIPE 73 In the search of resolvers 9

What if we let a classifier to learn the structure instead of imposing The same 14 features, 1 day’s DNS traffic Ignore clients that send less than 10 queries Reduce the noise Run K-Means Algorithm with K=6 Inspired by Verisign work from 2013 Calculate the percentage of clients distributed across clusters Unsupervised classifier RIPE 73 In the search of resolvers 10

K-Means Cost Curve RIPE 73 In the search of resolvers 11

Query Type Profile per cluster 12 RIPE 73 In the search of resolvers

Rcode profile per cluster 13 RIPE 73 In the search of resolvers

Flag profile per cluster RIPE 73 In the search of resolvers 14

How many known resolvers fall in the same cluster? How many known non- resolvers? Tested on both week day and weekend, 98% ~ 99% known resolvers fit in the same cluster Clustering accuracy 15 RIPE 73 In the search of resolvers

Another differentiating factor could be client persistence Within a 10-day rolling window, count the addresses seen on specific number of days Addresses sending traffic all the time will fit into known resolvers and monitoring roles Client persistence RIPE 73 In the search of resolvers 16

Client Persistence RIPE 73 In the search of resolvers 17

Resolvers persistence RIPE 73 In the search of resolvers 18 Do the known resolvers addresses fall into the hypothesis of persistence? What if we check their presence in different levels?

Resolvers persistence RIPE 73 In the search of resolvers 19

Identify unknown resolvers by checking membership to the “resolver like” cluster Exchange information with other operators about known resolvers. Potential uses: curated list of addresses, white listing, others. Future work RIPE 73 In the search of resolvers 20

This analysis can be repeated by other ccTLDs using authoritative DNS data Using open source tools Hadoop + Python Code analysis will be made available Easily adaptable to use ENTRADA Conclusions RIPE 73 In the search of resolvers 21

Contact: 22