Overview: Humans are unique creatures. Everything we do is slightly different from everyone else. Even though many times these differences are so minute.

Slides:



Advertisements
Similar presentations
Machine Learning Homework
Advertisements

Projects Data Representation Basic testing and evaluation schemes
KompoZer. This is what KompoZer will look like with a blank document open. As you can see, there are a lot of icons for beginning users. But don't be.
PARTITIONAL CLUSTERING
M2 – Explain the tools and techniques used in the creation of an interactive website. By Arturas Vitkovskij.
Transformations We want to be able to make changes to the image larger/smaller rotate move This can be efficiently achieved through mathematical operations.
Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.
Three kinds of learning
CS 255: Database System Principles slides: Variable length data and record By:- Arunesh Joshi( 107) Id: Cs257_107_ch13_13.7.
1 Chapter 20 — Creating Web Projects Microsoft Visual Basic.NET, Introduction to Programming.
Data Mining: A Closer Look
XP New Perspectives on Microsoft Access 2002 Tutorial 71 Microsoft Access 2002 Tutorial 7 – Integrating Access With the Web and With Other Programs.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
1 Chapter 2 Matrices Matrices provide an orderly way of arranging values or functions to enhance the analysis of systems in a systematic manner. Their.
The computer memory and the binary number system.
A year 1 computer userA year 2 computer userA year 3 computer user Algorithms and programming I can create a series of instructions. I can plan a journey.
DCT 1123 PROBLEM SOLVING & ALGORITHMS INTRODUCTION TO PROGRAMMING.
WIKI IN EDUCATION Giti Javidi. W HAT IS WIKI ? A Wiki can be thought of as a combination of a Web site and a Word document. At its simplest, it can be.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
Valuing Mental Computation Online Before you start…
CPSC 203 Introduction to Computers T59 & T64 By Jie (Jeff) Gao.
Issues with Data Mining
Fundamentals of Python: From First Programs Through Data Structures Chapter 14 Linear Collections: Stacks.
1 MySQL and phpMyAdmin. 2 Navigate to and log on (username: pmadmin)
The Confident Researcher: Google Away (Module 2) The Confident Researcher: Google Away 2.
1 Direct Manipulation Proposal 17 Direct Manipulation is when physical actions are used instead of commands. E.g. In a word document when the user inputs.
IT253: Computer Organization
WEKA - Explorer (sumber: WEKA Explorer user Guide for Version 3-5-5)
JMD2144 – Lesson 4 Web Design & New Media.
Weka Project assignment 3
IE 423 – Design of Decision Support Systems Data modeling and database development.
How to use the internet The internet is a wide ranging network that thousands of people use everyday. It is a useful tool in modern society that once one.
Big Idea 1: The Practice of Science Description A: Scientific inquiry is a multifaceted activity; the processes of science include the formulation of scientifically.
Dr. Michael D. Featherstone Summer 2013 Introduction to e-Commerce Web Analytics.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
Bits & Bytes Created by Chris McAbee For AAMU AGB199 Extra Credit Created from information copied and pasted from
SharePoint document libraries I: Introduction to sharing files Sharjah Higher Colleges of Technology presents:
Unit 3 Day 6 FOCS – Web Design. Journal Unit #3 Entry #4 Write the source code that would make the following display on a rendered page: Shopping List.
LEARNING HTML PowerPoint #1 Cyrus Saadat, Webmaster.
Introduction to programming in the Java programming language.
Term 2, 2011 Week 1. CONTENTS Problem-solving methodology Programming and scripting languages – Programming languages Programming languages – Scripting.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Java server pages. A JSP file basically contains HTML, but with embedded JSP tags with snippets of Java code inside them. A JSP file basically contains.
Introduction to XML By Manzur Ashraf (Shovon) Dept. of Computer Science & Engineering (BUET)
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Carolyn Penstein Rosé Language Technologies Institute Human-Computer Interaction Institute School of Computer Science With funding from the National Science.
Common Sense Validation Using SAS Lisa Eckler Lisa Eckler Consulting Inc. TASS Interfaces, December 2015.
Spiderman ©Marvel Comics Creating Web Pages (part 1)
Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Jacob (Jack) Gryn - Presented November 28, Semi-Structured Data and XML.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
INTRODUCTION TO ACCESS 2010 Winter Basics of Access Data Management System Allows for multiple levels of data Relational Database User defined relations.
Data mining in web applications
Information Extraction Review of Übung 2
Rule Induction for Classification Using
Implementing Boosting and Convolutional Neural Networks For Particle Identification (PID) Khalid Teli .
Step-By-Step Instructions for Miniproject 2
Machine Learning in Practice Lecture 11
Word Embedding Word2Vec.
iSRD Spam Review Detection with Imbalanced Data Distributions
Data Binary Conversion.
Introduction to Servers
Tutorial 7 – Integrating Access With the Web and With Other Programs
Internet Basics and Information Literacy
Evaluating Classifiers
Presentation transcript:

Overview: Humans are unique creatures. Everything we do is slightly different from everyone else. Even though many times these differences are so minute they are not even noticed. This is exactly the same for web browsing. The way that each human browses the web is unique to that person. The websites they visit as well as the order in which they visit them is unique. Now wouldn't it be nice if this uniqueness was not just overlooked and was actually used to benefit the user’s browsing experience. In this research we compare different representations of browsing histories to find which one will best be used to represent this uniqueness. Then by using machine learning algorithms this research will attempt to create a fingerprint from which a user could be identified based on there web-browsing history alone. Overview: Humans are unique creatures. Everything we do is slightly different from everyone else. Even though many times these differences are so minute they are not even noticed. This is exactly the same for web browsing. The way that each human browses the web is unique to that person. The websites they visit as well as the order in which they visit them is unique. Now wouldn't it be nice if this uniqueness was not just overlooked and was actually used to benefit the user’s browsing experience. In this research we compare different representations of browsing histories to find which one will best be used to represent this uniqueness. Then by using machine learning algorithms this research will attempt to create a fingerprint from which a user could be identified based on there web-browsing history alone. Representing the browsing history to the computer: Every users history is stored in a database file of different types depending on the browser they use. These database files contain a great deal extra data beyond just the webpages visited. For the purpose of this research I have stripped all this extra data off to create an ordered list of every webpage visited. The next step is to turn this list into a data set that could be used by the computer. The simplest data set that could be created would be similar to the one above. Where each column would represent a different “Attribute” of the data. In this case each Attribute would represent a webpage in the set of all webpages. The last column contains the user names. This is the column that will contain an empty cell when we don’t know who the browsing history belongs to. Each row would contain the browsing history from each user. These rows are called instances. Representing the browsing history to the computer: Every users history is stored in a database file of different types depending on the browser they use. These database files contain a great deal extra data beyond just the webpages visited. For the purpose of this research I have stripped all this extra data off to create an ordered list of every webpage visited. The next step is to turn this list into a data set that could be used by the computer. The simplest data set that could be created would be similar to the one above. Where each column would represent a different “Attribute” of the data. In this case each Attribute would represent a webpage in the set of all webpages. The last column contains the user names. This is the column that will contain an empty cell when we don’t know who the browsing history belongs to. Each row would contain the browsing history from each user. These rows are called instances. Finding out what makes each history unique: In order to be able identify someone from their browsing history, there must be something that makes each browsing history unique. After carful analysis and bit of common sense I have deduced that there are three main features of each browsers history that makes it unique. These features are: 1.The websites that have been visited 2.The number of times each website has been revisited 3.The order in which each website has been visited Finding out what makes each history unique: In order to be able identify someone from their browsing history, there must be something that makes each browsing history unique. After carful analysis and bit of common sense I have deduced that there are three main features of each browsers history that makes it unique. These features are: 1.The websites that have been visited 2.The number of times each website has been revisited 3.The order in which each website has been visited Manipulating the dataset to represent the history’s uniqueness: The task of representing which webpages have been visited and how many times each website was visited was simple. In the simple data set shown below you can see that every website the user visited is already represented. Represent the number of times that each website is visited is also a simple task. Each attribute value could instead of being a binary yes or no, it could be a number representing how many times the site that attribute represents was visited. This leads to another question about what it means to revisit a website because the history stores every “webpage” visited. To solve this problem my research counts both webpage and website visits and creates datasets for both. Manipulating the dataset to represent the history’s uniqueness: The task of representing which webpages have been visited and how many times each website was visited was simple. In the simple data set shown below you can see that every website the user visited is already represented. Represent the number of times that each website is visited is also a simple task. Each attribute value could instead of being a binary yes or no, it could be a number representing how many times the site that attribute represents was visited. This leads to another question about what it means to revisit a website because the history stores every “webpage” visited. To solve this problem my research counts both webpage and website visits and creates datasets for both. Website: Webpage: Manipulating the dataset to represent the history’s order: In order to create one data set with multiple users on it then it requires that each user has the same attributes. Making it not possible to preserve any order for every user after the first user. To solve this problem I have employ a technique from natural language processing called n-grams. In NLP n-grams are used to group words together and help predict parts of speech. The “N” in n-grams stands for the number of grams grouped together. A gram can be any variable that exists in an ordered list. In my research a gram is the site visited. A representation of how the dataset would look for a tri-gram representation. You can see in the first instance that Bob visited Site2 then Site3 then Site 5. The n-gram technique also has another variable of skips. A skip would represent the amount of grams skipped before recording another gram. A dataset for a 2 skip tri- gram would look exactly the same as the one above except that two sites would not have been next to each other in the original history. For example in the first instance Bob would have visited Site2 the two more sites then Site3 then two more sites then Site5. Manipulating the dataset to represent the history’s order: In order to create one data set with multiple users on it then it requires that each user has the same attributes. Making it not possible to preserve any order for every user after the first user. To solve this problem I have employ a technique from natural language processing called n-grams. In NLP n-grams are used to group words together and help predict parts of speech. The “N” in n-grams stands for the number of grams grouped together. A gram can be any variable that exists in an ordered list. In my research a gram is the site visited. A representation of how the dataset would look for a tri-gram representation. You can see in the first instance that Bob visited Site2 then Site3 then Site 5. The n-gram technique also has another variable of skips. A skip would represent the amount of grams skipped before recording another gram. A dataset for a 2 skip tri- gram would look exactly the same as the one above except that two sites would not have been next to each other in the original history. For example in the first instance Bob would have visited Site2 the two more sites then Site3 then two more sites then Site5. The Experiment: 1.Collect browsing histories from volunteers. 2.Strip the extra data out of the collected browsing histories 3.Create data sets. Which includes: 1.A separate dataset for every combination of n-grams and skips from a 0 skip bi-gram to a 50 skip 50-gram 2.One of every previous dataset for both website and webpage specificity 3.Then splitting every dataset into two sets. 1.80% training set 2.20% testing set 4.For every dataset train a classifier and test it with its corresponding test set. 5.Evaluate the results to find which representation of the data will yield the highest percentage of correct predictions. 6.Report on the findings The Experiment: 1.Collect browsing histories from volunteers. 2.Strip the extra data out of the collected browsing histories 3.Create data sets. Which includes: 1.A separate dataset for every combination of n-grams and skips from a 0 skip bi-gram to a 50 skip 50-gram 2.One of every previous dataset for both website and webpage specificity 3.Then splitting every dataset into two sets. 1.80% training set 2.20% testing set 4.For every dataset train a classifier and test it with its corresponding test set. 5.Evaluate the results to find which representation of the data will yield the highest percentage of correct predictions. 6.Report on the findings Using the created datasets: After researching different techniques and I found that learning classifiers were best suited for this task of identification for three main reasons: 1.They use simple datasets that are easily manipulated 2.Classification is a similar task to identification 3.A great deal of classifier algorithms have already been developed are readily available to be used through the Weka library. 4.Tools are readily available to evaluate the correctness of a classifier’s results on classifying a dataset Using the created datasets: After researching different techniques and I found that learning classifiers were best suited for this task of identification for three main reasons: 1.They use simple datasets that are easily manipulated 2.Classification is a similar task to identification 3.A great deal of classifier algorithms have already been developed are readily available to be used through the Weka library. 4.Tools are readily available to evaluate the correctness of a classifier’s results on classifying a dataset