JStylo: An Authorship-Attribution Platform and its Applications

Slides:



Advertisements
Similar presentations
Symantec 2010 Windows 7 Migration EMEA Results. Methodology Applied Research performed survey 1,360 enterprises worldwide SMBs and enterprises Cross-industry.
Advertisements

Symantec 2010 Windows 7 Migration Global Results.
1 A B C
Requirements Engineering Processes – 2
Monday HW answers: p B25. (x – 15)(x – 30) 16. (t – 3)(t – 7)29. (x -2)(x – 7) 19. (y – 6)(y + 3)roots = 2 and (4 + n)(8 + n)34. (x + 7)(x.
June 27, 2005 Preparing your Implementation Plan.
Institute for Cyber Security
Myra Shields Training Manager Introduction to OvidSP.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
Document #07-2I RXQ Customer Enrollment Using a Registration Agent (RA) Process Flow Diagram (Move-In) (mod 7/25 & clean-up 8/20) Customer Supplier.
Instructions for Filling out the Reintegration Opportunity Report Savable PDF Training.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.
Quality Education Investment Act of 2006 (QEIA) 1 Quality Education Investment Act (QEIA) of 2006 County Superintendents Oversight and Technical Assistance.
18 Copyright © 2005, Oracle. All rights reserved. Distributing Modular Applications: Introduction to Web Services.
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
Exit a Customer Chapter 8. Exit a Customer 8-2 Objectives Perform exit summary process consisting of the following steps: Review service records Close.
Plan My Care Brokerage Training Working in partnership with Improvement and Efficiency South East.
Dr. Alexandra I. Cristea CS 319: Theory of Databases: C3.
PhishZoo: Detecting Phishing Websites By Looking at Them
1. 2 Objectives Become familiar with the purpose and features of Epsilen Learn to navigate the Epsilen environment Develop a professional ePortfolio on.
Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,
Server Access The REST of the Story David Cleary
Welcome. © 2008 ADP, Inc. 2 Overview A Look at the Web Site Question and Answer Session Agenda.
Structure of the Presentation :
Intel VTune Yukai Hong Department of Mathematics National Taiwan University July 24, 2008.
Configuration management
CS525: Special Topics in DBs Large-Scale Data Management
June 4, 2004 A Robust Reputation System for P2P and Mobile Ad-hoc Networks Sonja Buchegger 1 A Robust Reputation System for P2P and Mobile Ad-hoc Networks.
ACM CIKM 2008, Oct , Napa Valley 1 Mining Term Association Patterns from Search Logs for Effective Query Reformulation Xuanhui Wang and ChengXiang.
August 2012 This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit
Protecting Location Privacy: Optimal Strategy against Localization Attacks Reza Shokri, George Theodorakopoulos, Carmela Troncoso, Jean-Pierre Hubaux,
1 Presenter: Josh Stuckey, Manager Harris County Permits Northwest Freeway Suite 120 Houston, Texas
University of Sheffield NLP Module 4: Machine Learning.
© 2008 Security Compass inc. 1 Firefox Plug-ins for Application Penetration Testing Exploit-Me.
Machine Learning: Intro and Supervised Classification
31242/32549 Advanced Internet Programming Advanced Java Programming
© 2012 National Heart Foundation of Australia. Slide 2.
University of Sheffield NLP Module 11: Advanced Machine Learning.
25 seconds left…...
Chapter 2 Entity-Relationship Data Modeling: Tools and Techniques
Speak Up for Safety Dr. Susan Strauss Harassment & Bullying Consultant November 9, 2012.
To the Assignments – Work in Progress Online Training Course
CMU SCS : Multimedia Databases and Data Mining Lecture#1: Introduction Christos Faloutsos CMU
Yakir Vizel 1,2 and Orna Grumberg 1 1.Computer Science Department, The Technion, Haifa, Israel. 2.Architecture, System Level and Validation Solutions,
IPE – Calendar Seite 1 Application deadline : February 28 Track length: 12 weeks + 2 weeks German class in Heidelberg Semester dates Intensive german class.
Document Examiner Feature Extraction: Thinned vs Skeletonised Images
Privacy & Stylometry: Practical Attacks Against Authorship Recognition Techniques Michael Brennan and Rachel Greenstadt
Computer Security Lab Concordia Institute for Information Systems Engineering Concordia University Montreal, Canada A Novel Approach of Mining Write-Prints.
Large-Scale Entity-Based Online Social Network Profile Linkage.
Stylometry System CSIS Stylometry System – Use Cases and Feasibility Study Gregory Shalhoub, Robin Simon, Jayendra Tailor, Ramesh Iyer, Dr. Sandra Westcott.
Stylometry System CSIS Stylometry Projects, mostly Fall 2009 Project Seidenberg School of Computer Science and Information Systems.
Authorship Attribution Erik Goldman & Abel Allison.
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.

COMPUTER-ASSISTED PLAGIARISM DETECTION PRESENTER: CSCI 6530 STUDENT.
Data Mining Applied to Document Imaging Jeff Rekoske.
INFORMATION NETWORKS DIVISION COMPUTER FORENSICS UNCLASSIFIED 1 DFRWS2002 Language and Gender Author Cohort Analysis of .
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
Presented by Teererai Marange. According to Caliskan-Islam et al.(2015), authorship attribution using the Code Stylometry feature set is possible when.
Security Analytics Thrust Anthony D. Joseph (UCB) Rachel Greenstadt (Drexel), Ling Huang (Intel), Dawn Song (UCB), Doug Tygar (UCB)
E VENT D ETECTION USING A C LUSTERING A LGORITHM Kleisarchaki Sofia, University of Crete, 1.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
 Used to be applicable to literary corpus/ academia only  Source code similarity/plagiarism detection is very important  “Moss” is the most widely.
By: Shannon Silessi Gender Identification of SMS Texts.
Detecting Web Attacks Using Multi-Stage Log Analysis
Stock Market Prediction
Evaluation of a Stylometry System on Various Length Portions of Books
Presentation transcript:

JStylo: An Authorship-Attribution Platform and its Applications Introduction JStylo is a platform designed to conduct supervised stylometry experiments – authorship attribution using linguistic style. It uses NLP techniques to extract linguistic features from documents, and supervised machine learning methods to classify those documents based on the extracted features. The platform feature extraction core is based on the JGAAP API [1], and the classifiers available include Weka [2] classifiers and an implementation of the Writeprints [3] classifier. Source: https://github.com/psal/JStylo-Anonymouth Motivation Important for research in history, literature and forensics Impact on privacy and anonymity in online environments: Reveal identity: users can use various tools to hide their location, but their writing style may still be exposed. JStylo provides a convenient platform for developing methods to reveal anonymous identities. Preserve anonymity: On the other hand JStylo can be used for developing and testing methods to secure anonymous communication, like Anonymouth [4]. Stylometry research is useful not only for revealing identities, but also author characteristics, like age, gender, native language and personality type. Novelty Cumulative feature-set analysis (vs. one feature at-a-time) Added feature extractors and processing tools Readability / complexity metrics Regular-expression-based features Counters (word / letter / regular expression) High feature-level customizability Factoring and Normalization Uses Weka classifiers Provides implementation of Writeprints Platform Overview Applications Document Anonymization Using Anonymouth [4] JStylo as an authorship-attribution engine to evaluate anonymization level Author 1 Author 2 Author N … “Blend-in” Corpus My docs Document to Anonymize Learn Styles Suggest Changes Change Document Check if Anonymized YES NO Document Anonymized Problem Definition Training Documents Test Documents Author 1 Author 2 Author N … ? Feature Selection Feature Set fM f1 f2 f3 Feature Document pre-processing Feature Extraction Feature post-processing Normalization Factoring Classifiers Selection cL c1 c2 c3 Analysis Training Documents A1 A2 AN … Test Document pre-process 12 289 5.61 13.7 Feature Extraction 1.2 5.78 5 41.1 Feature post-process Feature Extraction ? Classification Train c1 cL A3 A15 A7 Training Set CV Results Results Personal Traits Identification: Native Language Using Language-Family Information Classify documents by native language Set the classification probabilities as threshold T Use language-family reclassification for instances classified with probability p < T to improve language classification Classify language Candidate languages P > T P < T L11 L12 L13 F1 L21 L22 L23 F2 L31 L32 L33 F3 L Classifier Result family families F1 F2 F3 Fi Li1 Li2 Li3 Lij Evaluation A sample evaluation using the Writeprints feature set with Weka SMO SVM classifier on the Extended Brennan-Greenstadt Adversarial corpus [5]: 45 authors > 6,500 words per author, divided into ~500-words documents 10-fold cross-validation: Stylometry-Based Authentication An attacker may have user credentials Learn legitimate user’s writing style Record user activity and use stylometry to authenticate the user is who s/he says s/he is …I AM A MALICIOUS USER, BEWARE… Malicious user Legitimate credentials Legitimate user writing Test Train References [1] Juola, P., et al.: JGAAP, a Java-Based, Modular, Program for Textual Analysis, Text Categorization, and Authorship Attribution (2009) [2] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The Weka Data Mining Software: An Update (2009) [3] Abbasi, A., Chen, H.: Writeprints: A Stylometric Approach to Identity-Level Identification and Similarity Detection in Cyberspace (2008) [4] McDonald, A., Afroz, S., Caliskan, A., Stolerman, A. and Greenstadt, R.: Use Fewer Instances of the Letter "i": Toward Writing Style Anonymization (2012) [5] Brennan, M. and Greenstadt, R.: Practical Attacks Against Authorship Recognition Techniques (2009)