Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction JStylo is a platform designed to conduct supervised stylometry experiments – authorship attribution using linguistic style. It uses NLP techniques.

Similar presentations


Presentation on theme: "Introduction JStylo is a platform designed to conduct supervised stylometry experiments – authorship attribution using linguistic style. It uses NLP techniques."— Presentation transcript:

1 Introduction JStylo is a platform designed to conduct supervised stylometry experiments – authorship attribution using linguistic style. It uses NLP techniques to extract linguistic features from documents, and supervised machine learning methods to classify those documents based on the extracted features. The platform feature extraction core is based on the JGAAP API [1], and the classifiers available include Weka [2] classifiers and an implementation of the Writeprints [3] classifier. Source: https://github.com/psal/JStylo-Anonymouthhttps://github.com/psal/JStylo-Anonymouth Motivation Important for research in history, literature and forensics Impact on privacy and anonymity in online environments: Reveal identity: users can use various tools to hide their location, but their writing style may still be exposed. JStylo provides a convenient platform for developing methods to reveal anonymous identities. Preserve anonymity: On the other hand JStylo can be used for developing and testing methods to secure anonymous communication, like Anonymouth [4]. Stylometry research is useful not only for revealing identities, but also author characteristics, like age, gender, native language and personality type. Novelty Cumulative feature-set analysis (vs. one feature at-a-time) Added feature extractors and processing tools Readability / complexity metrics Regular-expression-based features Counters (word / letter / regular expression) High feature-level customizability Factoring and Normalization Uses Weka classifiers Provides implementation of Writeprints JStylo: An Authorship-Attribution Platform and its Applications Platform Overview Applications References [1] Juola, P., et al.: JGAAP, a Java-Based, Modular, Program for Textual Analysis, Text Categorization, and Authorship Attribution (2009) [2] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The Weka Data Mining Software: An Update (2009) [3] Abbasi, A., Chen, H.: Writeprints: A Stylometric Approach to Identity- Level Identification and Similarity Detection in Cyberspace (2008) [4] McDonald, A., Afroz, S., Caliskan, A., Stolerman, A. and Greenstadt, R.: Use Fewer Instances of the Letter "i": Toward Writing Style Anonymization (2012) [5] Brennan, M. and Greenstadt, R.: Practical Attacks Against Authorship Recognition Techniques (2009) 1.Problem Definition Author 1Author 2Author N … Training Documents Test Documents ??? 2.Feature Selection Feature Set Feature fMfM f1f1 f2f2 f3f3 Document pre- processing Feature Extraction Feature post- processing Normalization Factoring 3.Classifiers Selection cLcL c1c1 c2c2 c3c3 4.Analysis Training Documents A1A1 A2A2 ANAN … Test Documents Document pre-process … Feature Extraction … Feature post-process Feature Extraction ?? ? … Classification Train Test c1c1 Train Test cLcL … A3A3 A 15 A7A7 … Training Set CV Results Results Evaluation A sample evaluation using the Writeprints feature set with Weka SMO SVM classifier on the Extended Brennan- Greenstadt Adversarial corpus [5]: 45 authors > 6,500 words per author, divided into ~500-words documents 10-fold cross-validation: Document Anonymization Using Anonymouth [4] JStylo as an authorship-attribution engine to evaluate anonymization level Author 1Author 2Author N … “Blend-in” Corpus My docs Document to Anonymize Learn Styles Suggest Changes Change Document Check if Anonymized YES NO Document Anonymized Personal Traits Identification: Native Language Using Language-Family Information Classify documents by native language Set the classification probabilities as threshold T Use language-family reclassification for instances classified with probability p < T to improve language classification Classify language Candidate languages P > T P < T L 11 L 12 L 13 F1F1 L 21 L 22 L 23 F2F2 L 31 L 32 L 33 F3F3 L Classifier Result Classify family Candidate families F 1 F 2 F 3 FiFi Classify language Candidate languages L i1 L i2 L i3 FiFi L ij L Stylometry-Based Authentication An attacker may have user credentials Learn legitimate user’s writing style Record user activity and use stylometry to authenticate the user is who s/he says s/he is …I AM A MALICIOUS USER, BEWARE… Malicious user Legitimate credentials Legitimate user writing Test Train


Download ppt "Introduction JStylo is a platform designed to conduct supervised stylometry experiments – authorship attribution using linguistic style. It uses NLP techniques."

Similar presentations


Ads by Google