2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis.

Slides:



Advertisements
Similar presentations
Recommender System A Brief Survey.
Advertisements

Google News Personalization: Scalable Online Collaborative Filtering
Topic models Source: Topic models, David Blei, MLSS 09.
Information retrieval – LSI, pLSI and LDA
Hierarchical Dirichlet Processes
Community Detection with Edge Content in Social Media Networks Paper presented by Konstantinos Giannakopoulos.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Title: The Author-Topic Model for Authors and Documents
An Introduction to LDA Tools Kuan-Yu Chen Institute of Information Science, Academia Sinica.
San Diego Supercomputer Center Analyzing the NSDL Collection Peter Shin, Charles Cowart Tony Fountain, Reagan Moore San Diego Supercomputer Center.
Statistical Topic Modeling part 1
Information Retrieval in Practice
Latent Dirichlet Allocation a generative model for text
British Museum Library, London Picture Courtesy: flickr.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Parallel and Distributed IR
LATENT DIRICHLET ALLOCATION. Outline Introduction Model Description Inference and Parameter Estimation Example Reference.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
1 A Topic Modeling Approach and its Integration into the Random Walk Framework for Academic Search 1 Jie Tang, 2 Ruoming Jin, and 1 Jing Zhang 1 Knowledge.
Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.
1 Zi Yang, Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China {yangzi,
Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Example 16,000 documents 100 topic Picked those with large p(w|z)
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.
Master Thesis Defense Jan Fiedler 04/17/98
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
No. 1 Classification and clustering methods by probabilistic latent semantic indexing model A Short Course at Tamkang University Taipei, Taiwan, R.O.C.,
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
27. May Topic Models Nam Khanh Tran L3S Research Center.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Chapter 6: Information Retrieval and Web Search
Finding the Hidden Scenes Behind Android Applications Joey Allen Mentor: Xiangyu Niu CURENT REU Program: Final Presentation 7/16/2014.
Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
Hierarchical Dirichlet Process and Infinite Hidden Markov Model Duke University Machine Learning Group Presented by Kai Ni February 17, 2006 Paper by Y.
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Topic Modeling using Latent Dirichlet Allocation
Presented By- Shahina Ferdous, Student ID – , Spring 2010.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Latent Dirichlet Allocation
- University of North Texas - DSCI 5240 Fall Graduate Presentation - Option A Slides Modified From 2008 Jones and Bartlett Publishers, Inc. Version.
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Text-classification using Latent Dirichlet Allocation - intro graphical model Lei Li
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.
Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철
Making Sense of Large Volumes of Unstructured Responses K. M. P. N. Jayathilaka Department of Statistics University of Colombo.
Data mining in web applications
Information Retrieval in Practice
Extracting Mobile Behavioral Patterns with the Distant N-Gram Topic Model Lingzi Hong Feb 10th.
Online Multiscale Dynamic Topic Models
The topic discovery models
Measuring Sustainability Reporting using Web Scraping and Natural Language Processing Alessandra Sozzi
Extraction, aggregation and classification at Web Scale
Topic Modeling Nick Jordan.
Michal Rosen-Zvi University of California, Irvine
17th APAN Meetings & Joint Techs Workshop
Junghoo “John” Cho UCLA
Topic Models in Text Processing
Presentation transcript:

2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis

2009 IEEE Symposium on Computational Intelligence in Cyber Security 2 Outline What is Dark Web? Why do we need to analyze it? How to analyze Dark Web: Our Strategy  Web Crawling  Topic Discovery based on Latent Dirichlet Allocation (LDA)  Optimization Process Conclusion

2009 IEEE Symposium on Computational Intelligence in Cyber Security 3 What is Dark Web? Web is a global information platform accessible from different locations. It is a fast tool to spread information anonymously or with few regulations. Its cost is relatively low compared with other media. Dark Web is the place where terrorist/extremist organizations and their sympathizers  exchange ideology  spread propaganda  recruit members  plan attacks An example of dark web:

2009 IEEE Symposium on Computational Intelligence in Cyber Security 4 Why do we need to analyze it? To find the hidden topics in the Dark Web community, which are:  embedded in other large scale on-line web sites  information overloaded  multi-lingual

2009 IEEE Symposium on Computational Intelligence in Cyber Security 5 How to analyze Dark Web: architecture of our strategy GS: Gibbs Sampling – a random walk in the sample space to find the maximum estimation LDA: Latent Dirichlet Allocation

2009 IEEE Symposium on Computational Intelligence in Cyber Security 6 How to analyze Dark Web: architecture of our strategy Use a web crawler to download text-based documents  Pruning by removing: all the HTML tags irrelevant contents such as images, navigation instructions  Formatting into a plain text file F F := header {doc} header := a line contains the number of documents doc := {term_1} Feed the text file to GibbbsLDA analyzer to discover the latent topics Optimize topic discovery

2009 IEEE Symposium on Computational Intelligence in Cyber Security 7 Criteria to select web crawlers Able to parse ill-coded web pages Parameterized URLs Flexible to handle different web site structures The downloaded web pages will be read by machine rather than human, therefore some kind of normalization must be taken to ensure the text corpus is well formatted and readable Easy maintenance and of minimal hardware resources Not necessary to be super fast Not introduce any intellectual property problem

2009 IEEE Symposium on Computational Intelligence in Cyber Security 8 Web-harvest vs. others

2009 IEEE Symposium on Computational Intelligence in Cyber Security 9 Web-harvest pipeline

2009 IEEE Symposium on Computational Intelligence in Cyber Security 10 Topic discovery based on LDA LDA is an Information Retrieval (IR) technique Information Retrieval (IR)  reduces information overload  preserves the essential statistical relationships Basic and traditional IR methods  tf-idf scheme: term-count pair => term-by-document matrix  LSI (Latent semantic indexing)  pLSI (probabilistic LSI)  Clustering: divide data set into subsets

2009 IEEE Symposium on Computational Intelligence in Cyber Security 11 Dirichlet Distribution a generalization of the beta distribution

2009 IEEE Symposium on Computational Intelligence in Cyber Security 12 Beta Distribution a continuous probability distribution with the probability density function (pdf) defined on the interval [0, 1]

2009 IEEE Symposium on Computational Intelligence in Cyber Security 13 LDA graph corpus level:  α: Dirichlet prior hyper-parameter on the mixing proportion  β: Dirichlet prior hyper-parameter on the mixture component distributions  M: number of documents document level:  θ: the documents mixture proportion  φ: the mixture component of documents  N: # of words in a document word level:  ι: hidden topic variable  ω: document variable [H Zhang et al, 2007]

2009 IEEE Symposium on Computational Intelligence in Cyber Security 14 LDA vs. Clustering Clustering simply partition corpus; one document belongs to on category LDA-based analysis allows one document to be classified into different categories because of its hierarchy structure

2009 IEEE Symposium on Computational Intelligence in Cyber Security 15 Optimizing the results (1) LDA does not know how many topics could be there; this value is set by the user However we can evaluate the multiple “wild guesses” and choose the best one f(x) is the number of documents that contain the word x f(y) is the number of documents that contain the word y f(x,y) if the number of documents that contain both word x and word y M is the total number of the documents

2009 IEEE Symposium on Computational Intelligence in Cyber Security 16 Optimizing the results (2) For each topic discovery, find the minimum of average distance of each topic.

2009 IEEE Symposium on Computational Intelligence in Cyber Security 17 Optimizing the results (3) Results: Four topics has the minimum average distance between words in each topic.

2009 IEEE Symposium on Computational Intelligence in Cyber Security 18 A topic list of discovered topics from Discovering New Topics after Optimization

2009 IEEE Symposium on Computational Intelligence in Cyber Security 19 Conclusion Web-harvest integrated with LDA is able to discover the hidden latent topics from dark web sites. provide a more flexible and automated tool to counter terrorism. support a measurable way to optimize the results of LDA. provide a generic tool to analyze a variety of websites such as financial, medical, etc.

2009 IEEE Symposium on Computational Intelligence in Cyber Security 20 References Blei, D. M., Ng, A. Y., and Jordan, M. I Latent dirichlet allocation. Journal of Machine Learning Research. 3: Mar An LDA-based Community Structure Discovery Approach for Large- Scale Social Networks, Haizheng Zhang, Baojun Qiu, C. Lee Giles, Henry C. Foley and John Yen, In Proceedings of IEEE Intelligence and Security Informatics, Tracing the Event from Evolution of Terror Attacks from On-Line News, Christopher C. Yang, Xiaodong Shi, and Chih-Ping Wei, In Proceedings of IEEE Intelligence and Security Informatics, On the Topology of the Dark Web of Terrorist Groups, Jennifer Xu, Hsinchun Chen, Yilu Zhou, and Jialun Qin, In Proceedings of IEEE Intelligence and Security Informatics 2006.