Presentation is loading. Please wait.

Presentation is loading. Please wait.

2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis.

Similar presentations


Presentation on theme: "2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis."— Presentation transcript:

1 2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis

2 2009 IEEE Symposium on Computational Intelligence in Cyber Security 2 Outline What is Dark Web? Why do we need to analyze it? How to analyze Dark Web: Our Strategy  Web Crawling  Topic Discovery based on Latent Dirichlet Allocation (LDA)  Optimization Process Conclusion

3 2009 IEEE Symposium on Computational Intelligence in Cyber Security 3 What is Dark Web? Web is a global information platform accessible from different locations. It is a fast tool to spread information anonymously or with few regulations. Its cost is relatively low compared with other media. Dark Web is the place where terrorist/extremist organizations and their sympathizers  exchange ideology  spread propaganda  recruit members  plan attacks An example of dark web: www.natall.com

4 2009 IEEE Symposium on Computational Intelligence in Cyber Security 4 Why do we need to analyze it? To find the hidden topics in the Dark Web community, which are:  embedded in other large scale on-line web sites  information overloaded  multi-lingual

5 2009 IEEE Symposium on Computational Intelligence in Cyber Security 5 How to analyze Dark Web: architecture of our strategy GS: Gibbs Sampling – a random walk in the sample space to find the maximum estimation LDA: Latent Dirichlet Allocation

6 2009 IEEE Symposium on Computational Intelligence in Cyber Security 6 How to analyze Dark Web: architecture of our strategy Use a web crawler to download text-based documents  Pruning by removing: all the HTML tags irrelevant contents such as images, navigation instructions  Formatting into a plain text file F F := header {doc} header := a line contains the number of documents doc := {term_1} Feed the text file to GibbbsLDA analyzer to discover the latent topics Optimize topic discovery

7 2009 IEEE Symposium on Computational Intelligence in Cyber Security 7 Criteria to select web crawlers Able to parse ill-coded web pages Parameterized URLs Flexible to handle different web site structures The downloaded web pages will be read by machine rather than human, therefore some kind of normalization must be taken to ensure the text corpus is well formatted and readable Easy maintenance and of minimal hardware resources Not necessary to be super fast Not introduce any intellectual property problem

8 2009 IEEE Symposium on Computational Intelligence in Cyber Security 8 Web-harvest vs. others

9 2009 IEEE Symposium on Computational Intelligence in Cyber Security 9 Web-harvest pipeline

10 2009 IEEE Symposium on Computational Intelligence in Cyber Security 10 Topic discovery based on LDA LDA is an Information Retrieval (IR) technique Information Retrieval (IR)  reduces information overload  preserves the essential statistical relationships Basic and traditional IR methods  tf-idf scheme: term-count pair => term-by-document matrix  LSI (Latent semantic indexing)  pLSI (probabilistic LSI)  Clustering: divide data set into subsets

11 2009 IEEE Symposium on Computational Intelligence in Cyber Security 11 Dirichlet Distribution a generalization of the beta distribution

12 2009 IEEE Symposium on Computational Intelligence in Cyber Security 12 Beta Distribution a continuous probability distribution with the probability density function (pdf) defined on the interval [0, 1]

13 2009 IEEE Symposium on Computational Intelligence in Cyber Security 13 LDA graph corpus level:  α: Dirichlet prior hyper-parameter on the mixing proportion  β: Dirichlet prior hyper-parameter on the mixture component distributions  M: number of documents document level:  θ: the documents mixture proportion  φ: the mixture component of documents  N: # of words in a document word level:  ι: hidden topic variable  ω: document variable [H Zhang et al, 2007]

14 2009 IEEE Symposium on Computational Intelligence in Cyber Security 14 LDA vs. Clustering Clustering simply partition corpus; one document belongs to on category LDA-based analysis allows one document to be classified into different categories because of its hierarchy structure

15 2009 IEEE Symposium on Computational Intelligence in Cyber Security 15 Optimizing the results (1) LDA does not know how many topics could be there; this value is set by the user However we can evaluate the multiple “wild guesses” and choose the best one f(x) is the number of documents that contain the word x f(y) is the number of documents that contain the word y f(x,y) if the number of documents that contain both word x and word y M is the total number of the documents

16 2009 IEEE Symposium on Computational Intelligence in Cyber Security 16 Optimizing the results (2) For each topic discovery, find the minimum of average distance of each topic.

17 2009 IEEE Symposium on Computational Intelligence in Cyber Security 17 Optimizing the results (3) Results: Four topics has the minimum average distance between words in each topic.

18 2009 IEEE Symposium on Computational Intelligence in Cyber Security 18 A topic list of discovered topics from www.natall.com Discovering New Topics after Optimization

19 2009 IEEE Symposium on Computational Intelligence in Cyber Security 19 Conclusion Web-harvest integrated with LDA is able to discover the hidden latent topics from dark web sites. provide a more flexible and automated tool to counter terrorism. support a measurable way to optimize the results of LDA. provide a generic tool to analyze a variety of websites such as financial, medical, etc.

20 2009 IEEE Symposium on Computational Intelligence in Cyber Security 20 References Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent dirichlet allocation. Journal of Machine Learning Research. 3:993-1022. Mar. 2003. An LDA-based Community Structure Discovery Approach for Large- Scale Social Networks, Haizheng Zhang, Baojun Qiu, C. Lee Giles, Henry C. Foley and John Yen, In Proceedings of IEEE Intelligence and Security Informatics, 2007. Tracing the Event from Evolution of Terror Attacks from On-Line News, Christopher C. Yang, Xiaodong Shi, and Chih-Ping Wei, In Proceedings of IEEE Intelligence and Security Informatics, 2006. On the Topology of the Dark Web of Terrorist Groups, Jennifer Xu, Hsinchun Chen, Yilu Zhou, and Jialun Qin, In Proceedings of IEEE Intelligence and Security Informatics 2006.


Download ppt "2009 IEEE Symposium on Computational Intelligence in Cyber Security 1 LDA-based Dark Web Analysis."

Similar presentations


Ads by Google