School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Detecting Terrorist Activities via Text Analytics Eric Atwell, Language.

Slides:



Advertisements
Similar presentations
Testing Relational Database
Advertisements

Chapter 7 System Models.
Effective Searching Strategies and Techniques
Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
Smart Qualitative Data: Methods and Community Tools for Data Mark-Up SQUAD Libby Bishop Online Qualitative Data Resources: Best Practice in Metadata Creation.
1 e-Science for the arts and humanities Sheila Anderson Arts and Humanities Data Service Kings College London.
Obesity e-Lab Enabling obesity research using the Health Surveys for England: The Obesity e-Lab project Dexter Canoy The University of Manchester
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING An open discussion and exchange of ideas Introduced by Eric Atwell, Language.
Comp3776: Data Mining and Text Analytics Intro to Data Mining By Eric Atwell, School of Computing, University of Leeds (including re-use of teaching resources.
Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,
Computer Literacy BASICS
The basics for simulations
Configuration management
Software change management
Information Society Technologies Third Call for Proposals Norbert Brinkhoff-Button DG Information Society European Commission Key action III: Multmedia.
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
Desmond Thomas LTU Developing effective reading strategies and productive routines Dr Desmond Thomas, University of Essex.
DETECTING TERRORIST ACTIVITIES PRESENTED BY CATHERINE LUMB & ALI CLARKE.
Chapter 12 Analyzing Semistructured Decision Support Systems Systems Analysis and Design Kendall and Kendall Fifth Edition.
1 Distributed Agents for User-Friendly Access of Digital Libraries DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen,
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Machine Learning and Data Mining Course Summary. 2 Outline  Data Mining and Society  Discrimination, Privacy, and Security  Hype Curve  Future Directions.
IVITA Workshop Summary Session 1: interactive text analytics (Session chair: Professor Huamin Qu) a) HARVEST: An Intelligent Visual Analytic Tool for the.
Information Retrieval in Practice
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Search Engines and Information Retrieval
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Information Retrieval in Practice
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Methodology Conceptual Database Design
Overview of Search Engines
Detecting Terrorist Activities – “Making Sense”  EPSRC Sandpit on Detecting Terrorist Activities, May  Analysis and Visualisation of multi-modal.
Data Mining Techniques
Attention and Event Detection Identifying, attributing and describing spatial bursts Early online identification of attention items in social media Louis.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Search Engines and Information Retrieval Chapter 1.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Chapter 1 Introduction to Data Mining
SCSC 311 Information Systems: hardware and software.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1 Controversial Issues  Data mining (or simple analysis) on people may come with a profile that would raise controversial issues of  Discrimination 
Wang-Chien Lee i Pervasive Data Access ( i PDA) Group Pennsylvania State University Mining Social Network Big Data Intelligent.
INTERACTIVE ANALYSIS OF COMPUTER CRIMES PRESENTED FOR CS-689 ON 10/12/2000 BY NAGAKALYANA ESKALA.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Presenter: Shanshan Lu 03/04/2010
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
Data and Applications Security Developments and Directions Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #15 Secure Multimedia Data.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Automatic Discovery and Processing of EEG Cohorts from Clinical Records Mission: Enable comparative research by automatically uncovering clinical knowledge.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
1 e-Arts and Humanities Scoping an e-Science Agenda Sheila Anderson Arts and Humanities Data Service Arts and Humanities e-Science Support Centre King’s.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
1. ABSTRACT Information access through Internet provides intruders various ways of attacking a computer system. Establishment of a safe and strong network.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Information Retrieval in Practice
Information Retrieval in Practice
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
CSE 635 Multimedia Information Retrieval
Evaluating Classifiers
Presentation transcript:

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Detecting Terrorist Activities via Text Analytics Eric Atwell, Language Research Group I-AIBSI-AIBS: Institute for Artificial Intelligence and Biological Systems

Overview DTAct EPSRC initiative Recent research on terrorism informatics Ideas for future research

Background: EPSRC DTAct EPSRC: Engineering and Physical Science Research Council Detecting Terrorist Activities – DTAct A joint Ideas Factory Sandpit initiative supported by EPSRC, ESRC, the Centre for the Protection of National Infrastructure (CPNI), and the Home Office to develop innovative approaches to Detecting Terrorist Activities; 3 projects to run

DTAct aims … Effective detection of potential threats before an attack can help to ensure the safety of the public with a minimum of disruption. It should come as far in advance of attack as possible … Detection may mean physiological, behavioural or spectral detection across a range of distance scales; remote detection; or detection of an electronic presence. DTAct may even develop or use an even broader interpretation of the concept. Distance may be physical, temporal, virtual or again an interpretation which takes a wider view of what it means for someone posing a threat to be separated from his or her target. … Effective detection of terrorist activities is likely to require a variety of sensing approaches integrated into a system. Sensing approaches might encompass any of a broad range of technologies and approaches. In addition to sensing technologies addressing chemical and physical signatures these might include animal olfaction; mining for anomalous electronic activity; or the application of behavioural science knowledge in detection of characterised behavioural attributes. Likewise, the integration element of this problem is very broad, and might encompass, but is not limited to: hardware; algorithms; video analytics; a broad range of human factors, psychology and physiology considerations (including understanding where humans and technology, respectively, are most usefully deployed); or operational research, analysis and modelling to understand the problem and explore optimum configurations (including choice and location of sensing components.)…

How to use text analytics for DTAct? Terrorists may use , phone/txt, websites, blogs … … to recruit members, issue threats, communicate, plan… Also: surveillance and informant reports, police records, … So why not use NLP to detect anomalies in these sources? Maybe like other research at Leeds: Arabic text analytics detecting hidden meanings in text social and cultural text mining detecting non-standard language variation detecting hidden errors in text plagiarism detection

Recent research on DTAct Engineering devices to detect at airport or on plane – too late? Terrorism Studies, eg MA Leeds University (!) … political and social background, but NOT detection of plots Research papers with relevant-sounding titles … but very generic/abstract, not much real NLP text analysis Some examples:

Carnegie Mellon University Fienberg S. Homeland insecurity: Datamining, Terrorism Detection, and Confidentiality. MATRIX: Multistate Anti-Terrorism Information Exchange system to store, analyze and exchange info in databases – but doesnt say how to acquire DB info in the first place TIA: Terrorist Information Program – stopped 2003 PPDM: Privacy Preserving Data Mining – big issue is privacy of data once captured, rather than how to acquire data

University of Arizona Qin J, Zhou Y, Reid E, Lai G, Chen H. Unraveling international terrorist groups exploitation of the web. … we explore an integrated approach for identifying and collecting terrorist/extremist Web contents … the Dark Web Attribute System (DWAS) to enable quantitative Dark Web content analysis. Identified and collected 222,000 web-pages from 86 Middle East terrorist/extremist Web sites… and compared with 277,000 web-pages from US Government websites BUT only looked at HCI issues: technical sophistication, media richness, Web interactivity. NOT looking for terrorists or plots, NOT language analysis

Uni of Negev, Uni South Florida Last M, Markov A, Kandel A. Multi-lingual detection of terrorist content on the Web Aim: to classify documents: terrorist v non-terrorist Build a C4.5 Decision Tree using word subgraphs as decision-point features. Tested on a corpus of 648 Arabic web-pages, C4.5 builds a decision tree based on keywords in document: Zionist or Martyr or call of Al-Quds or Enemy terror Else non-terror NOT looking for plots, NOT deep NLP (just keywords)

Springer: Information Systems Chen H, Reid E, Sinai J, Silke A, Ganor B (eds) TERRORISM INFORMATICS: Knowledge Management and Data Mining for Homeland Security Methodological issues in terrorism research (ch 1-10); Terrorism informatics to support prevention, detection, and response (ch 11-24) Silke: U East London, UK; BUT sociology, not IS 57 co-authors of chapters! Only 2 in UK: Horgan (psychology), Raphael (politics) Several impressive-sounding acronyms …

Terrorism Informatics: text analytics U Arizona Dark Web analysis – not detecting plots Analysis of affect intensities in extremist group forums Extracting entity and relationship instances of terrorist events Data distortion methods and metrics: Terrorist Analysis System Content-based detection of terrorists browsing the web using Advanced Terror Detection System (ATDS) Text mining biomedical literature for bio-terrorism weapons Semantic analysis to detect anomalous content Threat analysis through cost-sensitive document classification Web mining and social network analysis in blogs

Sheffield University Abouzakhar N, Allison B, Guthrie L. Unsupervised Learning- based anomalous Arabic Text Detection Corpus of 100 samples ( words) from Aljazeera news Randomly insert sample of religious/social/novel text Can detect anomalous sample by average word length, average sentence length, frequent words, positive words, negative words, …

Problems in Text Analytics for Detecting Terrorist Activities Not just English: Arabic, Urdu, Persian, Malay, … Need a Gold Standard corpus of terror v non-terror texts What linguistic features to use? Terrorists may use covert language: the package

Problems with other languages Arabic: Writing system: short vowels, carrying morphological features, can be left out, increasing ambiguity; complex morphology: root+affix(es)+clitic(s) Malay: opposite problem – simple morphology, but a word can be used in almost any PoS grammatical function; Few resources (PoS-tagged corpora, lexical databases) for training PoS-taggers, Named Entity Recognition, etc.

Terror Corpus We need to collect a Corpus of suspicious e-text Start with existing Dark Web and other collections Human scouts look for suspicious websites, and Robot web-crawler uses seeds to find related web-pages MI5, CPNI, Police etc to advise and provide case data Annotate: label terror v non-terror, plot, …

Linguistic Annotation We dont know which features correlate to terror plot So: enrich with linguistic features (PoS, sentiment, …) Then we can use these in decision trees etc based on deeper linguistic knowledge

Covert language If we have texts which are labelled plot, look for words which are suspicious because they are NOT terror-words e.g. high log-likelihood of package

Text Analytics for Detecting Terrorist Activities: Making Sense Claire Brierley and Eric Atwell: Leeds University International Crime and Intelligence Analysis Conference Manchester - 4 November 2011

Making Sense: The Team Funded by EPSRC/ESRC/CPNI Multi-disciplinary: Psychology Law Operations research Computational linguistics Visual analytics Machine learning and artificial intelligence Human computer interaction Computer science Approximately 300 person months over 36 months (full economic cost: £2.6m).

What is Making Sense? EPSRC consortium project in the field of Visual Analytics Remit to create an interactive, visualisation-based decision support assistant as an aid to intelligence analysts Target user communities are law enforcement, military intelligence and the security services 1.Involves automated approaches to gisting multimedia content 2.Integrating gists from different modalities: audio, visual, text 3.Identifying links/connections in fused data 4.Visualisation of results to support interactive query and search Data collection Fusion & inference Analysis of merged data Visualis e results

Nature of intelligence material Task: To identify suspicious activity via multi-source, multi-modal data Issues of quantity and quality: DELUGE of multi-source, multi-modal data for target user groups to make sense of and act upon Deluge of NOISY data Nature of intelligence data and its critical features: It may be unreliable. The credibility of sources may be questionable. Its fragmented and partial. Text-based data may be non-standard (e.g. txt messages) Its from different modalities, and theres a lot of it! So its easy to miss that needle in the haystack.

Text Extraction: methodologies available There are various options for extracting actionable intelligence from text. 1.Google-type search and Information Retrieval (IR) to pull documents from the web in response to a query 2.Query formulation is informed by domain expertise and human intelligence (HUMINT) – another approach 3.Automatic Text Summarisation to generate summaries from regularities in well-structured texts 4.Information Extraction (IE), focussing on automatic extraction of entities (i.e. nouns, especially proper nouns), facts and events from text 5.Keyword Extraction (KWE) uses statistical techniques to identify keywords denoting the aboutness of a text or genre

What is Leeds approach? Making Sense proposal:...the gist of a phone tap transcript might comprise: caller and recipient number; duration of call; statistically significant keywords and phrases; and potentially suspicious words and phrases... Why use Keyword Extraction (KWE)? It can be implemented speedily over large quantities of ill- formed texts It will uncover new and different material, such that we can undertake content analysis

Newsreel word cloud 1980s BBC radio

Measuring deviation from the norm Chosen text Texts by same author or different parts of same text Contemporary authors or similar genre General reference corpus Chosen text or part of chosen text Chosen author or genre DEVIATION PRIMARY: Norms of the language as a whole SECONDARY: Norms of contemporary or genre-specific composition TERTIARY: Internal, norms of a text

Verifying over-use apparent in relative frequencies via log likelihood statistic Test set: 783 words airport security aircraft beirut athens hijackers hijacking baggage screens staff airport:41.28 security:33.36 aircraft:16.80 athens:12.83 beirut:11.69 hijacking:10.27 hijackers:8.21 staff:7.70 TWA: 7.70 screens:7.70 baggage:7.70 sometimes:7.40 did:6.70 an:6.66

Verifying over-use apparent in relative frequencies via log likelihood statistic Test set: 783 wordsReference set: 9672 words airport airport security security aircraft aircraft beirut beirut athens athens hijackers hijackers hijacking hijacking baggage baggage screens screens staff staff airport:41.28 security:33.36 aircraft:16.80 athens:12.83 beirut:11.69 hijacking:10.27 hijackers:8.21 staff:7.70 TWA: 7.70 screens:7.70 baggage:7.70 sometimes:7.40 did:6.70 an:6.66

Verifying over-use apparent in relative frequencies via log likelihood statistic Test set: 783 wordsReference set: 9672 words airport airport security security aircraft aircraft beirut beirut athens athens hijackers hijackers hijacking hijacking baggage baggage screens screens staff staff airport:41.28 security:33.36 aircraft:16.80 athens:12.83 beirut:11.69 hijacking:10.27 hijackers:8.21 staff:7.70 TWA: 7.70 screens:7.70 baggage:7.70 sometimes:7.40 did:6.70 an:6.66

Newsreel word cloud 1980s BBC radio

Habeas Corpus? Text Analytics Research Paradigm: Uses a corpus of naturally-occurring language texts which capture empirical data on the phenomenon being studied The phenomenon under scrutiny needs to be labelled in the corpus in order to derive training sets for machine learning This labelled corpus constitutes a gold standard for iterative development and evaluation of algorithms Therefore, our EPSRC proposal for Making Sense states that engagement with stakeholders and authentic datasets for simulation and evaluation are critical to the project.

Habeas Corpus? Text Analytics Research Paradigm: Uses a corpus of naturally-occurring language texts which capture empirical data on the phenomenon being studied The phenomenon under scrutiny needs to be labelled in the corpus in order to derive training sets for machine learning This labelled corpus constitutes a gold standard for iterative development and evaluation of algorithms Therefore, our EPSRC proposal for Making Sense states that engagement with stakeholders and authentic datasets for simulation and evaluation are critical to the project. Problem: we do not have ANY data - never mind LABELLED data!

Survey Findings Gaining access to relevant data is generally raised as an issue in academic publications for intelligence and security research Relevant data is truth-marked data, essential to benchmarking Research time and effort is thus spent on compiling synthetic data So-called terror corpora have been compiled from documents in the public domain, often Western press Design and content of synthetic datasets like VAST and Enron dataset assume an IE approach to text extraction Information Extraction is the dominant technique used in commercial intelligence analysis systems Only one (British) company is using KWE, which they say is just as good a predictor [of suspiciousness] as IE

Text Analytics: Style is countable Text analytics is about pattern-seeking and counting things 1.If we can characterise, for example, stylistic or genre-specific elements of a target domain via a set of linguistic features then we can measure deviation from linguistic norms via comparison with a (general) reference corpus 3.Concept of KEYNESS: when whatever it is youre counting occurs in your corpus and not in the reference corpus or significantly less in the reference corpus Leeds approach to genre classification and linking: 1.Derive keywords and phrases from a reliable terror corpus. 2.These lexical items can be said to characterise the genre and they also constitute suspicious words and phrases. 3.Compare frequency distributions for designated suspicious items in new and unseen data relative to their counterparts in the terror corpus. 4.Similar distributional profiles for these items, validated by appropriate scoring metrics (e.g. log likelihood), will discover candidate suspect texts.

Applying Text Analytics Methodology 1 Leeds have been involved in collaborative prototyping of parts of our system with project partners Middlesex and Dundee for the VAST Challenges 2010 and VAST 2010: Keyword gists have been incorporated in Dundee "Semantic Pathways" visualisation tool. VAST 2011 Mini Challenge 3: Text Extraction has been useful in gisting content from 4474 news reports of interest to intelligence analysts looking for clues to potential terrorist activity in the Vastopolis region. Each news report is a plaintext file containing a headline, the date of publication, and the content of the article. VAST 2011 Mini Challenge 1: A flu-like epidemic leading to several deaths has broken out in Vastopolis which has about 2 million residents. Text Extraction has been useful in ascertaining the extent of the affected area and whether or not the outbreak is contained.

Mini Challenge 1: Tweet Dataset Weve said that KWE can be implemented speedily over large quantities of ill-formed texts In this case, the ill-formed texts are tweets Problem with text-based data: different datasets need cleaning in different ways and tokenization is also problematic CSV format:ID, User ID, Date and Time, District, Message 11, 70840, 30/04/ :00, Westside, Be kind..If u step on ppl in this life u'll probably come bac as a cockroach in the next.#ummmhmm #karma 25, , 30/04/ :00, Lakeside, August 15th is 2weeks away :/! That's when Ty comes back! I miss him :( 44, , 30/04/ :01, Downtown, #NewTwitter #Rangers#TEAMfollowBACK #TFB #IReallyThink#becauseoftwitter #Mustfollow #MeMetiATerror #SHOUTOUT #justinbieber FOLLOW ME>

Mini Challenge 1: Collocations Used a subset of the dataset: start date/time of epidemic had already been established Each tweet had been tagged with its city zone, so created 13 tweet datasets, one for each zone Built wordlists for each zone and converted each wordlist into a Text object Then able to call object-oriented collocations() method on each text object to emit key collocations (bigrams or pairs of words) per zone The collocations() method uses log likelihood metric to determine whether bigram occurs significantly more frequently than counts for its component words would suggest

Mini Challenge 1: Collocations >>> smogtownTO.collocations() Building collocations list somewhere else; really annoying; getting really; stomach ache; bad diarrhea; vomitting everywhere; sick sucks; extremely painful; can't stand; terible chest; feeling better; short breath; chest pain; every minute; breath every; constant stream; bad case; flem coming; well soon; anyone needs >>> riversideTO.collocations() Building collocations list declining health; best wishes; somewhere else; wishes going; can't stand; terible chest; atrocious cough; chest pain; constant stream; flem coming; get plenty; really annoying; getting really; doctor's office; short breath; every minute; office tomorrow; sore throat; laying down.; get well

Mini Challenge 1: Keyword Gists Also computed keywords (or statistically significant words) per city zone Entails comparison of word distributions in 13 test sets (the tweets per zone) with distributions for the same words in a reference set: all tweets since start of outbreak Build wordlists and frequency distributions for test and reference corpora Apply scoring metric (log likelihood) to determine significant overuse in a test set relative to the reference set PLAINVILLE stomach: diarrhea: DOWNTOWN stomach: UPTOWN stomach: SMOGTOWN stomach: 646 diarrhea: 540

Text Extraction: Quran-as-Corpus Research question: Can keywords derived from training data which exemplifies a target concept be used to classify unseen texts? Problems flagged up by survey: Non-availability of truth-marked evidential data is a problem in the intelligence and security domain No machine learning can take place without exemplars and yardsticks for the concept or behaviour being studied

Text Extraction: Quran-as-Corpus Research question: Can keywords derived from training data which exemplifies a target concept be used to classify unseen texts? Problems flagged up by survey: Non-availability of truth-marked evidential data is a problem in the intelligence and security domain No machine learning can take place without exemplars and yardsticks for the concept or behaviour being studied Solution: 1.Simulate problem of finding a needle in a haystack on a real dataset: English translation of Quran 2.Can annotate a truth-marked (labelled) subset of verses associated with target concept via Leeds Qurany ontology browser 3.Target concept is NOT suspiciousness but is analogous in scope

Analogous in scope: skewed distribution 1.The subset represents roughly 2% of the corpus 2.Judgment Day verses are scattered throughout the Quran Important finding: The fact that the subset constitutes only 2% of the corpus has implications for evaluation As many as 234 attribute-value sets (including class attribute) Prior probability for majority class: 0.98 Prior probability for minority class: 0.02 Test SetReference Set 113 Judgment Day verses6236 verses 3680 words words

Methodology: keyword extraction Build wordlists and frequency distributions for test and reference corpora Compute statistically significant words in the test set relative to the reference set WordQuran Subset Subset frequency All QuranFrequency in reference set Log likelihood statistic will together gather day return

Training instances: attribute-value pairs CSV format location, all,gather,burdens,bearer,show,creation,back,one,brought,single,toget her,another,soul,trumpet,sepulchres,said,end,raise,laden,judgment,people,where on,day,excuses,call,exempt,marshalled,hidden,tell,be,good,return,truth,do,shall,g athered,toiling,ye,bear,you,observe,besides,graves,beings,with,response,originat es,revile,sounded,this,goal,resurrection,originate,up,us,later,will,knower,repeats, or, countKWs,countKeyBigrams,concept Majority class 6.149, 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0, 4,0,no Minority class 6.164, 1,0,2,1,0,0,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,1,2,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0, 16,5,yes

Skewed Data Problem ClassifierFeature Set Success Rate % Recall minority class Confusion Matrix TPFNTNFP OneR J NB Baseline performance doesnt leave much room for improvement Classification accuracy is not the only metric and it may not be the best one here because it assumes equal classification error costs Better recall for the minority class is attained at the expense of classification accuracy BUT we assume that capturing true positives is the most important thing even though this has a knock-on effect on false positive rate

Extra Metrics: BCR and BER ClassifierFeature Set Success Rate % Recall minority class Confusion MatrixBCRBER TPFNTNFP OneR J NB BCR = 0.5 * ((TP / total positive instances) + (TN / total negative instances)) BER = 1 - BCR BCR is computed as the average of true positives and true negatives and thus considers relative class distributions: HIGHER IS BETTER Question: How do our stakeholders view the trade-off between true positives and false alarms in the classification of suspicious data?

Applying Text Analytics Methodology 2 Leeds have used KWE Text Analytics methodology to: identify verses associated with a given concept in the Quran ascertain extent of spread of a flu-like epidemic from a (synthetic) corpus of tweets gist the contents of (synthetic) news reports for intelligence analysts looking for clues to potential terrorist activity We are planning to use it in Health Informatics, with real datasets: to classify cause of death in Verbal Autopsy reports to derive linguistic correlates from free text data such as clinicians notes for automatic prediction of likely outcome of a given cancer patient pathway at a critical stage to assist in recommending optimal course of action for patient: transfer to palliative care or further treatment entails careful scaling up via iterative development of clinical profiling algorithms

Collaboration We are keen to collaborate on other projects! Corpus of text messages etc generated during the recent UK riots is a potentially interesting dataset? KWE extraction algorithms need fine-tuning so that they run in real time We need labelled examples in the dataset of the phenomenon/behaviour of interest in order to develop and evaluate machine learning algorithms

Summary DTAct EPSRC initiative Recent research on terrorism informatics Ideas for future research IF YOU HAVE ANY MORE IDEAS, PLEASE TELL ME!