Project Final Presentation – Dec. 6, 2012 CS 5604 : Information Storage and Retrieval Instructor: Prof. Edward Fox GTA : Tarek Kanan ProjArabic Team Ahmed.

Slides:



Advertisements
Similar presentations
Florida International University COP 4770 Introduction of Weka.
Advertisements

CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Data Analysis of Tennis Matches Fatih Çalışır. 1.ATP World Tour 250  ATP 250 Brisbane  ATP 250 Sydney... 2.ATP World Tour 500  ATP 500 Memphis  ATP.
Tweet Classification for Political Sentiment Analysis Micol Marchetti-Bowick.
ClearTK: A Framework for Statistical Biomedical Natural Language Processing Philip Ogren Philipp Wetzler Department of Computer Science University of Colorado.
Current and Future Research Directions University of Tehran Database Research Group 1 October 2009 Abolfazl AleAhmad, Ehsan Darrudi, Hadi.
Indian Statistical Institute Kolkata
Probabilistic Generative Models Rong Jin. Probabilistic Generative Model Classify instance x into one of K classes Class prior Density function for class.
1 JCDL 2011 Report Kazunari Sugiyama WING meeting 19 th August, 2011.
Collaborative Filtering in iCAMP Max Welling Professor of Computer Science & Statistics.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
SAK 5609 DATA MINING Prof. Madya Dr. Md. Nasir bin Sulaiman
Mapping Between Taxonomies Elena Eneva 30 Oct 2001 Advanced IR Seminar.
CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.
Mapping Between Taxonomies Elena Eneva 27 Sep 2001 Advanced IR Seminar.
TEXT CLASSIFICATION CC437 (Includes some original material by Chris Manning)
A novel log-based relevance feedback technique in content- based image retrieval Reporter: Francis 2005/6/2.
Mapping Between Taxonomies Elena Eneva 11 Dec 2001 Advanced IR Seminar.
Document Classification Comparison Evangel Sarwar, Josh Woolever, Rebecca Zimmerman.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Presented by Zeehasham Rasheed
Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.
CS Instance Based Learning1 Instance Based Learning.
Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.
Chapter 5: Information Retrieval and Web Search
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
SPAM DETECTION USING MACHINE LEARNING Lydia Song, Lauren Steimle, Xiaoxiao Xu.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.
SVMLight SVMLight is an implementation of Support Vector Machine (SVM) in C. Download source from :
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Text mining.
K Nearest Neighborhood (KNNs)
Qatar Content Classification Presenter Mohamed Handosa VT, CS6604 May 6, 2014 Client Tarek Kanan 1.
NL Question-Answering using Naïve Bayes and LSA By Kaushik Krishnasamy.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
TEXT CLASSIFICATION USING MACHINE LEARNING Student: Hung Vo Course: CP-SC 881 Instructor: Professor Luo Feng Clemson University 04/27/2011.
Reduction of Training Noises for Text Classifiers Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Instance Filtering for Entity Recognition Advisor : Dr.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
1 Pattern Recognition Pattern recognition is: 1. A research area in which patterns in data are found, recognized, discovered, …whatever. 2. A catchall.
Introduction Use machine learning and various classifying techniques to be able to create an algorithm that can decipher between spam and ham s. .
ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Analysis of Classification Algorithms In Handwritten Digit Recognition Logan Helms Jon Daniele.
CSE 534 Final Project Internet Outage Analysis Name: Guanyu Zhu, Wei-Ting Lin, Zhaowei Sun Professor: Phillipa Gill.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Nuhi BESIMI, Adrian BESIMI, Visar SHEHU
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Information Retrieval and Organisation Chapter 14 Vector Space Classification Dell Zhang Birkbeck, University of London.
Musical Genre Categorization Using Support Vector Machines Shu Wang.
Copyright  2004 limsoon wong Using WEKA for Classification (without feature selection)
Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam s, also known as junk s, are unwanted s sent to numerous recipients.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
Big Data Processing of School Shooting Archives
Text Mining CSC 600: Data Mining Class 20.
Auto Coding System Development and application
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Machine Learning Week 1.
Event Focused URL Extraction from Tweets
Prepared by: Mahmoud Rafeek Al-Farra
Text Mining CSC 576: Data Mining.
Information Retrieval
USING NLP TO MAKE UNSTRUCTURED DATA HIGHLY ACCESSABLE
Tiran Software RadeX Tahir Bilal Onur Deniz Soner Kara
Presentation transcript:

Project Final Presentation – Dec. 6, 2012 CS 5604 : Information Storage and Retrieval Instructor: Prof. Edward Fox GTA : Tarek Kanan ProjArabic Team Ahmed Elbery

Outline Arabic documents classification: Motivation Arabic documents classification: Challenges Model Model Details Results and Evaluation

Arabic documents classification: Motivation Rich set of Arabic documents Now > 65M Internet users of Arabic Arabic NLP needed for increasing Arabic internet content

Arabic documents classification: Challenges Techniques built for English language processing may not apply to Arabic because:- Arabic is very rich with complex morphology Arabic has a very different and difficult syntax and grammar

Project model Stemmers Naive Bayes Classification k-Nearest Neighbors Decision tree Support-vector machines Stems Feature Extractor Classification Result Tokenizer Tokens Top Terms Preprocessing Arabic Documents

Data Set 100 Docs 50 Docs Politics Violence Arabic Spring

Preprocessing Stemmers Stems Feature Extractor Tokenizer Tokens Top Terms

Preprocessing Stemmers Stems Feature Extractor Tokenizer Tokens Top Terms

Preprocessing Stemmers Stems Feature Extractor Tokenizer Tokens Top Terms

Example Doc P-1 : Systems Politics nation area Liberty International Politics Government Doc P-2 Politics Systems nation area Kill nation Politics Government Doc V-1 Violence Systems Weapon Weapon Militias Violence Kill Government Burn Doc V-1 Burn Systems Weapon Militias Violence Kill Kill Government

Example- Cont.

Preprocessing …….term3term2term1 Doc1 D0c2 Doc3..… The output matrix tf-idf values Class

Classifier Classification Algorithm Classifier (Model) ClassDoc P1 V2 V3 …..… Test Set Training Set

Results and Evaluation Accuracy 100 Docs (50+50) 10 times 80% training 20% test Accuracy

Results and Evaluation Accuracy Correlation coefficient

Results and Evaluation Av. Accuracy

Results and Evaluation Time Av. Time

Future work Test the different parameters of the classifier Feature ratio Feature selection parameters Classifier parameters. Statistically analysis the results.

Ahmed Elbery