Presented by: Prof. Ali Jaoua

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
New Technologies Supporting Technical Intelligence Anthony Trippe, 221 st ACS National Meeting.
Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Text Mining with Machine Learning.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Scalable Text Mining with Sparse Generative Models
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Introduction to Data Mining Engineering Group in ACL.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Text mining.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Name : Emad Zargoun Id number : EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.
 An important problem in sponsored search advertising is keyword generation, which bridges the gap between the keywords bidded by advertisers and queried.
FIIT STU Bratislava Classification and automatic concept map creation in eLearning environment Karol Furdík 1, Ján Paralič 1, Pavel Smrž.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Universit at Dortmund, LS VIII
AUTOMATED TEXT CATEGORIZATION: THE TWO-DIMENSIONAL PROBABILITY MODE Abdulaziz alsharikh.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Instance Filtering for Entity Recognition Advisor : Dr.
Text categorization using hyper rectangular keyword extraction: Application to news articles classification Abdelaali Hassaine, Souad Mecheter and Ali.
Evaluation of Agent Building Tools and Implementation of a Prototype for Information Gathering Leif M. Koch University of Waterloo August 2001.
Understanding User’s Query Intent with Wikipedia G 여 승 후.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.
CSC 594 Topics in AI – Text Mining and Analytics
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
© Copyright 2008 STI INNSBRUCK A Semantic Model of Selective Dissemination of Information for Digital Libraries.
Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
Contextual Text Cube Model and Aggregation Operator for Text OLAP
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Intelligent Database Systems Lab Presenter : YU-TING LU Authors : Hsin-Chang Yang, Han-Wei Hsiao, Chung-Hong Lee IPM Multilingual document mining.
Sentiment Analysis Using Common- Sense and Context Information Basant Agarwal 1,2, Namita Mittal 2, Pooja Bansal 2, and Sonal Garg 2 1 Department of Computer.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
The KDD Process for Extracting Useful Knowledge from Volumes of Data Fayyad, Piatetsky-Shapiro, and Smyth Ian Kim SWHIG Seminar.
Presentation by: ABHISHEK KAMAT ABHISHEK MADHUSUDHAN SUYAMEENDRA WADKI
Queensland University of Technology
An Efficient Algorithm for Incremental Update of Concept space
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Sentiment analysis algorithms and applications: A survey
System for Semi-automatic ontology construction
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Data Warehousing and Data Mining
Extracting Semantic Concept Relations
Topic Oriented Semi-supervised Document Clustering
Text Categorization Assigning documents to a fixed set of categories
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Chapter 7: Transformations
Hierarchical, Perceptron-like Learning for OBIE
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

Presented by: Prof. Ali Jaoua Feature Extraction Based on Isolated Labels: Application for Automatic News Categorization Presented by: Prof. Ali Jaoua

Agenda Introduction Financial Watch Project Related Work Problems: Reducing dimensionality of space of words Exploring pertinent features X2-statistic, Document Frequency Thresholding, Information gain Proposed Approach Basic Aspect & Preprocessing phase: (Empty words, Segmentation, Stemming,) Conceptual Approach & Mathematical background Conclusion & future work

Financial watch project objectives: Develop a financial news text mining platform (sources from: web-pages, blogs and forum postings, email messages, etc.) Collect, analyze and extract relevant concepts and events from Arabic and English financial news Trigger alerts to potential users of the system assisting them to take decisions  Identify and alert decision-makers Identify the best tools to be integrated into the FinancialWatch platform.

Why news should process ??? Applications Health Care Delivery IE systems Intelligence Gathering IE systems IE for disaster response Financial Watch System

Financial Watch Project

Automatic Text Categorization Definition and History Classify a set of documents into one or more predefined categories or classes. Definition of set of alerting systems Approach: Supervised Text Categorization Definition of financial news ontology Semantics characterizing the concept labels Use of Support Vector Machine (SVM) for evaluation

Proposed Approach Semantic Filtering Exploring pertinent features Segmentation Stop word Removal Synonym and Stemming Process Synonyms in English text: Loading Wordnet3 in database to process synonyms Stemming process: Arabic Stemmer English Stemmer Exploring pertinent features Conceptual Approach & Mathematical background

Algorithm of Text Categorization System Training Set   Global Feature List TF.IDF Weighting Document Representation Training Vectors   Training Phase Step 2: Obtain Training Vectors Perform stop word elimination and stemming. Apply TF.IDF weighting Represent a document as vector. Step 3: Apply SVM Classifier Apply SVM classifier for each pair of classes of training vectors and obtain k(k-1)/2 classifiers. Elimination of stop words Stemming Classifier 1 Classifier k(k- 1)/2 SVM Classifier ……

Formal Context Coverage based on Isolated Labels Algorithm Map textual document into binary relation R. Compute the fringe relation Rd ∩ R◦R−1◦R associated with a binary context R and for all pairs (x; y) in Rd, only different generated concepts are added to the coverage C. If R is still not covered by C, then it is extended by additional composed properties or labels. Second and third steps are repeated until obtaining a complete coverage of the binary relation R.

FCA Background Formal Information Representation(1/2) Formal Context I: (A,B,R) {} {A1, A2, A3, A4, A5} {B2}{A1, A2, A3, A4} {B3}{A3, A4, A5} {B1, B2}{A1, A2} {B2, B3}{A3, A4} {B3, B4}{A4, A5} {B2, B3, B4}{A4} {B1, B2, B3, B4}{} Lattice of Context I R B1 B2 B3 B4 A1 1 A2 A3 A4 A5 A binary relation R between two finite sets D and T is a subset of the Cartesian product DXT. A Rectangle: of R is a Cartesian product of two sets (A,B) such that AD, BT and AXBR. If AXB  A´XB´ R, we have A = A´ and B = B´  the Rectangle is maximal If AXB  A´XB´ R, we have A = A´ and B = B´  the Rectangle is maximal (Concept) Concepts c1 and c2 are connected iff c1 ≤ c2 and ∄ c3 such that c1 ≤ c3 ≤ c2.

Background Formal Information Representation(2/2) Formal Context I: (A,B,R) {} {A1, A2, A3, A4, A5} {B2}{A1, A2, A3, A4} {B3}{A3, A4, A5} {B1, B2}{A1, A2} {B2, B3}{A3, A4} {B3, B4}{A4, A5} {B2, B3, B4}{A4} {B1, B2, B3, B4}{} Lattice of Context I R B1 B2 B3 B4 A1 1 A2 A3 A4 A5 Minimal Coverage: {C4,C5,C6} C4 C5 C6 Pseudo Concept (A4, B3) Is the union of all concepts containing (A4,B3): PS = I(B3.R-1) o R o I(A4.R) The set of all concepts of a formal context and the partial ordering can be represented graphically using a concept lattice. A concept lattice consists of nodes that represent the concepts and edges connecting these nodes. The nodes for concepts c1 and c2 are connected if and only if c1 ≤ c2 and there is no other concept c3 such that c1 ≤ c3 ≤ c2.

Formal Context Coverage based on Isolated Labels Example D= {a b c d e. a b e. a b e. a b c d e. b d. a b c d e. a b c d e. d f g. b f g.}. First Iteration Isolated Points: {(x1, a), (x1, e), (x2 a), (x2, e), (x3, c)} Concepts: C0={x0; x1; x2; x5; x6} × {a; b; e}. Labels are a, e. C1={x0; x3; x5; x6} × {b; c}. Label is c.

Formal Context Coverage based on Isolated Labels Next Iteration (Coverage is not reached) Expand Repeat same steps until obtaining full coverage of R. Concepts: C0={x0; x1; x2; x5; x6} × {a; b; e}. Labels are a, e. C1={x0; x3; x5; x6} × {b; c}. Label is c. C2={x0; x4; x5; x6} × {b; d}. Label is b.d. C3={x7} × {d; f; g}. Label is d.f. C4={x8} × {b; f; g}. Label is b.f. Features: { a, e, c, b.d, d.f, b.f}

Formal Context Coverage based on Isolated Labels Example إرتفعت الأرباح الصافية لمصرف الراجحي السعودي بنسبة 10% خلال الربع الثاني من العام الحالي مقارنة مع الفترة ذاتها من العام الماضي. وبلغ صافي ربح الشركة خلال الربع الثالث 24مليون ريال مقابل صافي ربح بمقدار 21.8 مليون ريال للربع المماثل من العام السابق. كما بلغ إجمالي الربح خلال الربع الثالث 27 مليون ريال مقابل 24 مليون ريال عن نفس الفترة من العام السابق أي بارتفاع قدره 12.5%. وبلغ إجمالي الربح التشغيلي خلال الربع الثالث 24مليون ريال مقابل 21مليون ريال للربع المماثل عن نفس الفترة من العام السابق. Apply Stop Word Elimination and Stemming رفع ربح صفا صرف رجح سعد نسب ربع ثني عام حال قرن فتر عام مضي. بلغ صفا ربح شرك ربع ثلث مليون ريال قبل صفا ربح قدر مليون ريال ربع مثل عام سبق. بلغ جمل ربح ربع ثلث مليون ريال مليون ريال فتر عام سبق رفع. بلغ جمل ربح شغل ربع ثلث مليون ريال قبل مليون ريال ربع مثل فترعام سبق. Labels (Features) } رفع.ربح، صرف، رجح، سعد، نسب، قرن، مضي، ثني، شرك، قدر {

Illustrative example: original text

Illustrative example: (cont.) Textual document after preprocessing Features Extracted

Illustrative example: (cont.) Stemmed Words

Methodology Algorithms Find minimal coverage Use of composed properties to reach a coverage Find pertinent information in of quality and reasonable efficiency .. Management Change Transaction Performance Others

Evaluation – English Texts

Evaluation – Arabic Texts

Conclusion & Future work New methods were proposed to handle dimensional reduction of the space of words FCA is used for its simplicity and effectiveness as an analysis tool Domain information is used to reduce the structuring noise and improve labeling process Applications like corpus organization, text summarization and feature-extraction were experimented in context of the FWATCH project Evaluation results show the significance of the new methods for incrementally updating a stable information store

Thank You Q & A