Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ 12.05.2010 Text Categorization For Turkish News.

Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ 12.05.2010 Text Categorization For Turkish News

Text categorization Classify text to predefined categories Supervised learning Labeled corpus Used in Indexing (e.g. Libraries)‏ News articles Spam filtering 1 / 22 11.5.2010

2 / 22

Bilkent News Portal Gather news from different news providers News are more accessible New event detection and tracking Novelty detection Dublicate elimination Personalization News Categorization 3 / 22

Aktuel En Çok Okunanlar Anasayfa Spor Politika Çevre Tüm Haberler Son Dakika....... News CATEGORIES 4 / 22

Motivation News are categorized From Rss 24 good categories A few bad categories Anasayfa EnCokOkunanlar Gundem SonDakika Tum_Haberler Yazarlar Aktuel Avrupa_Futbol BilimTeknoloji Bilisim Cevre DisHaberler Dunya Ege Egitim Ekonomi Formula1 Hava_Yol Ispanya Italya KulturSanat Politika Saglik Siyaset Spor Televizyon Turkiye Yasam Yazarlar YurtHaberler 5 / 22

Data Set Categories are skewed (not homogene)‏ 6 / 22

Approach Classifiers K Nearest Neighbour (kNN)‏ Support Vector Machines (SVM)‏ Use training set News with good categories Use test set News with good categories (for evaluation)‏ Evaluation Test with already categorized news 7 / 22 Found to be best [1]

Methodology Cleaning Noises Preprocessing Indexing Document Classification 8 / 22

Cleaning Noises News documents coming from different RSS feeds generally contain noises such as advertisements, hypertexts, etc. Increase the similarity between documents which contain the same or similar noises Decrease in the performance of the systems as Bilkent News Portal, which uses similarity between documents. 9 / 22

Cleaning Noises (cntd.) Cleaning process of noises such as hypertexts is easily done by removing the sentences contain these noises. Their pattern do not change for each news document coming from different RSS feeds. E.g. hypertexts, which contain links to other documents, are defined as “ ”. 10 / 22

Cleaning Noises (cntd.) Each RSS feed attaches specific advertisements to its news documents. No general pattern for all news documents. After a while, even the same RSS feed changes the advertisement being attached to its news documents. 11 / 22

Cleaning Noises (cntd.) Compare two consecutive news documents from the same RSS feed sentence-by-sentence. (Each sentence is compared with every sentence of the consecutive news document) Calculate the similarity between each sentence by using Cosine Similarity. 12 / 22

Preprocessing Stemming – Zemberek API is used. Stop word list comparison – frequently occuring words are not taken into consideration. 13 / 22

Document Indexing  Creating vector space model from index terms + feature selection may be a costly process.  Consistency with Bilkent News Portal  Lemur[2] for document indexing operation.  Lemur can only index predefined formats  TREC text 14 / 22

Document Classification Two different approaches to assess which one performs better:  K-Nearest Neighbor  Support Vector Machines 15 / 22

K Nearest Neighbor Given training data D (categorized news in our case), goal is to assign test point X (news with unknown category in our case) to label of associated closest neighbors in D. As distance function to specify k nearest news, we again used Lemur. Lemur can also retrieve documents according to some score calculation. Lemur is quite fast at retrieving k similar documents. 16 / 22

Support Vector Machines Support Vector Machines (SVMs), Applied to various problems, Data in k-dim space, Find a hyperplane (i.e subset with k-1 dim), Several possible hyperplanes.. 17 / 22

Support Vector Machines (cntd.)‏ Figure 2. Possible hyperplanes in a sample space. Margin of u2 Support Vectors 18 / 22

Support Vector Machines (cntd.)‏ Aim: Find a hyperplane correctly classifying with maximal margin, Support vectors are only effective, Represent a hyperplane : 19 / 22

Support Vector Machines (cntd.)‏ Figure 3. A sample linear SVM. 20 / 22

Support Vector Machines (cntd.)‏ Figure 4. A sample non-linear SVM. Figure 5. Mapping with kernel function. 21 / 22

Conclusion The necessity of Turkish news categorization is covered.  Bilkent News Portal: RSS feeds may lack category information or having unrealistic categories such as Last Minute, Main Page, Agenda etc. A categorization methodology for Turkish news is proposed. Finding the correct category is done both by KNN (base classifier) and SVM to evaluate which one performs better. KNN for 100 experiments, 60% success. 22 / 22

References [1] Yang, Y. and Liu, X., A re-examination of text categorization methods. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, (SIGIR), 1999 [2] http://www.lemurproject.org/

Questions? Thank you for listening…

Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ 12.05.2010 Text Categorization For Turkish News.

Similar presentations

Presentation on theme: "Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ 12.05.2010 Text Categorization For Turkish News."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ 12.05.2010 Text Categorization For Turkish News.

Similar presentations

Presentation on theme: "Ceyhun Karbeyaz Çağrı Toraman Anıl Türel Ahmet Yeniçağ 12.05.2010 Text Categorization For Turkish News."— Presentation transcript:

Similar presentations

About project

Feedback