Content Classification Analysis based on LDA Topic Model PROJECT LEADER: HONGBO ZHAO.

Slides:



Advertisements
Similar presentations
Stony Brook University
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Running a model's adjoint to obtain derivatives, while more efficient and accurate than other methods, such as the finite difference method, is a computationally.
Numerical Solution of Linear Equations
Chapter 11: The t Test for Two Related Samples
Learning in Neural and Belief Networks - Feed Forward Neural Network 2001 년 3 월 28 일 안순길.
Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.
Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Final Project of Information Retrieval and Extraction by d 吳蕙如.
Ao-Jan Su † Y. Charlie Hu ‡ Aleksandar Kuzmanovic † Cheng-Kok Koh ‡ † Northwestern University ‡ Purdue University How to Improve Your Google Ranking: Myths.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Single Category Classification Stage One Additive Weighted Prototype Model.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
The Vector Space Model …and applications in Information Retrieval.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Feng Zhang, Guang Qiu, Jiajun Bu*, Mingcheng Qu, Chun Chen College of Computer Science, Zhejiang University Hangzhou, China Reporter: 洪紹祥 Adviser: 鄭淑真.
Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun 2010, ACM Automatic Keyphrase Extraction.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Search Engine Marketing SEM = Search Engine Marketing SEO = Search Engine Optimization optimizing (altering/changing) your page in order to get a higher.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates
 Goal recap  Implementation  Experimental Results  Conclusion  Questions & Answers.
ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.
1 Analysis of in-use driving behaviour data delivered by vehicle manufacturers By Heinz Steven
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Linear Discriminant Analysis and Logistic Regression.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Neural Networks Vladimir Pleskonjić 3188/ /20 Vladimir Pleskonjić General Feedforward neural networks Inputs are numeric features Outputs are in.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
AchieveNJ: Principal and Assistant/ Vice Principal Evaluation Scoring Guide
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
AchieveNJ: Principal and Assistant/ Vice Principal Evaluation Scoring Guide
A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Sentimental feature selection for sentiment analysis of Chinese online reviews Lijuan Zheng 1,2, Hongwei Wang 2, and Song Gao 2 1 School of Business, Liaocheng.
Test Title Test Content.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Linguistic Graph Similarity for News Sentence Searching
Clustering of Web pages
Web News Sentence Searching Using Linguistic Graph Similarity
Uncertainty in Measurement
Natural Language Processing of Knee MRI Reports
Overview This presentation provides information on how districts compile evaluation ratings for principals, assistant principals (APs), and vice principals.
Multimedia Information Retrieval
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Boosting Nearest-Neighbor Classifier for Character Recognition
Toshiyuki Shimizu (Kyoto University)
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
Text Categorization Assigning documents to a fixed set of categories
Outline Background Motivation Proposed Model Experimental Results
Chapter 5: Information Retrieval and Web Search
Introduction precision and accuracy
Overview This presentation provides information on how districts compile evaluation ratings for principals, assistant principals (APs), and vice principals.
Research Institute for Future Media Computing
Building Topic/Trend Detection System based on Slow Intelligence
Title of Presentation (your names).
Department of Computer Science Ben-Gurion University of the Negev
Algorithm Efficiency and Sorting
Wil Collins, Will Dickerson Client: Mohamed Magdy and CTRnet
A Coupled User Clustering Algorithm for Web-based Learning Systems
Presentation transcript:

Content Classification Analysis based on LDA Topic Model PROJECT LEADER: HONGBO ZHAO

Content Classification Analysis based on LDA Topic Model Web crawler achieving web news chinese parsing & extracting Advanced TF-IDF contents processing adding content- based tests finding best parameters in small data Testing parameters testing in big data comparing to content-based algorithm

Web crawler achieving nearly 17,000 web news through Sougou Database including html characters, insignificantly achieving web news chinese parsing & extracting

Web crawler using ICTCLAS to parse and extract chinese words, excluding stop words, conjunctions, prepositions and numerals achieving web news chinese parsing & extracting

Advanced TF-IDF Extracting news into TITLE, BEGIN, CONTENT and END section with different weights Using TF-IDF to calculate top 5 keywords, the accuracy is 81% comparing to the sorted database content processing adding content-based tests finding best parameters in small data

Advanced TF-IDF Adding content-based algorithm(the accuracy through 81% to 82% when the semantic weight through 1.0 to 0.0), there is no significant changes. We concludes that the semantics is useless in this circumstance. contents processing adding content-based tests finding best parameters in small data

Advanced TF-IDF Testing perfect parameters in small data(less than 2000 news), including accurancy, time efficiency factors testing sets = 30% of whole data training sets = 70% of whole data contents processing adding content-based tests finding best parameters in small data

Advanced TF-IDF the keywords in training sets equals to testing sets contents processing adding content-based tests finding best parameters in small data keywords number error ecore accuracy ALL Unstable

Advanced TF-IDF Using all keywords in training sets contents processing adding content-based tests finding best parameters in small data keywords number error score accuracy Extremly low speed

Advanced TF-IDF Using all keywords in testing sets contents processing adding content-based tests finding best parameters in small data keywords number error score accuracy When using 10 keywords in training sets, the accuracy, error score and time efficency is perfect

Testing parameters Testing to big data, when the training set in every section increases gradually to 200, 450, 750 and finally 1343(all words), the accuracy is shown in the figure. The final accuracy reaches 82.5% or 85.1% excluding the culture section. The results shows the perfect parameters we selected. testing in big data comparing to content-based algorithm

Testing parameters to content-based algorithm, the accuracy is greater, however, the time efficiency is lower testing in big data comparing to content-based algorithm

Summary partial encoding & decoding problems errors in keywords parsing leads to classification faults partial repeated passages leads to errors in accuracy successful algorithm in general

Thanks Content Classification Analysis based on LDA Topic Model