Content Classification Analysis based on LDA Topic Model PROJECT LEADER: HONGBO ZHAO
Content Classification Analysis based on LDA Topic Model Web crawler achieving web news chinese parsing & extracting Advanced TF-IDF contents processing adding content- based tests finding best parameters in small data Testing parameters testing in big data comparing to content-based algorithm
Web crawler achieving nearly 17,000 web news through Sougou Database including html characters, insignificantly achieving web news chinese parsing & extracting
Web crawler using ICTCLAS to parse and extract chinese words, excluding stop words, conjunctions, prepositions and numerals achieving web news chinese parsing & extracting
Advanced TF-IDF Extracting news into TITLE, BEGIN, CONTENT and END section with different weights Using TF-IDF to calculate top 5 keywords, the accuracy is 81% comparing to the sorted database content processing adding content-based tests finding best parameters in small data
Advanced TF-IDF Adding content-based algorithm(the accuracy through 81% to 82% when the semantic weight through 1.0 to 0.0), there is no significant changes. We concludes that the semantics is useless in this circumstance. contents processing adding content-based tests finding best parameters in small data
Advanced TF-IDF Testing perfect parameters in small data(less than 2000 news), including accurancy, time efficiency factors testing sets = 30% of whole data training sets = 70% of whole data contents processing adding content-based tests finding best parameters in small data
Advanced TF-IDF the keywords in training sets equals to testing sets contents processing adding content-based tests finding best parameters in small data keywords number error ecore accuracy ALL Unstable
Advanced TF-IDF Using all keywords in training sets contents processing adding content-based tests finding best parameters in small data keywords number error score accuracy Extremly low speed
Advanced TF-IDF Using all keywords in testing sets contents processing adding content-based tests finding best parameters in small data keywords number error score accuracy When using 10 keywords in training sets, the accuracy, error score and time efficency is perfect
Testing parameters Testing to big data, when the training set in every section increases gradually to 200, 450, 750 and finally 1343(all words), the accuracy is shown in the figure. The final accuracy reaches 82.5% or 85.1% excluding the culture section. The results shows the perfect parameters we selected. testing in big data comparing to content-based algorithm
Testing parameters to content-based algorithm, the accuracy is greater, however, the time efficiency is lower testing in big data comparing to content-based algorithm
Summary partial encoding & decoding problems errors in keywords parsing leads to classification faults partial repeated passages leads to errors in accuracy successful algorithm in general
Thanks Content Classification Analysis based on LDA Topic Model