Presentation is loading. Please wait.

Presentation is loading. Please wait.

Content Classification Analysis based on LDA Topic Model PROJECT LEADER: HONGBO ZHAO.

Similar presentations


Presentation on theme: "Content Classification Analysis based on LDA Topic Model PROJECT LEADER: HONGBO ZHAO."— Presentation transcript:

1 Content Classification Analysis based on LDA Topic Model PROJECT LEADER: HONGBO ZHAO

2 Content Classification Analysis based on LDA Topic Model Web crawler achieving web news chinese parsing & extracting Advanced TF-IDF contents processing adding content- based tests finding best parameters in small data Testing parameters testing in big data comparing to content-based algorithm

3 Web crawler achieving nearly 17,000 web news through Sougou Database including html characters, insignificantly achieving web news chinese parsing & extracting

4 Web crawler using ICTCLAS to parse and extract chinese words, excluding stop words, conjunctions, prepositions and numerals achieving web news chinese parsing & extracting

5 Advanced TF-IDF Extracting news into TITLE, BEGIN, CONTENT and END section with different weights Using TF-IDF to calculate top 5 keywords, the accuracy is 81% comparing to the sorted database content processing adding content-based tests finding best parameters in small data

6 Advanced TF-IDF Adding content-based algorithm(the accuracy through 81% to 82% when the semantic weight through 1.0 to 0.0), there is no significant changes. We concludes that the semantics is useless in this circumstance. contents processing adding content-based tests finding best parameters in small data

7 Advanced TF-IDF Testing perfect parameters in small data(less than 2000 news), including accurancy, time efficiency factors testing sets = 30% of whole data training sets = 70% of whole data contents processing adding content-based tests finding best parameters in small data

8 Advanced TF-IDF the keywords in training sets equals to testing sets contents processing adding content-based tests finding best parameters in small data keywords number error ecore accuracy 522560.4983277591973244 1015790.5735785953177257 1513350.6304347826086957 2012760.6789297658862876 ALL17200.7190635451505016 Unstable

9 Advanced TF-IDF Using all keywords in training sets contents processing adding content-based tests finding best parameters in small data keywords number error score accuracy 518770.6471571906354515 1014230.7073578595317725 1514570.7006688963210702 2014740.7056856187290970 Extremly low speed

10 Advanced TF-IDF Using all keywords in testing sets contents processing adding content-based tests finding best parameters in small data keywords number error score accuracy 513780.6321070234113713 1014130.7040133779264214 1513330.7107023411371237 2014680.7257525083612040 When using 10 keywords in training sets, the accuracy, error score and time efficency is perfect

11 Testing parameters Testing to big data, when the training set in every section increases gradually to 200, 450, 750 and finally 1343(all words), the accuracy is shown in the figure. The final accuracy reaches 82.5% or 85.1% excluding the culture section. The results shows the perfect parameters we selected. testing in big data comparing to content-based algorithm

12 Testing parameters to content-based algorithm, the accuracy is greater, however, the time efficiency is lower testing in big data comparing to content-based algorithm

13 Summary partial encoding & decoding problems errors in keywords parsing leads to classification faults partial repeated passages leads to errors in accuracy successful algorithm in general

14 Thanks Content Classification Analysis based on LDA Topic Model


Download ppt "Content Classification Analysis based on LDA Topic Model PROJECT LEADER: HONGBO ZHAO."

Similar presentations


Ads by Google