Presentation is loading. Please wait.

Presentation is loading. Please wait.

Elena Mikhalkova, Nadezhda Ganzherli, Yuri Karyakin, Dmitriy Grigoryev

Similar presentations


Presentation on theme: "Elena Mikhalkova, Nadezhda Ganzherli, Yuri Karyakin, Dmitriy Grigoryev"— Presentation transcript:

1 MACHINE LEARNING CLASSIFICATION OF USER INTERESTS ACROSS LANGUAGES AND SOCIAL NETWORKS
Elena Mikhalkova, Nadezhda Ganzherli, Yuri Karyakin, Dmitriy Grigoryev Tyumen State University

2 Assumption: ML classification of user interests does not depend on the language and the network (...and, probably, the interest itself)

3 Dataset: https://github.com/evrog/TSAAP
Krippendorff’s α=0.82 (>0.8) No. of pages Vkontakte Russian Football 39 Rock Music 109 Vegetarianism 127 Twitter Russian 33 37 32 Twitter English 97 96 100

4 Normalization & Lemmatization
Own tweet preprocessing software: English texts => NLTK Lemmatizer; Russian texts => Pymystem3 We do not exclude stop-words!

5 Interclass classification
Cross-validation: 200 texts of different length (1800 texts, in sum), average F1-score in 5 folds Algorithms: Support Vector Machine, Neural Network, Naive Bayes, Logistic Regression, Decision Trees, k-Nearest Neighbors Optimization parameters in Scikit-Learn: four kernel functions: linear, polynomial, Radial Basis Function, sigmoid in SVM; Bernoulli, Multinomial, and Gaussian variants of Naive Bayes; Multi-layer Perceptron (NN): 1 hidden layer of 100 neurons, two solver functions (lbfgs and adam); three data models...

6 Data Models Bernoulli - absence/presence of a word (0 or 1);
Frequency distribution - presence of a word denoted by its frequency in the training vocabulary (integer [0;+∞)); Normalized frequency - presence of a word denoted by normalized frequency in the training vocabulary in the interval [0;1].

7

8 Results-1 Lemmatization slightly increases the performance (by about 3%): ∑ F1-scores => versus Effectiveness of the Bernoulli model: (by mode) 18 versus 4 scores of 1.0; (by mean) versus plain, normalized. Logistic Regression with Bernoulli model: ∑ F1-scores = versus Neural Network (lbfgs) with Bernoulli model (no need to add layers…) & 17.5 Multinomial Bayes, plain.

9 MaI Total Vk Ru T Ru T En Vk, xAVE T, xAVE Ru, xAVE En, xAVE Normalized texts F 33.976 10.24 11.826 11.91 0.853 0.989 0.919 0.993 R 33.138 10.064 11.334 11.74 0.839 0.961 0.892 0.978 V 32.906 9.808 11.302 10.796 0.817 0.962 0.88 0.983 Lemmatized texts 34.282 10.43 11.932 11.92 0.869 0.994 0.932 33.942 10.398 11.624 11.754 0.867 0.974 0.918 0.98 33.708 10.272 11.622 11.814 0.856 0.977 0.912 0.985

10 Results-2: Mann-Whitney U
DIFFERENCE BY NETWORK: Vkontakte-Russian underscores compared to the Twitter-English (pvalue=1.0, greater) and Twitter-Russian (pvalue=0.99, greater). DIFFERENCE BY INTEREST: Vegetarianism and Rock Music are very likely to score less than Football: pvalue=0.99, greater, and pvalue=0.99, greater.

11 Thank you!


Download ppt "Elena Mikhalkova, Nadezhda Ganzherli, Yuri Karyakin, Dmitriy Grigoryev"

Similar presentations


Ads by Google