Presentation is loading. Please wait.

Presentation is loading. Please wait.

USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Similar presentations


Presentation on theme: "USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu."— Presentation transcript:

1 USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu

2 Outline Background & Challenge Preprocessing Lemmatization TF-IDF Classification k-NN Naive Bayes Linear Support Vector Classification Logistic Regression Classification Random Forest Evaluation Conclusion

3 Outline Background & Challenge Preprocessing Lemmatization TF-IDF Classification k-NN Naive Bayes Linear Support Vector Classification Logistic Regression Classification Random Forest Evaluation Conclusion

4 Background A problem from Kaggle Predict the category of cuisine from the recipe ingredients pasta -> Italian, kimchi -> Korean, curry -> Indian

5 Challenge Multi-label classification 6715 features If we use binary label for every ingredient in each recipe, the train data will be too large. Huge number of labels to train Quite different from ‘Yes’ or ‘No’ label. class-imbalanced the Italian and Indian food dominate the whole recipe while we could seldom see one or two cuisine called “cajun_creole”

6

7 Outline Background & Challenge Preprocessing Lemmatization TF-IDF Classification k-NN Naive Bayes Linear Support Vector Classification Logistic Regression Classification Random Forest Evaluation Conclusion

8 Lemmatization Characters hard to handle in the data set ™ and ® - delete it, do not influence the result. French character(é, ù) – replace it by a similar English character, guarantee the word is unique in the features after replacing. Plural form eggs and egg. NLTK(Natural Language Toolkit) – lemmatize the word according to the dictionary in toolkit.

9 TF-IDF The problem is similar to label document according to the content in the document. Term Frequency–Inverse Document Frequency(TF-IDF), a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. Using the raw frequency of a term TF(t) means the number of times that term t occurs in content. After lemmatization and TF-IDF, we reduce feature from 6715 to 2774

10 Outline Background & Challenge Preprocessing Lemmatization TF-IDF Classification k-NN Naive Bayes Linear Support Vector Classification Logistic Regression Classification Random Forest Evaluation Conclusion

11 k-NN scikit-learn implements two different nearest neighbors classifiers: KNeighborsClassifier: implements learning based on the k nearest neighbors of each query point, where k is an integer value specified by the user. RadiusNeighborsClassifier: implements learning based on the number of neighbors within a fixed radius r of each training point, where r is a floating-point value specified by the user. We choose the 1 st classifier, set k = 1. The result should be taken as basic standard of all classifers’ performance

12 Naive Bayes The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). In practice, fractional counts such as tf-idf may also work. Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods. The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one dimensional distribution. This in turn helps to alleviate problems stemming from the curse of dimensionality. Performed better than expected: Attributes are relatively independent compared with word vectors in text.

13 Parameters Default alpha = 1 We set alpha = 0.01 for N(N<1) is much smaller than n

14 Linear Support Vector Classification The advantages of support vector machines are: Effective in high dimensional spaces. Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient. Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels. The disadvantages of support vector machines include: If the number of features is much greater than the number of samples, the method is likely to give poor performances. SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.

15 Linear Support Vector Classification Multiclass support is handled according to One-Vs-All scheme Radial Basis Function kernel

16 Parameters Default parameters Penalty parameter C of the error term is 1.0 Dual = true We set C = 0.8 Dual = false

17 Logistic Regression Classification Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

18 Parameters We use GridSearchCV to find the best parameters.

19 Random Forest A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. A diverse set of classifiers is created by introducing randomness in the classifier construction. The prediction of the ensemble is given as the averaged prediction of the individual classifiers.

20 Parameters By default, the number of trees in the forest is 10 We set the number of trees in the forest to 100 More trees will cover more features. The larger the better, but also the longer it will take to compute.

21 Outline Background & Challenge Preprocessing Lemmatization TF-IDF Classification k-NN Naive Bayes Linear Support Vector Classification Logistic Regression Classification Random Forest Evaluation Conclusion

22 Evaluation Setup Python 3.3 for windows Two library NLTK(Natural Language Toolkit) Scikit-learn Evaluation metric Accuracy Time

23 Accuracy

24 Time The time for 1NN is longer than 5 hours

25 Outline Background & Challenge Preprocessing Lemmatization TF-IDF Classification k-NN Naive Bayes Linear Support Vector Classification Logistic Regression Classification Random Forest Evaluation Conclusion

26 The preprocessing step dramatically save the execution time. Different parameter will significantly affect the result Considering both accuracy and time, Linear SVC is the best choice.


Download ppt "USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu."

Similar presentations


Ads by Google