Presentation is loading. Please wait.

Presentation is loading. Please wait.

第一届中国大数据技术创新与创业大赛 关键词行业分类 ThuFit 队: 周昕宇,吴育昕 ,任杰 ,王禺淇 ,罗鸿胤 指导:方展鹏 ,唐杰 清华大学 未来互联网兴趣团队.

Similar presentations


Presentation on theme: "第一届中国大数据技术创新与创业大赛 关键词行业分类 ThuFit 队: 周昕宇,吴育昕 ,任杰 ,王禺淇 ,罗鸿胤 指导:方展鹏 ,唐杰 清华大学 未来互联网兴趣团队."— Presentation transcript:

1 第一届中国大数据技术创新与创业大赛 关键词行业分类 ThuFit 队: 周昕宇,吴育昕 ,任杰 ,王禺淇 ,罗鸿胤 指导:方展鹏 ,唐杰 清华大学 未来互联网兴趣团队

2 Given: Partially labeled keywords First 10 search results for each keywords Keyword-buyer relationship Goal: Predict unlabeled keywords Task

3 keyword_class.txt 10,787,584 keywords 1,143,928 labeled, 10.6% 9,963,062 unique keywords 33 classes keyword_users.txt 23,942,643 entries Each entry is a keyword-buyer pair keyword_titles.txt 21,575,166 entries, but only 10,787,583 entries are non-empty. Each entry comprised of keyword and its first 10 search result using Baidu Data summary

4 Preprocessing: Keyword segmentation Feature Extraction: Keyword segment Keyword-buyer relation Keyword-segment relation Search result utilization Model: liblinear Approach

5 Keyword segmentation

6 Sparse representation of segments Smoothened TFIDF-based feature N-gram “End-gram” Feature Extraction - segment

7 Feature Extraction - TFIDF

8 Feature Extraction - N-gram

9 Feature Extraction - End-gram

10 Feature Extraction

11 Keyword-buyer/segment relation B0 B1 B2 B3 K0 K1 K2 K3 S0 S1 S2 S3 K0 K1 K2 K3 C0 C1 C2 C3

12 Keyword-buyer/segment relation B0 B1 B2 B3 K0 K1 K2 K3 S0 S1 S2 S3 K0 K1 K2 K3 C0 C1 C2 C3 S0: C2S1: C3 S2: S3: C2 C3 K0: C2K1: K2: K3: C3 B0: C2 C3B1: B2: B3:

13 Keyword-buyer/segment relation B0 B1 B2 B3 K0 K1 K2 K3 S0 S1 S2 S3 K0 K1 K2 K3 C0 C1 C2 C3 S0: C2S1: C3 C3 S2: C0 S3: C2 C3 C0 C3 K0: C2 C0K1: K2: C3 K3: C3 B0: C2 C3B1: C0 C3 B2: B3:

14 Assumption: A user tends to by similar class of keywords Obtain the distribution of classes of keywords a buyer buys on labeled data. Each buyer has a 33-dimensioned feature vector For each keyword, its feature vectors is an average over feature vector of a buyers that buys this keyword. Using only this feature we get an accuracy of 0.82 Keyword-buyer relation

15 B0 B1 B2 B3 K0 K1 K2 K3 S0 S1 S2 S3 K0 K1 K2 K3 C0 C1 C2 C3

16 We have made effort trying modeling buyers by the segments of keywords they bought, and model keywords-keywords relationship by exploiting their common connection with segments. Buyer -> Keyword ->Segment =>Buyer -> Segment We further introduced higher order relation influence between buyers and keywords, but improvements are subtle. Keyword-buyer relation

17 Reverse the link between segment and keywords Keyword ->Segment => Segment -> Keyword Keyword-segment relation

18 B0 B1 B2 B3 K0 K1 K2 K3 S0 S1 S2 S3 K0 K1 K2 K3 C0 C1 C2 C3

19 Some weird keywords appears /^[0-9a-zA-Z\-_]{1,}$/ : : Their search results 全国供货商【 IC37 旗下站】 价格 |PDF... IC 芯片 品牌、价格、 PDF 参数 - 电子产 品资料 - 买卖 IC 网 PIC16C57-XT/SP145 的 IC 、二极管、三极管查 询, 采购 PIC16C57-XT/SP... 原装进口连接器 TYCO pcs 现货 泰科 Tyco 集成电路、连接器、接插件 AMP 欧式背板连接器崧晔达 _ 达价格 _ 优质崧晔达批发 / 采购 - 阿里巴 巴 供应聚氯乙烯 _ 连接器 _ 供应聚 崧晔达价格 _ 优质崧晔达批发 / 采 购 - 阿里巴巴 供应聚氯乙烯 _ 连接器 _ 供应聚氯乙烯批发 _ 供应聚 氯乙烯供应 _ 阿里巴巴 上海金庆电子技术有限公司 限位开关 12 福州福铭仪器 Search Result Utilization

20 For normal keywords, the keyword itself has semantic meaning. For those keywords with less semantic information, they are usually a product serial number or some domain specific terminology, e.g chemical element names. These supplementary information yields more accuracy results on “weird” keywords. But these keywords did not seem to be included in online test. Search Result Utilization

21

22 Dimensionality: 200,000 Lower dimensionality introduce better generalization ability. Feature Statistics

23 Life is short, you need Python Implementation

24 Liblinear: A Library for Large Linear Classification L2-loss logistic regression 33 one-vs-all classifiers for each class. Model

25 We split labeled data into training and validation set All following results are local results. Online test result are higher due to utilizing more training data. Due to the complexity of migrating our code to hadoop platform (mainly because we used third party non-java libraries), not all of the features above are employed in our final submission. Experiments and Results

26 Feature vector constituents Accuracy Keyword-buyer relation Keyword-segment relation

27 We split labeled data into training and validation set All following results are local results. Online test result are higher due to utilizing more training data. Due to the complexity of migrating our code to hadoop platform (mainly because we used third party non-java libraries), not all of the features above are employed in our final submission. Analysis

28 Two types of feature Relation feature: Utilized prior knowledge of class label information Low dimension May biased to training data TFIDF feature: No class label information utilized High dimension Robust, good generalization ability But a simple combination of two does not work well Ensemble methods may workaround this problem. Limitations

29 Thanks!


Download ppt "第一届中国大数据技术创新与创业大赛 关键词行业分类 ThuFit 队: 周昕宇,吴育昕 ,任杰 ,王禺淇 ,罗鸿胤 指导:方展鹏 ,唐杰 清华大学 未来互联网兴趣团队."

Similar presentations


Ads by Google