Presentation is loading. Please wait.

Presentation is loading. Please wait.

Social Data Toby Segaran Author, Programming Collective Intelligence Data Magnate, Metaweb Technologies.

Similar presentations


Presentation on theme: "Social Data Toby Segaran Author, Programming Collective Intelligence Data Magnate, Metaweb Technologies."— Presentation transcript:

1 Social Data Toby Segaran Author, Programming Collective Intelligence Data Magnate, Metaweb Technologies

2 Data mining? Sorting through data* to identify patterns and establish relationships * usually a lot of data

3 Where and why? Methods and examples

4 Where and why? Targeted Advertising Recommendations Search Results Group Discovery Filtering of Documents Theme Extraction

5 Google ad

6 Facebook ad

7 This is strange... Google just has text Facebook knows more about me But its taking a few cues...

8 Status: engaged

9 Where and why? Targeted Advertising Recommendations Search Results Group Discovery Filtering of Documents Theme Extraction

10 Real Amazon Products

11 Netflix Prize

12 Strands Contest

13 Custom News

14

15

16 Where and why? Targeted Advertising Recommendations Search Results Group Discovery Filtering of Documents Theme Extraction

17 Ranking algorithms The now-incredibly-famous paper

18 Ranking algorithms

19 Google begins tracking clicks in 2005 MSN search claims neural network AOL Data Scandal Learning behavior

20 Where and why? Targeted Advertising Recommendations Search Results Group Discovery Filtering of Documents Theme Extraction

21 In Biology

22 Page Grouping

23 Resumes Can resumes be grouped into career paths?

24 Where and why? Targeted Advertising Recommendations Search Results Group Discovery Filtering of Documents Theme Extraction

25 The obvious: spam SpamBayes

26 Other email uses

27 Web documents As you add information to Twine, it is automatically tagged so that you and others can find it more easily

28 Where and why? Targeted Advertising Recommendations Search Results Group Discovery Filtering of Documents Theme Extraction

29 What is the buzz?

30 Customer Community

31 Where and why? Methods and examples

32 Methods and Examples Bayesian Filtering Distance Metrics Clustering Decision Trees Network Analysis Feature Extraction

33 Bayesian Filtering

34

35

36

37

38 school work algorithm

39 Bayesian Filtering school work algorithm v1agra trades associate

40 Craigslist personals

41 Analysis Five Cities W4M Personal Ads

42 Results New York Mets Lounges Offense Desires Musical Submissive Create Song Oral Boston Pink Sox Poetry Intellectually Punk Appreciation Exercise Winter Education Chicago Cubs Burbs Bears Girlie Insecure Cheat Importance Blunt Mouth

43 Results Los Angeles Excellent Vegas Meaningful Star Lame Industry Heat Fitness Entertainment Latino San Francisco Tee Employment Picnic STD Tasting Hikes French.com Kayaking Cycling

44 Methods and Examples Bayesian Filtering Distance Metrics Clustering Decision Trees Network Analysis Feature Extraction

45 Preference distance Sarah Marshall Leatherheads 3 3 2 3 1 5 2 5

46 Preference distance 5 4 3 2 1 12345

47 5 4 3 2 1 12345 1 2.23

48 For recommendations 5 4 3 2 1 12345 Prom Night: 5Prom Night: 2? 1 2.23

49 For recommendations 5 4 3 2 1 12345 Prom Night: 5Prom Night: 24.1

50 Linguistic distance The Six Degrees Hypothesis Experienced It Is When You Travel

51 Linguistic distance The Six Degrees Hypothesis Experienced It Is When You Travel Six Degrees Hypothesis Experienced Travel Six3 Degrees3 Hypothesis1 Experienced5 Travel6

52 Linguistic distance chinakidsmusic travel yahoo Gothamist033 3 0 GigaOM601 4 2 QuickOnlineTips022 0 12 OReilly Radar103 6 4

53 Linguistic distance chinakidsmusicyahoo Gothamist0330 GigaOM6012 Quick Online Tips02212 Euclidean as the crow flies = 12 (approx)

54 Article/blog similarity Valleywag - Huffington > Slashdot - Wired

55 Methods and Examples Bayesian Filtering Distance Metrics Clustering Decision Trees Network Analysis Feature Extraction

56 Hierarchical Clustering 5 4 3 2 1 12345

57 5 4 3 2 1 12345

58 5 4 3 2 1 12345

59 5 4 3 2 1 12345

60

61 Grouping bloggers

62

63

64 Grouping articles

65

66

67

68 Methods and Examples Bayesian Filtering Distance Metrics Clustering Decision Trees Network Analysis Feature Extraction

69 Decision Trees

70 CART Algorithm BrandTypeLife (hrs) DuracellC4 Energize r C5 DuracellAA2 Energize r AA2.5 From any dataset...

71 CART Algorithm BrandTypeLife (hrs) DuracellC4 Energize r C5 DuracellAA2 Energize r AA2.2... find the best split... Type is C? Avg=4.5Avg=2.1 NoYes

72 CART Algorithm BrandTypeLife (hrs) DuracellC4 Energize r C5 DuracellAA2 Energize r AA2.2... and repeat. Type is C? NoYes Duracell NoYes Duracell NoYes 42.225

73 Hot or Not

74

75 Methods and Examples Bayesian Filtering Distance Metrics Clustering Decision Trees Network Analysis Feature Extraction

76 A network A B C D E F

77 PageRank A B C D E F 1.0

78 PageRank A B C D E F 1.0 D = 0.15 +.85*E/1 +.85 * F/2 +.85*B/1 = 2.275

79 PageRank A B C D E F 0.58 1.0 2.275 1.0 0.15

80 PageRank A B C D E F 0.58 2.08 1.56 0.3 0.15

81 PageRank A B C D E F 1.03 1.48 1.56 0.3 0.15

82 PageRank A B C D E F 0.78 1.48 1.34 0.3 0.15

83 CI FOO participants

84 Science papers The paper attempts to provide an alternative method for measuring the importance of scientific papers based on the Google's PageRank. The method is a meaningful extension of the common integer counting of citations and is then experimented for bringing PageRank to the citation analysis in a large citation network. It offers a more integrated picture of the publications' influence in a specific field. Bringing PageRank to the citation analysis

85 Clustering coefficient How many of each persons friends are friends with each other?

86 Clustering coefficient A B C D E F Low clustering coefficient

87 Clustering coefficient A B C D E F High clustering coefficient small world graph

88

89

90

91 Twitter!

92

93 Methods and Examples Bayesian Filtering Distance Metrics Clustering Decision Trees Network Analysis Feature Extraction

94 Independent Features

95 Message boards

96

97 Matrix Factorization Msg1Msg2Msg3Msg4Msg5 Gym13301 Calorie02413 Weigh23101 Carbs01102 Treadmill32022 Msg1M2M3M4M5 F110230 F202113 F310200 F1F2F3 Gym012 Calorie201 Weigh221 Carbs103 Treadmill012 Features Matrix Weight Matrix x Current Guess

98 Matrix Factorization Msg1M2M3M4M5 F110230 F202113 F310200 F1F2F3 Gym012 Calorie201 Weigh221 Carbs103 Treadmill012 Features Matrix Weight Matrix x Msg1Msg2Msg3Msg4Msg5 Gym20030 Calorie02113 Weigh10200 Carbs03002 Treadmill10020 Target Result Msg1Msg2Msg3Msg4Msg5 Gym13301 Calorie02413 Weigh23101 Carbs01102 Treadmill32022 Current Guess

99 Matrix Factorization Msg1M2M3M4M5 F120010 F202013 F310100 F1F2F3 Gym100 Calorie011 Weigh002 Carbs010 Treadmill100 Features Matrix Weight Matrix x Msg1Msg2Msg3Msg4Msg5 Gym20030 Calorie02113 Weigh10200 Carbs03002 Treadmill10020 Target Result Msg1Msg2Msg3Msg4Msg5 Gym20030 Calorie02113 Weigh10200 Carbs03002 Treadmill10020 Current Guess

100 Interpreting Features Msg1M2M3M4M5 F120010 F202013 F310100 F1F2F3 Gym100 Calorie011 Weigh002 Carbs010 Treadmill100 Features Matrix Weight Matrix Theme 1Theme 2Theme 3 GymCalorieWeigh TreadmillCarbsCalorie Msg1Msg2Msg3 etc. Theme 1Theme 2Theme 3

101 Diet & Body themes Atkins Induction South Beach Carbs Chocolate Black Coffee Olive Broccoli Gym Weights Exercise Running Injured Cook Recipe Fried Home Money Organic Want Best Calories Weight Fats Protein Cholesterol

102 Wikipedia people she her after when father women series television show which radio bbc league major baseball season played with olympics competed won summer medal athelete university professor received science research born

103 Were just getting started...

104 Homepage http://kiwitobes.comhttp://kiwitobes.com Freebase http://freebase.comhttp://freebase.com

105 Questions?


Download ppt "Social Data Toby Segaran Author, Programming Collective Intelligence Data Magnate, Metaweb Technologies."

Similar presentations


Ads by Google