Presentation is loading. Please wait.

Presentation is loading. Please wait.

Series-O-Rama Search & Recommend TV series with SQL Guillaume Cabanac March 27th, 2012.

Similar presentations


Presentation on theme: "Series-O-Rama Search & Recommend TV series with SQL Guillaume Cabanac March 27th, 2012."— Presentation transcript:

1 Series-O-Rama Search & Recommend TV series with SQL http://bit.ly/series-o-rama2012 Guillaume Cabanac guillaume.cabanac@univ-tlse3.fr March 27th, 2012

2 Toulouse: A Picture is Worth a Thousand Words Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac 2 1 2 3 4 Capbreton 3h ride Toulouse population:437 000 students: 97 000 Aberdeen population:210 400 students: ?? ??? Collioure 2h30 ride Ax-les-Thermes 1h40 ride

3 en.wikipedia.org Telly Addicts Need Help to Find TV Series Greys Anatomy Main Topics of Greys Anatomy? Text mining, Visualization plane crash island Series about plane crash island Search engine What should I watch next? Recommender system amazon.com 3 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac

4 Text Mining: Lets Crunch Subtitles 4 Greys Anatomy Main Topics of Greys Anatomy? Text mining, Visualization plane crash island Series about plane crash island Search engine What should I watch next? Recommender system Cold Case Greys Anatomy Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac

5 Whats in a Subtitle File? 5 Title – Season – Episode – Language.srt 1 episode = 1 plain text file Synchronization start --> stop Dialogue We can easily extract words [a, again*2, and, but, com, cuban, different, favorite, food, for*2, forum, going, great, happen*2, has, hungry, i*2, is, it, love, m, my, nice, night*2, miami, now, pork, s*2, sandwiches, something, the, to*2, tonight, town, www ] Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac

6 6 DB technology at Work! [Home] 7 527 files = 337 MB 100% Java and Oracle

7 DB technology at Work! [Search engine] 7 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Ranked list of results

8 DB technology at Work! [Infos] 8 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Most popular terms Most related series

9 DB technology at Work! [Recommendations] 9 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac

10 DB technology at Work! [Recommendations] 10 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac I liked I disliked What should I watch next?

11 DB technology at Work! [Recommendations] 11 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Ranked list of recommendations

12 How Does this Work? 12 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac

13 Architecture and Data Model 13 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac DB subtitles indexing searching browsing recommending GUI offline online Series = {idS,name} 12Lost 45Dexter 45???? Dict = {idT,term} 8plane 27killer 29crash Posting = {idT*,idS*,nb} 274589 8453 81290

14 Theory Text Indexing Pipeline 14 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac [the, plane, crashed,..., planes,..., is] [plane, crashed,..., planes,...] [plane, crash,..., plane,...] {(plane, 48), (crash, 15)...} Tokenization + lowercase Stopwords removal Stemming Porters Stemmer (1980) Porters Stemmer (1980) http://qaa.ath.cx/porter_js_demo.html In 1720 Robert Gordon retired to Aberdeen having amassed a considerable fortune in Poland. On his death 11 years later he willed his entire estate to build a residential school for educating young boys. In the summer of 1750 the Robert Gordons Hospital was born In 1881 this was converted into a day school to be known as Robert Gordons College. This school also began to hold day and evening classes for boys girls and adults in primary secondary mechanical and other subjects … Counting

15 Theory Similarity of Paired Series 15 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac A Big Limitation The distribution of terms among series is ignored It makes no difference that a term occurs 1 time or 1,000,000 times Dices Coefficient (1945) Based on the Set Theory Example: Let us Model a Series as a Set of Terms House= {hospital, doctor, crazy, psycho} Greys= {doctor, care, hospital}

16 Vocabulary Theory Vector Space Model, Term Weighting 16 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Raw TF dexter > lost max Normalization TF / max(TF) survive ? max dexter < lost

17 Theory Best Match Retrieval 17 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac 1 TV series = 1 vector 14514676790n Now, we know how to: popular terms Find most popular terms for a TV series similarity Compute similarity between TV series matching a query Find TV series matching a query

18 Theory More on Term Weighting 18 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac 14514676790n 1 TV series = 1 vector All terms are supposed to be equally representative … but survive is way more unusual than people survive better represents Lost than people does IDF: Inverse Document Frequency

19 Theory The Big Picture: TF*IDF 19 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac 1 TV series = 1 vector Some Limitations Term positions?e.g., ice truck killer in Dexter Stemming?e.g., christmas Mixture of languages? e.g., amusant FR vs. fun EN is frequent in Sglobally unusual An important term for series S is frequent in S and globally unusual.

20 Theory … and Practice 20 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Series = {idS,name,maxNb} 12Lost540 45Dexter125 Dict = {idT,termidf } 8plane1.25 27killer2.87 29crash3.07 Posting = {idT*,idS*,nb,tf } 2745890.71 84530.02 812900.16

21 Description of a TV Series 21 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Lost Many surnames need to be filtered out

22 Retrieval of TV Series queries with 1 term 22 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac survive Importance of normalization Stargate Atlantis nb/maxNb = 63/1116 = 0.05645 Blade nb/maxNb = 9/163 = 0.05521

23 Retrieval of TV Series queries with n terms 23 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac survive mulder 67|The Vampire Diaries survive|0.028|0.107 = 0.028 * 0.107 = 0.003 mulder|0.007|3.977 = 0.007 * 3.977 = 0.028 + 0.031 18| X-Files survive|0.014|0.107 = 0.014 * 0.107 = 0.001 mulder|1.000|3.977 = 1.000 * 3.977 = 3.977 + 3.978

24 Similar to House? Computing Similarities Among TV Series 1/2 24 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac First, lets compute the numerator where: A i = Terms from House B i = Terms from Another TV series AiAi BiBi

25 Similar to House? Computing Similarities Among TV Series 2/2 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac 25

26 Thank you http://www.irit.fr/~Guillaume.Cabanac


Download ppt "Series-O-Rama Search & Recommend TV series with SQL Guillaume Cabanac March 27th, 2012."

Similar presentations


Ads by Google