Presentation is loading. Please wait.

Presentation is loading. Please wait.

Series-O-Rama Search & Recommend TV series with SQL Guillaume Cabanac February 15th, 2011.

Similar presentations


Presentation on theme: "Series-O-Rama Search & Recommend TV series with SQL Guillaume Cabanac February 15th, 2011."— Presentation transcript:

1 Series-O-Rama Search & Recommend TV series with SQL http://bit.ly/dMh7kb Guillaume Cabanac cabanac@irit.fr February 15th, 2011

2 Toulouse: A Picture is Worth a Thousand Words Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac 2 1 2 3 4 Capbreton 3h ride Toulouse population:437 000 students: 97 000 Aberdeen population:210 400 students: ?? ??? Collioure 2h30 ride Ax-les-Thermes 1h40 ride

3 en.wikipedia.org Telly Addicts Need Help to Find TV Series Grey’s Anatomy Main Topics of Grey’s Anatomy?  Text mining, Visualization plane crash island Series about ‘plane crash island’  Search engine What should I watch next?  Recommender system amazon.com  3 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac

4 Text Mining: Let’s Crunch Subtitles 4 Grey’s Anatomy Main Topics of Grey’s Anatomy?  Text mining, Visualization plane crash island Series about ‘plane crash island’  Search engine What should I watch next?  Recommender system Cold Case Grey’s Anatomy Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac

5 What’s in a Subtitle File? 5 Title – Season – Episode – Language.srt  1 episode = 1 plain text file Synchronization  start --> stop Dialogue We can easily extract words [a, again*2, and, but, com, cuban, different, favorite, food, for*2, forum, going, great, happen*2, has, hungry, i*2, is, it, love, m, my, nice, night*2, miami, now, pork, s*2, sandwiches, something, the, to*2, tonight, town, www ] Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac

6 6 DB technology at Work! [Home] 7 527 files = 337 MB 100% Java and Oracle

7 DB technology at Work! [Search engine] 7 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Ranked list of results

8 DB technology at Work! [Infos] 8 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Most popular terms Most related series

9 DB technology at Work! [Recommendations] 9 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac

10 DB technology at Work! [Recommendations] 10 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac I liked I disliked What should I watch next?

11 DB technology at Work! [Recommendations] 11 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Ranked list of recommendations

12 How Does this Work? 12 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac

13 Architecture and Data Model 13 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac DB subtitles indexing searching browsing recommending GUI offline online Series = {idS,name} 12Lost 45Dexter 45???? Dict = {idT,term} 8plane 27killer 29crash Posting = {idT*,idS*,nb} 274589 8453 81290  

14 Theory  Text Indexing Pipeline 14 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac [the, plane, crashed,..., planes,..., is] [plane, crashed,..., planes,...] [plane, crash,..., plane,...] {(plane, 48), (crash, 15)...} Tokenization + lowercase Stopwords removal Stemming Porter’s Stemmer (1980) Porter’s Stemmer (1980) http://qaa.ath.cx/porter_js_demo.html In 1720 Robert Gordon retired to Aberdeen having amassed a considerable fortune in Poland. On his death 11 years later he willed his entire estate to build a residential school for educating young boys. In the summer of 1750 the Robert Gordon’s Hospital was born In 1881 this was converted into a day school to be known as Robert Gordon’s College. This school also began to hold day and evening classes for boys girls and adults in primary secondary mechanical and other subjects … Counting

15 Vocabulary Theory  Vector Space Model, Term Weighting 15 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Raw TF dexter > lost max  Normalization TF / max(TF) survive ? max dexter < lost

16 Theory  Best Match Retrieval 16 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac 1 TV series = 1 vector 14514676790n Now, we know how to: popular terms  Find most popular terms for a TV series similarity  Compute similarity between TV series matching a query  Find TV series matching a query

17 Theory  More on Term Weighting 17 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac 14514676790n 1 TV series = 1 vector  All terms are supposed to be equally representative … but ‘survive’ is way more unusual than ‘people’  ‘survive’ better represents Lost than ‘people’ does IDF: Inverse Document Frequency

18 Theory  The Big Picture: TF*IDF 18 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac 1 TV series = 1 vector Some Limitations  Term positions?e.g., “ice truck killer” in Dexter  Stemming?e.g., ananas, christmas  Mixture of languages? e.g., amusant FR vs. fun EN is frequent in Sglobally unusual An important term for series S is frequent in S and globally unusual.

19 Theory … and Practice 19 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Series = {idS,name,maxNb} 12Lost540 45Dexter125 Dict = {idT,termidf } 8plane1.25 27killer2.87 29crash3.07 Posting = {idT*,idS*,nb,tf } 2745890.71 84530.02 812900.16  

20 Description of a TV Series 20 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Lost ⋈  Many surnames need to be filtered out

21 Retrieval of TV Series  queries with 1 term 21 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac survive ⋈ Importance of normalization Stargate Atlantis nb/maxNb = 63/1116 = 0.05645 Blade nb/maxNb = 9/163 = 0.05521

22 Retrieval of TV Series  queries with n terms 22 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac survive mulder ⋈ 67|The Vampire Diaries survive|0.028|0.107 = 0.028 * 0.107 = 0.003 mulder|0.007|3.977 = 0.007 * 3.977 = 0.028 + 0.031 18| X-Files survive|0.014|0.107 = 0.014 * 0.107 = 0.001 mulder|1.000|3.977 = 0.007 * 3.977 = 3.977 + 3.978

23 Similar to House? Computing Similarities Among TV Series 1/2 23 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac ⋈ First, let’s compute the numerator where: A i = Terms from House B i = Terms from Another TV series AiAi BiBi

24 Similar to House? Computing Similarities Among TV Series 2/2 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac ⋈ ⋈ ⋈ 24

25 Thank you http://www.irit.fr/~Guillaume.Cabanac


Download ppt "Series-O-Rama Search & Recommend TV series with SQL Guillaume Cabanac February 15th, 2011."

Similar presentations


Ads by Google