Presentation is loading. Please wait.

Presentation is loading. Please wait.

Series-O-Rama Search & Recommend TV series with SQL Guillaume Cabanac February 15th, 2011.

Similar presentations


Presentation on theme: "Series-O-Rama Search & Recommend TV series with SQL Guillaume Cabanac February 15th, 2011."— Presentation transcript:

1 Series-O-Rama Search & Recommend TV series with SQL Guillaume Cabanac February 15th, 2011

2 Toulouse: A Picture is Worth a Thousand Words Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Capbreton 3h ride Toulouse population: students: Aberdeen population: students: ?? ??? Collioure 2h30 ride Ax-les-Thermes 1h40 ride

3 en.wikipedia.org Telly Addicts Need Help to Find TV Series Grey’s Anatomy Main Topics of Grey’s Anatomy?  Text mining, Visualization plane crash island Series about ‘plane crash island’  Search engine What should I watch next?  Recommender system amazon.com  3 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac

4 Text Mining: Let’s Crunch Subtitles 4 Grey’s Anatomy Main Topics of Grey’s Anatomy?  Text mining, Visualization plane crash island Series about ‘plane crash island’  Search engine What should I watch next?  Recommender system Cold Case Grey’s Anatomy Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac

5 What’s in a Subtitle File? 5 Title – Season – Episode – Language.srt  1 episode = 1 plain text file Synchronization  start --> stop Dialogue We can easily extract words [a, again*2, and, but, com, cuban, different, favorite, food, for*2, forum, going, great, happen*2, has, hungry, i*2, is, it, love, m, my, nice, night*2, miami, now, pork, s*2, sandwiches, something, the, to*2, tonight, town, www ] Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac

6 6 DB technology at Work! [Home] files = 337 MB 100% Java and Oracle

7 DB technology at Work! [Search engine] 7 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Ranked list of results

8 DB technology at Work! [Infos] 8 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Most popular terms Most related series

9 DB technology at Work! [Recommendations] 9 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac

10 DB technology at Work! [Recommendations] 10 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac I liked I disliked What should I watch next?

11 DB technology at Work! [Recommendations] 11 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Ranked list of recommendations

12 How Does this Work? 12 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac

13 Architecture and Data Model 13 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac DB subtitles indexing searching browsing recommending GUI offline online Series = {idS,name} 12Lost 45Dexter 45???? Dict = {idT,term} 8plane 27killer 29crash Posting = {idT*,idS*,nb}  

14 Theory  Text Indexing Pipeline 14 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac [the, plane, crashed,..., planes,..., is] [plane, crashed,..., planes,...] [plane, crash,..., plane,...] {(plane, 48), (crash, 15)...} Tokenization + lowercase Stopwords removal Stemming Porter’s Stemmer (1980) Porter’s Stemmer (1980) In 1720 Robert Gordon retired to Aberdeen having amassed a considerable fortune in Poland. On his death 11 years later he willed his entire estate to build a residential school for educating young boys. In the summer of 1750 the Robert Gordon’s Hospital was born In 1881 this was converted into a day school to be known as Robert Gordon’s College. This school also began to hold day and evening classes for boys girls and adults in primary secondary mechanical and other subjects … Counting

15 Vocabulary Theory  Vector Space Model, Term Weighting 15 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Raw TF dexter > lost max  Normalization TF / max(TF) survive ? max dexter < lost

16 Theory  Best Match Retrieval 16 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac 1 TV series = 1 vector n Now, we know how to: popular terms  Find most popular terms for a TV series similarity  Compute similarity between TV series matching a query  Find TV series matching a query

17 Theory  More on Term Weighting 17 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac n 1 TV series = 1 vector  All terms are supposed to be equally representative … but ‘survive’ is way more unusual than ‘people’  ‘survive’ better represents Lost than ‘people’ does IDF: Inverse Document Frequency

18 Theory  The Big Picture: TF*IDF 18 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac 1 TV series = 1 vector Some Limitations  Term positions?e.g., “ice truck killer” in Dexter  Stemming?e.g., ananas, christmas  Mixture of languages? e.g., amusant FR vs. fun EN is frequent in Sglobally unusual An important term for series S is frequent in S and globally unusual.

19 Theory … and Practice 19 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Series = {idS,name,maxNb} 12Lost540 45Dexter125 Dict = {idT,termidf } 8plane killer crash3.07 Posting = {idT*,idS*,nb,tf }  

20 Description of a TV Series 20 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac Lost ⋈  Many surnames need to be filtered out

21 Retrieval of TV Series  queries with 1 term 21 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac survive ⋈ Importance of normalization Stargate Atlantis nb/maxNb = 63/1116 = Blade nb/maxNb = 9/163 =

22 Retrieval of TV Series  queries with n terms 22 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac survive mulder ⋈ 67|The Vampire Diaries survive|0.028|0.107 = * = mulder|0.007|3.977 = * = | X-Files survive|0.014|0.107 = * = mulder|1.000|3.977 = * =

23 Similar to House? Computing Similarities Among TV Series 1/2 23 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac ⋈ First, let’s compute the numerator where: A i = Terms from House B i = Terms from Another TV series AiAi BiBi

24 Similar to House? Computing Similarities Among TV Series 2/2 Series-O-Rama: Search & Recommend TV series with SQL Guillaume Cabanac ⋈ ⋈ ⋈ 24

25 Thank you


Download ppt "Series-O-Rama Search & Recommend TV series with SQL Guillaume Cabanac February 15th, 2011."

Similar presentations


Ads by Google