Presentation is loading. Please wait.

Presentation is loading. Please wait.

Research Internships Advanced Research and Modeling Research Group.

Similar presentations


Presentation on theme: "Research Internships Advanced Research and Modeling Research Group."— Presentation transcript:

1 Research Internships Advanced Research and Modeling Research Group

2 ADREM – What? Research group that deals with computational aspects of data – databases – data mining – Information retrieval

3 ADREM – Who? DB/DM/IR Floris Geerts Bart Goethals Martin Theobald Bioinf Kris Laukens Tim Van den Bulcke + Phd students and postdoctoral researchers

4 Internships – What? 2 research internships (15 credits each) Msc thesis (30 credits). Goal: internships are an initiation to research and is in collaboration with researchers in ADReM 15 credits is a lot = internship is time consuming! 1 credit = 15 hour work… Balance your course load and internship well. Internships are not necessarily related to your Msc thesis (but it can) In a Msc thesis your ability to independently do research plays an important role.

5 Internships – Who? Everyone who follows the research option in the database Msc program

6 Research In an internship you need to: 1.Understand a specific problem 2.Implement an (existing) method for solving the problem 3.Test and evaluate 4.Write a report (Msc thesis: you have to solve the problem as well by designing new methods…)

7 Internships in a company It is allowed to do a internship in a company but you have to ask permission Also, you have to find the company yourself and convince us that there is research involved You can’t receive any money from the company during your internship

8 Databases, data mining, information retrieval These are not separate research domains The topics for internships that each of us will present next are usually on the intersection of these areas. Let’s see some example topics….

9 Bart Goethals

10 Recommender Systems Implement state of the art recommenders Pattern mining for better recommendations Interactive Recommendation Explaining recommendations Test recommenders for real data

11 Visual Instant Interactive Pattern Mining Study Visualizations enabling Interactive Pattern Mining Implement and Experiment with novel instant mining methods

12 Pattern based Clustering Implement and evaluate different techniques for clustering based pattern mining, and pattern based clustering

13 Data Mining for Cleaning Study and experiment with data mining methods for data cleaning.

14 Martin Theobald

15 Information Extraction (I): Wikipedia Infoboxes

16 bornOn(Jeff, 09/22/42) gradFrom(Jeff, Columbia) hasAdvisor(Jeff, Arthur) hasAdvisor(Surajit, Jeff) knownFor(Jeff, Theory) Information Extraction (I): Infoboxes YAGO/DBpedia et al. >120 M facts for YAGO2 (mostly from Wikipedia infoboxes)

17 Information Extraction (II): Wikipedia Categories

18 ?

19 RDF Knowledge Bases Entity Max_Planck Apr 23, 1858 Person City Country subclass Location subclass instanceOf subclass bornOn “Max Planck” means subclass Oct 4, 1947 diedOn Kiel bornIn Nobel Prize Erwin_Planck FatherOf hasWon Scientist means “Max Karl Ernst Ludwig Planck” Physicist instanceOf subclass Biologist subclass Germany Politician Angela Merkel Schleswig- Holstein State “Angela Dorothea Merkel” Oct 23, 1944 diedOn Organization subclass Max_Planck Society instanceOf means instanceOf subclass means “Angela Merkel” means citizenOf instanceOf locatedIn subclass accuracy  95% 3 Mio. entities, 120 Mio. facts 100 relations, 200k classes

20 Linked Open Data As of Sept. 2011: > 200 sources > 30 billion RDF triples > 400 million links

21 Currently (Sept. 2011) > 5 Mio owl:sameAs links between DBpedia/YAGO/Freebase As of Sept. 2011: > 5 million owl:sameAs links between DBpedia/YAGO/Freebase

22 IBM Watson: Deep Question Answering 99 cents got me a 4-pack of Ytterlig coasters from this Swedish chain This town is known as "Sin City" & its downtown is "Glitter Gulch" William Wilkinson's "An Account of the Principalities of Wallachia and Moldavia" inspired this author's most famous novel As of 2010, this is the only former Yugoslav republic in the EU YAGO knowledge back-ends question classification & decomposition D. Ferrucci et al.: Building Watson: An Overview of the DeepQA Project. AI Magazine, Fall 2010.

23 A big US city with two airports, one named after a World War II hero, and one named after a World War II battle field? Jeopardy!

24 Structured Knowledge Queries A big US city with two airports, one named after a World War II hero, and one named after a World War II battle field? Select Distinct ?c Where { ?c type City. ?c locatedIn USA. ?a1 type Airport. ?a2 type Airport. ?a1 locatedIn ?c. ?a2 locatedIn ?c. ?a1 namedAfter ?p. ?p type WarHero. ?a2 namedAfter ?b. ?b type BattleField. } Use manually created templates for mapping sentence patterns to structured queries. Works for factoid and list questions.

25 Mining Rules from RDF Knowledge Bases A-priori-style pre-filtering of low-support join patterns Dynamic programming ILP algorithm Learning with constants and type constraints Ground truth for bornIn (partially known) Facts produced by the rule (only partially true) Closed World Assumption: strongly penalizes the rule Specificity: avoid producing overly general rules Use a combination of statistical measures Confidence instead of Accuracy: do not penalize the rule for unseen entities Our solution: Overly general Refine by types Ground truth for livesIn (only partially known) Knowledge base for livesIn (known positive examples) Facts produced by the rule (only partially correct) Goal: Inductively learn (soft) rules: livesIn(x,y) :- bornIn(x,y) G KB R

26 Rule-based Reasoning (Soft) Deduction Rules vs. (Hard) Consistency Constraints People may live in more than one place livesIn(x,y)  marriedTo(x,z)  livesIn(z,y) livesIn(x,y)  hasChild(x,z)  livesIn(z,y) People are not born in different places/on different dates bornIn(x,y)  bornIn(x,z)  y=z People are not married to more than one person (at the same time, in most countries?) marriedTo(x,y,t 1 )  marriedTo(x,z,t 2 )  y≠z  disjoint(t 1,t 2 ) [0.8] [0.5]

27 Probabilistic RDF Database   \/ /\ graduatedFrom (Surajit, Princeton) [0.7] graduatedFrom (Surajit, Princeton) [0.7] hasAdvisor (Surajit,Jeff )[0.8] hasAdvisor (Surajit,Jeff )[0.8] worksAt (Jeff,Stanford )[0.9] worksAt (Jeff,Stanford )[0.9] graduatedFrom (Surajit, Stanford) [0.6] graduatedFrom (Surajit, Stanford) [0.6] Query graduatedFrom(Surajit, y) Query graduatedFrom(Surajit, y) CD AB A  (B  (C  D))  A  (B  (C  D)) graduatedFrom (Surajit, Princeton) graduatedFrom (Surajit, Princeton) graduatedFrom (Surajit, Stanford) graduatedFrom (Surajit, Stanford) Q1Q1 Q2Q2 Rules hasAdvisor(x,y)  worksAt(y,z)  graduatedFrom(x,z) [0.4] graduatedFrom(x,y)  graduatedFrom(x,z)  y=z Rules hasAdvisor(x,y)  worksAt(y,z)  graduatedFrom(x,z) [0.4] graduatedFrom(x,y)  graduatedFrom(x,z)  y=z Base Facts graduatedFrom(Surajit, Princeton) [0.7] graduatedFrom(Surajit, Stanford) [0.6] graduatedFrom(David, Princeton) [0.9] hasAdvisor(Surajit, Jeff) [0.8] hasAdvisor(David, Jeff) [0.7] worksAt(Jeff, Stanford) [0.9] type(Princeton, University) [1.0] type(Stanford, University) [1.0] type(Jeff, Computer_Scientist) [1.0] type(Surajit, Computer_Scientist) [1.0] type(David, Computer_Scientist) [1.0] Base Facts graduatedFrom(Surajit, Princeton) [0.7] graduatedFrom(Surajit, Stanford) [0.6] graduatedFrom(David, Princeton) [0.9] hasAdvisor(Surajit, Jeff) [0.8] hasAdvisor(David, Jeff) [0.7] worksAt(Jeff, Stanford) [0.9] type(Princeton, University) [1.0] type(Stanford, University) [1.0] type(Jeff, Computer_Scientist) [1.0] type(Surajit, Computer_Scientist) [1.0] type(David, Computer_Scientist) [1.0] 1-(1-0.72)x(1-0.6) = x0.9 = x( )=0.078(1-0.7)x0.888=0.266

28 Temporal Knowledge

29 ‘03 ‘05‘07 playsFor(Beckham, Real, T 1 ) Base Facts Derived Facts ‘05‘00‘02‘07 playsFor(Ronaldo, Real, T 2 ) ‘04 ‘03‘04 ‘07 ‘05 playsFor(Beckham, Real, T 1 )  playsFor(Ronaldo, Real, T 2 )  overlaps(T 1,T 2 )  t 3 teamMates(Beckham, Ronaldo, t 3 )  State Relation teamMates(Beckham, Ronaldo, T 3 ) Probabilistic-Temporal Consistency Reasoning

30 Topics for Internships & Master Theses Research Internships Preparation & Integration of Linked Data Sources for Scientific Experiments (SQL/Java/Python) Mining Association Rules from Linked Data (Java/C++) Visualization Frontend for Linked Data (ActionScript & Adobe Flash) Master Theses Implementation of a distributed rule-based query engine for RDF data (C++ & Message Passing Interface) Implementation of a distributed factor graph model for correlated RDF facts (C++ & Message Passing Interface) Faceted Search and Interactive Browsing for Linked Data

31 Floris Geerts

32 Find top-3 flights from Edi to NYC with at most one stop  Items: flights  Selection criteria: relational queries  Utility function: in terms of price and duration (for ranking) RDBMS-based recommendation systems 32 Books, music, news, Web sites, research papers,….. top-k items … NY EDI items Top-k item selection Utility function Selection criteria

33 valid query relaxation Query relaxation 33 Q(f#, name,type,ticket, time) = ∃ DT, AT, AD, x To ( flight ( f#, EDI, x To, DT, 5/19/2012, AT, AD, Pr ) ∧ POI ( name, x To, type, ticket, time) ∧ x To = NYC ) Q 1 (f#, name, type, ticket, time) = ∃ DT, AT, AD, u To, w Edi, w NYC,w DD ( flight ( f#, w Edi, x To, DT,w DD, AT,A D, Pr ) ∧ x To = w NYC ∧ POI( name, u To, type, ticket, time) ∧ w DD =5/19/2012 ∧ dist(w NYC,NYC)≤15 ∧ dist(w Edi,EDI) ≤15 ∧ x To =u To ) E = { EDI,NYC,4/1/2012 }, X = { x To } There is no direct flight from EDI to NYC Relaxation: cities within 15 miles of EDI or NYC are acceptable Query for 5-day holiday dist(w DD,5/10/2012 ) ≤ 3 Further relaxation: departure dates within 3 days of 5/19/2012 are acceptable

34 Top-k query answering algorithm on top of RDBMS Query relaxation approaches and query completion Topics

35 Data quality Detecting and correcting inconsistencies Finding duplicates Finding most up-to-date information

36 Semantic errors Yahoo! Finance Nasdaq Day’s Range: wk Range: Wk: Day’s Range:

37 Instance ambiguity

38 Out-of-Date Data 4:05 pm 3:57 pm

39 Unit errors 76,821, B

40 Fast inconsistency detection Duplication elimination algorithms Automated repairing algorithms Mining of “data quality rules” Topics


Download ppt "Research Internships Advanced Research and Modeling Research Group."

Similar presentations


Ads by Google