Download presentation
Presentation is loading. Please wait.
1
Rules of Thumb for Information Acquisition from Large and Redundant Data Wolfgang Gatterbauer http://UniqueRecall.com Database group University of Washington Version April 21, 2011 33 rd European Conference on Information Retrieval (ECIR'11)
2
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com Information Acquisition from Redundant Data Pareto principle (80-20 rule)20% causes80% effect e.g. businessclientssales e.g. softwarebugserrors e.g. health carepatientsHC resources Information acquisition? 20% data (instances) ? 80% information (concepts) e.g. words in a corpusall wordsdifferent words e.g. used first namesindividual namesdifferent names e.g. web harvestingweb dataweb information Motivating question: Can we learn 80% of the information, by looking at only 20% of the data?
3
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 3 Information Acquisition from Redundant Data AuAu ABBuBu "Unique" information Available Data Retrieved and extracted data Acquired information Information Integration Information Retrieval, Information Extraction Information Dissemination Information Acquisition Redundancy distribution k Recall r Expected sample distribution k Expected unique recall r u Three assumptions no disambiguity in data random sampling w/o replacement very large data sets
4
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 4 Outline A horizontal sampling model The role of power-laws Real data & Discussion
5
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com A Simple Balls-and-Urn Sampling Model Redundancy Distribution k k=(6,3,3,2,1) Information i Redundancy k i Data 12345 1 5 6 4 3 2 frequency of i-th most often appearing information (color) (# colors: a u =5) (# balls: a=15) Redundancy k i 12345 1 5 6 4 3 2 Sampled Data (# balls: b=3) Sampled Information i (# Colors: b u =2) Recall r r=3/15=0.2 Unique recall r u =2/5=0.4 Sample Redundancy distribution k=(2,1) 5
6
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com A model for sampling in the Limit of large data sets 12345 1 5 6 4 3 2 6 13579 1 5 6 4 3 2 246810 k=(6,3,3,2,1)k=(6,6,3,3,3,3,2,2,1,1) 12345 1 5 6 4 3 2 =(0.2,0.2,0.4,0,0,0.2) a=15 a∞a∞ a=30 k 2-3 =3 3 =0.4 vertical perspectivehorizontal perspective k 3-6 =3 3 =0.4...
7
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com The Intuition for constant redundancy k 01 r u = 1-(1-0.5) 3 =0.875 ruru 1 2 3 k=const=3 7 Unique recall 22 Expected sample distribution 2 =(3 choose 2)0.5 2 (1-0.5) 1 =0.375 Binomial distribution Indep. sampling with p=r lim a∞a∞ r=0.5
8
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com The Intuition for arbitrary redundancy distributions 3 =0.4 6 = 0.2 2 = 0.2 1 = 0.2 1 2 3 5 6 4 Redundancy k 0.20.600.81 8 r u = 6 [ 1-(1-r) 6 ] + 3 [ 1-(1-r) 3 ] + 3 [ 1-(1-r) 2 ] + 1 [ 1-(1-r) 1 ] Stratified sampling lim a∞a∞ r=0.5 22
9
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com A horizontal Perspective for Sampling 9 a u =5a u =20 a u =100a u =1000 Horizontal layer of redundancy 1 =0.8 Expected sample redundancy k 1 =3 r=0.5 b u =800
10
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 10 Unique Recall for arbitrary redundancy distributions 10
11
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 11 Outline A horizontal sampling model The role of power-laws Real data & Discussion
12
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 12 Three formulations of Power law distributions 12 Stumpf et al. [PNAS’05] Mitzenmacher [Internet Math.’04] Adamic [TR’00] Mitzenmacher [IM’04] redundancy k i redundancy frequency k complementary cumulative frequency k Zipf-MandelbrotParetoPower-law probability Zipf [1932] Commonly assumed to be different represen- tations of the same distribution
13
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com Unique Recall with Power laws 13 For =1 log-log plot
14
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com Unique Recall with Power laws 14 Rule of Thumb 1: When sampling 20% of data from a Power-law distribution, we expect to learn less than 40% of the information For =1
15
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com Invariants under Sampling 15 Given our model: Which redundancy distribution remains invariant under sampling?
16
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 16 Invariant is a power-law! Hence, the tail of all power laws remains invariant under sampling 16 r uk = k / k
17
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com Also, the power law tail breaks in 17 Rule of Thumb 2: When sampling data from a Power-law then the core of the sample distribution follows
18
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 18 Outline A horizontal sampling model The role of power-laws Real data & Discussion
19
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 19 Sampling from Real Data Incoming links for one domain Tag distribution on delicious.com Rule of Thumb 1: not good! Rule of Thumb 1: perfect! Rule of Thumb 2: works, but only for small area Rule of Thumb 2: works!
20
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com Theory on Sampling from Power-Laws Stumpf et al. [PNAS’05] "Here, we show that random subnets sampled from scale- free networks are not themselves scale-free." This paper "Here, we show that there is one power- law family that is invariant under sampling, and the core of other power-law function remains invariant under sampling too.
21
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com Some other related work Population sampling Downey et al. [IJCAI’05] Ipeirotis et al. [Sigmod’06] Stumpf et al. [PNAS’05] urn model to estimate the probability that extracted information is correct. random sampling with replacement show that random subnets sampled from scale-free networks are not scale-free decision framework to search or to crawl random sampling without replacement with known population sizes goal: estimate size of population e.g. mark-recapture animal sampling sampling of small fraction w/ replacement Bar-Yossef, Gurevich [WWW’06] biased methods to sample from search engine's index to estimate index size
22
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 22 Summary 1/2 AuAu ABBuBu InformationAvailable Data Retrieved Data Acquired information Inf. IntegrationIR & IE Inf. Dissemination A simple model of information acquisition from redundant data Full analytic solution Inf. Acquisition -no disambiguity -random sampling w/o replacement -large data r uk = k / k
23
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 23 Summary 2/2 Rule of thumb 1: Rule of thumb 2: 80/20 40/20 -power-law core remains invariant r uk r -sensitive to exact power-law root http://uniqueRecall.com
24
24 backup
25
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com Geometric interpretation of k ( , k, r) 25 BACKUP
26
Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 26 Information Acquisition from Redundant Data 3 pieces of data, containing 2 pieces of (“unique”) information* Capurro, Hjørland [ARIST’03] e.g. words of a corpus: word appearances / vocabulary e.g. used first names in a group individual names / different names e.g. web harvesting: web data / web information Motivating question: Can we learn 80% of the information, by looking at only 20% of the data? Data (instances) Information (concepts) *data interpreted as redundant representation of information
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.