Presentation is loading. Please wait.

Presentation is loading. Please wait.

Rules of Thumb for Information Acquisition from Large and Redundant Data Wolfgang Gatterbauer Database group University of Washington.

Similar presentations


Presentation on theme: "Rules of Thumb for Information Acquisition from Large and Redundant Data Wolfgang Gatterbauer Database group University of Washington."— Presentation transcript:

1 Rules of Thumb for Information Acquisition from Large and Redundant Data Wolfgang Gatterbauer http://UniqueRecall.com Database group University of Washington Version April 21, 2011 33 rd European Conference on Information Retrieval (ECIR'11)

2 Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com Information Acquisition from Redundant Data Pareto principle (80-20 rule)20% causes80% effect e.g. businessclientssales e.g. softwarebugserrors e.g. health carepatientsHC resources Information acquisition? 20% data (instances) ? 80% information (concepts) e.g. words in a corpusall wordsdifferent words e.g. used first namesindividual namesdifferent names e.g. web harvestingweb dataweb information Motivating question: Can we learn 80% of the information, by looking at only 20% of the data?

3 Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 3 Information Acquisition from Redundant Data AuAu ABBuBu "Unique" information Available Data Retrieved and extracted data Acquired information Information Integration Information Retrieval, Information Extraction Information Dissemination Information Acquisition Redundancy distribution k Recall r Expected sample distribution k Expected unique recall r u Three assumptions no disambiguity in data random sampling w/o replacement very large data sets

4 Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 4 Outline A horizontal sampling model The role of power-laws Real data & Discussion

5 Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com A Simple Balls-and-Urn Sampling Model Redundancy Distribution k k=(6,3,3,2,1) Information i Redundancy k i Data 12345 1 5 6 4 3 2 frequency of i-th most often appearing information (color) (# colors: a u =5) (# balls: a=15) Redundancy k i 12345 1 5 6 4 3 2 Sampled Data (# balls: b=3) Sampled Information i (# Colors: b u =2) Recall r r=3/15=0.2 Unique recall r u =2/5=0.4 Sample Redundancy distribution k=(2,1) 5

6 Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com A model for sampling in the Limit of large data sets 12345 1 5 6 4 3 2 6 13579 1 5 6 4 3 2 246810 k=(6,3,3,2,1)k=(6,6,3,3,3,3,2,2,1,1) 12345 1 5 6 4 3 2  =(0.2,0.2,0.4,0,0,0.2) a=15 a∞a∞ a=30 k 2-3 =3  3 =0.4 vertical perspectivehorizontal perspective k 3-6 =3  3 =0.4...

7 Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com The Intuition for constant redundancy k 01 r u = 1-(1-0.5) 3 =0.875 ruru 1 2 3 k=const=3 7 Unique recall 22 Expected sample distribution  2 =(3 choose 2)0.5 2 (1-0.5) 1 =0.375  Binomial distribution  Indep. sampling with p=r lim a∞a∞ r=0.5

8 Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com The Intuition for arbitrary redundancy distributions  3 =0.4  6 = 0.2  2 = 0.2  1 = 0.2 1 2 3 5 6 4 Redundancy k 0.20.600.81 8 r u =  6 [ 1-(1-r) 6 ] +  3 [ 1-(1-r) 3 ] +  3 [ 1-(1-r) 2 ] +  1 [ 1-(1-r) 1 ] Stratified sampling lim a∞a∞ r=0.5 22

9 Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com A horizontal Perspective for Sampling 9 a u =5a u =20 a u =100a u =1000 Horizontal layer of redundancy  1 =0.8 Expected sample redundancy k 1 =3 r=0.5 b u =800

10 Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 10 Unique Recall for arbitrary redundancy distributions 10

11 Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 11 Outline A horizontal sampling model The role of power-laws Real data & Discussion

12 Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 12 Three formulations of Power law distributions 12 Stumpf et al. [PNAS’05] Mitzenmacher [Internet Math.’04] Adamic [TR’00] Mitzenmacher [IM’04] redundancy k i redundancy frequency  k complementary cumulative frequency  k Zipf-MandelbrotParetoPower-law probability Zipf [1932] Commonly assumed to be different represen- tations of the same distribution

13 Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com Unique Recall with Power laws 13 For  =1 log-log plot

14 Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com Unique Recall with Power laws 14 Rule of Thumb 1: When sampling 20% of data from a Power-law distribution, we expect to learn less than 40% of the information For  =1

15 Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com Invariants under Sampling 15 Given our model: Which redundancy distribution remains invariant under sampling?

16 Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 16 Invariant is a power-law! Hence, the tail of all power laws remains invariant under sampling 16 r uk =  k /  k

17 Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com Also, the power law tail breaks in 17 Rule of Thumb 2: When sampling data from a Power-law then the core of the sample distribution follows

18 Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 18 Outline A horizontal sampling model The role of power-laws Real data & Discussion

19 Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 19 Sampling from Real Data Incoming links for one domain Tag distribution on delicious.com Rule of Thumb 1: not good! Rule of Thumb 1: perfect! Rule of Thumb 2: works, but only for small area Rule of Thumb 2: works!

20 Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com Theory on Sampling from Power-Laws Stumpf et al. [PNAS’05] "Here, we show that random subnets sampled from scale- free networks are not themselves scale-free." This paper "Here, we show that there is one power- law family that is invariant under sampling, and the core of other power-law function remains invariant under sampling too.

21 Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com Some other related work Population sampling Downey et al. [IJCAI’05] Ipeirotis et al. [Sigmod’06] Stumpf et al. [PNAS’05] urn model to estimate the probability that extracted information is correct. random sampling with replacement show that random subnets sampled from scale-free networks are not scale-free decision framework to search or to crawl random sampling without replacement with known population sizes goal: estimate size of population e.g. mark-recapture animal sampling sampling of small fraction w/ replacement Bar-Yossef, Gurevich [WWW’06] biased methods to sample from search engine's index to estimate index size

22 Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 22 Summary 1/2 AuAu ABBuBu InformationAvailable Data Retrieved Data Acquired information Inf. IntegrationIR & IE Inf. Dissemination A simple model of information acquisition from redundant data Full analytic solution Inf. Acquisition -no disambiguity -random sampling w/o replacement -large data r uk =  k /  k

23 Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 23 Summary 2/2 Rule of thumb 1: Rule of thumb 2:  80/20  40/20 -power-law core remains invariant r uk  r  -sensitive to exact power-law root http://uniqueRecall.com

24 24 backup

25 Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com Geometric interpretation of  k ( , k, r) 25 BACKUP

26 Wolfgang Gatterbauer Rules of Thumb for Information Acquisition from Redundant Data http://uniqueRecall.com 26 Information Acquisition from Redundant Data 3 pieces of data, containing 2 pieces of (“unique”) information* Capurro, Hjørland [ARIST’03] e.g. words of a corpus: word appearances / vocabulary e.g. used first names in a group individual names / different names e.g. web harvesting: web data / web information Motivating question: Can we learn 80% of the information, by looking at only 20% of the data? Data (instances) Information (concepts) *data interpreted as redundant representation of information


Download ppt "Rules of Thumb for Information Acquisition from Large and Redundant Data Wolfgang Gatterbauer Database group University of Washington."

Similar presentations


Ads by Google