Presentation is loading. Please wait.

Presentation is loading. Please wait.

Modern Machine Learning: New Challenges and Connections Maria-Florina Balcan.

Similar presentations


Presentation on theme: "Modern Machine Learning: New Challenges and Connections Maria-Florina Balcan."— Presentation transcript:

1 Modern Machine Learning: New Challenges and Connections Maria-Florina Balcan

2 Machine Learning is Changing the World A breakthrough in machine learning would be worth ten Microsofts (Bill Gates, Microsoft) Machine learning is the hot new thing (John Hennessy, President, Stanford) Web rankings today are mostly a matter of machine learning (Prabhakar Raghavan, VP Engineering at Google)

3 The World is Changing Machine Learning New applications Explosion of data

4 Modern applications: massive amounts of raw data. Only a tiny fraction can be annotated by human experts. Billions of webpages Images Protein sequences Modern ML: New Learning Approaches

5 Modern applications: massive amounts of raw data. Modern ML: New Learning Approaches Learning Algorithm Expert Semi-supervised Learning, (Inter)active Learning. Techniques that best utilize data, minimizing need for expert/human intervention.

6 Modern applications: massive amounts of data distributed across multiple locations. Modern ML: New Learning Approaches

7 scientific data Key new resource communication. video data E.g., Modern applications: massive amounts of data distributed across multiple locations.

8 The World is Changing Machine Learning Resource constraints. E.g., Computational efficiency (noise tolerant algos) Communication Human labeling effort Statistical efficiency Privacy/Incentives New approaches. E.g., Semi-supervised learning Distributed learning Interactive learning Multi-task/transfer learning Never ending learning Community detection

9 d log(1/ ) A2A2 equilibria d(1/ ) Â Err(h) The World is Changing Machine Learning My research: foundations, robust algorithms. Applications: vision, comp. biology, social networks. New approaches. E.g., Semi-supervised learning Distributed learning Interactive learning Never ending learning Community detection Multi-task/transfer learning

10 Machine Learning is Changing the World My research: MLT way to model and tackle challenges other fields. E.g., Game Theory Approximation Algorithms Matroid Theory Machine Learning Theory Discrete Optimization Mechanism Design Control Theory

11 Machine Learning Lenses in Other Areas Outline of the talk Modern Learning Paradigms Interactive Learning Submodularity, implications to Matroid Theory, Algorithmic Game Theory, Optimization ; V Discussion, Other Exciting Directions

12 Supervised Learning E.g., which s are spam and which are important. Not spam spam E.g., classify objects as chairs vs non chairs. Not chair chair

13 Labeled Examples Learning Algorithm Expert / Oracle Data Source c* : X ! {0,1} h : X ! {0,1} (x 1,c*(x 1 )),…, (x m,c*(x m )) Algo sees (x 1,c*(x 1 )),…, (x m,c*(x m )), x i i.i.d. from D Distribution D on X Statistical / PAC learning model (x 1,…,x m ) Does optimization over S, finds hypothesis h 2 C. Goal: h has small error, err(h)=Pr x 2 D (h(x) c*(x)) c* in C, realizable case; else agnostic

14 14 Two Main Aspects in Classic Machine Learning Algorithm Design. How to optimize? Automatically generate rules that do well on observed data. Generalization Guarantees, Sample Complexity Confidence for rule effectiveness on future data. E.g., Boosting, SVM, etc.

15 Interactive Machine Learning Active Learning Learning with more general queries; connections

16 Active Learning face OOOOOO Expert Labeler raw data Learning Algorithm Unlabeled data Classifier not face

17 Active Learning in Practice Text classification: active SVM ( Tong & Koller, ICML2000). e.g., request label of the example closest to current separator. Video Segmentation ( Fathi-Balcan-Ren-Regh, BMVC 11).

18 w +- Exponential improvement. Sample with 1/ unlabeled examples; do binary search. - Active: only O(log 1/ ) labels. Passive supervised: (1/ ) labels to find an -accurate threshold. + - Active Algorithm Canonical theoretical example [CAL92, Dasgupta04] Provable Guarantees, Active Learning

19 First noise tolerant algo [Balcan, Beygelzimer, Langford, ICML06] [Hanneke07, DasguptaHsuMontleoni07, Wang09, Fridman09, Koltchinskii10, BHW08, BeygelzimerHsuLangfordZhang10, Hsu10, Minsker12, Ailon12, …] Generic (any class), adversarial label noise. Substantial follow-up on work. Lots of exciting activity in recent years. My work: Provable Guarantees, Active Learning First computationally efficient algos [Balcan-Long COLT13] [Awasthi-Balcan-Long STOC14]

20 Margin Based Active Learning Realizable: exponential improvement in label complex. over passive [only O(d log 1/ ) labels to find w error ]. [Balcan-Long COLT13] [Awasthi-Balcan-Long STOC14] Learning linear separators, when D logconcave in R d. Agnostic: poly-time AL algo outputs w with err(w) =O( ´ ), ´ =err(best lin. sep) Unexpected Implications for Passive Learning! Computational complexity: improves significantly on the best guarantees known for passive [KKMS05], [KLS09]. Sample complexity: resolves open question about ERM.

21 Margin Based Active-Learning, Realizable Case Draw m 1 unlabeled examples, label them, add them to W(1). iterate k = 2, …, s find a hypothesis w k-1 consistent with W(k-1). W(k)=W(k-1). sample m k unlabeled samples x satisfying |w k-1 T ¢ x| · k-1 label them and add them to W(k). w1w1 1 w2w2 2 w3w3

22 Log-concave distributions: log of density fnc concave. wide class: uniform distr. over any convex set, Gaussian, etc. TheoremIf then err(w s ) · D log-concave in R d. after Active learning Passive learning rounds using label requests unlabeled examples labels per round. Margin Based Active-Learning, Realizable Case

23 Analysis: Aggressive Localization Induction: all w consistent with W(k), err(w) · 1/2 k w w*w*

24 Analysis: Aggressive Localization Induction: all w consistent with W(k), err(w) · 1/2 k w k-1 w k-1 w*w* Suboptimal w k-1 w k-1 w*w*

25 Analysis: Aggressive Localization Induction: all w consistent with W(k), err(w) · 1/2 k w k-1 w k-1 w*w* · 1/2 k+1

26 Analysis: Aggressive Localization Induction: all w consistent with W(k), err(w) · 1/2 k w k-1 w k-1 w*w* · 1/2 k+1 Enough to ensure Need only labels in round k.

27 Localization in concept space. Margin Based Active-Learning, Agnostic Case Draw m 1 unlabeled examples, label them, add them to W. Localization in instance space. iterate k=2, …, s find w k-1 in B(w k-1, r k-1 ) of small k-1 hinge loss wrt W. Clear working set. sample m k unlabeled samples x satisfying |w k-1 ¢ x| · k-1 ; label them and add them to W. end iterate Analysis: exploit localization & variance analysis control the gap between hinge loss and 0/1 loss in each round.

28 Improves over Passive Learning too! Passive LearningPrior WorkOur Work Malicious Agnostic Active Learning [agnostic/malicious] [KLS09] NA [KLS09]

29 Improves over Passive Learning too! Passive LearningPrior WorkOur Work Malicious Agnostic Active Learning [agnostic/malicious] Info theoretic optimal [KKMS05] [KLS09] [KKMS05] NA Info theoretic optimal Slightly better results for the uniform distribution case.

30 Useful for active and passive learning! Localization both algorithmic and analysis tool!

31 Important direction: richer interactions with the expert. Learning Algorithm Expert Fewer queries Natural interaction Better Accuracy

32 Learning Algorithm raw data Expert Labeler Classifier New Types of Interaction Class Conditional Query Mistake Query Learning Algorithm raw data Expert Labeler ) Classifier dogcatpenguin wolf

33 Class Conditional & Mistake Queries Used in practice, e.g. Faces in IPhoto. Lack of theoretical understanding. Realizable (Folklore) : much fewer queries than label requests. [Balcan-Hanneke COLT12] Tight bounds on the number of CCQs to learn in the presence of noise (agnostic and bounded noise) Our Work

34 Important direction: richer interactions with the expert. Learning Algorithm Expert Fewer queries Natural interaction Better Accuracy

35 Interaction for Faster Algorithms E.g., clustering protein sequences by function Expert an expensive computer program. Voevodski-Balcan-Roglin-Teng-Xia, [UAI 10, SIMBAD 11, JMLR12] Strong guarantees under natural stability condition: produce accurate clusterings with only k one-vs-all queries. BLAST to computes similarity to all other objects in dataset (one-vs-all queries). Better accuracy than state of the art techniques using limited info on standard datasets (pFam and SCOP datasets).

36 Voevodski-Balcan-Roglin-Teng-Xia [UAI 10, SIMBAD 11, JMLR12] Better accuracy than state of the art techniques using limited info on standard datasets (pFam and SCOP datasets). Interaction for Faster Algorithms

37 Interactive Learning First noise tolerant poly time, label efficient algos for high dim. cases. [BL13] [ABL14] Learning with more general queries. Active & Differentially Private [Balcan-Feldman, NIPS13] [BH12] Communication complexity, distributed learning. Cool Implications: Sample & computational complexity of passive learning [Balcan-Blum-Fine-Mansour, COLT12] Runner UP Best Paper Summary: Related Work:

38 Learning Theory Lenses on Submodularity

39 Submodular functions Discrete fns that model laws of diminishing returns. Optimization, operations research Algorithmic Game Theory Machine Learning [Lehman-Lehman-Nisan01], …. [Bilmes03] [Guestrin-Krause07], … Social Networks [Kleinberg-Kempe-Tardos03]

40 [Balcan&Harvey, STOC 2011 & Necktar Track, ECML-PKDD 2012] Novel structural results New Lenses on Submodularity Learning Submodular Functions Alg. Game Theory Economics Matroid Theory Discrete Optimization Applications: Economics, Social Networks, …

41 Submodular functions x S T x + + Large improvement Small improvement For T µ S, x S, f(T [ {x}) – f(T) ¸ f(S [ {x}) – f(S) V={1,2, …, n}; set-function f : 2 V ! R submodular if E.g., +

42 Submodular functions Concave Functions Let h : R ! R be concave. For each S µ V, let f(S) = h(|S|) Vector Spaces Let V={v 1,,v n }, each v i 2 R n. For each S µ V, let f(S) = rank(V[S]) More examples:

43 Learning submodular functions Influence Function in Social Networks V = set of nodes. f(S) = expected influence when S is the originating set of nodes in the difusion. Valuation Functions in Economics V = all the items you sell. f(S) = valuation on set of items S.

44 PMAC model for learning real valued functions Distribution D on 2 [n] Labeled Examples Learning Algorithm Expert / Oracle Data Source Alg.outputs f : 2 [n] R + g :2 [n] R + (S 1,f(S 1 )),…, (S m,f(S m )) Algo sees (S 1,f(S 1 )),…, (S m,f(S m )), S i i.i.d. from D, produces g. Probably Mostly Approximately Correct With probability ¸ 1- ± we have Pr S [g(S) · f(S) · ® g(S)] ¸ 1- ²

45 Learning submodular functions [BH11] No algo can PMAC learn the class of submodular fns with approx. factor õ(n 1/3 ). Theorem: (General lower bound) Corollary: Matroid rank fns do not have a concise, approximate representation. Surprising answer to open question in Economics of Theorem: (General upper bound) Poly time alg. for PMAC-learning (w.r.t. an arbitrary distribution) with an approx. factor ® =O(n 1/2 ). Even if value queries allowed; even for rank fns of matroids. Noam NisanPaul Milgrom

46 A General Lower Bound Construct a hard family of matroid rank functions. A1A1 A2A2 ALAL A3A3 X X X Low=log 2 n High=n 1/3 X … … …. …. L=n log log n No algo can PMAC learn the class of non-neg., monotone, submodular fns with an approx. factor õ(n 1/3 ). Theorem Vast generalization of partition matroids.

47 Lipschitz Functions. Product distributions Let ¹ = i=1 m f (x i ) / m Let g be the constant function with value ¹ Approx. factor ® =O(log(1/ ² )) on a 1- ² fraction of points. Algorithm: For any Lipschitz function f and x drawn from a product distribution, classic tail bounds give: Only get constant mult. error if E[f] = ( sqrt(n) ) We can strengthen this if f is also submodular:

48 Exploit Additional Properties E.g., product distribution BH11, Lipschitz submodular FV13 general submodular Learning valuation fns from AGT and Economics. Learning influence function in social networks.

49 Learning Valuation Functions Target dependent learnability for classes from Algorithmic Game Theory and Economics Additive µ OXS µ GS µ Submodular µ XOS µ Subadditive XOR OR ({1}, $2)({2}, $5) Romania Switzerland OR ({2}, $6) ({1}, $10) ({3}, $5) [Balcan-Constantin-Iwata-Wang, COLT 2012] Theorem: (Poly number of XOR trees) O(n ² ) approximation in time O(n 1/ ² ).

50 Learning the Influence Function Influence function in social networks: coverage function. [Du, Liang, Balcan, Song 2014] Maximum likelihood approach; promising performance Memetracker Dataset, blog data cascades : apple and jobs, tsunami earthquake, william kate marriage

51 [Balcan&Harvey, STOC 2011 & Necktar Track, ECML-PKDD 2012] Novel structural results New Lenses on Submodularity Learning Submodular Functions Algo Game Theory Economics Matroid Theory Discrete Optimization Follow-up work: Algo Game Theory, Machine Learning, Discrete Optimization, Privacy.

52 Discussion. Other Exciting Directions

53 Foundations for Modern ML and Applications Themes of My Research Game Theory Approx. Algorithms Matroid Theory Machine Learning Theory Discrete Optimization Mechanism Design Control Theory Learning Lens on Other Fields, Connections

54 Foundations for Modern Paradigms Resource constraints. E.g., Computational efficiency (noise tolerant algos) Communication Human labeling effort Statistical efficiency Privacy/Incentives New approaches. E.g., Semi-supervised learning Distributed learning Interactive learning Never ending learning Community detection Multi-task/transfer learning

55 Analysis Beyond the Worst Case Identify ways to capture structure of real world instances, analysis beyond the worst case. E.g., circumvent NP-hardness barriers for objective based clustering (e.g., k-means, k-median) Perturbation resilience: small changes to input distances shouldnt move optimal solution by much. Approximation stability: near-optimal solutions shouldnt look too different from optimal solution [BBG09] [BB09] [BRT10] [BL12]

56 Analysis Beyond the Worst Case Understand characteristics of practical tasks that makes them easier to learn. Develop learning tools that take advantage of such characteristics. Major Direction: Identify ways to capture structure of real world instances, analysis beyond the worst case.

57 Understanding & Influencing Multi-agent Systems What do we expect to happen? More realistic: view agents as adapting, learning entities. Major Question: Can we guide or nudge dynamics to high-quality states quickly [using minimal intervention]? Traditional: analyze equilibria (static concept)

58 Foundations for Modern ML and Applications Themes of My Research Game Theory Approx. Algorithms Matroid Theory Machine Learning Theory Discrete Optimization Mechanism Design Control Theory Learning Lens on Other Fields, Connections


Download ppt "Modern Machine Learning: New Challenges and Connections Maria-Florina Balcan."

Similar presentations


Ads by Google