Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Smart Software with F# Joel Pobar Language Geek

Similar presentations


Presentation on theme: "1 Smart Software with F# Joel Pobar Language Geek"— Presentation transcript:

1 1 Smart Software with F# Joel Pobar Language Geek http://callvirt.net/blog

2 2 Agenda What is it? F# Intro Algorithms: Search Fuzzy Matching Classification (SVM) Recommendations Q&A

3 3 All This in 45 mins? This is an awareness session! Lots of content, very broad, very fast You’ll get all demos, pointers, and slide deck to take offline and digest Two takeaways: F# is a great language for data Smart algorithms aren’t hard – use them, explore more!

4 4 F# is...a functional, object-oriented, imperative and explorative programming language for.NET what is Functional Programming? http://callvirt.net/jaoo.zip

5 5 What is Functional Programming? Wikipedia: “A programming paradigm that treats computation as the evaluation of mathematical functions and avoids state and mutable data” -> Emphasizes functions -> Emphasizes shapes of data, rather than impl. -> Modeled on lambda calculus -> Reduced emphasis on imperative -> Safely raises level of abstraction

6 6 Motivation for Functional Simplicity in life is good: cheaper, easier, faster, better. We typically achieve simplicity in software in two ways: By raising the level of abstraction (and OO was one design to raise abstraction) Increasing modularity Increasing signal to noise another good strategy: Communicate more in less time with more clarity Better composition and modularity == reuse

7 7 Functional Programming Safer, while still being useful UnsafeUnsafe SafeSafe UsefulUseful Not Useful C#, C++, … V.Next#V.Next# HaskellHaskell F#F#

8 8 What is F# for? F# is a General Purpose language Can be used for a broad range of programming tasks Superset of imperative and dynamic features Great for learning FP concepts Some particularly important domains Financial modeling and analysis Data mining Scientific data analysis Domain-specific modeling Academic

9 9 Let ‘Let’ binds values to identifiers let helloWorld = “Hello, World” print_any helloWorld let myNum = 12 let myAddFunction x y = let sum = x + y sum Type inference. The static typing of C# with the succinctness of a scripting language

10 10 Tuples Simple, and most useful data structure let site1 = (“msdn.com”, 10) let site2 = (“abc.net.au”, 12) let site3 = (“news.com.au”, 22) let allSites = (site1, site2, site3) let fst (a, b) = a let snd (a, b) = b

11 11 Lists, Arrays, Seq and Options Lists & Arrays are first-class citizens Options provide a some-or-nothing capability let list1 = [“Joel"; "Luke"] let array = [|2; 3; 5;|] let myseq = seq [0; 1; 2; ] let option1 = Some(“Joel") let option2 = None

12 12 Records Simple concrete type definition type Person = { Name: string; DateOfBirth: System.DateTime; } let n = { Name = “Joel”; DateOfBirth = “13/04/81”; }

13 13 Immutability (by default) Values may not be changed Data is immutable by default

14 14 Discriminated Unions Great for representing the structure of data type Make = string type Model = string type Transport = | Car of Make * Model | Bicycle let me = Car (“Holden”, “Barina”) let you = Bicycle Both of these identifiers are of type “Transport”

15 15 Functions Functions: like delegates + unified and simple Deep type inference (fun x -> x + 1) let myFunc x = x + 1 val myFunc : int -> int let rec factorial n = if n>1 then n * factorial (n-1) else 1 let data = [5; 3; 4; 4; 5] List.sort (fun x y -> x – y) data

16 16 Pattern Matching let (fst, _) = (“first”, “second”) Console.WriteLine(fst) let switchOnType(a:obj) match a with | :? Int32 -> printfn “int!” | :? Transport -> printfn “Transport“ | _ -> printfn “Everything Else!” Very important part of F# Helps deal with the ‘teasing apart’ of data Works best with Discriminated Unions & Records

17 17 Lists, Types, Interactive

18 18 Search Given a search term and a large document corpus, rank and return a list of the most relevant results…

19 19 Blog Crawler

20 20 Search Words Stemming? Tokenize? E.g ‘Python/Ruby’ Markup Title, Author, Date Headings (h1,h2 etc) Paragraphs Links A sign of strength? Let’s explore something simple…

21 21 Search Simplify: For easy machine/language manipulation … and most importantly, easy computation Vectors: natures own quality data structure Convenient machine representation (lists/arrays) Lots of existing vector math algorithms After a loving incubation period, moonlight 2.0 has been released. sour ce code FireFox binaries … after 2 after 1 incubation 1 loving 6 moonlight 4 firefox 6 linux 2 binaries

22 22 Term Count Document1: Linux post: Document2: Animal post: Vector space: 9 the 1 incubation 1 crazy 6 moonlight 4 firefox 6 linux 2 penguin 2 the 1 dog 5 penguin 9 the 1 incubation 1 crazy 6 moonlight 4 firefox 6 linux 0 dog 2 penguin 20200015 2 crazy

23 23 Term Count Issues ‘the dog penguin’ Linux: 9+0+2 = 11 Animal: 2+1+5 = 8 ‘the’ is overweight Enter TF-IDF: Term Frequency Inverse Document Frequency A weight to evaluate how important a word is to a corpus i.e. if ‘the’ occurs in 98% of all documents, we shouldn’t weight it very highly in the total query 9 the 1 incubation 1 crazy 6 moonlight 4 firefox 6 linux 0 dog 2 penguin 20200015

24 24 TF-IDF Normalise the term count: tf = termCount / docWordCount Measure importance of term idf = log ( |D| / termDocumentCount) where |D| is the total documents in the corpus tfidf = tf * idf A high weight is reached by high term frequency, and a low document frequency

25 25 Search Engine in under 10 mins

26 26 Fuzzy Matching String similarity algorithms: SoundEx; Metaphone Jaro Winkler Distance; Cosine similarity; Sellers; Euclidean distance; … We’ll look at Levenshtein Distance algorithm Defined as: The minimum edit operations which transforms string1 into string2

27 27 Fuzzy Matching Edit costs: In-place copy – cost 0 Delete a character in string1 – cost 1 Insert a character in string2 – cost 1 Substitute a character for another – cost 1 Transform ‘kitten’ in to ‘sitting’ kitten -> sitten (cost 1 – replace k with s) sitten -> sittin (cost 1 - replace e with i) sittin -> sitting (cost 1 – add g) Levenshtein distance: 3

28 28 Fuzzy Matching Estimated string similarity computation costs: Hard on the GC (lots of temporary strings created and thrown away, use arrays if possible. Levenshtein can be computed in O (kl) time, where ‘l’ is the length of the shortest string, and ‘k’ is the maximum distance. Parallelisable – split the set of words to compare across n cores. Can do approximately 10,000 compares per second on a standard single core laptop.

29 29 Did You Mean?

30 30 Classification Support Vector Machines (SVM) Supervised learning for binary classification Training Inputs: ‘in’ and ‘out’ vectors. SVM will then find a separating ‘hyperplane’ in an n- dimensional space Training costs, but classification is cheap Can retrain on the fly in some cases

31 31 SVM Classification

32 32 SVM Issues Classification on 2 dimensions is easy, but most input is multi-dimensional Some ‘tricks’ are needed to transform the input data

33 33 SVM Classifier

34 34 F# and Algorithms Netflix Demo Netflix Prize - $1 million USD Must beat Netflix prediction algorithm by 10% 480k users 100 million ratings 18,000 movies Great example of deriving value out of large datasets Earns Netflix loads and loads of $$$!

35 35 MovieIdCustomerIdRating Clerks4444445 Clerks20933934 Clerks9995 Clerks86684781 Dogma24321143 Dogma4444445 Dogma9995... Nearest Neighbour Find neighbours who like what I like

36 36 MovieIdCustomerIdRating Clerks4444445 Clerks20933934 Clerks9995 Clerks86684781 Dogma24321143 Dogma4444445 Dogma9995... Netflix Data Format Netflix Demo

37 37 CustomerId3024418356732 4444445452 999551 111211353 6666655 121212154 56565651 45454555 Nearest Neighbour Algorithm Find all my neighbours movies Find the best movies my neighbours agree on

38 38 Netflix Recommendations

39 39 A Short Stop-over at Vector Math A (x1,y1) B (x2,y2) C (x0,y0) If we want to calculate the distance between A and B, we call on Euclidean Distance We can represent the points in the same way using Vectors: Magnitude and Direction. Having this Vector representation, allows us to work in ‘n’ dimensions, yet still achieve Euclidean Distance/Angle calculations.

40 40 Q & A Any questions? http://callvirt.net/ joelpobar@gmail.com THANKS!


Download ppt "1 Smart Software with F# Joel Pobar Language Geek"

Similar presentations


Ads by Google