Presentation is loading. Please wait.

Presentation is loading. Please wait.

Utilising software to enhance your research Eamonn Hynes 5 th November, 2012.

Similar presentations


Presentation on theme: "Utilising software to enhance your research Eamonn Hynes 5 th November, 2012."— Presentation transcript:

1 Utilising software to enhance your research Eamonn Hynes 5 th November, 2012

2 Basic statistics and some parallel computing

3 Basic statistics Probability Mean Standard deviation Simple examples: -Probability of just one six from three throws of a die? -Probability of winning the Lotto Tougher problems: -Transcribing speech into words -Poker robot that plays optimally

4 Hands on 32.2363709297.2890150217.1819334254.0713912979.94693169 10.758473479.18147579774.2162928921.5445616368.44443827 53.2126302348.81160488111.66268863.23529694527.0289035 31.88350860.420613252116.4116864118.307538595.714701 36.8742497127.347160427.4100597553.3612377428.25817912 52.65014850.16117165959.81456149119.160172645.9590348 14.7726389323.4438991846.5967968394.48144298110.4074521 61.3506672817.6571922364.908910750.910675387106.1256412 87.5408004333.2210467632.8240679784.2730999849.8711702 30.1845835921.673322128.47109933587.4842324439.41019714 Mean of column 1? Mean of row 4? Standard deviation of column 3?

5 Standard deviation 13.6%

6 A billion numbers? Single-core Multi-core Eight cores Single core Memory

7 More interesting example Again, a large sequence of numbers Speech signal ~56 Different sounds Task is to calculate the most likely sequence of words Over 50 years of research

8 Moore’s Law

9 Demise of Moore’s Law Reality

10 Moore’s Law The solution: – Parallel architectures – Hybrid architectures – New software – harder to write – New programming paradigms – Dedicated hardware – Beyond silicon

11 Amdhal’s Law Limitations on parallel code – Thankfully a large number of problems are parallel in nature (rendering 3D graphics, weather prediction, image processing, DNA matching) – But many problems are sequential in nature! – e.g. card game, legal process, ordering a laptop, etc. – Nothing we can do except increase clock rate!

12 Clustering

13 Categorise data into groups Important in many fields – speech, medical statistics, data mining, etc. Very loose algorithm (k-means clustering): – Let each point be a cluster centroid – Pick a random point – Get point closest to this chosen point – Calculate centroid – Repeat until just k centroids Big limitation: k must be specified in advance… Example

14 Clustering Not just for points on a 2d surface Pixels of an image Example

15 Support Vector Machines Support vector machines (SVMs) – Popular in the 1990s/2000s (Vapnik et al. 1992) – Non-linear classification – Beautiful maths Find a nonlinear boundary between k sets of points Example

16 Text analysis

17 Searching documents task Naïve search: – SQL query: “SELECT * FROM articles WHERE body LIKE '%$key word%';” – Works fine for small document collections Large databases: Better to index all documents tf-idf

18 Text analysis Process each document Calculate the frequency of each word Store the index, not the entire document Much faster document retrieval Intuitive to pick document with highest term count Must weight each document by the inverse document frequency

19 Text analysis Example: Simple Boolean logic Searching for “rose” If word appears, then document is relevant

20 Text analysis Taking term frequencies into account

21 Text analysis TFIDF = TF * IDF where: TF = C/T where C = number of times a given word appears in a document and T = total number of words in a document IDF = D/DF where D = total number of documents in a corpus, and DF = total number of documents containing a given word

22 Text analysis Natural language follows a Zipfian distribution

23 Finally

24 Deep belief networks Given a document, how to find similar documents? Deep belief networks (DBNs) State-of-the-art in machine learning More advanced than Latent Semantic Analysis (LSA) Principal Component Analysis (PCA) and clustering

25 Deep belief networks 2000 most common word stems fed into base layer Gradual reduction in number of neurons Left with a 30-digit binary representation of a document with 2000-dimension feature vector Super fast document retrieval (“semantic hashing”) Images from G. Hinton, Science (2006)

26


Download ppt "Utilising software to enhance your research Eamonn Hynes 5 th November, 2012."

Similar presentations


Ads by Google