HPC in linguistic research Andrew Meade University Of Reading

HPC in linguistic research Andrew Meade University Of Reading a.meade@reading.ac.uk

HPC use in linguistic research Linguistic and biological models Phylogenies Linguistic data Models of evolution Parallelism Scaling Results On going work Key challenges

Linguistic and biological systems AttributeGeneticsLinguistics Discrete units nucleotides, codons, genes, individuals words, grammar, syntax Replicationtranscription teaching, learning, imitation Dominant mode(s) of inheritance parent-offspring, clonal parent-offspring, peer groups, teaching Horizontal transmissionmany mechanismsborrowing Mutation many mechanisms SNP’s, mobile DNA, mistakes, vowel shifts, innovation Selection fitness differences among alleles ?

Inferring evolutionary histories form linguistic data Evolutionary histories, phylogenies Tools for understand evolution Depicts relationships between languages Identify groups which share a common ancestor Calculate timing events Account for lack of independence in the data Inferred from data, taken from different languages Using an explicate statistical model of evolution Problem is NP-hard, growth is a double factorial. Markov chain Monte Carlo search methods, heuristic search, hill climber Product of Data + Model

Greek Indo-Iranian Slavic Germanic Celtic Romance

The Data Swadesh list, Morris Swadesh 1940, onwards 200 meaning, present in all languages (all most) Chosen to be stable, slowly evolving and resistant to borrowing Some what of a language “gene”

Cognate classes Word with a common evolutionary ancestry and meaning English Fish Danish Fisk Dutch Visch Fish Ryba Czech Ryba Russian Ryba Bulgarian Riba 23 other languages 34other languages

Data coding, Cognates Cognates, words and meaning what are derived from a common ancestor Languages evolve by a processes of descent with modification Englishwhenwater Germanwannwasser Frenchquandeau Italianquandoacqua Greek qotenero Hittite kuwapiwatar English11 0 0 German11 0 0 French10 1 0 Italian10 1 0 Greek10 0 1 Hittite11 0 0 “Water” 3 cognates “When” 1 cognate

Continuous-time Markov Model Q 01 0 Non cognate 1 Cognate Q 10 Q 01 Rate at which cognates are gained Q 10 Rate at which cognates are lost

The Likelihood Model Product over the model 1 – 12 categories Product over the data 200 – 100,000 sites

Level of parallelism Data – Analysis of multiple datasets (3-5) Model – Test a range of models (10-20) Run – Stochastic process multiple runs (5-10) Code – individual run can still take years Trivially parallel

The problem 2003 – 16 taxa, 125 sites, 1 x model 2005 – 87 taxa, 2450 sites, 4 x model 2007 – 400 taxa, 34,440 sites, 100 x model Complexity 700,000x, 5-6 order of magnitude 4.8 years per run, typically 5 publication quality runs + 10 model tests 4.8 years < attention span of academics results are required in days

Parallel method 1 Distribute the data (MPI) Cognates Languages Data Core 1Core 2Core 3 011 011 010 110 111 101 001 001 101 101 000 101 101 101 001 001 111 100 100 100 ……………………..……………..

Parallel method 2 Distribute the model (OpenMP) Data Core 1 Pass 1 Data Core 2 Pass 2 Data Core 3 Pass 3 Data Core 4 Pass 4

Distribute the data and the model (MPI + OpenMP) Data Core 1 Pass 1 Core 2 Data Core 3 Pass 2 Core 4 Data Core 5 Pass 3 Core 6 Data Core 7 Pass 4 Core 8

Cores Seconds - log 10

Cores Efficiency

Results Runtime reduced from 4.8 years to Good scaling, but not sustainable HPC has allowed for the accurate analysis of large complex data sets with statistically justifiable models. CoresDays 6031.5 15014.5 3008.5 6006

Current work Phoneme data Modelling sound utterances Better resolution than cogency data Relevant linguistics patterns are emerging 120 phonemes, 2 cogency judgments Another 3 order of magnitude complexity Accelerator implementation CUDA / OpenCL LanguageWordCogencyPhoneme EnglishFish1 DanishFisk1

Scalable computing Last 10 years, 5-6 order of magnate increase in complexity Reasonably scalable code redesign needed. Need to change the how not the what What – statistical framework, realistic models How – algorithm, language, parallelisation method, hardware Scalable algorithms

Burn in Serial Convergence Parallel

Parallel sampling using multiple chains

Key challenges Computing is a rate limiting step Trending water / drowning Widening gap between computing power and data models complexity Data set size and model complexity restricted 20-30 year old methods, which are less accurate and non statistical are returning Connecting researchers with results not HPC HPC is a nuisance in science Steep learning curve High cost. Hardware, running costs and personnel Access and flexibility Not one off activity, thousands of data sets are produced each year, 3000+ published in 2011

Acknowledgments Mark Pagel

HPC in linguistic research Andrew Meade University Of Reading

Similar presentations

Presentation on theme: "HPC in linguistic research Andrew Meade University Of Reading"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

HPC in linguistic research Andrew Meade University Of Reading

Similar presentations

Presentation on theme: "HPC in linguistic research Andrew Meade University Of Reading"— Presentation transcript:

Similar presentations

About project

Feedback