Download presentation
Presentation is loading. Please wait.
1
1 Unsupervised and Knowledge- Free Natural Language Processing in the Structure Discovery Paradigm Chris Biemann University of Leipzig, Germany Talk at University of Wolverhampton September 7, 2007
2
2 Outline Review of traditional approaches Knowledge-intensive vs. knowledge-free Degrees of Supervision A new approach The Structure Discovery Paradigm Graph-based SD procedures Graph models for language processing Chinese Whispers Graph Clustering Structure Discovery Processes Language Separation Unsupervised POS tagging Word Sense Induction
3
3 Knowledge-Intensive vs. Knowledge-Free In traditional automated language processing, knowledge is involved in all cases where humans manually tell machines How to process language by explicit knowledge How a task should be solved by implicit knowledge Knowledge can be provided by the means of: Dictionaries, e.g. thesaurus, WordNet, ontologies, … (grammar) rules Annotation
4
4 Degrees of Supervision Supervision is providing positive and negative training examples to Machine Learning algorithms, which use this as a basis for building a model that reproduces the classification on unseen data Degrees : Fully supervised (Classification): Learning is only carried out on fully labeled training set Semi-supervised: Unlabeled examples are also used for building a data model Weakly-supervised (Bootstrapping): A small set of labeled examples is grown and classifications are used for re-training Unsupervised (Clustering): No labeled examples are provided
5
5 Structure Discovery Paradigm SD: Analyze raw data and identify regularities Statistical methods, clustering Knowledge-free, unsupervised Structures: as many as can be discovered Language-independent, domain-independent, encoding-independent Goal: Discover structure in language data and mark it in the data
6
6 Example: Discovered Structures Annotation on various levels Similar labels denote similar properties as found by the SD algorithms Similar structures in corpus are annotated in a similar way „ Increased interest rates lead to investments in banks.“ In=creas-ed interest rate-s lead to investment-s in bank-s POS=298>.
7
7 Consequences of Working in SD Only input allowed is raw text data Machines are told how to algorithmically discover structure Self-annotation process by marking regularities in the data Structure Discovery process is iterated Text Data SD algorithm Find regularities by analysis Annotate data with regularities SD algorithm SD algorithms
8
8 Pros and Cons of Structure Discovery Advantages : Cheap: only raw data needed Alleviation of acquisition bottleneck Language and domain independent No data-resource mismatch (all resources leak) Disadvantages : No control over self-annotation labels Congruence to linguistic concepts not guaranteed Much computing time needed
9
9 What is it good for? How do I know? Many structural regularities can be thought of, some are interesting, some are not. Structures discovered by SD algorithms will not necessarily match the concepts of linguists Working in the SD paradigm means to over-generate structure acquisition methods and to check, whether these are helpful Methods for telling helpful from useless SD procedures: „Look at my nice clusters“-approach: Examine data by hand. While good in the initial phase of testing, this is inconclusive: choice of clusters, coverage… Task-based evaluation: Use the labels obtained as features in a Machine Learning scenario and measure the contribution of each label type. Involves supervision, is indirect
10
10 Graph models for SD procedures Motivation for graph representation Graphs are an intuitive and natural way to encode language units as nodes and their similarities as edges - but also other representations are possible Graph clustering can efficiently perform abstraction by grouping units into homogeneous sets with Chinese Whispers Some graphs on basic units Word co-occurrence (neighbour/sentence), significance, higher orders Word context similarity based on global context vectors Sentence/document similarity on common words
11
11 Graph Clustering Find groups of nodes in undirected, weighted graphs Hierarchical Clustering vs. Flat Partitioning 3 3 3 3443
12
12 ? Desired outcomes ? Colors symbolise partitions 3 3 3 3443
13
13 Chinese Whispers Algorithm Nodes have a class and communicate it to their adjacent nodes A node adopts one of the the majority class in its neighbourhood Nodes are processed in random order for some iterations Algorithm: initialize: forall v i in V: class(v i )=i; while changes: forall v in V, randomized order: class(v)=highest ranked class in neighborhood of v; A L1 D L2 E L3 B L4 C L3 5 8 6 3 deg=1 deg=2 deg=3 deg=5 deg=4
14
14 Properties of CW PRO: Efficiency: CW is time-linear in the number of edges. This is bound by n² with n= number of nodes, but in real world data, graphs are much sparser Parameter-free: this includes number of clusters CON: Non-deterministic: due to random order processing and possible ties w.r.t. the majority. Does not converge: See tie example: Formally hard to analyse However, the CONs are not severe for real world data...
15
15 Application: Language Separation Cluster the co- occurrence graph of a multilingual corpus Use words of the same class in a language identifier as lexicon Almost perfect performance
16
16 “Look at my nice languages!” Cleaning CUCWeb Latin: In expeditionibus tessellata et sectilia pauimenta circumferebat. Britanniam petiuit spe margaritarum: earum amplitudinem conferebat et interdum sua manu exigebat.. Scripting: @echo @cd $(TLSFDIR);$(CC) $(RTLFLAGS) $(RTL_LWIPFLAGS) -c $(TLSFSRC) … @echo @cd $(TOOLSDIR);$(CC) $(RTLFLAGS) $(RTL_LWIPFLAGS) -c $(TOOLSSRC).. Hungarian: A külügyminiszter a diplomáciai és konzuli képviseletek címjegyzékét és konzuli … Köztestületek, jogi személyiséggel és helyi jogalkotási jogkörrel. Esperanto: Por vidi ghin kun internacia kodigho kaj kun kelkaj bildoj kliku tie chi ) La Hispana.. Ne nur pro tio, ke ghi perdigis la vivon de kelk-centmil hispanoj, sed ankau pro ghia efiko.. Human Genome: 1 atgacgatga gtacaaacaa ctgcgagagc atgacctcgt acttcaccaa ctcgtacatg 61 ggggcggaca tgcatcatgg gcactacccg ggcaacgggg tcaccgacct ggacgcccag 121 cagatgcacc … Isoko (Nigeria): (1) Kọ Ileleikristi a rẹ rọwo inọ Ọghẹnẹ yọ Esanerọvo? (5) Kọ Jesu o whu evaọ uruwhere?
17
17 Unsupervised POS tagging Given: Unstructured monolingual text corpus Goal: Induction of POS tags for all words. Motivation: POS information is a processing step in a variety of NLP applications such as parsing, IE, indexing POS taggers need a considerable amount of hand-tagged training data which is expensive and only available for major languages Even for major languages, POS taggers are suited for well- formed texts and do not cope well with domain-dependent issues as being found e.g. in email or spoken corpora
18
18 Steps in unsupervised POS tagging..., sagte der Sprecher bei der Sitzung...., rief der Vorsitzende in der Sitzung...., warf in die Tasche aus der Ecke. 17 C1: sagte, warf, rief C2: Sprecher, Vorsitzende, Tasche C3: in C4: der, die..., sagte|C1 der|C4 Sprecher|C2 bei der|C4 Sitzung...., rief|C1 der|C4 Vorsitzende|C2 in|C3 der|C4 Sitzung...., warf|C1 in|C3 die|C4 Tasche|C2 aus der|C4 Ecke...., sagte|C1 der|C4 Sprecher|C2 bei|C3 der|C4 Sitzung|C2...., rief|C1 der|C4 Vorsitzende|C2 in|C3 der|C4 Sitzung|C2...., warf|C1 in|C3 die|C4 Tasche|C2 aus|C3 der|C4 Ecke|C2. Unlabelled Text Distributional Vectors NB-cooccurrences high frequency words medium frequency words Graph 1Graph 2 Partitioning 1 Maxtag Lexicon Partially Labelled Text Fully Labelled Text Trigram Viterbi Tagger Chinese Whispers Graph Clustering Partitioning 2
19
19 Partitioning 1 Example Collect global contexts for top 10K words Cosine between vectors = similarity - matrix not sparse! Construct graph, cluster with Chinese Whispers –no predefined number of clusters –efficient –parameter: threshold on sim. Value –excludes singletons... _KOM_ sagte der Sprecher bei der Sitzung _ESENT_... _KOM_ rief der Vorsitzende in der Sitzung _ESENT_... _KOM_ warf in die Tasche aus der Ecke _ESENT_ Features: der(1), die(2), bei(3), in(4), _ESENT_(5), _KOM_(6) Word123456123456123456123456 sagte11 rief11 warf111 Sprecher111 Vorsitzende 111 Tasche111 -2+1+2 Cos(x,y)sagteriefwarf SprecherVorsitzendeTasche sagte1 rief11 warf0.408 1 Sprecher0001 Vorsitzende000.3330.6661 Tasche0000.4080.3331
20
20 Partitioning 2 Example Clustering more than 10K words based on context vectors is not viable (time-quadratic). Vectors tend to get sparser for lower frequency words Use neigbouring cooccurrences to compute similarity between medium- and low frequency words (without top 2000): Clustering with CW: Medium and low frequency words are arranged in POS-motivated clusters
21
21 unsuPOS: Ambiguity Example Wordcluster IDcluster members (size) I166I (1) saw2past tense verbs (3818) the73a, an, the (3) man1nouns (17418) with13prepositions (143) a73a, an, the (3) saw1nouns (17418).116. ! ? (3)
22
22 unsuPOS: Medline tagset 1 (13721) recombinogenic, chemoprophylaxis, stereoscopic, MMP2, NIPPV, Lp, biosensor, bradykinin, issue, S-100beta, iopromide, expenditures, dwelling, emissions, implementation, detoxification, amperometric, appliance, rotation, diagonal, 2 (1687) self-reporting, hematology, age-adjusted, perioperative, gynaecology, antitrust, instructional, beta-thalassemia, interrater, postoperatively, verbal, up-to-date, multicultural, nonsurgical, vowel, narcissistic, offender, interrelated, 3 (1383) proven, supplied, engineered, distinguished, constrained, omitted, counted, declared, reanalysed, coexpressed, wait, 4( 957) mediates, relieves, longest, favor, address, complicate, substituting, ensures, advise, share, employ, separating, allowing, 5 (1207) peritubular, maxillary, lumbar, abductor, gray, rhabdoid, tympanic, malar, adrenal, low-pressure, mediastinal, 6 (653) trophoblasts, paws, perfusions, cerebrum, pons, somites, supernatant, Kingdom, extra-embryonic, Britain, endocardium, 7 (1282) acyl-CoAs, conformations, isoenzymes, STSs, autacoids, surfaces, crystallins, sweeteners, TREs, biocides, pyrethroids, 8 (1613) colds, apnea, aspergilloma, ACS, breathlessness, perforations, hemangiomas, lesions, psychoses, coinfection, terminals, headache, hepatolithiasis, hypercholesterolemia, leiomyosarcomas, hypercoagulability, xerostomia, granulomata, pericarditis, 9 (674) dysregulated, nearest, longest, satisfying, unplanned, unrealistic, fair, appreciable, separable, enigmatic, striking, i 10 (509) differentiative, ARV, pleiotropic, endothermic, tolerogenic, teratogenic, oxidizing, intraovarian, anaesthetic, laxative, 13 (177) ewe, nymphs, dams, fetuses, marmosets, bats, triplets, camels, SHR, husband, siblings, seedlings, ponies, foxes, neighbor, sisters, mosquitoes, hamsters, hypertensives, neonates, proband, anthers, brother, broilers, woman, eggs, 14 (103) considers, comprises, secretes, possesses, sees, undergoes, outlines, reviews, span, uncovered, defines, shares, s 15 (87) feline, chimpanzee, pigeon, quail, guinea-pig, chicken, grower, mammal, toad, simian, rat, human-derived, piglet, ovum, 16 (589) dually, rarely, spectrally, circumferentially, satisfactorily, dramatically, chronically, therapeutically, beneficially, already, 18 (124) 1-min, two-week, 4-min, 8-week, 6-hour, 2-day, 3-minute, 20-year, 15-minute, 5-h, 24-h, 8-h, ten-year, overnight, 120- 21 (12) July, January, May, February, December, October, April, September, June, August, March, November 23 (13) acetic, retinoic, uric, oleic, arachidonic, nucleic, sialic, linoleic, lactic, glutamic, fatty, ascorbic, folic 25 (28) route, angle, phase, rim, state, region, arm, site, branch, dimension, configuration, area, Clinic, zone, atom, isoform, 24 7(6) P<0.001, P<0.01, p<0.001, p<0.01, P<.001, P<0.0001 391 (119) alcohol, ethanol, heparin, cocaine, morphine, cisplatin, dexamethasone, estradiol, melatonin, nicotine, fibronectin,
23
23 Task-based unsuPOS evaluation UnsuPOS tags are used as features, performance is compared to no POS and supervised POS. Tagger was induced in one-CPU-day from BNC Kernel-based WSD: better than noPOS, equal to suPOS POS-tagging: better than noPOS Named Entity Recognition: no significant differences Chunking: better than noPOS, worse than suPOS
24
24 Application: Word Sense Induction Co-occurrence graphs of ambiguous words can be partitioned: Leave out focus word Clusters contain context words for disambiguation
25
25 hip
26
26 hip
27
27 hip
28
28 hip
29
29 Future work Grammar induction: Use unsupervised POS labels and sequences for inducing tree-like structures Word Sense Induction and Disambiguation with hierarchical sense models, evaluation in SEMEVAL framework Structure Discovery Machine: putting many SD procedures together, including morphology induction, multiword expression identification, relation finding etc. Semisupervised SD: Framework for connecting SD processes to application-based settings
30
30 Summary Structure Discovery Paradigm contrasted to traditional approaches: –no manual annotation, no resources (cheaper) –language- and domain-independent –iteratively enriching structural information by finding and annotating regularities Graph-based SD procedures –Chinese Whispers for Language Separation, unsupervised POS tagging and Word Sense Induction
31
31 Questions? THANKS FOR YOUR ATTENTION!
32
32 Emergent Language Generation Models From SD to NLG Structure Discovery allows us to process yet undescribed languages Generation models only capture certain aspects of real natural language SD processes can identify what is common and what is different between generated and real language Generation models can be improved according to meet the criteria of SD processes Understanding the emergence of natural language is a third way of approaching the understanding of natural language structure (besides classical linguistics and bottum-up NLP)
33
33 Graph-based Generation Model Word GeneratorSentence Generator
34
34 Comparison: Generated and Real
35
35 Computational Linguistics and Statistical NLP CL: Implementing linguistic theories with computers Rule-based approaches Rules found by introspection, not data-driven Explicit knowledge Goal: understanding language itself Statistical NLP: Building systems that perform language processing tasks Machine Learning approaches Models are built by training on annotated dataset Implicit knowledge Goal: Build robust systems with high performance There is a continuum rather than a sharp cutting edge
36
36 Building Blocks in SD Hierarchical levels of basic units in text data: Letters Words Sentences Documents These are assumed to be recognizable in the remainder. SD allows for arbitrary numbers of intermediate levels grouping of basic into complex units, but these have to be found by SD procedures.
37
37 Similarity and Homogeneity For determining which units share structure, a similarity measure for units is needed. Two kinds of features are possible: Internal features: compare units based on the lower level units they contain Context features: compare units based on other units of same or other level that surround them A clustering based on unit similarity yields sets of units that are homogeneous w.r.t. structure This is an abstraction process: Units are subsumed under the same label.
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.