Presentation is loading. Please wait.

Presentation is loading. Please wait.

1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,

Similar presentations


Presentation on theme: "1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,"— Presentation transcript:

1 1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group, University of Edinburgh) Ohio State University Copyright  2000 Chris Brew

2 2 Summer School, July 2000 XML and Linguistic Annotation XML topics èWhat is XML? èHTML,XML and SGML èWider context of XML èData Description èDTDs, Schemas nQuery Languages n XML Query, XQL, Quilt, LORE, LT QUERY nStyle Languages nCSS, XSL

3 3 Summer School, July 2000 XML and Linguistic Annotation What is XML? nIt is a markup language used for annotating text nis concerned with logical structure nto identify sections, titles, section headers, chapters, paragraphs,… nis not concerned with appearance nyou say 'this is a subtitle' not 'this is in bold, 14pt, centered' nyou say 'this is an example' not 'this is in verbatim, indented by 5pts, ragged right’ nDerived from SGML.

4 4 Summer School, July 2000 XML and Linguistic Annotation Why is XML a big deal? nIt is a W3C standard nIt is vendor-independent, platform independent, application independent,… nunlike Word documents, RTF documents, PDF documents, Postscript documents,… nIt is human readable nditto (for most values of 'human') nThe Web interchange format

5 5 Summer School, July 2000 XML and Linguistic Annotation Who is in charge of XML? nXML is a W3C Recommendation nThe W3C is The World Wide Web Consortium, a voluntary association of companies and non-profit organizations. Membership costs serious money, confers voting rights. Complex procedures, with the Chairman (Tim Berners-Lee) holding all the high cards, but the big vendors (e.g. Microsoft, Adobe, Netscape) have a lot of power. nThe recommendation was written by the W3C’s XML Working Group.

6 6 Summer School, July 2000 XML and Linguistic Annotation XML as a career move? nMost of the big computer and entertainment companies believe XML is the solution. nExactly what was the problem? Presenting a parts database over the Internet Running an on-line job market ( flipdog.com ) Usually not corpus creation. nScholars win and lose SGML was a minority interest where we had serious influence on what facilities were used XML is mainstream. We’re the minority now. This year’s.coms are busily hiring people who understand ontologies, NLP and web technology.

7 7 Summer School, July 2000 XML and Linguistic Annotation Does it live up to the hype? nOf course not, but… nThe basic idea is simple labeled brackets. Lisp showed the power of this idea in knowledge representation. nKnowledge representation is inherently hard. Lisp made it easier to state the problem, but it wasn’t itself the solution. XML won’t solve your knowledge representation problems either, but it will let you state them. nLabeled brackets++ nLabeled brackets – but designed for information exchange, with sophisticated input (and political pressures) from many interest groups.

8 8 Summer School, July 2000 XML and Linguistic Annotation Does it live up to the hype? nYes. XML and allied standards (XSLT, XML Query,) give us a framework for data interchange. Weather Reports XSL Browser Day Planner Weather Model XML Transformation End UsersData

9 9 Summer School, July 2000 XML and Linguistic Annotation Transformation nEnd users will differ in which parts of the weather reports they need, so the middle stage is the crux. nOne XML format defines the available data nTransformations map this format into what is needed by the different applications, leaving out bits that they don’t need. nOne common transformation is to HTML, for browsers. (easy) nAnother is to printed paper, for efficient random access. (difficult, because our quality expectations are so high)

10 10 Summer School, July 2000 XML and Linguistic Annotation Representing knowledge in text nUnformatted text nFormatted text nStructured Markup

11 11 Summer School, July 2000 XML and Linguistic Annotation Unformatted text United Kingdom Geography Location: Western Europe, bordering on the North Atlantic Ocean and the North Sea, between Ireland and France Map references: Europe, Standard Time Zones of the World Area: total area: 244,820 km2 land area: 241,590 km2 comparative area: slightly smaller than Oregon note: includes Rockall and Shetland Islands Land boundaries: total 360 km, Ireland 360 km Coastline: 12,429 km

12 12 Summer School, July 2000 XML and Linguistic Annotation Formatted text United Kingdom Geography Location: Western Europe, bordering on the North Atlantic Ocean and the North Sea, between Ireland and France Map references: Europe, Standard Time Zones of the World Area: total area: 244,820 km2 land area: 241,590 km2 comparative area: slightly smaller than Oregon >> note: includes Rockall and Shetland Islands Land boundaries: total 360 km, Ireland 360 km Coastline: 12,429 km

13 13 Summer School, July 2000 XML and Linguistic Annotation XML marked up text United Kingdom Geography Western Europe, bordering on the North Atlantic Ocean and the North Sea, between Ireland and France Europe, Standard Time Zones of the World 244,820 km2 241,590 km2 slightly smaller than Oregon note: includes Rockall and Shetland Islands total 360 km, Ireland 360 km

14 14 Summer School, July 2000 XML and Linguistic Annotation The syntax... But aren't all those angle brackets still terribly cumbersome and complicated? nYes. simpler relative only to SGML. But.. There are tools that allow you to add XML annotation without the need to know XML There are tools that allow you to search XML annotation without the need to know XML XML is no more complex than other annotation schemes If you roll your own scheme, you’ll have to write (and maintain) the tools. If you use XML, part or all of your tool set will be provided by mainstream computer industry.

15 15 Summer School, July 2000 XML and Linguistic Annotation RTF Format {\rtf1\ansi \pard\plain\s1\fs36\ppscheme-3\lang2057 {\f1\lang1033 Formatted text\par }\pard\plain\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\b\f1\fs32\ppscheme-6\lang1033 United Kingdom}{\f1\fs20\lang1033 }{\f1\fs16\lang1033 \par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\b\f1\fs24\lang1033 Geography}{\f1\fs12\lang1033 \par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 Location: Western Europe, bordering on the North Atlantic Ocean \par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 and the North Sea, between Ireland and France\par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 Map references: Europe, Standard Time Zones of the World \par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 Area: total area: 244,820 km2 \par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 land area: 241,590 km2 \par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 comparative area: slightly smaller than Oregon\par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 >> note: includes}{\f1\fs20\lang1033 Rockall}{\f1\fs20\lang1033 and Shetland Islands\par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 Land boundaries: total 360 km, Ireland 360 km \par }\pard\s2\li270\fi-270\fs28\ppscheme-1\lang2057\li0\fi0 {\f1\fs20\lang1033 Coastline: 12,429 km\par}}

16 16 Summer School, July 2000 XML and Linguistic Annotation SGML/XML for computational linguists HTML is a use of SGML ïAlso derived from SGML, but an application, not a subset SGML/XML let you define new types of document HTML gives you a language to write document instances But browsers typically aren't SGML parsers and don't enforce the syntax :- ( ïHard-wired to a particular tag set (often with proprietary extensions -- e.g. frames) ïHard-wired to particular typographic format, with limited style-sheets (e.g. you can’t generate different versions of an HTML page -- different order, tailored content) ï XHTML is to XML as HTML is to SGML

17 17 Summer School, July 2000 XML and Linguistic Annotation SGML/XML for computational linguists What is XML? nSGML Lite nSimpler to write nSimpler to parse nHTML Heavy nNew user-definable tags nNot (just) about browsing nData interchange nHeavily legislated syntax

18 18 Summer School, July 2000 XML and Linguistic Annotation What is XML? nXML is just labeled brackets. You get elements with a start tag, some content, and an end tag. Marc Moens Henry, David confidential GGP Contract The GGP contract is ready for signature. Please sign the contract as well as the NDA.

19 19 Summer School, July 2000 XML and Linguistic Annotation XML is SGML made simple nSGML is labeled brackets too. You get elements with an optional start tag, some content, Marc Moens Henry, David confidential GGP Contract The GGP contract is ready for signature.

20 20 Summer School, July 2000 XML and Linguistic Annotation XML Basics Document Type Definition (DTD) nDescribes what can (and can’t) be in a particular type of document nE.g. a memo DTD might specify that every memo has: sender (name), recipients (list of names), date (default: today), subject, message, status (confidential or unrestricted) Document Instance: nIdentifies the document type and contains the marked-up text nE.g. a memo document instance: refers to the memo DTD contains text marked up in conformance with that DTD

21 21 Summer School, July 2000 XML and Linguistic Annotation XML and document structure XML is used to make the structure of documents explicit machine readable Document content SGML Tags Marc Moens This is the first paragraph. It has some text. This is the second paragraph with some more text.

22 22 Summer School, July 2000 XML and Linguistic Annotation XML markup XML tags Marc Moens This is the first paragraph. It has some text. This is the second paragraph with some more text and an embedded element. Elements: start tagse.g. content e.g. Marc Moens end tagse.g. Elements mark up text to indicate structure and function of text (as opposed to appearance) tag name = element type Elements can have attributes Elements and attributes are defined in the Document Type Definition

23 23 Summer School, July 2000 XML and Linguistic Annotation XML markup: for structure and function He shouted: 'Come here now, Mr Banks. ' He shouted : Come here now, Mr Banks Encodes structure information to support rendering as well as data handling Data handling e.g. search for all quotes inside sentences but not in footnotes; search for every mention of someone called Banks without finding the Banks of Scotland [Use an XML-aware query tool] Rendering e.g. emphasis should be bold underline; quotes should be in italics [Use a stylesheet]

24 24 Summer School, July 2000 XML and Linguistic Annotation XML: Relevance for Linguists nSimplify and standardize appeal to context nE.g. build tokenizer which specifically works for headlines of newspaper articles: We need to be able to tell the tokenizer where the headline starts and ends nAnnotate text with interesting linguistic information nE.g. use XML tags to record the results of a tokenizer or part of speech tagger. Or a human annotator nAllow sharing of results between research efforts nwithout having to write a new parser every time you get new material from somewhere

25 25 Summer School, July 2000 XML and Linguistic Annotation XML: Relevance for Linguists (example) cat text | lttok -q '.*/P' -m W | ltpos -q '.*/W' -m C Use the tokeniser lttok on all paragraphs in the text and mark the resulting words as entities Then run the part of speech tagger ltpos over the text and pos tag all the entities, putting the result in attribute C said the director of Russian Bear Ltd..

26 26 Summer School, July 2000 XML and Linguistic Annotation Associated Standards nXSLT nTransforming documents nXML Query nFind bits of documents nXML Schema nUse element syntax for DTDs nNamespaces Ensure that and both get processed correctly.

27 27 Summer School, July 2000 XML and Linguistic Annotation Infrastructure standards nXpath nReferring to parts of documents nXPointer npointing at documents and parts of documents nDOM nUniform programmer’s interface to document trees (abstracts away from some details) nSAX nStream-based document interface (essential for big documents) nInformation Set

28 28 Summer School, July 2000 XML and Linguistic Annotation XML in detail nWell-formedness and validity nDTDs nXML tools nXSLT nXML Query

29 29 Summer School, July 2000 XML and Linguistic Annotation Well-formed and Valid documents nWell-formed XML nEach start tag has an end tag nXML content is rooted in single “document element” nValid encoding declaration nValid nWell-formed nAll elements mentioned in DTD nAll entities defined nAll parent-child relations as described in DTD nAll attributes used as described in DTD nAll element IDs unique

30 30 Summer School, July 2000 XML and Linguistic Annotation Why well-formedness? na simpler standard for documents to meet nCan be determined without reference to a DTD nSimplifies the parser nRetains “standalone” property of HTML, which was a big win. nNon-validating XML systems can thus still be conformant, providing they check well-formedness nIf you have a DTD (or a Schema) you can do more refined processing.

31 31 Summer School, July 2000 XML and Linguistic Annotation DTDs nDocument Type Definitions: the grammar of a document family nElements nAttributes & values nEntities & parameter entities nComments

32 32 Summer School, July 2000 XML and Linguistic Annotation DTD: Elements nElements are used to structure a document. Element types are declared in the DTD: n

33 33 Summer School, July 2000 XML and Linguistic Annotation DTD: Attribute declarations nAttributes specify properties of elements. The attributes which may appear on elements of a given type are also declared in the DTD. n n ]>

34 34 Summer School, July 2000 XML and Linguistic Annotation DTD: Entity declarations nEntities provide short names for commonly used strings, and are also declared in the DTD. n n]>

35 35 Summer School, July 2000 XML and Linguistic Annotation DTD: IDs nIDs are rigid designators for particular elements in the document. They are declared using type ID ]> nPotentially, IDs allow processors to provide fast random access to parts of documents. nIds must be unique. Checking might be onerous

36 36 Summer School, July 2000 XML and Linguistic Annotation XML tools nXML Parser nLT XML Toolkit nXSLT - xt and Saxon

37 37 Summer School, July 2000 XML and Linguistic Annotation XML Parser nprobably most important single bit of XML software nuses DTD to check if document instance is valid

38 38 Summer School, July 2000 XML and Linguistic Annotation Example: >> cat memo.xml ]> This is the text of a very short article, with very little internal structure. Here is a reference to the <g; entity.

39 39 Summer School, July 2000 XML and Linguistic Annotation Add correct output Example: >> xmlnorm -V memo.xml Entity reference has been replaced with entity text by parser

40 40 Summer School, July 2000 XML and Linguistic Annotation Exercise Practice using xmlnorm to check your documents nAdd some new entities to the memo. Experience some of xmlnorm ‘s error messages nBegin to think about DTD design nPractice using Web browsers to look at XML files nGet a glimpse of what XSL is about

41 41 Summer School, July 2000 XML and Linguistic Annotation DTD: Comments

42 42 Summer School, July 2000 XML and Linguistic Annotation Element type declaration details keyword element type start with a-z may contain hyphen, number, stops not case sensitive can be more than one content model An unambiguous regular expression

43 43 Summer School, July 2000 XML and Linguistic Annotation Element types: Content model + at least one, possibly more ? optional * zero or more, all occur, in that order | exclusive or XML eradicated SGML’s neat & all occur, any order

44 44 Summer School, July 2000 XML and Linguistic Annotation Element types: Content model options n nEMPTY nno content nno end tag npoint semantics: attributes may specialise n(#PCDATA) ntext only nANY nno constraint: sub-elements and/or text n((#PCDATA|emph)*) n'mixed content'

45 45 Summer School, July 2000 XML and Linguistic Annotation Element grammar nSince content model is a regular expression, markup grammar is context free nExcept for one thing ANY keyword nNote that any realistic application interprets the markup tree. The interpretation could be anything. All bets are off…

46 46 Summer School, July 2000 XML and Linguistic Annotation SGML/XML for computational linguists nsgmls:exa2a.sgm:7:42:E: element "PI" undefined nsgmls:exa2a.sgm:8:24:E: general entity "T." not defined and no default entity (ARTICLE (PARA -Here is some text with an inequality: a (PI -2 and an abbreviation: AT )PI )PARA )ARTICLE Example: >> nsgmls exa2a.sgm

47 47 Summer School, July 2000 XML and Linguistic Annotation Escaping special characters nThere are several ways around the problem of introducing XML's meta-syntax characters into documents nUse numeric character references AT&T nUse CDATA marked sections is data ¬ markup]]> nXML provides built-in definitions for amp, lt, gt, quot and apos

48 48 Summer School, July 2000 XML and Linguistic Annotation 76 SGML/XML for computational linguists Example: >> nsgmls exa2b.sgm (ARTICLE (PARA -Here is some text with an inequality: a

49 49 Summer School, July 2000 XML and Linguistic Annotation DTD: Comments double hyphens act as comment

50 50 Summer School, July 2000 XML and Linguistic Annotation DTD: Attributes ]>

51 51 Summer School, July 2000 XML and Linguistic Annotation DTD Attribute declarations: syntax keyword element type attribute name attribute type default type #REQUIRED #IMPLIED (= optional) #FIXED

52 52 Summer School, July 2000 XML and Linguistic Annotation Attribute Value types (contd) CDATA valid SGML characters ENTITY declared entity name ID unique name IDREF reference to a unique name

53 53 Summer School, July 2000 XML and Linguistic Annotation Cross-references ]> Here is some text. In section we showed you how to create crossreferences.

54 54 Summer School, July 2000 XML and Linguistic Annotation nIn a valid SGML/XML document nIDs are unique nIDREFs are discharged nApplications may interpret IDREF/ID connections nLinks from elsewhere may target IDs ncf. HTML 'name' attribute as the target for #.... IDs and IDREFs

55 55 Summer School, July 2000 XML and Linguistic Annotation Attribute value types: list CDATA valid SGML characters author='Robin Hood' ENTITY/IES declared entity name(s) figs='pict2 pict7' ID unique name id='foo37' IDREF(S) reference(s) to an ID refid='foo2 foo37' NMTOKEN(S) name(s) w/o i.c. restraint code='96-mm01 98-a' NOTATION data content notation encoding='eps'

56 56 Summer School, July 2000 XML and Linguistic Annotation Enumerated attribute values  Attribute values can also be constrained to be one of a finite set of allowed values Not valid

57 57 Summer School, July 2000 XML and Linguistic Annotation Elements vs Attributes Content is unconstrained Order will be enforced vs Content is constrained Order is unconstrained

58 58 Summer School, July 2000 XML and Linguistic Annotation DTD: Entities ]> The <g; carries out application-oriented research in language engineering. The <g; is based within the HCRC. Each occurrence of <g; in the text is replaced by Language Technology Group during parsing. can be nested:

59 59 Summer School, July 2000 XML and Linguistic Annotation DTD: Parameter Entities Like entities, except within the DTD each time parser finds %section; in the DTD, it will replace it with (title?, para+)

60 60 Summer School, July 2000 XML and Linguistic Annotation DTD nThat’s almost all there is to it nFor more detail, see the XML standard nWhich, as Michael Kay puts it, is like tax legislation nDTD syntax differs from element syntax nHarder to learn/use XML Schema nAlso, DTDs were designed to be used by document designers, not for distributed data interchange nXML can use a DTD, but doesn’t assume one. nComposite documents entail composite DTDs, but these don’t exist. nNamespace prefixes add extra complexity

61 61 Summer School, July 2000 XML and Linguistic Annotation “Problems” with XML nUses complex and weird terminology nYes. But so does the ANSI C standard. So do most fields… nNot convenient for specifying graphs (as opposed to trees) nThis is a point about graphs, not XML. Unification grammar notations get unwieldy too. nNot as convenient as plain text nTrue for some tasks, but the extra structure of XML lets do things that you wouldn’t even try with plain text.

62 62 Summer School, July 2000 XML and Linguistic Annotation Simple SGML tools nSimple equivalents of UN*X tools are available (for free) to do simple SGML processing nWe'll introduce them using examples, and give details at the end

63 63 Summer School, July 2000 XML and Linguistic Annotation sggrep nLT XML program for searching for structure and text in XML files nsggrep -q query -s subquery -t regexp in.xml nOptions n-d DTD: Specify a DTD explicitly. File is an XML file n-r : Attribute values in queries are regular expressions. n-v : Invert sense of sub-query+regexp. nOther options

64 64 Summer School, July 2000 XML and Linguistic Annotation || LT XML query language nTwo-dimensional regular expressions nFirst dimension is over tree paths Based on file path analogy: DIV/PARA/W matches Ws inside PARAs inside (toplevel) DIVs nSecond dimension is regular expressions over text content of leaf nodes Select Ss containing Ws whose text is it's or its -q S -s './W' -t "^(it's|its)$" Full UTZOO (Henry Spencer) regular expression support nInfluential, slightly dated now.

65 65 Summer School, July 2000 XML and Linguistic Annotation sggrep: examples of use  sggrep -q ".*/P/S" -s "./W[TAG=NN]" ïfind all S elements occuring inside a P element at any depth which immediately contain a W element with attribute TAG="NN".  sggrep -q ".*/P/S/W[TAG=NN]" ïfind those W elements themselves  sggrep -q ".*/S/W[0]" -t "^[a-z]" ïfind all sentence initial words starting with a lower case letter.

66 66 Summer School, July 2000 XML and Linguistic Annotation sgmltrans  converts XML into different formats. sgmltrans -r rulefile file.nsg > file.txt ïsample rule file:.*/W matches W "" what to print at start tag "/$TAG\n" what to print at end tag: value of TAG attribute.*/W/# matches text inside W " " --> "" text replacement: eliminate space if any.*/S matches S "" start tag: nothing "\n" end tag: make each S on separate line.* matches other markup

67 67 Summer School, July 2000 XML and Linguistic Annotation sgmltrans: example of use The previous rule file would do this: The cat sat. on the mat. The/A cat/B sat/ on/A the/B mat/

68 68 Summer School, July 2000 XML and Linguistic Annotation sgrpg: SGML report generator nProgram for making more complex queries of normalised SGML and for transforming SGML. nProvides nested subqueries and sequencing nUsage: nsgrpg query sub-query regexp out-fmt oargs file.txt nsgrpg -f pat-file file.txt nThis now looks like a design study for XSLT and XML Query. nHas one advantage, designed (from the outset) for big documents

69 69 Summer School, July 2000 XML and Linguistic Annotation The British National Corpus n2 gigabytes of contemporary English nMarked up to word level with part of speech tags nExtract data: nzcat medium.xml.gz | sggrep -q ".*/W[TYPE=NN1]" ngives all singular nouns in a part of the corpus, e.g. part meeting while funeral loss meeting time

70 70 Summer School, July 2000 XML and Linguistic Annotation The BNC: an example (2) zcat medium.xml.gz | \ sggrep -q ".*/S" -s "./W[TYPE!=AJ0]" \ -t "^[Rr]ight$" gives sentences containing non-adjectival uses of the word 'right', e.g. Yes that was, that was right...

71 71 Summer School, July 2000 XML and Linguistic Annotation The BNC: an example (3) Format the output into a more readable form: zcat medium.xml.gz | \ sggrep -q ".*/S" -s "./W[TYPE!=AJ0]" -t "^[Rr]ight$" |\ sgmltrans -r test.rule Yes/ITJ that/DT0 was/VBD, that/DT0 was/VBD right/AV0 erm/UNC there/EX0 was/VBD a/AT0 limit/NN1 to/PRP how/AVQ much/AV0 you/PNP could/VM0 spend/VVI aswell/AV0 was/VBD n't/XX0 there/EX0 ? He/PNP goes/VVZ into/PRP a/AT0 restaurant/NN1 and/CJC he/PNP says/VVZ oh/ITJ the/AT0 waiter/NN1 erm/UNC let/VVB me/PNP see/VVI the/AT0 menu/NN1 and/CJC he/PNP looks/VVZ at/PRP the/AT0 menu/NN1 and/CJC said/VVD right/AV0, he/PNP said/VVD.

72 72 Summer School, July 2000 XML and Linguistic Annotation An extended example: Noun Compounds nNoun compounds in British National Corpus nWhat is a noun compound? Too hard. nSimple approximation? Sequence of tags matching NN... BNC uses a version of the Brown tags, where NN0, NN1,... are all variants of Noun nA pipeline of SGML-aware tools will do the job n sgrpg | sggrep [ |...] nUse sgrpg to wrap such tag sequences in.... nUse sggrep to filter the output. nUse further tools to tabulate, format, etc.

73 73 Summer School, July 2000 XML and Linguistic Annotation An extended example: The pipe nStep by step through the pipe nsgrpg -r -f np-pat.xml |... Group the sequences -r use regexp matching -f script file n... sggrep -d groups.xml -q '.*/G' extract the sequences -d DTD -q query (selects groups) nResult: Local government districts...

74 74 Summer School, July 2000 XML and Linguistic Annotation An extended example: filtering nFind all words with unresolved tags, e.g. AJ0-NN1 nuse regexp matching, which is unanchored by default n...| sggrep -r -q './W[TYPE="-"]' |... nFind all words in second position n...| sggrep -q './W[1]' |... nFind all words with unresolved tags in second position n...| sggrep -r -q './W[1 TYPE="-"]' |...

75 75 Summer School, July 2000 XML and Linguistic Annotation An extended example: counting nCount all words in second position n...| sggrep -q './W[1]' | sgcount nCount all words with unresolved tags in second position n...| sggrep -r -q './W[1 TYPE="-"]' | sgcount nResults: nall 2nd place W23283 n2nd place W with unresolved tag 5066

76 76 Summer School, July 2000 XML and Linguistic Annotation An extended example: long compounds nLong compounds including 'government' nUse subquery to select... s with 'government': nsggrep -q G -s './W' -t government nNext step, discard short ones: nsggrep -q G -s './W[2]' nThen sgmltrans for neater format nResults: official/AJ0-NN1 government/NN0 report/NN1-VB Local/AJ0-NN1 government/NN0 districts/NN2...

77 77 Summer School, July 2000 XML and Linguistic Annotation An extended example: left context nselect for 'government' in 2nd place n... | sggrep -q G -s './W[1]' -t government | npull words from first place nsggrep -q './W[0]' | nremove markup ntextonly | nuse UN*X for the rest nsort | uniq -c | sort -nr | head -4 n6 French n5 German n4 interim n4 Chinese

78 78 Summer School, July 2000 XML and Linguistic Annotation British International Corpus? nWe are more francophone than we think! nLongest 'noun-phrase' in 10% of BNC is: serai/NN1 mentionné/NN1 dans/NN2 le/NN1 rapport/NN1-VB qui/NN1 te/NN1 sera/NN1 remis/NN1 nNo disgrace that the part-of-speech tagger gave up here. nTools can't be better than their input allows

79 79 Summer School, July 2000 XML and Linguistic Annotation XML Conclusions nXML is the wave of the future nBoth Microsoft and Netscape have endorsed it nBoth Mozillla and IE5 have XML support built-in nVery good free software is available nMicrosoft seem to be serious about standard compliance nThe W3C have made it clear that all subsequent W3C standards for web distribution of information will be based on XML (c.f. SMIL, SVG and RDF) nIssues nXSLT efficiency - space and time.

80 80 Summer School, July 2000 XML and Linguistic Annotation To read nRobin Cover’s SGML/XML Web Page nincludes many pointers to SGML tutorials, overviews, publications nThe Whirlwind Guide to SGML & XML Tools and Vendors nThe XML FAQ nAn excellent introduction to XML with pointers to useful resources for newcomers to the standard

81 81 Summer School, July 2000 XML and Linguistic Annotation SGML/XML for Linguistics n2.1 Programs for querying/modifying SGML nan example nwhat is needed navailable tools n2.2 SGML marked-up corpora nsome existing resources n2.3 Related developments nSSTML nSGML for X-waves

82 82 Summer School, July 2000 XML and Linguistic Annotation An example nYou want to build a system that performs particular LE task nYou have a corpus of texts for analysis (detecting textual regularities) system training system testing nUse XML Why? How?

83 83 Summer School, July 2000 XML and Linguistic Annotation Why use XML? nUse structure of text to fine-tune certain tools ne.g. build tokeniser which specifically works for headlines of newspaper articles nAnnotate text with linguistic information ne.g. use SGML tags to record the results of a tokeniser or part of speech tagger, so that other tools can make use of this information nEnsure the others (and you two years from now :-) will have easy access to your results nNo special-purpose parser required nSimple retrieval and tabulation with existing free tools nDTD provides some self-documentation

84 84 Summer School, July 2000 XML and Linguistic Annotation What is needed to use XML? nXML is text nTherefore: nyou can use any UNIX text manipulation program e.g. grep, sed, awk, perl, etc nXML is annotated text nTherefore: nNeeded: versions of these tools that are XML-aware

85 85 Summer School, July 2000 XML and Linguistic Annotation What is needed to use XML? nSGML reflects the hierarchical structure of a text nYou want to be able to tell tools to operate on a particular part of the SGML-annotated text, for example: all WORD entities with attribute POS set to JJ (i.e. all adjectives) occurring within the first PARAGRAPH of the main BODY of an ARTICLE; or occurring within the HEADLINE of and ARTICLE nNeeded: a query language over XML structures

86 86 Summer School, July 2000 XML and Linguistic Annotation What is needed to use XML? nXML-aware versions of text processing tools nQuery language nIn fact sggrep is just a simple wrapper round our query language. Our query language and interface is designed to work with big files, so it doesn’t read the whole document into memory unless absolutely necessary. Most competitors do this

87 87 Summer School, July 2000 XML and Linguistic Annotation XML tools: the LT XML library nsggrep is part of an SGML toolset, called LT XML nDeveloped by the Language Technology Group (Edinburgh) nsee: n XML Library with nCommand-line tools nApplication Programming Interface (API) nAvailable for WIN32, UN*X (and Mac) nLT XML processes XML or nSGML nnSGML now looks like a design study for XML

88 88 Summer School, July 2000 XML and Linguistic Annotation LT XML: Command-line tools nsggrep - retrieving context sensitive data nsgmltrans - transforming information nsgrpg - more complex queries/reformatting ntextonly - strips out SGML markup nsgcount- counts SGML tags nknit- resolves XML-link links nothers

89 89 Summer School, July 2000 XML and Linguistic Annotation LT NSL: APIs nLT NSL Application Program Interfaces: procedure calls to help you write your own programs to process nSGML nC language API nPython language API

90 90 Summer School, July 2000 XML and Linguistic Annotation C API for specialised access nWrite your own programs to read/write SGML/XML nLT XML provides a rich API nBoth event and tree views of the document stream nThe distribution includes two heavily commented example programs.

91 91 Summer School, July 2000 XML and Linguistic Annotation Python language API for LT XML  Experimental integration of the LT XML API into Python (free portable object-oriented scripting language)  Uses TK portable widget library for graphical UI  Reflects document stream as Python objects

92 92 Summer School, July 2000 XML and Linguistic Annotation Specialised XML editors nUsing the Python API we have written a number of specialised processors: nA WYSIWYG XML instance editor (XED) nSeveral specialised annotation tools, E.g. PoS correctors, span coders nLimited set of operations nPreserve validity nHide structure from the user

93 93 Summer School, July 2000 XML and Linguistic Annotation Dataflow in LT NSL programs mknsg unknit nSGMLNSLC(++) program streamAPI parser nSGMLNSLC(++) program streamAPI parser DDB file file1.sgm Ö file2.sgm... file1.sgm...

94 94 Summer School, July 2000 XML and Linguistic Annotation The Edinburgh MapTask Corpus nContents n128 task oriented spontaneous Scottish dialogues nsmall corpus, but very dense and detailed SGML markup. nAvailability: nTranscripts and digitized speech on 8 CD-ROMS: or from the LDC nWhat is its markup like? n(early) TEI-compliant nTurns, pointers into the speech, identification of non-words. nWord-level transcripts with timing markup available soon via the Internet

95 95 Summer School, July 2000 XML and Linguistic Annotation HCRC Maptask: an example mknsg q1ec1.turns.sgm | sggrep -q ".*/W[TAG=at]" a an the

96 96 Summer School, July 2000 XML and Linguistic Annotation Parsed HCRC Maptask : an example mknsg q1ec1.g.syn.sgm | sggrep -q ".*/NP" | sgmltrans -r mt.rule we a caravan park we an old mill on the right hand side an old mill on the right you...

97 97 Summer School, July 2000 XML and Linguistic Annotation The MLCC corpus nContents nFinancial Newspaper texts: Dutch, English, French, German, Italian, Spanish nParallel texts: The Journal of the European Commission, Written Questions (1993). Corpus of European Parliamentary debates ( ). (languages: Danish, Dutch, English, French, German, Greek, Italian, Portuguese and Spanish ). Markup nAvailable nfrom ELRA:

98 98 Summer School, July 2000 XML and Linguistic Annotation The MLCC Corpus: an example zcat exp.joc en.01.tei.gz |\ mknsg | \ sggrep -q ".*/DIV4[TYPE=Q]/HEAD" Subject: The staffing in the Commission of the European Communities Subject: Supplies of military equipment to Iraq Subject: Commission plans to liberalize the postal sector and to abolish the State monopoly Subject: New industries in Attika...

99 99 Summer School, July 2000 XML and Linguistic Annotation The same example for French zcat exp.joc fr.01.tei.gz |\ mknsg | \ sggrep ".*/DIV4[TYPE=Q]/HEAD" "" Objet: Organigramme de la Commission Objet: Livraisons de matÈriel militaire ‡ l'Irak Objet: Projets de la Commission visant ‡ libÈraliser et ‡ abolir le monopole d'…tat dans le secteur des postes Objet: Nouvelles industries en Attique Corresponds to the English data: Suitable input for multilingual alignment experiments.

100 100 Summer School, July 2000 XML and Linguistic Annotation The Text Encoding Initiative (TEI) nThe TEI is a large and well documented DTD for textual markup. nUse it if you can nNow has an XML version nLarge and comprehensive hardcopy documentation available nhttp://www.uic.edu/orgs/tei/ nDTDs available there as well

101 101 Summer School, July 2000 XML and Linguistic Annotation The Linguistic Data Consortium nLDC - based in Pennsylvania USA nDistributes text corpora nSee: nSGML Corpora include: nThe European Language Newspaper Text corpus French (100 million words), German (90 million words) and Portuguese (15 million words). SGML. nTIPSTER Information Retrieval Text Research Collection 3 gigabytes. SGML-like. Various English texts. nUnited Nations Parallel Text Corpus (English, French, Spanish) Fully-compliant SGML, 2.5 gigabytes

102 102 Summer School, July 2000 XML and Linguistic Annotation Tutorials nXML: far too many to mention nXSL: nXSL specification nRobin Cover's guide

103 103 Summer School, July 2000 XML and Linguistic Annotation Resources nLT-XML nFull-text search Witten, Moffat and Bell's Managing Gigabytes

104 104 Summer School, July 2000 XML and Linguistic Annotation Corpus Tools nStuttgart Corpus Workbench Birmingham Qwick} The MATE Workbench }. NB. Prototype

105 105 Summer School, July 2000 XML and Linguistic Annotation Bibliography nMcKelvie, Brew,Thompson: Using SGML as a Basis for Data-Intensive Natural Language Processing, Computers and the Humanities, 31(5): , 1997 nSinclair, Mason,Ball,Barnbrook Language Independent Statistical Software for Corpus Exploration, Computers and the Humanities, Vol 31(3): , 1998 References on McKelvie's MATE workbench page n Welty and Ide. Using the right tools: enhancing retrieval from marked-up documents. Computers and the Humanities. 33(10): Alignment graphs (and much else) Steven Bird's Linguistic Annotation Page

106 106 Summer School, July 2000 XML and Linguistic Annotation Annotation topics _Item annotations nWords, Parts-of-speech, lemmas nSimple annotations (one data stream) nBoundaries,Spans,Partitions nComplex annotations (multiple data streams) nSequences,Graphs,Overlaps nData models for annotation access nStreams, Trees, Graphs, Databases _Human factors in annotation nWriting instructions, Measuring and improving reliability

107 107 Summer School, July 2000 XML and Linguistic Annotation XML topics nData formats nHTML,XML and SGML nData Description Formalisms nDTDs, XML Schema nStyle Languages nXSLT nQuery Languages nAnnotation Graphs, XML Query, XQL, Quilt, LORE

108 108 Summer School, July 2000 XML and Linguistic Annotation Exercises On average, these exercises should take about one hour to complete. Try not to spend longer. nCreate an XML document nCreate a very simple memo nSimple annotation nDisambiguate parts-of-speech nCompare results with those made by a partner. nStyle nCreate an XML DTD and an XSL style sheet for displaying POS-tagged text in a browser.

109 109 Summer School, July 2000 XML and Linguistic Annotation Exercises nMore complex annotation nsyntactic annotation in Penn tree bank style. nAs before, compare results nSearch nExercise XML search tools on the newly annotated texts

110 110 Summer School, July 2000 XML and Linguistic Annotation Projects These are open-ended projects hard enough to merit write-up in a research paper. I’d willingly supervise these. nDesign a DTD and an XSL stylesheet for tree bank style syntactic annotations. Implement a convenient interface allowing these annotations to be edited over the Web. nInvestigate the corpus search tools provided at the LDC web- site. What do they do? Could they and should they use XML/XSL technology for the same purpose? (Easiest if your institution has an LDC membership).

111 111 Summer School, July 2000 XML and Linguistic Annotation Projects (contd) nCritical review of the Talkbank tools (www.talkbank.org) nDesign an XML query language that works well with very big documents nWhat sort of annotation structure for dialog? (cf. MATE) nDesign an optimizing compiler for XSLT (cf. Sun’s very recent XSL compiler) nDoes XSLT support language modeling and statistical computation? (If you put XSLT and Splus into a closed box and shake vigorously, what emerges?)

112 112 Summer School, July 2000 XML and Linguistic Annotation In Summary nPhew!


Download ppt "1XML and Linguistic Annotation Chris Brew, Ohio State University ( credit to Marc Moens, Henry Thompson, David McKelvie, all Language Technology Group,"

Similar presentations


Ads by Google