Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ting Qian Human Language Processing Lab Brain and Cognitive Sciences 1.

Similar presentations


Presentation on theme: "Ting Qian Human Language Processing Lab Brain and Cognitive Sciences 1."— Presentation transcript:

1 Ting Qian Human Language Processing Lab Brain and Cognitive Sciences 1

2  Dr. T. Florian Jaeger  My father  My friends who have voluntarily given me their Chinglish essays  People at HLP lab 2

3 1) Meanwhile, Bren crude hit an all-time peak of $ before falling back. 2) Prices initially rose when the report was released with traders reacting to news that inventories were lower than expected. 3) US light, sweet crude oil rose to a fresh high of $ before slipping back to $

4  US light, sweet crude oil rose to a fresh high of $ before slipping back to $  Meanwhile, Bren crude hit an all-time peak of $ before falling back.  Prices initially rose when the report was released with traders reacting to news that inventories were lower than expected. 4

5  If humans try to communicate in the most efficient way, they should produce language: Humans as rational agents who optimize the flow of information in language production ActionGoal by putting less information into words or sentences with little prior context, and more later on To ensure the increase of information is uniform 5

6  Uniform Information Density (UID) 6

7 An engineering perspective  The most efficient way of communicating through a noisy channel is to send information at a constant rate. (Information Theory, Shannon 1948). 7

8  No good models of the information of a sentence in context exist 8  Methods from natural language processing provide reasonably good estimates of out-of-context information of sentences

9  Intuitively, less contextual information is available at the beginning of a discourse.  If speakers/writers communicate efficiently, early sentences should be made more predictable (easier for listeners).  The out-of-context information at the beginning of a discourse should be lower than later in the discourse. 9

10 10

11  Genzel & Charniak (2002) provided evidence for the hypothesis of uniform information by analyzing English discourse.  They found that: ◦ Information of sentences increases with sentence numbers in a discourse. ◦ The effect of increase is due to both lexical (what words are used) and non-lexical (how words are used) factors. 11

12  Evaluate UID on Chinese written corpora by measuring information content.  Evaluate UID on a Chinese English (Chinglish) corpus  Ultimately: why is Chinese English harder to understand for native English speakers, but relatively easy for native Chinese speakers? 12

13 13

14  Four corpora are used ◦ XIN – Beijing Xinhua News ◦ SINO – Taiwan Sinorama Magazine ◦ HK – Hong Kong News (too little data) ◦ VOA – Voice of America Chinese News  We build n-gram language models to measure the (un)predictability of written Chinese sentences. 14

15 二十 年 前 ,许多 中国 家庭 的 梦想 是 拥有 一 部 电话 。 Twenty year ago, many Chinese family ‘s dream is have a piece telephone. 15 二十 年 前 年 前 , 前 ,许多 ,许多 中国 Trigrams 部 电话 。 …... P( 二十 年 前 ) = 0.1%

16  Lexicalized part-of-speech n-gram 二十 _CD 年 _M 前 _LC , _PU 许多 _CD 中国 _NR 家庭 _NN 的 _DEG 梦想 _NN 是 _VC 拥有 _VV 一 _CD 部 _M 电话 _NN 。 _PU 16

17  With respect to an entire document ◦ Sentence effect in a document ◦ Paragraph effect in a document 17

18 18

19 19

20  With respect to the immediate containing domain of the linguistic unit in question.  Predictors 1. Sentence position in paragraph 2. Paragraph position in document 3. Word position in sentence Multiple regression on the above three predictors 20

21 21 Sentence position in paragraph

22  Limited amount of context information available. 22 Information goes up and converges (after removal of early words)

23 二十 年 前 ,许多 中国 家庭 的 梦想 是 拥有 一 部 电话 。 Twenty year ago, many Chinese family ‘s dream is have a piece telephone. 23 二十 年 前 年 前 , 前 ,许多 ,许多 中国 Trigrams 部 电话 。 …...

24  We replicated Genzel & Charniak’s study on Chinese corpora. ◦ Sentence effect within documents is not found. ◦ However:  Paragraph effect within documents is consistent with UID.  Sentence effect within paragraphs is also found.  Due to the size of data, effects are observable only early in discourse (viable cut-offs are low). 24

25  We are the first to look at the effect of word position within sentences. ◦ Information content increases with word position. ◦ Context estimation leads to early convergence.  Does increase of information only occur locally in Chinese? ◦ Current data seem to support this idea. 25

26  Writing style? Could be. ◦ Chinese – Summarization & Expansion ◦ English – Narrative style 26

27  A collection of English essays written by native Chinese speakers. ◦ Corpus of English as a Second Language (CESL)  We trained a language model based on the Brown Corpus (American English) and use the model to measure information content of Chinese English sentences. 27

28 28 XIN: - p<0.001*** CESL: - p= *

29  The average information content is much higher in Chinese English (8.2~8.4) than in Chinese (4.5~5.0).  It is also higher than information content of English, which converges at 7.0 bits (Paintadosi, CUNY 2008). 29

30  Chinese, English, and Chinglish ◦ Globally, Chinglish essays fail to exhibit the information distribution as predicted by UID, either. ◦ Further studies needed to discover more properties of Chinglish.  Possible reasons that explain why Chinglish is harder to understand ◦ Higher information content ◦ Again, writing style 30

31 Questions? 31


Download ppt "Ting Qian Human Language Processing Lab Brain and Cognitive Sciences 1."

Similar presentations


Ads by Google