CH.4 PROBABILITY AND TEXT SAMPLING 2011.10.19. Data mining LAB 이아람.

Slides:



Advertisements
Similar presentations
This is a self running presentation that lasts about 4 minutes, and loops continuously until escape is pressed. The text appears one word at a time to.
Advertisements

Albert Gatt Corpora and Statistical Methods – Lecture 7.
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
Section 8.3 Similar Polygons
Histograms We are learning to…create and analyze histograms and identify misleading graphs. Saturday, April 15, 2017.
Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics.
Ch.18 Normal approximation using probability histograms Review measures center and spread –List of numbers (histogram of data) –Box model For a “large”
Lecture 1, Part 2 Albert Gatt Corpora and statistical methods.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01.
Unsupervised Turkish Morphological Segmentation for Statistical Machine Translation Coskun Mermer and Murat Saraclar Workshop on Machine Translation and.
Math 310 Section 7.1 Probability. What is a probability? Def. In common usage, the word "probability" is used to mean the chance that a particular event.
Input-Output Relations in Syntactic Development Reflected in Large Corpora Anat Ninio The Hebrew University, Jerusalem The 2009 Biennial Meeting of SRCD,
The Use of Corpora for Automatic Evaluation of Grammar Inference Systems Andrew Roberts & Eric Atwell Corpus Linguistics ’03 – 29 th March Computer Vision.
Word prediction What are likely completions of the following sentences? –“Oh, that must be a syntax …” –“I have to go to the …” –“I’d also like a Coke.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia.
P247. Figure 9-1 p248 Figure 9-2 p251 p251 Figure 9-3 p253.
You will use the ratio of two similar figures to find their perimeter and area.
Lie Detection using NLP Techniques
Data collection and experimentation. Why should we talk about data collection? It is a central part of most, if not all, aspects of current speech technology.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
1 12/3/03 Math warm-up Draw an example of each a line graph, bar graph, and a circle graph. (without exact numbers) Label it. When would you use a line.
Expressing Implicit Semantic Relations without Supervision ACL 2006.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
2-1 Sample Spaces and Events Random Experiments Figure 2-1 Continuous iteration between model and physical system.
Lesson 2.6- Simple Probability and Odds, pg. 96 Objectives: To find the probability of a simple event. To find the odds of a simple event.
7-4: Similarity in Right Triangles
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.
Probability: Terminology  Sample Space  Set of all possible outcomes of a random experiment.  Random Experiment  Any activity resulting in uncertain.
Chapter 6(6.5~) Concordance lines and corpus linguistics Parallel embedded system design lab 이청용.
DATA MINING –TEXT MINING. RETRIEVE DATA SET ROLE NOMINAL TO TEXT PROCESS DOCUMENT TO DATA TOKENIZE FITLER STOPWORDS FILTER TOKENS (Length) TRANSFORM CASE.
The Scientific Method:
Section 7.4 Use of Counting Techniques in Probability.
Warm-up Wednesday, You are a scientist and you finished your experiment. What do you do with your data? Discuss with your group members and we.
1 Ch 1. VOCABULARY SIZE, TEXT COVERAGE & WORD LISTS Nation& Waring.
Section 9-3 Probability. Probability of an Event if E is an event in a sample space, S, of equally likely outcomes, then the probability of the event.
GraPhS! Line – Bar - Pie. What is a graph? A graph is a visual representation of information. Graphs have 5 main parts: L-abels U-nits S-cale T-itle I-nformation.
CSCI 115 Chapter 3 Counting. CSCI 115 §3.1 Permutations.
Brown Bear, What Do You See?. Brown Bear by Annabelle.
Text Analytics Gateway Project Background Michael Black Drew Schmidt.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
1 Possibilities of identification of translation equivalents in a parallel corpus Krešimir Šojat Marko Tadić Institute of Linguistics Faculty of Philosophy;
Histograms We are learning to…create and analyze histograms and identify misleading graphs.
Significant Figures.
Ch Measurement, Uncertainty, & Units
Review K16 Paul and Carl have a chance at winning
CH 23: Discrete random variables
Title of Your Research Heading 3 Heading 6 Heading 4 Heading 1
Year 3-5 Open-ended Mathematics Activities
Colour Farm Dr. Jean.
Brown Bear, Brown Bear, What Do You See?
Scientific Notation Notes
Counting & Comparing Money 2 $ $ $ $.
Counting & Comparing Money $ $ $ $.
In text.
Figure 9: Response and frequency analysis of G2(s).
Probability and Chance
Review for test Ch. 6 ON TUESDAY:
Label Name Label Name Label Name Label Name Label Name Label Name
Significant Figures.
Colour Farm Dr. Jean.
Cat.
C.2.10 Sample Questions.
C.2.8 Sample Questions.
C.2.8 Sample Questions.
Similarities Differences
Figure:
Presentation transcript:

CH.4 PROBABILITY AND TEXT SAMPLING Data mining LAB 이아람

4.5 THE BAG-OF-WORDS MODEL Only analyzing word frequencies Word order is irrelevant

4.6 THE EFFECT OF SAMPLE SIZE How the number of types is related to the number of tokens as the sample size increases. Types vs Tokens as the sample size increases

4.6.1 TOKENS vs TYPES Tokens : every word is counted, including repetitions Types : repetitions are ignored The cat ate the bird.

Notation N = the size of the text sample the number of tokens V(N) = the number of types w i = Labeled word f( w i, N ) = the frequency of the word w i in a text of size N

TOKENS vs TYPES

Tokens vs Types Figure 4.5

Tokens vs Tokens/Types Figure 4.6

Tokens vs Tokens/Types (2) Figure 4.7 The Black cat 3.17 tokens per type The Unparalleled Adventures of One Hans Pfaall 5.61 tokens per type N -> 1,000~4,000

Size of sample In corpus linguistics, take samples of equal size. Smaller than each text -> analyzed in a similar fashion The corpora use this approach. ex) the Brown Corpus