Presentation on theme: "LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1."— Presentation transcript:
LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1
PURPOSE of STUDY As internet grows dramatically, the number of electronic text documents increases considerably. By means of increasing number of documents, the information extraction grows in importance. This study introduces an approach to information extraction, which provides extraction of the main subject, main predicate, main location and main date of a text document and label it to use for semantic web applications. 2
PURPOSE of STUDY 3 LABELMEANING SUBJECTThe most important person, place, thing, or idea in the document PREDICATEActual doing or being of the main subject LOCATIONLocation of the main predicate DATEDate of the main predicate.
PURPOSE of STUDY The most pronounced difference between key phrase extraction studies and labeling study is that labeling study extract the most significant phrases with their functions in the document. Extracted labels give an idea to the reader about the main topic of the document at a glance. 4
SCOPE of STUDY Documents inspected are written in Turkish language. Documets are gathered from news distributors. Documents include 50-300 words. 6
LABELING by ANNOTATORS Data set is composed of 1000 raw news stories gathered from RSS feeds of Turkish news distributors and then labeled by annotators. Manually labeled documents are used for training and test phase of CRF model. 7
Manual Labeling Process 8 Capturing RSS feeds from news distributors Arrange captured news with XML format Reading news by human annotators and labeling manually
FIRST STEP of STUDY Due to the Turkish is an agglutinative language, input file is converted to the file includes the information of stems, inflectional suffixes and parser results of the raw new stories. Morphological analyzerr Morphological disambiguator Dependency parser 9
Morphological Analyzer Each word in a raw is morphologically analyzed. As a morphologic analyzer, Oflazers morphologic analyzer is used. The output of morphological analyzer presents one or more possible results. 10
MORPHOLOGICAL DISAMBIGUATOR 11 The most possible result must be distinguished in the output of morphological analyzer. Morphological disambiguator which is developed by Sak et al. has been used for disambiguating. At this point roots or stems are provided.
DEPENDENCY PARSER 12 Dependency parser defines the attribute of each word in a sentence. In order to do this we use a multilingual dependency parser.
CONSTRUCTING THE MODEL At first we are developed a rule based model with the help of the features provided by morphological analyzer, disambiguator and dependency parser. Because of the success rates are not enough to use we developed a new model with machine learning techniques. In our case labels consist of one word generally more than one word. So, we can estimate our problem is a sequence classification problem. 13
CONSTRUCTING THE MODEL Each word in the document belongs to a class which is subject, predicate, location, date or none of them. 14
Rule based features Due to the experimental set of this study is news stories, main subject of the text should be proper noun phrases. This assumption is obtained after inspected all manually annotated subject labels. In order to obtain proper name phrases in Turkish language, at first all words start with capital letter are gathered. However, this assumption is not correct in all cases, because some other words may start with capital letter, such as first word of sentence, titles, month or day names in dates etc. 15
Rule based features Rule 1 : If the word is first word in a sentence and it is a proper name, it is a possible candidate of proper name phrase. Rule 2 : If a word starts with capital letter and not the first word of sentence, select it as a possible candidate of proper name phrase. Rule 3 : If a conjunction is between two possible candidates of proper name phrases, select this word. But all these rules are not enough to divide all these words into proper noun phrases. For instance, Mustafa Kemal Atatürk Ankaraya gitti. is a sample Turkish sentence. In this sentence Mustafa Kemal Atatürk and Ankara are two different proper noun phrases. However, the rules explained above selects the proper name phrase as Mustafa Kemal Atatürk Ankaraya. So new boundary rules are defined. 16
Rule based features Boundary Rule 1: If a possible candidate of proper noun phrase ends with a punctuation such as quotation mark, comma etc, this word is the last name of proper noun phrase. Boundary Rule 2: If a possible candidate has the suffixes P3sg, this word is a last word of proper noun phrases. 17
Other Features Morphological Features: Outputs of morphological disambiguator. Syntactic Features: Output of dependency parser. Structural Features: Document sequence number in data set is defined in order to describe word is belong to which document in data set. In order to distinguish sentence, sentence sequence number in document is used. Term Frequency in document is used as a feature. First observed sentence sequence number of a word in the document is used as a feature. The feature which defines first letter of the word is capital or not is used. 18
TRAINING CRF SYSTEM 19 Manually annotated documents are used with features of each word. 950 news stories are used as training set and CRF model is generated as Figure 1.
TESTING CRF SYSTEM 20 In order to the measure the success of the system, rest 50 manually annotated documents are used with generated CRF model.
EVALUATION In this study, the main concern is the precision and the recall that is how many of the suggested keywords are correct (precision), and how many of the manually assigned labels that are found (recall). 21
Conclusion Factors affects success rate: Human annotators are not %100 reliable. Human makes mistake. Spell chek is needed, because it also affects results of morphologic analyzer. Errors of morphologic analyzer Errors of morphologic disambiguator Errors of dependancy parser Size and scope of traning set 23