Presentation on theme: "CRF Homework Gaining familiarity with a standard NLP toolkit, and NLP tasks."— Presentation transcript:
CRF Homework Gaining familiarity with a standard NLP toolkit, and NLP tasks
Goals for this homework 1.Learn how to use Conditional Random Fields, a standard framework for sequence labeling in NLP 2.Apply CRFs to two standard datasets 3.Learn how to use the CRF++ package and the Stanford POS tagger Please read the whole homework before beginning. Some steps will be really quick, but others – especially 7 and 9 – may take a while.
Step 1: Install software Download and install CRF++ on some machine that you will be able to use for a while. Lab machines will not have this, but if you don’t have access to another machine, let me know, and I will try to get this installed for you on a lab machine. The CRF++ package comes with two primary executables that you will need to use from the command line: crf_learn crf_test If you installed a source package, you will need to compile to get these executables. If you installed a binary, they should have come with the download. Also, you may want to add these two executables to your path.
Step 1b: Install perl, if you don’t already have it If you’re on linux, you probably have perl, or you can use your package manager to get it If you’re on windows and don’t have perl installed already, you can get a MSI for ActivePerl (free open source) from: If you’re on a Mac, sorry, you’re on your own … You don’t need to develop in Perl for this assignment, but one part of the software for the assignment (see next slide) is written in Perl, and needs the Perl interpreter to be installed in order to run.
Step 1c: Download conlleval.pl Create a working directory somewhere, and download the perlscript conlleval.pl to it. The script is available at: Rename the file from “conlleval.txt” to “conlleval.pl”. This script is used only to measure how accurate the CRF’s predictions are. It requires perl (from previous step) to be installed on your machine.
Step 1d: Install the Stanford POS tagger Download and unzip the Stanford POS tagger: Also, log on to the blackboard site for the course. Under “course content”, navigate to the folder called “programming- assignment-crfs”. Download “TagAndFormatForCRF.java” from there, and place it in the same directory where the Stanford tagger jar file is (the dir you just unzipped). Compile the java file: javac –cp stanford-postagger.jar TagAndFormatForCRF.java You won’t use this until step 9, but don’t forget about it!
Step 2: Get the main dataset 1.log on to the blackboard site for this course 2.navigate to the “course content” section, and the folder called “programming- assignment-crfs” 3.Download the zip file called “conll-2003-dataset.zip” to your working directory. 4.Unzip the archive. This archive contains three files: -eng.train -eng.testa -eng.testb These files are already formatted for CRF++ to use. You will use the eng.train file during training (crf_learn), and eng.testa file to measure how well your system is doing (crf_test) while you are developing your model. Finally, once your system is fully developed and debugged, you will test it on eng.testb (crf_test again).
Step 2b: Get the initial “template” file In the same place on blackboard, you should also see a file called “initial-template.txt”. Download this to your working directory. This file contains an initial set of “features” for the CRF model. You will use this as an input to crf_learn.
Step 3: Run an initial test to verify everything is working If you have everything installed properly, then you should now be able to run crf_learn, crf_test, and conlleval.pl. From your working directory: 1. /bin/crf_learn 2. /bin/crf_test –m > 3.conlleval.pl -d \t On a windows machine, use crf_learn.exe instead of crf_learn, and crf_test.exe instead of crf_test. For instance, on my machine: 1.CRF \crf_learn.exe initial-template.txt eng.train eng.model (this takes a little over 2 minutes and 206 iterations on my machine) 2.CRF \crf_test.exe –m eng.model eng.testa > eng.testa.predicted 3.conlleval.pl –d \t < eng.testa.predicted
Step 3: Run an initial test to verify everything is working If everything is working properly, you should get output like this from conlleval.pl: processed tokens with 5942 phrases; found: 5647 phrases; correct: accuracy: 96.83%; precision: 84.70%; recall: 80.49%; FB1: LOC: precision: 84.42%; recall: 84.05%; FB1: MISC: precision: 91.26%; recall: 71.37%; FB1: ORG: precision: 81.81%; recall: 74.79%; FB1: PER: precision: 84.34%; recall: 85.67%; FB1: The “accuracy” is misleading – ignore that. This says that the current system found 84.05% of the true location (LOC) named entities (the “recall” for LOC), and 84.42% of the entities that the system predicted to be LOCs were in fact LOCs (the “precision” for LOC). FB1 is a kind of average (called the “harmonic mean”) between the precision and recall numbers. FB1 for each of the 4 categories are the important numbers.
Step 3: Run an initial test to verify everything is working If everything is working properly, the first few lines of eng.testa.predicted should look something like this: -DOCSTART--X-OOO CRICKETNNPI-NPOO -:OOO LEICESTERSHIRENNPI-NPI-ORGI-PER TAKENNPI-NPOI-PER OVERINI-PPOO ATNNPI-NPOO TOPNNPI-NPOO AFTERNNPI-NPOO INNINGSNNPI-NPOO VICTORYNNI-NPOO..OOO LONDONNNPI-NPI-LOCI-LOC CDI-NPOO WestNNPI-NPI-MISCI-MISC IndianNNPI-NPI-MISCI-MISC all-rounderNNI-NPOO PhilNNPI-NPI-PERI-PER SimmonsNNPI-NPI-PERI-PER The last line are the system’s predictions. The second-to-last line are the correct answers.
Step 4: Add a feature to the template In the directory where you placed the CRF++ download, you will find a documentation file at doc/index.html. Read the section called “Usage” (mainly focus on “Training and Test file formats” and “Preparing feature templates”). In your working directory, copy “initial-template.txt” to a new template file called “template1.txt”. Add a new unigram feature to this template file that includes the previous and current part of speech (POS) tags. A subscript of [0,0] refers to the current token, 0 th column (column 0 contains words, column 1 contains POS tags in our dataset). So to add this new feature, we will insert the following line into the template file: U15:%x[-1,1]/%x[0,1] The U15 just says this is the 15 th unigram feature; the number doesn’t really matter so long as it is different from all of the other unigram features. The [-1,1] subscript refers to the previous token, and the 1 st column, so the previous POS tag. The [0, 1] subscript refers to the current token, and the 1 st column, so the current POS tag. Save template1.txt with this new line added to it.
Step 4: Add a feature to the template Once you’ve created the new template file, re-run crf_learn, crf_test, and conlleval.pl. Store the results of conlleval.pl in a file called “results-template1.txt”. Notice that the results have changed. By adding more features, you’re allowing the NER system to search through more information during training, to try to find helpful patterns for detecting named-entities. But adding more features can actually hurt performance, if it tricks the CRF into learning a pattern that it thinks is a good way to find named-entities, but actually isn’t.
Step 5: Add more features to the templates Copy “template1.txt” to a file called “template2.txt”. Add the following features to “template2.txt”: 1.unigram feature: the POS tag for the current token and the next token. 2.Unigram feature: the POS tag for the previous token, the current token, and the next token. 3.Bigram feature: the POS tag for the current token and the next token. Re-run training, testing on eng.testa, and evaluation. Store the results of conlleval.pl in a file called “results-template2.txt” Note that as we add more features, not only do the accuracies change, but the training time also takes longer … so we can’t go too crazy with adding in features, or it will take too long.
Step 6: Add a new column to the data So far, we have been changing the predictor by giving it new template features. Another (often better) way of improving the predictor is to give it totally new information, so long as that information is relevant to the task (in this case, finding named entities). To show an example of that, in this step you will add a new column to the data (all three files, eng.train, eng.testa, and eng.testb) that indicates information about capitalization of the word.
Step 6: Add a new column to the data In any programming language, write a script that will take one of the data files as input, and output a very similar one, but with one extra column. The new column should be second-to-last: the last column has to be the labels for the task in order for CRF++ to work properly. And if we keep the first two columns the same (word and POS tag), then the template files we used before will still work properly. In the new, second-to-last column, put a 3 if the word is all-caps, 2 if it has a capital letter anywhere besides the first letter (eg “eBay”), 1 if it starts with a capital letter and has no other capital letters, and 0 if it has no capital letters. Apply this script to eng.train to produce cap.train. Do the same for the test files to get cap.testa and cap.testb.
Step 6: Add a new column to the data Copy “template2.txt” to “template3.txt”. Add the following features to “template3.txt”: 1.Unigram feature: the current capitalization 2.Unigram feature: the previous capitalization and the current capitalization 3.Bigram feature: the previous capitalization and the current capitalization Re-run training, testing, and evaluation. Store the results of conlleval.pl in a file called “results-template3.txt”.
Step 7: See how good you can get Copy “template3.txt” to a file called “template4.txt”. Add, remove, and edit the template file however you think it will work best. You may edit as much or as little as you like, but you must make at least one change from previous templates. You may also add or modify the columns (except for the labels) in the data as much as you want. You may use new columns to incorporate external data (eg, a column that indicates whether a token matches an external database of city names), or just functions of the existing data. Note: you may not add any new columns that are functions of the labels! Save your new data files as eng.mod.train, eng.mod.testa, and eng.mod.testb. Save your results from conlleval.pl in a file called “results-template4.txt”. The TA will award the student with the best results 4 points of extra credit. Second- best will get 2 points of extra credit.
Step 8: Final evaluation With any system that involves learning, it is a good idea to keep a final dataset that is completely separate from the rest of your data, for the final evaluation. So far, we’ve been using eng.testa for all of our development, debugging, and testing. Now, run crf_test and conlleval.pl using eng.mod.testb rather than eng.mod.testa, to produce the final evaluation. Store the results in “results-template4-eng.testb.txt”. Note that the results for final evaluation are typically a bit worse than the results on eng.testa. This is expected. That’s because we essentially designed the model to be good on eng.testa, but what works for eng.testa won’t necessarily generalize to eng.testb.
Step 9: Evaluating on new domains Your NER system can now detect named entities in text. You can apply it to any text, and automatically determine where all the names are. However, it was trained on news text. One thing people have noticed is that systems trained on one kind of text tend not to be quite as accurate on other types of text. For your final task, you will collect some sentences from non-news (or news-like) documents, and see how well your NER system works on them.
Step 9: Evaluating on new domains Create a file called “new-domain.txt”. Collect a set of approximately 50 English sentences. You may collect these from any public source (eg, the Web) that is not a news outlet or provider of news coverage. For instance, blogs or magazine articles work great, but really any non-news source that mentions named-entities will do. The less news- like the better, and the more named-entities, the better. I do not recommend Twitter for this, since tweets are so different from English found in news articles that they will be basically impossible for your system to process. Save your 50 sentences to your file new-domain.txt. You do not need to format this file in any special way (i.e., you don’t need to split sentences onto different lines, or words onto different lines). Just make sure that there aren’t any really unusual characters in the text.
Step 9: Evaluating on new domains Next, you will need to format and add POS tags to this file. We will use the Stanford POS tagger to do this. Run the java file that you compiled in step 1d: Change to the directory of TagAndFormatForCRF.java java –cp.;stanford-postagger.jar TagAndFormatForCRF /new-domain.txt > /new- domain.tagged.txt Note the output redirection in bold after new-domain.txt; it’s easy to miss.
Step 9: Evaluating on new domains Next, you will need to add NER tags to your file. Open new-domain.tagged.txt (from the previous step) in a text editor. For each line, add one of these labels: I-PER (if it’s a person’s name) I-LOC (if it’s the name of a location, like a country or river) I-ORG (if it’s the name of an organization, like “United Nations” or “Red Cross of America”) I-MISC (if it’s the name of something besides a person, location, or organization) O (if it’s not a name at all) Make sure to separate the labels from the POS tags with a tab character. If you’re unsure about how to label any of your sentences, you can check the annotation guidelines at:
Step 9: Evaluating on new domains Finally, run your scripts for adding new columns to this dataset, then run crf_test and conlleval.pl. Save the output of crf_test in a file called “new- domain.predicted.txt”. Save the output of conlleval.pl in a file called “results-new-domain.txt”.
To turn in You should turn in a single zip archive called.zip. It should contain: template1.txt and results-template1.txt template2.txt and results-template2.txt template3.txt, cap.train, cap.testa, cap.testb, and results- template3.txt template4.txt, eng.mod.train, eng.mod.testa, and eng.mod.testb and results-template4.txt results-template4-eng.testb.txt The fully annotated version of new-domain.tagged.txt (after adding labels and all of your new columns, including capitalization). new-domain.predicted.txt and results-new-domain.txt or otherwise transfer this zip file to your TA.