Presentation is loading. Please wait.

Presentation is loading. Please wait.

Alex Meng Chunshi Jin Elliott Conant Jonathan Fung.

Similar presentations


Presentation on theme: "Alex Meng Chunshi Jin Elliott Conant Jonathan Fung."— Presentation transcript:

1 Alex Meng Chunshi Jin Elliott Conant Jonathan Fung

2  What is Over9k?  Architecture  Crawler  Postprocessor  Extractor  Web Service  Summary

3  Original Goal: A system to predict stock’s future volatility based on the news and information gathered from Internet.  Current Goal: create a system that crawled different news sites for articles, identified which companies are affected, and extracted events from articles. We store all information in a database that is accessed through our web service.

4

5  Web crawler: Nutch  Domains we crawl: ◦ www.cnbc.com www.cnbc.com ◦ www.reuters.com www.reuters.com ◦ www.marketwatch.com www.marketwatch.com ◦ … (6 total)  Nutch’s Successes  Nutch’s Failures

6  Components: ◦ NBClassifier  Classifies articles using Naives-Bayes ◦ DateParser  Parses date using regular expressions ◦ PageGetter  Retrieves training data from RSS feeds

7  Tried several systems for IE ◦ Gate ◦ OpenCalais ◦ CRF++

8  OpenCalais: ◦ Web service. Easy to use. ◦ Not extensible. No machine learning process. ◦ Has usage quotas  Gate: ◦ ANNIE( a Nearly New IE system ):  Tokenizer, Sentence Splitter, POSTagger, Gazetteer, NE ◦ JAPE: Gate’s rule engine. ◦ Extensible with JAPE. Easy to use for its regex like syntax. Behavior is almost deterministic. ◦ High precision for defined patterns, low recall if there are sentences of undefined patterns.

9  CRF++ ◦ Need tools to preprocess content:  HTML to text  POS Tag/NE (Stanford NLP library)  Extract other features when necessary  Convert file to the required train/test format of CRF++ ◦ Template file to define dependencies of feature and label. ◦ Need big set of training set. ◦ Labeling training set is laborious ◦ Fairly good precision/recall. “Intelligence” may emerge.

10  Technologies used: ◦ YUI Toolkit ◦ PHP ◦ Apache ◦ CSS ◦ Javascript  Layout description

11  A realistic goal is critical.  Right tools are important.  Communication is key.  Future Improvement ◦ Controlled crawling ◦ Improve feature extraction qualities: POSTagger/NE etc. ◦ Developing a model to predict volatility

12 Q&A Thanks!


Download ppt "Alex Meng Chunshi Jin Elliott Conant Jonathan Fung."

Similar presentations


Ads by Google