Presentation is loading. Please wait.

Presentation is loading. Please wait.

Miha Grčar (Department of Knowledge Technologies, Jožef Stefan Institute) Achim Klein (University of Hohenheim) Core Information Processing Technologies.

Similar presentations


Presentation on theme: "Miha Grčar (Department of Knowledge Technologies, Jožef Stefan Institute) Achim Klein (University of Hohenheim) Core Information Processing Technologies."— Presentation transcript:

1 Miha Grčar (Department of Knowledge Technologies, Jožef Stefan Institute) Achim Klein (University of Hohenheim) Core Information Processing Technologies Technical Presentation & Demos Luxembourg, November 2011

2 Technical WPs Luxembourg, Nov 2011 Architecture, Integration & Scaling Strategy Architecture, Integration & Scaling Strategy Management WP10 WP2 & WP7 Dissemination & Exploitation WP9 WP3 WP4 WP6 Ontology Infrastructure Ontology Infrastructure Information Extraction Information Extraction Sentiment Analysis Sentiment Analysis Decision Support Infrastructure Decision Support Infrastructure Domain-independent GUI (Open Source) Domain-independent GUI (Open Source) Information Integration Data, Information & Knowledge Base WP5 WP1 & WP8 UC#1 Market Surveillance UC#1 Market Surveillance UC#2 Reputational Risk management UC#2 Reputational Risk management UC#3 Online Retail Brokerage UC#3 Online Retail Brokerage Data Acquisition Data Acquisition Data Acquisition Data Acquisition We are here 2 FIRST Y1 Review Meeting

3 Data acquisition pipeline (Dacq) FIRST Y1 Review Meeting Boilerplate remover Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Boilerplate remover Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Boilerplate remover RSS reader Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Load balancing One reader per site processing pipelines Luxembourg, Nov

4 Data acquisition pipeline (Dacq) FIRST Y1 Review Meeting 4 Demo video (3:20)

5 Data acquisition pipeline FIRST Y1 Review Meeting Boilerplate remover Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Boilerplate remover Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Boilerplate remover Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter RSS reader Luxembourg, Nov

6 Boilerplate removal Demo video (1:30)

7 Data acquisition pipeline FIRST Y1 Review Meeting Boilerplate remover Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Boilerplate remover Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Boilerplate remover Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter RSS reader Luxembourg, Nov

8 Language detection Motivation: language-specific text analysis components Relatively simple problem Solutions based on word or character sequences (language models) Side effects: removes “garbage” and can be used to identify code page Our implementation based on frequencies of character sequences FIRST Y1 Review Meeting Demo video (0:45) Luxembourg, Nov

9 Data acquisition pipeline FIRST Y1 Review Meeting Boilerplate remover Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Boilerplate remover Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Boilerplate remover Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter RSS reader Luxembourg, Nov

10 Near-duplicate detection Why is this a difficult problem? We are dealing with millions of documents – cannot afford to compare every document with every document We are also looking for near-duplicates, not only exact matches Overlooked boilerplate “produces” false near- duplicates FIRST Y1 Review Meeting Luxembourg, Nov Demo video (1:00)

11 Near-duplicate detection Existing approaches like SimHash, shingling and sketching, SpotSigs… Apart from SpotSigs, they require “clean” documents Hard to interpret similarity value (how many characters, words, sentences?) Developing a novel solution to remove boilerplate and detect duplicates [with clear interpretation] in the same framework Luxembourg, Nov 2011 FIRST Y1 Review Meeting 11

12 Technical WPs Luxembourg, Nov 2011 Architecture, Integration & Scaling Strategy Architecture, Integration & Scaling Strategy Management WP10 WP2 & WP7 Dissemination & Exploitation WP9 WP3 WP4 WP6 Ontology Infrastructure Ontology Infrastructure Information Extraction Information Extraction Sentiment Analysis Sentiment Analysis Decision Support Infrastructure Decision Support Infrastructure Domain-independent GUI (Open Source) Domain-independent GUI (Open Source) Information Integration Data, Information & Knowledge Base WP5 WP1 & WP8 UC#1 Market Surveillance UC#1 Market Surveillance UC#2 Reputational Risk management UC#2 Reputational Risk management UC#3 Online Retail Brokerage UC#3 Online Retail Brokerage Data Acquisition Data Acquisition Ontology Infrastructure Ontology Infrastructure Information Extraction Information Extraction 12 FIRST Y1 Review Meeting We are here

13 FIRST ontology SentimentObject  FinancialInstrument   Index    Stock_Index   Stock  Company  Country Luxembourg, Nov 2011 FIRST Y1 Review Meeting 13 Seed indices Constituents (stocks) Companies Countries

14 FIRST ontology :NASDAQ_100 a :Stock_Index ; rdfs:label "NASDAQ-100". :MICROSOFT a :Stock ; rdfs:label "MICROSOFT CORP COM USD " ; :memberOf :NASDAQ_100. :MICROSOFT_CORP a :Company ; rdfs:label "Microsoft Corp." ; :issues :MICROSOFT. :USA a :Country ; rdfs:label "USA". :MICROSOFT_CORP :locatedIn :USA. :MICROSOFT_CORP :hasGazetteer :MICROSOFT_CORP_Gazetteer. :MICROSOFT_CORP_Gazetteer :hasTerm "Microsoft Corp" ; :hasTerm "Microsoft Corporation" ; :hasStopWord "CORP" ; :hasStopWord "CORPORATION" ; a :Gazetteer. Luxembourg, Nov 2011 FIRST Y1 Review Meeting 14 Microsoft Corporation is engaged in developing, licensing and supporting a range of software products and services. Microsoft also designs and sells hardware, and delivers online advertising to the customers. Microsoft Corp

15 correlationDefinitionInfluencesIndicator featureHasCorrelationDefinition indicatorHas CorrelationDefinition correlationDefinition InfluencesFeature objectHasCorrelationDefinition correlationDefinition InfluencesObject FIRST ontology Sentiment Object Sentiment Object Company Financial Instrument Financial Instrument MacroIndicator MicroIndicator Indicator Technical Fundamental Feature Correlation Definition Correlation Definition Volatility Price Reputation Orientation Phrase

16 Annotation pipeline FIRST Y1 Review Meeting Boilerplate remover Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Boilerplate remover Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Boilerplate remover Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter RSS reader Ontology-based semantic annotation Luxembourg, Nov Demo video (3:00)

17 Technical WPs Luxembourg, Nov 2011 Architecture, Integration & Scaling Strategy Architecture, Integration & Scaling Strategy Management WP10 WP2 & WP7 Dissemination & Exploitation WP9 WP3 WP4 WP6 Ontology Infrastructure Ontology Infrastructure Information Extraction Information Extraction Sentiment Analysis Sentiment Analysis Decision Support Infrastructure Decision Support Infrastructure Domain-independent GUI (Open Source) Domain-independent GUI (Open Source) Information Integration Data, Information & Knowledge Base WP5 WP1 & WP8 UC#1 Market Surveillance UC#1 Market Surveillance UC#2 Reputational Risk management UC#2 Reputational Risk management UC#3 Online Retail Brokerage UC#3 Online Retail Brokerage Data Acquisition Data Acquisition Sentiment Analysis Sentiment Analysis We are here 17 FIRST Y1 Review Meeting

18 Sentiment Analysis Object: Sentiment in financial web texts Problem: Classification of sentiment orientation with respect to expected future … price change of financial instruments volatility change of financial instruments reputation change of companies Approach: Knowledge-based sentiment classification Starting at the sentence-level Specific to features of objects (e.g., reputation of a company)

19 Example Ambiguity: ”The low clarity of messages implies that quite often people would be likely to disagree on the classification” [Das and Chen 2007]. Identification and differentiation of objects (and features) Relationships of indicators (e.g., earnings) and objects Short term: uptrend Support for the SPX remains at 848 and then 789, with resistance at 912 and then 935. Short term momentum was overbought during the rally early in the week and is now displaying a positive divergence at friday's lows. Should the market fail to hold this pivot (SPX 840) in the days and weeks ahead the uptrend is likely over. Long term: bear market The Cycle wave bear market of October 2007 continues. Thus far, equity markets worldwide have declined on average about 50%. The opportunity still remains for the US and World economies to avoid a devastating Supercycle bear market like that of

20 Manual Sentiment Annotation Luxembourg, Nov 2011 FIRST Y1 Review Meeting 20 Topic

21 FIRST Knowledge-based Sentiment Analysis Approach 1. Identify 2. Extract 3. Classify sentiment orientation  {positive, negative} for all sentence-level sentiments 4. Aggregate All sentence- level sentiments Scoring Document-level sentiment score All sentiment scores for a given day Averaging Sentiment Index  [-1,1] All sentiments in one document Rules, Ontology Support for the SPX remains at 848 and then 789, with resistance at 912 … Rules, Ontology All sentiment objects and features

22 Sentence-level Sentiment Classification a) directly Example: „I expect the S&P 500 to rise“  positive sentiment  Addressed by rules b) indirectly, via an indicator Example: „I think U.S. interest rates will rise“  negative sentiment  Addressed by ontology

23 Oct 4, 2011 – 3:24 PM ET The fourth quarter began on Monday with the broad S&P 500 on the precipice of a bear market and investors lacking confidence in either European or U.S. policymakers being able to stem the disquiet surrounding the debt crisis. Wall Street typically defines a bear market as a drop of 20 percent or more from a recent high. Volatility is at its most persistently elevated since the financial crisis of 2008, as measured by the popular VIX, or CBOE Volatility Index. Barring a knock-out U.S. earnings period in the next month, it could remain high, and investors should brace for wild swings and more down days. Example Text: S&P 500

24 Sentiment Sentences on Price Change of S&P 500 Luxembourg, Nov 2011 FIRST Y1 Review Meeting 24 Negative sentiment about the future price change of the S&P 500

25 Sentiment Sentences on Volatility Change of S&P 500 Luxembourg, Nov 2011 FIRST Y1 Review Meeting 25 Positive sentiment about the future volatility change of the S&P 500

26 Document-level Sentiment with respect to multiple objects/features

27 Initial Experiment Results Accuracy of knowledge-based sentiment classification vs. standard machine learning methods Small manually classified corpus Result: 7% more accurate Portfolio selection experiment Use sentiment to select Dow Jones stocks Result: Excess returns seem possible  More information in paper: „Extracting Investor Sentiment from Weblog Texts: A Knowledge-based Approach“, published in IEEE CEC 2011 conference proceedings Luxembourg, Nov 2011 FIRST Y1 Review Meeting 27

28 Main Y1 Achievements Data acquisition software running Sentiment analysis for web texts Sentence-level, and specific to features of objects Initial experiment results are promising Ontology available (~4000 instances) Sentence-level annotated corpus available (900 documents and growing)  Delivered as D3.1 and D4.1  Book chapter on data acquisition in preparation  Paper on sentiment extraction (best paper award at CEC 2011 conference) Luxembourg, Nov 2011 FIRST Y1 Review Meeting 28

29 Next Steps Improve ontology and gazetteers Use corpus to improve sentiment classification Increase throughput of sentiment extraction Luxembourg, Nov 2011 FIRST Y1 Review Meeting

30 Thank you 30


Download ppt "Miha Grčar (Department of Knowledge Technologies, Jožef Stefan Institute) Achim Klein (University of Hohenheim) Core Information Processing Technologies."

Similar presentations


Ads by Google