Download presentation
Presentation is loading. Please wait.
Published byGianni Mandery Modified over 9 years ago
1
Miha Grčar (Department of Knowledge Technologies, Jožef Stefan Institute) Achim Klein (University of Hohenheim) Core Information Processing Technologies Technical Presentation & Demos Luxembourg, November 2011
2
Technical WPs Luxembourg, Nov 2011 Architecture, Integration & Scaling Strategy Architecture, Integration & Scaling Strategy Management WP10 WP2 & WP7 Dissemination & Exploitation WP9 WP3 WP4 WP6 Ontology Infrastructure Ontology Infrastructure Information Extraction Information Extraction Sentiment Analysis Sentiment Analysis Decision Support Infrastructure Decision Support Infrastructure Domain-independent GUI (Open Source) Domain-independent GUI (Open Source) Information Integration Data, Information & Knowledge Base WP5 WP1 & WP8 UC#1 Market Surveillance UC#1 Market Surveillance UC#2 Reputational Risk management UC#2 Reputational Risk management UC#3 Online Retail Brokerage UC#3 Online Retail Brokerage Data Acquisition Data Acquisition Data Acquisition Data Acquisition We are here 2 FIRST Y1 Review Meeting
3
Data acquisition pipeline (Dacq) FIRST Y1 Review Meeting Boilerplate remover Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Boilerplate remover Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Boilerplate remover RSS reader Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter............ Load balancing One reader per site processing pipelines Luxembourg, Nov 2011 3
4
Data acquisition pipeline (Dacq) FIRST Y1 Review Meeting 4 Demo video (3:20)
5
Data acquisition pipeline FIRST Y1 Review Meeting Boilerplate remover Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Boilerplate remover Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Boilerplate remover Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter RSS reader............ Luxembourg, Nov 2011 5
6
Boilerplate removal Demo video (1:30)
7
Data acquisition pipeline FIRST Y1 Review Meeting Boilerplate remover Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Boilerplate remover Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Boilerplate remover Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter RSS reader............ Luxembourg, Nov 2011 7
8
Language detection Motivation: language-specific text analysis components Relatively simple problem Solutions based on word or character sequences (language models) Side effects: removes “garbage” and can be used to identify code page Our implementation based on frequencies of character sequences FIRST Y1 Review Meeting Demo video (0:45) Luxembourg, Nov 2011 8
9
Data acquisition pipeline FIRST Y1 Review Meeting Boilerplate remover Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Boilerplate remover Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Boilerplate remover Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter RSS reader............ Luxembourg, Nov 2011 9
10
Near-duplicate detection Why is this a difficult problem? We are dealing with millions of documents – cannot afford to compare every document with every document We are also looking for near-duplicates, not only exact matches Overlooked boilerplate “produces” false near- duplicates FIRST Y1 Review Meeting Luxembourg, Nov 2011 10 Demo video (1:00)
11
Near-duplicate detection Existing approaches like SimHash, shingling and sketching, SpotSigs… Apart from SpotSigs, they require “clean” documents Hard to interpret similarity value (how many characters, words, sentences?) Developing a novel solution to remove boilerplate and detect duplicates [with clear interpretation] in the same framework Luxembourg, Nov 2011 FIRST Y1 Review Meeting 11
12
Technical WPs Luxembourg, Nov 2011 Architecture, Integration & Scaling Strategy Architecture, Integration & Scaling Strategy Management WP10 WP2 & WP7 Dissemination & Exploitation WP9 WP3 WP4 WP6 Ontology Infrastructure Ontology Infrastructure Information Extraction Information Extraction Sentiment Analysis Sentiment Analysis Decision Support Infrastructure Decision Support Infrastructure Domain-independent GUI (Open Source) Domain-independent GUI (Open Source) Information Integration Data, Information & Knowledge Base WP5 WP1 & WP8 UC#1 Market Surveillance UC#1 Market Surveillance UC#2 Reputational Risk management UC#2 Reputational Risk management UC#3 Online Retail Brokerage UC#3 Online Retail Brokerage Data Acquisition Data Acquisition Ontology Infrastructure Ontology Infrastructure Information Extraction Information Extraction 12 FIRST Y1 Review Meeting We are here
13
FIRST ontology SentimentObject FinancialInstrument Index Stock_Index Stock Company Country Luxembourg, Nov 2011 FIRST Y1 Review Meeting 13 Seed indices Constituents (stocks) Companies Countries
14
FIRST ontology :NASDAQ_100 a :Stock_Index ; rdfs:label "NASDAQ-100". :MICROSOFT a :Stock ; rdfs:label "MICROSOFT CORP COM USD0.00000625" ; :memberOf :NASDAQ_100. :MICROSOFT_CORP a :Company ; rdfs:label "Microsoft Corp." ; :issues :MICROSOFT. :USA a :Country ; rdfs:label "USA". :MICROSOFT_CORP :locatedIn :USA. :MICROSOFT_CORP :hasGazetteer :MICROSOFT_CORP_Gazetteer. :MICROSOFT_CORP_Gazetteer :hasTerm "Microsoft Corp" ; :hasTerm "Microsoft Corporation" ; :hasStopWord "CORP" ; :hasStopWord "CORPORATION" ; a :Gazetteer. Luxembourg, Nov 2011 FIRST Y1 Review Meeting 14 Microsoft Corporation is engaged in developing, licensing and supporting a range of software products and services. Microsoft also designs and sells hardware, and delivers online advertising to the customers. Microsoft Corp
15
correlationDefinitionInfluencesIndicator featureHasCorrelationDefinition indicatorHas CorrelationDefinition correlationDefinition InfluencesFeature objectHasCorrelationDefinition correlationDefinition InfluencesObject FIRST ontology Sentiment Object Sentiment Object Company Financial Instrument Financial Instrument MacroIndicator MicroIndicator Indicator Technical Fundamental Feature Correlation Definition Correlation Definition Volatility Price Reputation Orientation Phrase
16
Annotation pipeline FIRST Y1 Review Meeting Boilerplate remover Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Boilerplate remover Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter Boilerplate remover Language detector Duplicate detector Natural language preproc. Semantic annotator ZeroMQ emitter RSS reader............ Ontology-based semantic annotation Luxembourg, Nov 2011 16 Demo video (3:00)
17
Technical WPs Luxembourg, Nov 2011 Architecture, Integration & Scaling Strategy Architecture, Integration & Scaling Strategy Management WP10 WP2 & WP7 Dissemination & Exploitation WP9 WP3 WP4 WP6 Ontology Infrastructure Ontology Infrastructure Information Extraction Information Extraction Sentiment Analysis Sentiment Analysis Decision Support Infrastructure Decision Support Infrastructure Domain-independent GUI (Open Source) Domain-independent GUI (Open Source) Information Integration Data, Information & Knowledge Base WP5 WP1 & WP8 UC#1 Market Surveillance UC#1 Market Surveillance UC#2 Reputational Risk management UC#2 Reputational Risk management UC#3 Online Retail Brokerage UC#3 Online Retail Brokerage Data Acquisition Data Acquisition Sentiment Analysis Sentiment Analysis We are here 17 FIRST Y1 Review Meeting
18
Sentiment Analysis Object: Sentiment in financial web texts Problem: Classification of sentiment orientation with respect to expected future … price change of financial instruments volatility change of financial instruments reputation change of companies Approach: Knowledge-based sentiment classification Starting at the sentence-level Specific to features of objects (e.g., reputation of a company)
19
Example Ambiguity: ”The low clarity of messages implies that quite often people would be likely to disagree on the classification” [Das and Chen 2007]. Identification and differentiation of objects (and features) Relationships of indicators (e.g., earnings) and objects Short term: uptrend Support for the SPX remains at 848 and then 789, with resistance at 912 and then 935. Short term momentum was overbought during the rally early in the week and is now displaying a positive divergence at friday's lows. Should the market fail to hold this pivot (SPX 840) in the days and weeks ahead the uptrend is likely over. Long term: bear market The Cycle wave bear market of October 2007 continues. Thus far, equity markets worldwide have declined on average about 50%. The opportunity still remains for the US and World economies to avoid a devastating Supercycle bear market like that of 1929-1932. http://caldaroew.spaces.live.com/Blog/cns!D2CB8C5EBA2ADE86!27847.entry
20
Manual Sentiment Annotation Luxembourg, Nov 2011 FIRST Y1 Review Meeting 20 Topic
21
FIRST Knowledge-based Sentiment Analysis Approach 1. Identify 2. Extract 3. Classify sentiment orientation {positive, negative} for all sentence-level sentiments 4. Aggregate All sentence- level sentiments Scoring Document-level sentiment score All sentiment scores for a given day Averaging Sentiment Index [-1,1] All sentiments in one document Rules, Ontology Support for the SPX remains at 848 and then 789, with resistance at 912 … Rules, Ontology All sentiment objects and features
22
Sentence-level Sentiment Classification a) directly Example: „I expect the S&P 500 to rise“ positive sentiment Addressed by rules b) indirectly, via an indicator Example: „I think U.S. interest rates will rise“ negative sentiment Addressed by ontology
23
http://business.financialpost.com/2011/10/04/economic-uncertainty-could-fan-volatility/ Oct 4, 2011 – 3:24 PM ET The fourth quarter began on Monday with the broad S&P 500 on the precipice of a bear market and investors lacking confidence in either European or U.S. policymakers being able to stem the disquiet surrounding the debt crisis. Wall Street typically defines a bear market as a drop of 20 percent or more from a recent high. Volatility is at its most persistently elevated since the financial crisis of 2008, as measured by the popular VIX, or CBOE Volatility Index. Barring a knock-out U.S. earnings period in the next month, it could remain high, and investors should brace for wild swings and more down days. Example Text: S&P 500
24
Sentiment Sentences on Price Change of S&P 500 Luxembourg, Nov 2011 FIRST Y1 Review Meeting 24 Negative sentiment about the future price change of the S&P 500
25
Sentiment Sentences on Volatility Change of S&P 500 Luxembourg, Nov 2011 FIRST Y1 Review Meeting 25 Positive sentiment about the future volatility change of the S&P 500
26
Document-level Sentiment with respect to multiple objects/features
27
Initial Experiment Results Accuracy of knowledge-based sentiment classification vs. standard machine learning methods Small manually classified corpus Result: 7% more accurate Portfolio selection experiment Use sentiment to select Dow Jones stocks Result: Excess returns seem possible More information in paper: „Extracting Investor Sentiment from Weblog Texts: A Knowledge-based Approach“, published in IEEE CEC 2011 conference proceedings Luxembourg, Nov 2011 FIRST Y1 Review Meeting 27
28
Main Y1 Achievements Data acquisition software running Sentiment analysis for web texts Sentence-level, and specific to features of objects Initial experiment results are promising Ontology available (~4000 instances) Sentence-level annotated corpus available (900 documents and growing) Delivered as D3.1 and D4.1 Book chapter on data acquisition in preparation Paper on sentiment extraction (best paper award at CEC 2011 conference) Luxembourg, Nov 2011 FIRST Y1 Review Meeting 28
29
Next Steps Improve ontology and gazetteers Use corpus to improve sentiment classification Increase throughput of sentiment extraction Luxembourg, Nov 2011 FIRST Y1 Review Meeting
30
Thank you 30
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.