Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ian Koenig Chief Architect – Thomson Financial

Similar presentations


Presentation on theme: "Ian Koenig Chief Architect – Thomson Financial"— Presentation transcript:

1 Ian Koenig Chief Architect – Thomson Financial
April 6, 2019 C4. Case Study: Event Processing as a Core Capability of Your Content Distribution Fabric 20 September 2007 Ian Koenig Chief Architect – Thomson Financial

2 A framework for discussing Complex Event Processing
Agenda A framework for discussing Complex Event Processing “Elementized” News as a data source The “fabric” for distributing content and its emerging capabilities for stream processing Enabling new types of data sources creating new ‘opportunities’ Hinting at a larger pattern for distributing content and the role that complex event processing will play.

3 The CEP Framework model
April 6, 2019 The CEP Framework model Content Sources Content Streams Stream Agents User Interface CEP Engine Event Proc Event Proc News Level 2 Level1 Event Proc Application Logic Stream Adapters Other New Content Streams Stream adapters describe the incoming data streams so that they can be processed by the CEP engine Stream agents read the streams and inject new data SQL (with extensions) is the de-facto language used The primary types of analysis are on a single stream “in time” or crossing two or more “streams” “Don’t cross the streams” – Ghostbusters Application logic is another type of “agent” that either generates a new stream to go outside the box or a user interface And of course new stream can be new CEP inputs or in the case of algorithmic trading, a “FIX” order stream

4 Thinking Outside the CEP Box
April 6, 2019 Thinking Outside the CEP Box Outside The CEP Box CEP Engine Application Logic Stream Adapters User Interface New Content Streams Event Proc Stream Agents News Level 2 Level1 Content Sources Content Streams Other Most of the rest of this conference focuses on the CEP box This discussion focuses “out of the Box”

5 Thinking Outside the Box
April 6, 2019 Thinking Outside the Box The Classic 9 Dots Puzzle: Connect all 9 dots with 4 straight lines without ever taking the pencil off of the paper As an aside, The genesis of “Thinking outside the box” is…. To solve the problem, you have to “think outside the box”

6 Content Sources and Content Distribution
April 6, 2019 Content Sources and Content Distribution Content Distribution Fabric Content Sources Content Streams Application Logic Level 2 Level1 CEP Engine Application Logic Stream Adapters Event Stream Agents News Other Lots of focus exists on market data sources as inputs to CEP By market data, we mean Level 1 and Level 2 data. Pricing, Bids/Asks, Orders, Depth The next content source of interest, and there is quite a bit of press about this, is News (no pun intended)

7 Elementized News Moves Markets …
April 6, 2019 Why News? Elementized News Moves Markets … ….News Moves Markets …. Aboutness SEC will Allow Companies to use the Internet to Improve Investor-Management Communications          NA From CFO.com - August 16, 2007 According to SEC chairman Christopher Cox, the commission will allow companies to use the Internet to improve investor-management communications. As currently proposed by the commission, a company interested in offering this venue to shareholders would alert them via Genre Facts News is important to algorithmic trading because news “moves markets” To make news “process-able” by computers we need to know four main things: Aboutness: What the news is about within a well- defined concept system (or ontology). Aboutness is defined by: Entities and Subjects. These are the: Companies, People, Markets, Places, Events, the Economy, Human Interest, etc. that the news is “about”. In database ‘lingo’ , Entities and subjects are all Entities, but we differentiate those Entities that link other content sets (in our universe) together (like Companies, & people) and those that don’t. Those that do are Entities. Those that do not are “subjects”. Genre: (or Type): Is this a story, feature, market report, blog, opinion, rumor, etc? Sentiment: Is the Story generally positive, negative or neutral about the subject(s). Facts: Specific numbers, such as Economic releases or Company Earnings that are tagged with XML elements Aboutness, Genre, Sentiment and Fact “tags” are added both at the document level and marked up in-line. This is called Elementized News. Computers can aid the traditionally human process. This is called auto-categorization. Sentiment

8 April 6, 2019 The Metaverse The Metadata Universe (or Metaverse) is the set of Categories (Entities and Subjects) that provide semantic understanding for text and data. Geography Regions, Countries, Physical Features Operates within Industry Sector Hierarchy (Multiple Schemes) Market (Equity, Commod FI, et al) Is grouped by Analyst For Organization Gov’t, Agency, Company , NGO Market Participant Subsidiary of Person (Multiple Roles) Analyst For Officer of Mkt. Part. – Provides Quotes Listed (Market Participant) Index For Issues Indicator For Entity Model + Categorization Hierarchy Instrument Security, Future, Derivative, et al Index Financial Indexes Indicator Economics, Market Stats Event Corp. Action Meeting, et al Has Quotes Quote, Trade, IOI, Advertisement, Order

9 Categorization Mark-up Example
April 6, 2019 Categorization Mark-up Example Entity: Pharmaceuticals - An Industry Entity Entity: Schering-Plough (SGP-US) – An Organization of type: Company Entity: Merck KGAA (MRK-US) - An Organization Entity of type: Company Auto-categorization and Entity Extraction Use case The company Merk (which is a type of Organization) is referenced. Once identified, we know the ticker symbol and therefor can bring in prices The same is true for Schering Plough. Both are in the Pharmaceuticals industry which we can deduce both by the text of the story and by the associated industry of the referenced companies Finally, you may notice that we do not identify drugs, which if we were “Thomson Scientific”, we would. These are not “yet” part of the entity model for Thomson Financial

10 NewsML Mark-up example
April 6, 2019 NewsML Mark-up example Document Level Mark-up (Categories only) ... <subject type="type:subject" qcode="CategoryId: " creator="org:thomson"/> <subject type="type:subject" qcode="CategoryId: " creator="sys:care" why="why:machine-generated" confidence="70" relevance="65"/> In-line Markup (Categories + Facts) <body> ... <p>The <toc:Category xsi:type="toc:Indicator" IndicatorId=" ">unemployment rate</toc:Category> fell one-tenth of a percentage point to 5.4 percent, the lowest rate since October 2001, primarily because 152,000 adults dropped out of the labor force.</p> <p>"We were encouraged to see the headline payroll number meet expectations after two months of disappointments," said <toc:Category xsi:type="toc:Organization" OrganizationId=" ">SunTrust Robinson Humphrey</toc:Category> analyst <toc:Category xsi:type="toc:Person" PersonId="122456">Tobey Sommer</toc:Category>. The report, he said, "is likely to improve investor sentiment on employment-related stocks."</p> <p> <toc:Category xsi:type="toc:Quote" QuoteId=" ">Manpower (MAN-US)</toc:Category> shares led the gainers, rising 2.5 percent to $ < </body> Sample NewsML

11 Auto-categorization Technology
April 6, 2019 Auto-categorization Technology Much Financial, Legal and Medical information exists in the form of textual documents Traditional “Editorial” processes to tag/index documents can no be augmented by algorithms that can achieve very high precision (~95%) against very large ontologies (10,000’s of terms) Thomson employs a technology called CaRE (Categorization and Recommendations Engine) to do this, which originated in the Thomson Legal and Regulatory division. CaRE uses a set of statistics-based algorithms that are trained to understand a specific ontology as a concept scheme. Autocat technology

12 Elementized News – Summary
April 6, 2019 Elementized News – Summary Each News story is tagged at three levels. Document Level: The overall story lists all the category metadata (Entities + Subjects + Genre + Sentiment) for the story. In-line Entities: Each initial reference to an Entity is marked up “in- line” in the document for additional context. In-line Facts: Specific Numeric Elements (e.g. US GDP or Thomson Q3 Revenue) are tagged using XML elements In-line News vs. Document level Mark-up Sentiment tags (e.g. positive earnings or negative rating) and Subject tags provide semantic understanding of the news story Numeric Facts (when Elementized) are directly process-able by algorithms. Entity tags (e.g. Company references) allow news to be linked and correlated to Market data streams by CEP engines, for example, to make trading decisions The Value of News Elements Entities are the well defined business objects that link financial content sets together. Subjects (or topics) include: Entity Groupings, Entity types, and other classifiers of interest that are not distinct “entity nouns” Subjects + Entities = Categories

13 Content Sources and Content Distribution
April 6, 2019 Content Sources and Content Distribution Content Distribution Fabric Content Sources Content Streams Application Logic Level 2 Level1 CEP Engine Application Logic Stream Adapters Event Stream Agents News Other Lead in to the content distribution fabric So we’ve talked a bit about market data streams and now about news data streams. We know about the complex event processing logic that consumes these streams. So lets talk about the distribution fabric that lies between the content sources and the CEP applications.

14 The Content Distribution Fabric
April 6, 2019 The Content Distribution Fabric Intermediation Service Provider Service Consumer Service Contract Service Broker Find Bind Register Initialization Service Provider Service Consumer Synchronization Service Provider Service Consumer Content Aware Network The Fabric itself provides three main functions: Intermediation – The process by which we connect service providers (content sources) and service consumers (CEP applications). These must be “loosely coupled” so we provide a meta-data driven service broker to connect the two. By loose coupling we mean simply: the ability to make a breaking change in the interface (i.e. non-backwards compatible change) without breaking the application and we can move from the previous version to the next version by upgrading provider and consumer independendently, one at a time and in either order. The Service broker provides: “Register, Find and Bind” (explain) Initialization – is the process by which we get an empty application up to a known “synchronization checkpoint” as defined Synchronization

15 XSLT XPATH X XML x “X” Marks the spot April 6, 2019
XML as the Enabling technology XPATH The XML Path Language (XPath) is an expression language that allows for addressing parts on an XML document. XPath models an XML document as a tree in which elements, attributes, and contents are nodes. Its name is derived from the use of the slash separator to descend the branches of the tree. An XPath expression “selects” a node-set (one or more nodes), a string, a number, or a Boolean. XSLT XSLT is typically used to transform an XML document into another XML document. It does this with style sheets. XSLT stylesheets are similar to Cascading Style Sheets, with two exceptions: XSLT stylesheets are written in XML, and XPath is used to determine which parts of the original document to transform. But XML has always been “too big” and “too slow” except for niche use. Or has it?

16 Content-Aware Hardware Infrastructure
Mobile Devices Applications Databases Content-Aware Network IP/MPLS Routing Module Transformation Module Advanced Interface Module Assured Delivery Module 500, 000 routes 1000’s xforms / sec >1MM msgs / sec Active/active fail-over We use hardware from Solace systems to provide the underlying content aware network that lays the foundation for the synchronization capability. We call this the “Message Bus” Ultra low latency JMS + HTTP + “native” interfaces Use of XML as the message format which is now “enabled” due to advances in hardware based acceleration, compression, routing XPATH as subscription language XSLT to make messages smaller (fewer fields) Ultra low latency across the bus (0.7 ms for a 4K XML message) Use of JMS + HTTP + native interfaces to access 0.7ms transit for a 4K XML document CONFIDENTIAL

17 New Streaming Content Sources
April 6, 2019 New Streaming Content Sources Content Sources Level1 Content Distribution Fabric (Intermediation, Initialization, Synchronization) CEP Engine Application Logic Stream Adapters Event Stream Agents Complex Event Processing Applications Content Streams Level 2 News Research Briefings Filings Deals (M&A) Financials Now that we have the basic “event streaming capability” we can use that to create data streams for content sets that we do not traditionally think of as “streaming” such as: Research Estimates SEC Filings Company Financials M&A Lots more Because in essence we can learn lessons from market data that are transferrable to all other content sets. And by moving to an event-oriented mindset we can not only increase business value, but we can also reduce network traffic and system load in the end because even though XML is “big” and “slow”, we have techniques to manage that and only sending the data that changes has a clear “mathematical” advantage on big, slow moving data sets. We had one case where a 5G database was “replicating” 100GB / day. Moving to an eventing model will reset that to 10MB /day, because that’s all the data that actually changes. That’s a 1000x fold decrease in overall aggregate network traffic. Estimates

18 The Entity Model vs the Relational Model
April 6, 2019 The Entity Model vs the Relational Model Relational Data Physical Table Logical Entity Business Entities Canonical Business Entities XML Transform The relational model vs the Entity Model Relational models are great for storing data. They are physical models. They are not how we “use” the data. When people use RDBs they often have to restructure the data into logical business entities or objects to make the data more usable to the application logic. We use the Entity model (encoded in XML) as the canonical data model for transporting data across the Enterprise. This model is very powerful. First, it forces the unit of data transfer to be logical elements that mean something to the business (not tables encoded as files with fragile foreign keys). The data is structured hierarchically which is very different than the way it is structured in the RDB and much more object oriented. What comes out are objects that “feel” more like financial business objects and events and feel more like financial business events.

19 Changed Data Capture as an Enabling technology
April 6, 2019 Changed Data Capture as an Enabling technology . Content Source 1 Changed Data Capture (publish) Transform 4 Content Distribution Fabric XML 2 Table Table Triggers or Log Mining Table Transaction Log 3 Changed Data Capture technology as an enabler of Event Streams But to get these traditional databases to stream, you need new “enabling technology” Enter “Changed Data Capture” Changed Data Capture technology watches the transaction logs of traditional databases and can create messages when transactions are committed. Furthermore, they have transformation capability, so that those messages need not follow the relational data model of the database. Instead, in our case they follow an Entity or Object model Some CDC tools are better than others at transformation. Some will work in conjunction with a more traditional ETL tool where sophisticated transformations are needed. This ETL process is what we use to create the initialization file. 2: Database Triggers – Database triggers can be used to generate events, but this is not recommended 1: Publishing pipeline – For Databases built using a “publishing pipeline pattern”, events can be generated directly 3: Log Mining – is a technique that watches the transaction log that modern databases use to capture all changes as they are made. 4: Transformation– The final step is transforming the transactional changes made to the databases to XML messages that capture the “business event” process-able downstream.

20 Content Distribution Pattern
April 6, 2019 Content Distribution Pattern Content Distribution Fabric (Intermediation, Initialization, Synchronization) Canonical Data Model (in XML) The Application Database Data Interface (Content Distribution) Service Interface Human Interface Ingest Interface (Feeds + Authoring) Content Source(s) Metadata The Enterprise Database The Content Master Lots of the literature refers to the Enterprise Database as this amorphous “mass” of uncoordinated data with SOA as the white night to slay the dragon of chaos. Maybe it is. But the real answer is that with any technique like SOA or OO or MDM or “pick you favorite acronym” comes a degree of architectural rigor that if properly applied both technologically and organizationally, allows the dragon of chaos to be slain. This model learns the lessons from market data and datafeeds and applies those lessons widely across the world of content. What we learn is that The Enterprise Database is not a single thing. In fact we see two distinct classes of databases: Content masters – who exist to house the single version of the truth Application databases – who exist to make predictable queries, fast Content masters – highly normalized, highly optimized for update, never delete data. Application databases – Highly denormalized, optimized for retrieval. Between them – Data Interfaces: Initialization files and Content Streams And the Content Distribution Fabric provides the concrete implementation for: intermediating publishers & subscribers Initializing Staying synchronized with changes

21 And if you Squint just a little tiny bit …
April 6, 2019 And if you Squint just a little tiny bit … Content Master Database Content Source(s) Ingest Interface (Ripping) Data Interface (Content Distribution) Application Database Service Interface Human Interface Metadata And if you squint just a little tiny bit, you might have seen this pattern once before. TF Information Architecture v0.7

22 The World of Event-Oriented Content
April 6, 2019 The World of Event-Oriented Content Content Streams Financials Deals (M&A) Level 2 Level1 News Content Sources Orders IOIs Research Briefings Filings Estimates And More Content Distribution Fabric (Intermediation, Initialization, Synchronization) CEP Engine Application Logic Stream Adapters Event Stream Agents Complex Event Processing Applications The new world – The only constancy is “change”. All data “moves” Its only a matter of “how fast”. Key technologies are stream oriented databases, content distribution fabrics and a general awareness of the rules around Event Stream processing In in this new world, all content has the potential to change “transactionally”. We have lots of interesting new content streams for CEP aware applications and a Content distribution fabric that itself has event stream processing capabilities.

23 The End Q & A

24 Appendix Appendix

25 Thomson Master Categories: Sample Structure
Canonical Terms are mapped to presentation terms at the most precise Level of the hierarchy. The presentation hierarchy contributes to search relevance. Canonical Presentation Consumer Surveys Business Surveys Geography 3353 Categories Industry 2482 Categories Market 354 Categories Surveys & Cyclical Indexes Cyclical & Activity Indexes Activity Index Leading Indicator Economics & Trade GDP by Expenditure Instrument Security, Future, Derivative, et al Indicator Economics, Market Stats GDP by Industry Incomes National Accounts Exports Event Corp Action, et al Imports Investment Capital Money & Finance Money Supply

26 Intelligent Network Hardware Performance
Messaging Throughput (msgs/sec) Tens of thousands >Million Messaging Latency (at 50% of peak load) Milliseconds Microseconds Transformations (sustained throughput) MB/sec GB/sec Content Routing (number of rules) Small number of thousands Hundreds of Thousands Content Routing Latency (with content rules) Seconds Microseconds Persistent Messaging (msgs/second) A Few Thousand Many Tens of Thousands Software Infrastructure Hardware Infrastructure


Download ppt "Ian Koenig Chief Architect – Thomson Financial"

Similar presentations


Ads by Google