Download presentation
Presentation is loading. Please wait.
Published byЂоко Кораћ Modified over 6 years ago
1
Reading Between the Lines: Mining Domain-Specific Informal Texts with Mixed Registers and Languages
Dr. Shmuel Bar
2
IntuView Text Analytics
IntuView is a developer and provider of “meaning mining” in unstructured text, including: Idea Extraction, Object-Attribute Focused Sentiment Identification, Entity Recognition, Entity Recognition, Analysis and Resolution, and Entity Matching. IntuView currently fully supports English, Spanish, French, Arabic, Urdu (Pakistani) and Indonesian with Russian, Farsi, Hebrew and Hindi in the works. 2/17/2019
3
The Need To extract information from large amounts of “Big Data” relating to areas of interest in order to discover the “unknown unknown”. To separate the “wheat from the chaff” and to extract meaning from texts. To identify non-evident links between persons, organizations, places, objects and ideas. To do all of that in a polyglot global environment in which each piece of the puzzle may be in a different language or “sub-language”.
4
IntuView – Technology overview – the Idea
Imitation of human " fast thinking” (Daniel Kahneman). ”Fast thinking" is not transparent to the individual and is perceived as “intuition”, it is associative and incorporates social and cultural “memes”. It allows us to: Process new information by identifying patterns and features from familiar texts seen in the past, including through relevant “extra-linguistic” information. Create impressions, feelings and tendencies, “jump” to conclusions following a familiar pattern identification, produce “intuitive” links, to create a coherent pattern of ideas and associations, detect anomalies, reduce ambiguity, and to identify the" unknown unknown ”. Confidential – Intuview proprietary February 17, 2019February 17, 2019
5
Language Register 2/17/2019
6
Identification of the “register” of the language
IntuView – Technology overview – the Underlying Concept As a result of these insights, the process of disambiguation of meaning in texts is based on a number of stages: Identification of the “register” of the language Statistical categorization of a document as belonging to a certain domain, topic, or cultural or religious context in order to reduce ambiguity. Using the lexical tokens directly preceding or following the token or tokens in question – in order to provide additional disambiguating information. Matching the lexical tokens after disambiguation to unequivocal instances in an ontology, adding the ontological relationships and features to those of the lexical token. Confidential – Intuview proprietary February 17, 2019February 17, 2019
7
Ontology Hierarchy 2/17/2019
8
IntuView Technology – Key Functionalities
Language Identification - identifies the language and the “Sub-language” (Register/ Dialect) and processes texts in any of the supported languages. . The register of the language may represent a certain period of the language), dialects, social strata etc. Domain Detection – determines the general domain or genre of the text. Categorization - determines the relevance of the text to the user’s requirement, ideological/political affiliation, document type, etc. Entity Recognition - recognizes, analyses and aggregates names (persons, places, organizations etc.) in documents or in input lists, regardless of their orthography, and extracts implicit and contextual information on them, such as: ethnicity, gender, affiliations, nicknames, positions, titles and possible relations with other identified entities. Name Matching - validates, disambiguates and matches names in input texts or lists against an existing watch list or a database of entities extracted from documents. Idea Extraction – based on domain detection and domain-specific ontologies with associated lexicons, extracts implicit ideas and the “hermeneutics” (implicit meaning) of the text. Sentiment Analysis – identifies nuanced opinion vis-à-vis attributes of entities or ideas to generate a comprehensive picture of sentiment. Report - provides a report of the key elements in the text and enables semantic search in large amounts of processed documents. Queries on the database to find combinations of entities or pieces of information.
9
IntuView – Technology overview
The process Confidential – Intuview proprietary February 17, 2019February 17, 2019
10
Ontological Disambiguation
Court 2/17/2019 Court of law Royal Court Athletics Court Legal Domain Sports Domain Ontological context relating to Legal Affairs Ontological context relating to monarchy Ontological context relating to Sport Disambiguation
11
Ontological Disambiguation
Court בית משפט חצר מגרש 2/17/2019 Court of law Royal Court Athletics Court Legal Domain Sports Domain Legal Context Political Context Sport Context Disambiguation
12
Idea Mining “Go to the mattresses” = in US English - fight (The Godfather) “Fight on the beaches” = in UK English - long struggle, sacrifices, ultimate victory (Churchill). إن الله اشترى من المؤمنين أنفسهم وأموالهم - (Allah has purchased from the believers their souls and their money - Koran): In Jihadi texts - legitimacy of suicide attacks In moderate texts – duty to give charity 2/17/2019
13
Implicit Information Shmuel Bar, Bouali Balahsan and Nigel Smith-Jones met at Penn Station = Israeli, Tunisian and Briton met in Manhattan/New York/USA or in Baltimore/Maryland. The Popular Front for the Liberation of Ahwaz: Popular Front – marxist-organisation-type Liberation Front – terrrorist-organisation-type Ahwaz – province in Iran = Organization – Nationality= Iran, Ideology=marxist, Organization type = terrorist 2/17/2019 A name of an entity may provide implicit culturally-coded or country-specific information on: gender, status, ethnicity, religion, age, possible affiliations and more.
14
Entity Matching The Problem
“They spell it Vinci and pronounce it Vinchy; foreigners always spell better than they pronounce” - Mark Twain How to quickly, effectively and precisely identify, vet, and match names of individuals; to extract contextual and implicit information such as gender, ethnicity, status, family- tribal relations and to overcome inconsistency of data due to different cultural and linguistic conventions and errors in transmitting phonetic information. The Solution IntuScan identifies the linguistic origin of the name, reverses it to its source orthography, applies cultural naming conventions and statistical models to generate name variants and discovers information such as ethnic origin, gender, religion/sect, status, family/tribal links etc.
15
IntuView Sentiment Analysis
IntuView Sentiment Analysis seeks to identify: Who is the “holder” of the sentiment and who is the “object”? What is the attribute/property of the object towards which the sentiment is expressed (e.g. “honesty”), where is the sentiment on a positive-negative scale (e.g. corrupt” ranks higher on the negative scale of “Honesty- negative” than “disingenuous”)? Are there multiple sentiments that can be aggregated as “public opinion” (e.g. multiple expressions on a high negative scale of “Honesty” of the same person)? What is the context of the sentiment or the theme of the discourse in which it occurs (e.g. negative sentiment towards the Prime Minister in the context of the price of “Milky” or in the context of Qassam rockets)? Is there sentiment expressed towards entities who are not mentioned explicitly in the text but alluded to through entities that are mentioned explicitly . Confidential – Intuview proprietary February 17, 2019February 17, 2019
16
IntuView – Sentiment February 17, 2019February 17, 2019
Identification of the ontological instances linked to the lexical tokens from different languages. “dishonest”/”malhonnête”/לא ישר= dishonest. “smart”/”bright”/” doué”/״מבריק״ = intelligent Identification of the semantic parents of the ontological instances dishonest ∈ Honesty-negative intelligent ∈ Intelligence-positive A: Y is dishonest = Honesty-negative B: Y is corrupt = Honesty-negative C: Y is smart = Intelligence-positive D: Y is brilliant = Intelligence-positive Agreement: Y is negative in honesty and positive in intelligence. Aggregation of references from different sentiment holders towards attributes of each object. Aggregation of information regarding sentiment towards specific entities to a general aggregated sentiment towards their “parent” entities A: W is dishonest = Honesty-negative B: X is corrupt = Honesty-negative C: Y is disingenuous = Honesty-negative D: Z is shifty= Honesty-negative W, X, Y, Z∈ K (ontologically pre-identified group) Agreement : K is negative in honesty Confidential – Intuview proprietary February 17, 2019February 17, 2019
17
Applications February 17, 2019February 17, 2019
Military - Real time field DOCEX for fighting forces, scanning of websites, social media, SIGINT, “Content Forensics” of seized digital storage, etc. Legal Research - Entity recognition, resolution and discovery of relationships, sentiment and implicit information of entities in the analyzed documents. Advertising and Political Campaigns - analysis of social networks to identify subgroups, extraction of information about their different areas of interest, identification of the sentiment of group members towards entities, products and issues, and identification of sectors and entities that may be compatible with a product for intelligent lead generation and building targeted commercial messages to well-defined social sectors. Financial Markets Research - Extraction of all relevant information from a large quantity of financial data (research reports, and public filings) and generation of a structured representation of sentiment towards attributes of identified entities, and then applying the data to discover trends and correlations between financial entities. Confidential – Intuview proprietary February 17, 2019February 17, 2019
18
Thank you
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.