A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.

Slides:



Advertisements
Similar presentations
CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
Advertisements

Chapter 5: Introduction to Information Retrieval
Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.
Improved TF-IDF Ranker
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
CMSC 414 Computer and Network Security Lecture 9 Jonathan Katz.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
Document Centered Approach to Text Normalization Andrei Mikheev LTG University of Edinburgh SIGIR 2000.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
Mining and Summarizing Customer Reviews
OOSE 01/17 Institute of Computer Science and Information Engineering, National Cheng Kung University Member:Q 薛弘志 P 蔡文豪 F 周詩御.
Keyphrase Extraction in Scientific Documents Thuy Dung Nguyen and Min-Yen Kan School of Computing National University of Singapore Slides available at.
LIFELONG LEARNING PROGRAMME LEONARDO DA VINCI Transfer of innovation GR1-LEO EcoQualify III: Workshop 4 – May 30 th - June 1 st,
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Language and Literacy Levels across the Australian Curriculum: EALD Students Introduction for leaders.
NLP superficial and lexic level1 Superficial & Lexical level 1 Superficial level What is a word Lexical level Lexicons How to acquire lexical information.
How can I trust the rest of Europe ? Requirements and a possible organisation with regard to epSOS and eHealth Frank Robben General manager eHealth platform.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Spoken dialog for e-learning supported by domain ontologies Dario Bianchi, Monica Mordonini and Agostino Poggi Dipartimento di Ingegneria dell’Informazione.
ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES exploring frequencies in texts Bambang Kaswanti Purwo
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
Chapter 6 Supporting Knowledge Management through Technology
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
Educational Objectives
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.
27/3/2008 1/16 A FRAMEWORK FOR REQUIREMENTS ENGINEERING PROCESS DEVELOPMENT (FRERE) Dr. Li Jiang School of Computer Science The.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Medical law and its place in the system of law and legislation in Ukraine. Legislative provision in healthcare in Ukraine.  The concept, object, method.
ELEE 4303 Digital II Introduction to Verilog. ELEE 4303 Digital II Learning Objectives Get familiar with background of HDLs Basic concepts of Verilog.
Automatic acquisition for low frequency lexical items Nuria Bel, Sergio Espeja, Montserrat Marimon.
Final Year Project 1 (FYP 1) CHAPTER 1 : INTRODUCTION
© Copyright 2008 STI INNSBRUCK A Semantic Model of Selective Dissemination of Information for Digital Libraries.
Levels of Linguistic Analysis
1 Chapter 12 Configuration management This chapter is extracted from Sommerville’s slides. Text book chapter 29 1.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
AUTONOMOUS REQUIREMENTS SPECIFICATION PROCESSING USING NATURAL LANGUAGE PROCESSING - Vivek Punjabi.
California Department of Public Health / 1 CALIFORNIA DEPARTMENT OF PUBLIC HEALTH Standards and Guidelines for Healthcare Surge during Emergencies How.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
CIS-2005 : Xi’an - China 1 A New Conceptual Framework within Information Privacy: Meta Privacy Mr. Geoff Skinner Dr Song Han Prof. Elizabeth Chang Curtin.
1 C.I.A.T. Technical Conference Theme 2 KEY ASPECTS FOR IMPROVING CONTROL CAPACITY OF TAX ADMINISTRATION Mr Luigi Magistro Director of the Central Directorate.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
David A. Thomas Mathematics Education Associates LLP
Text Based Information Retrieval
INAGO Project Automatic Knowledge Base Generation from Text for Interactive Question Answering.
Natural Language Processing (NLP)
Representation of documents and queries
Healthcare Privacy: The Perspective of a Privacy Advocate
Levels of Linguistic Analysis
Text Mining & Natural Language Processing
Introduction to Text Analysis
Natural Language Processing (NLP)
System Model Acquisition from Requirements Text
Natural Language Processing (NLP)
Presentation transcript:

A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento di Informatica e Sistemistica Universita’ degli Studi di Napoli, Federico II Naples, Italy 1

Rationale Introduction to challenges in e-healt; Motivation and Open challenges; Proposal of access control policies; Methodology to extract relevant information to protect and apply the proper security policy; A Case study; Conclusion and future works 2

The Electronic Health E-Health challenges: –To provide value-added services to the healthcare actors (patients, doctors, etc...); –To enhance the efficiency and reducing the costs of complex informative systems. E-Health term encloses many meanings; we are focused on those aspects of telemedicine that involve not only technological aspects but, also, procedural ones; In particular, we are assisting to a gradual adoption of innovative IT solutions for e-health but, at the state, the major open issue is the cohesistence of two different domains: 3

The cohesistence of old and new systems from a security point of view….. 1)Modern eHealth systems are designed to enforce fine-grain access control policies and the medical records are a-priori well structured to properly manage the different fields, but….. 2)eHealth is also applied in those contexts where new information systems have not been developed yet but “documental systems” are, in some way, introduced. This means that today documental systems give users the possibility to access a digitalized version of a medical record without having previously classified the critical parts. 4

Unstructured Medical record data and actors Actors are not aware that structuring data is important for data elaboration and protection. Security Problem private data (critical part) can be accessed by not authorized actors. It is not possible to enforce a fine-grained acess control on digitalized unstructured documents Solution extract relevant informaton from the records, enforce access control policies 5

Motivation and our proposal The problem: “Documental systems” allow access to medical record digitalized version (unstructured data) without having previously classified the critical parts. We propose a semantic-based method to locate the resource being accessed and associate the proper security rule to apply. The Access control models is still based on fine- grain data classification. 6

Semantic method for resource classification Knowledge extraction by means of several text analysis methodologies. STEPS Running example: 7

Goal: – extraction of relevant units of lexical elements Text tokenization: –segmentation of a sentence into minimal units of analysis (token). -disambiguation of punctuation marks, aiming at token separation;; separation of continuous strings (i.e. strings that are not separated by blank spaces) to be considered as independent tokens: for example, in the Italian string “c’era” there are two independent tokens (c’ + era). This segmentation can be performed by means of special tools, defined tokenizers, including glossaries with wellknown expressions to be regarded as medical domain tokens and mini- grammars containing heuristic rules regulating token combinations. Text normalization: –variations of the same lexical expression should be reported in a unique way: (i) words that assume different meaning if are written in small or capital letter (ii) acronyms and abbreviations (“USA” or “U.S.A.”) Step 1 - Text Preprocessing: Tokenization and Normalization 8

Step 2 - Morpho-syntactic analysis: POS tagging and Lemmatization Goal: –extraction of word categories. Part-of-speech (POS) tagging: –assignment of a grammatical category (noun, verb, etc.) to each lexical unit. –word-category disambiguation: the vocabulary of the documents of interest is compared with an external lexical resource Key-Word In Context (KWIC) Analysis. Lemmatization: –Reducing the inflected forms to the respective lemma 9

Step 3 - Relevant Terms Recognition Goal: –identification of terms useful to characterize the sections of interest. TF-IDF (Term Frequency - Inverse Document Frequency): relevant lexical items are frequent and concentrated on few documents. W t,d = f t,d * log(N/D t ) term frequency (tf ), corresponds to the number of times a given term occurs in the resource; inverse document frequency (idf), concerning the term distribution within all the sections of the medical records: it relies on the principle that term importance is inversely proportional to the number of documents from the corpus where the given term occurs. 10

Step 4 - Identification of Concepts of Interest Goal: –Clusterize relevant terms in synset (semantically equivalent terms) in order to associate the semantic concept 11

Security Policies At the end of the semantic analysis process, a medical record can be seen as composed by several sections (resources) that can be properly protected; A Security policy is set of rules structured as ACL:  s j ; a i ; r k  where: –s j  S =  s 1 … s m  the set of actors; –a i  A =  a 1 … a h  the set of actions; –r k  R =  r 1 … r h  the set of resources; 12

Medical Record Policy (Use Case) actors resources actions 13

Action-actors identification L r* =  s j, a i, r*  r*  R,  a i  A*  A,  s j  S*  S  Giving the policy and given a resource r*  R, it is easy to locate the set of all allowed rules: 14

System behavior: an example 15

Conclusions and Future works We have proposed a semantic approach for document parts (resource) classification from a security point of view; It is useful to associate a set of security rules on the resources; It is a promising method that can strongly help in facing security issues that arise once data are made available for new potential applications. Future works: –To prove the methodology in other e-government fields, –To implement a system to on-line extract/classify and enforce fine-grained policies with acceptable performances. 16