Presentation is loading. Please wait.

Presentation is loading. Please wait.

Large-Scale Record Linkage Support for Cloud Computing Platforms Yuan Xue, Bradley Malin, Elizabeth Durham EECS Department, Biomedical Informatics Department,

Similar presentations


Presentation on theme: "Large-Scale Record Linkage Support for Cloud Computing Platforms Yuan Xue, Bradley Malin, Elizabeth Durham EECS Department, Biomedical Informatics Department,"— Presentation transcript:

1 Large-Scale Record Linkage Support for Cloud Computing Platforms Yuan Xue, Bradley Malin, Elizabeth Durham EECS Department, Biomedical Informatics Department, Vanderbilt University Background Record linkage is the process of comparing the records from multiple resources to aggregate information on the same real-world entity Many applications across different industry sectors and government agencies. Healthcare It is difficult to track a patient’s record across healthcare providers, as healthcare information systems are marred by gross fragmentation. This can hinder primary care and limit biomedical research. The negative effects of fragmentation can be mitigated through record linkage systems. Counter-terrorism Record linkage is applied to identify records from multiple data owners to detect aliases or combine information about the individual to learn about their actions or co-conspirators Cloud computing platforms can enable cost-efficient, high-performance large-scale record linkage Cloud platforms provide massive distributed computing resources Record linkage tasks are usually performed on an infrequent basis. The ability to pay for the use of computing resources on a short-term basis as needed and release them after use, can achieve great cost efficiency. Research Objective Building an end-to-end solution that enables record linkage as a core service on the Cloud computing platform Support a transition toward cloud-based data management for distributed services 1) Flexible usage: for both novice users (service users) and expert users (service component developers using high-level programming primitives) 2) Cost and performance awareness: users can estimate the time, the expense, the linkage quality, and thus choose the appropriate linkage method and configuration. Facilitate the information exchange in many domains, for example, most notably, national health information exchange network. blocking record pair comparison record pair classification record set A record set B Matched pairs Non-matched pairs Data preparation Cloud encoded record set A encoded record set B field comparison Challenge Massive amounts of data are now being collected In 2007, for instance, it was estimated that over 281 exabytes of new data was generated and the quantity of data is growing at an exponential rate. Data in the real-world is dirty Sophisticated linkage techniques need to be applied to record linkage to deal with the noise and semantic errors. Expensive detailed comparison of fields (or attributes) are required between pairs of records, which forms a performance bottleneck. Record linkage over large-scale data sources is extremely time and resource intensive. This challenge gets worse as the quantity of data and number of sources grows. Current Research Efforts Privacy-preserving data encoding protects sensitive fields through a set of well-designed encoding functions. In addition to protecting the confidentiality of the data, the encoded data need to be compared later for similarity to identify the same entity for linkage. It is essential to develop a model that can quantify these encoding schemes in terms of their linkage accuracy, computational complexity and security. Blocking (data partitioning) is pertaining to parallel execution of record linkage. It determines the group of records that most likely to match, retrieves these records and creates partitions of the record set within which records will be compared and linked independently from other partitions. An optimal blocking model can be used in determining optimal data partitions in parallel execution of record linkage, where linkage quality, execution time, and resource requirement are considered as optimization objectives. Record Linkage is a multi-step process Future Research Plan 1)High-level parallel programming model and its run-time support that are tailored to the semantic of privacy-preserving record linkage, exploring the multi-level parallelism in the record linkage 2)Develop estimation and optimization techniques that enable user-aware cost- optimal record linkage in this multi- dimensional space. 3)Develop security analysis framework and new cryptographic models and methods for privacy-preserving record linkage in the Cloud; Analyze the security properties of record linkage on the Cloud by considering a wide variety of threats.


Download ppt "Large-Scale Record Linkage Support for Cloud Computing Platforms Yuan Xue, Bradley Malin, Elizabeth Durham EECS Department, Biomedical Informatics Department,"

Similar presentations


Ads by Google