Indexing and Binning Large Databases

Name: Indexing and Binning Large Databases
Uploaded: 2017-10-14T16:13:48+00:00
Duration: PTM23S25
Description: Indexing and Binning Large Databases

Indexing and Binning Large Databases

Abstract Problems with large databases
Biometric identification (1:N Matching) does not scale well with size No established way to organize high dimensional biometric data Proposed Solution Reduce search space before 1:N matching Divide the database using Clustering Techniques Contributions We analyze the effect of implementing a binning scheme on search performance and accuracy We present binning and pruning approaches using multiple biometrics Using hand geometry and signature, we have achieved a search space reduction of 95% without any FRR

Background Only biometric identification (1:N matching) can prevent duplicate enrollments, double dipping Biometrics are being deployed for immigration and national ID applications US-VISIT program Voter ID and national ID programs[3] Potential size that can run into millions Current research is focused only on accuracy Apart from accuracy, scalability, speed and efficiency also become important at this scale Conventional security systems rely on PINs, passwords and other token or key based methods for authentication and identification of users. Though these systems are easy to use, they are insecure as the tokens can be lost, stolen or used by more than one person. With each service requiring different form and means of identification, the multiplicity of authentication schemes becomes difficult to manage.

Challenges Textual/Numeric Data Data is scalar(1D)
Textual/numeric data can be linearly ordered and therefore easily indexed Biometric Data Biometric templates are high dimensional No linear ordering or sorting methods exists for biometric data

Search space analysis As number of stored templates increases, template density (TD) also increases Nc = Number of Clusters For all the clusters – calculate the distance of all points from the mean of the cluster they belong to The average of all such distances = Dic – Average Intra cluster distance: NOTE: the averaging is done by the NUMBER OF USERS The 2 colored plots: The right plot shows extra elements added to the plot on the left  It shows that as the number of templates increases, the probability of finding 2 templates very close to each other also increases. Template density simply tries to quantize the tightness of each cluster  closeness of samples within same cluster Template density TD = (Nc / Dic)  As the number of templates increase, the optimal number of clusters needed to cluster them increases. After increasing the number of clusters, the resulting clusters are such that all templates within the same cluster are very close to each other  hence the average intra-cluster distance mentioned above decreases The increase in the number of clusters is given by Nc. The decrease in the average intra-cluster distance is given by Dic Thus the template density defined by Nc/Dic increases as the numerator Nc has increased and denominator Dic has decreased EXPLAIN FORMULA IN DEPTH

Identification problem
Number of false positives grows geometrically with the size of the database Let FAR and FRR be the False Acceptance Rate (probability) and False Reject Rate (probability) for 1:1 matching For a 1:N matching, The total number of False Accepts is given by FAR = False Acceptance probability in 1:1 case (1-FAR)  Probability of NO False Acceptance in 1:1 case (1-FAR)N  Probability of NO False Acceptance in N 1:1 cases 1 – (1-FAR)N  Probability of at least 1 False Acceptance in N 1:1 cases  Probability of FAR for Identification task False Reject happens when template belonging to same person is not recognized as actually belonging to him Since there exists only 1 corresponding template in database per user, the FRR would be generated when this one particular template is missed/not recognized as belonging to the test-template  For both Verification and Identification, the FRR is generated in the same case Thus FRRN = FRR EXPLAIN FORMULA IN DEPTH

State of the Art Biometrics State of the art Research Problems
Fingerprint 0.15% FRR at 1% FAR (FVC 2002) Fingerprint Enhancement Partial fingerprint matching Face Recognition 10% FRR at 1% FAR (FRVT 2002) Improving accuracy Face alignment variation Handling lighting variations Hand Geometry 4% FRR at 0% FAR (Transport Security Administration Tests) Developing reliable models Identification problem Signature Verification 1.5%(IBM Israel) Developing offline verification systems Handling skillful forgeries Voice Verification <1% FRR (Current Research) Handling channel normalization User habituation Text and language independence

State of the Art Biometrics State of the art Research Problems
Fingerprint 0.15% FRR at 1% FAR (FVC 2002) Fingerprint Enhancement Partial fingerprint matching Face Recognition 10% FRR at 1% FAR (FRVT 2002) Improving accuracy Face alignment variation Handling lighting variations Hand Geometry 2.6% FRR at 0.02% FAR (CUBS, SUNY-Buffalo) Developing reliable models Identification problem Signature Verification 1.5%(IBM Israel) Developing offline verification systems Handling skillful forgeries Voice Verification <1% FRR (Current Research) Handling channel normalization User habituation Text and language independence

Identification problem (contd.)
Even if FAR = %, False accepts = 1 in 10 for N=100000(lower bound) in the identification case. No single biometric is capable of meeting this security requirement individually Ways to reduce identification errors: Reduce FAR FAR is limited by feature representation and the recognition algorithm Cannot be indefinitely reduced Reduce N Classify or index the biometric database. (e.g Henry classification system for fingerprints) Index the records based on meta-data Can we do better?

Fingerprint Features Fingerprints can be classified based on the ridge flow pattern The corrugated surface of the fingerprint is made up of ridges and valleys cover that the entire palmer surface of the hand. The flow pattern of these ridges and valleys are unique to each individual. These patterns that are used for identification and authentication. The image below shows the image of a fingerprint along with the distinguishing features on the print. The flow of the ridges and patterns has been classified into 5 broad classes. This classification is used to catalog the fingerprints and also in authenticating two prints. Henry systems follow an elaborate classification scheme of cataloging and filing forensic prints. Fig1 shows the different classes of ridge flows. This methods cannot be used to distinguish between two fingerprints. Fig 2 shows the distinguishing features on the fingerprint. These features are discontinuities or anomalies in the normal flow of ridges on the surface of the finger. These features are termed as minutiae (small details). There are eighteen different types of minutiae. Fig1 shows the most commonly encountered ones and their names. Fig 2 shows a thumbprint captured on paper. These are typically the kind of images that forensic AFIS (Automatic Fingerprint Identification Systems) have to deal with. The quality of the print not only deteriorates during capture but also during storage and hence AFIS systems are more sophisticated than their biometric counterparts. Fingerprints can be distinguished based on the ridge characteristics 65% of fingerprints belong to the Loop class

Henry Classification of Fingerprints
[Ratha et al,1996] used Henry Classification on database of 1800 templates, tested on 100 templates Search Space: 25%; FRR: 10% [Jain, Pankanti,2000] similar experiment on database of 700 templates achieved FRR: 7.4% (Focus on classification only) State-of-art Fingerprint classification system [Capelli,Maio,Maltoni,Nanni,2003] has FRR 4.8% for 5 class problem and 3.7% for 4 class problem Though natural class exists, still classification is non-trivial Natural classes do not exist for biometrics like Hand Geometry Need more sophistication for partitioning database

Analysis of search space reduction
We can improve performance by reducing the search space during identification Let PSYS – Penetration rate [between 0.0 and 1.0] Penetration rate is the average fraction of the database searched during identification Effective size = N*PSYS For a 1:N matching, NOTE: The State of the art Fingerprint system Psys refers to the Psys obtained at 0% FRR, the previous slide Fingerprint system achieved Psys of 0.25, but at a high FRR Add 1 slide before this one - stating the work people have done using Henry Classification for Fingerprints The total number of False Accepts is given by State of the art fingerprint systems has PSYS=0.5

Effect of binning on accuracy
For PSYS < 0.2, the false accepts are almost constant Query response time improves by a factor of PSYS Capabilities of a low FAR system Will allow us to screen immigrants at airports Will make biometric systems more user-friendly by eliminating the need to remember PINs and IDs

Binning Binning can be used to achieve a smaller PSYS
Partition the feature space Each bin is represented by a cluster center CK Records are compared with only NB cluster centers Bin representatives are computed offline during training Challenges How to handle clustering of large databases? How to handle additions and deletions? Put previous diagram in here .. After the 1st Main Point

Tradeoff Although binning reduces search space, it introduces another source of identification error : Bin Miss If the bin in which the user record exists is not searched, then FRR is generated no matter how good the matcher is If P(B) is the probability of getting the correct bin Binning increases the probability of False Rejects Not tolerable in security and screening applications Solution: Use K-means clustering to find K bins Check Ns nearest bins for the record, such that P(B) = 1 (1 – P(B)) = Probability of Missing a Bin P(B)*FRRPsysn = Probability of a False Reject even after getting into the correct bin EXPLAIN FORMULA IN DEPTH

Formal definition of Binning
In general a biometric template may be represented as a vector Vectors are represented into N distinct clusters; each represented by a ‘code book vector’ The code book vectors divide the feature space into N distinct Voronoi regions Every template is closest to the mean (codebook vector) of the region it belongs to Each template is closest to the mean (codebook vector) of the region which it belongs to The Regions are such that they are non-overlapping  Their Intersection is a NULL set The Union of the Regions form the entire data space Explain Lower Formulations

Search Space Partition: Voronoi Regions

Hand Geometry Template
Feature extraction stages Image capture Binarization Contour Extraction Noise Removal 35 Features are extracted 25 directly measured features 10 ratio and perimeter features

Signature Template Regression Constants b0,b1 Connected Components
11 Features Extracted Regression Constants b0,b1 Compactness Signature Length Major Stroke Length Major Stroke Angle Connected Components Hole Count Hole Area Stroke Count Signing Time

Results 11 – Dimensional Signature data
Best Penetration: 35.57% for 6 bins FRR = 0% 35 – Dimensional Hand Geometry data Best Penetration: 35.8% for 6 bins FRR = 0% Dataset 250 Training Set & 250 Testing Set

Multi-modal approach Resulting bins have very high template densities
A different biometric modality should be used to classify templates within a bin Multimodal biometrics Using multiple biometrics improves accuracy It is difficult to forge multiple biometrics Composite templates reduce template density Statistical independence ensures that individual binning results are diverse The search space (intersection of bins) is reduced due to low commonality between the individual binning results

Multi-Modal Approach

Multi-Modal Approach Search Space: 5% original database size; FRR – 0%

Results of Combination
Best combined penetration rate of 5% Dataset 250 Training Set & 250 Testing Set

Binning v/s Indexing Applications can have frequent insertions of new templates Binning works well when database is static Insertions will require re-partitioning the entire database Indexing can be used in both – static and dynamic database scenarios Trees are commonly used for indexing Extend the concept of indexing relational databases to indexing biometric databases Much more challenging – no concept of primary key exists in biometric templates!

Pyramid Technique spatial hashing
Determine the Pyramid (i) within with which the template lies Determine height (h) of template from the apex The 1-D value = Pyramid Number (i) + Height (h) Indexing done using B+ Trees

Various Indexing Techniques
Grid Files KD Tree R Tree R+ Tree X Tree Pyramid Technique ALL STRUCTURES: INSERTION ORDER DEPENDENT R TREE: Overlapping nodes R+ Tree: Long tree – poor space utilization Grid File: Number of cells increases super-linearly with size of data KD Tree: Useful only if database is non-changing Grid File – Number of cells increases super-linearly with Data, Not effective for range-querying KD Tree – Insertion Order dependent, Not dynamic, Range Querying difficult R Tree – Overlapping intermediate nodes R* Tree – Overlapping intermediate nodes R+ Tree – Longer Trees, poor space utilization X Tree – Insertion order dependent

Comparative Study Method Grid File Y N R Tree R* Tree R+ Tree KD Tree
Scalable Order Invariant Dynamic Range Query No Overlap Grid File Y N R Tree R* Tree R+ Tree KD Tree X Tree Pyramid Tech

Results of Indexing 35 – Dimensional Hand Geometry data Best Penetration: 27% FRR = 0% Dataset 450 Training Set & 450 Testing Set Parallel combination with signature will further reduce the search space

Multimodal Biometrics

2D Biometric: Signature & Fingerprint Fusion
Impostor Score Pairs True Match Score Pairs

Optimal Fusion Algorithm Signature Fused With Fingerprint
Unrealizable Performance Area True Match Score Pairs Optimal Fusion ROC Fusion Algorithm Accuracy (1-FRR) False Accept Rate (FAR) Suboptimal Performance Area Impostor Score Pairs The ROC is the boundary between what is possible and suboptimal performance.

Match Zone No-Match Zone
Optimal Fusion Algorithm Decision Regions 99.04% Specified FAR of 1 in a Million 2nd Biometric Score Axis 1st Biometric Score Axis irregular decision region boundary due to finite sample size the more data the smoother the boundaries No-Match Zone Match Zone

RSS Fusion RSS Fusion ROC
RSS Fusion Algorithm for Fingerprint & Signature Provides A Suboptimal Performance ROC Optimal ROC True Match Score Pairs RSS Fusion ROC RSS Fusion Accuracy (1-FRR) False Accept Rate (FAR) Impostor Score Pairs

RSS Fusion Decision Regions 96.11% Specified FAR of 1 in a Million 2nd Biometric Score Axis 1st Biometric Score Axis No-Match Zone Match Zone

OR Fusion Algorithm for Fingerprint & Signature Provides A Suboptimal Performance ROC
Optimal ROC True Match Score Pairs OR Fusion ROC OR Fusion Accuracy (1-FRR) False Accept Rate (FAR) Impostor Score Pairs

OR Fusion Decision Regions 96.85% Specified FAR of 1 in a Million 2nd Biometric Score Axis 1st Biometric Score Axis No-Match Zone Match Zone

AND Fusion AND Fusion ROC
AND Fusion Algorithm for Fingerprint & Signature Provides A Suboptimal Performance ROC Optimal ROC True Match Score Pairs AND Fusion ROC AND Fusion Accuracy (1-FRR) False Accept Rate (FAR) Impostor Score Pairs

AND Fusion Decision Regions 62.91% Specified FAR of 1 in a Million 2nd Biometric Score Axis 1st Biometric Score Axis No-Match Zone Match Zone

Thank You

Indexing and Binning Large Databases

Similar presentations

Presentation on theme: "Indexing and Binning Large Databases"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Indexing and Binning Large Databases

Similar presentations

Presentation on theme: "Indexing and Binning Large Databases"— Presentation transcript:

Similar presentations

About project

Feedback