*I was hoping for the cross apply to only fuzzy against items ***close** to the length of the parameter in @objectmodel.

To measure proximity use ABS(a-b). If you want strings of similar(close) lengths it would be:

```
ABS(LEN(string1) - LEN(string2))
```

If you wanted strings that, say, no more than 5 characters longer/shorter than the other, the WHERE clause would look like

```
WHERE ABS(LEN(string1) - LEN(string2)) <= 5
```

*Can I add a clause like so to get a more precise data set to fuzzy, or will this happen AFTER the cross apply already runs and only affect the final result.*

SQL Server Logic query Processing (https://images.app.goo.gl/qTQpdg2NsC7M4eUj9) dictates that the WHERE clause is evaluated second (after FROM). Scalar UDFs in WHERE clauses change this. If dbo.FuzzyControlMatch is a T-SQL scalar UDF and is used in the WHERE clause then it will be processed first and you will be forced to evaluate everything. That does not appear to be the case with what you posted though.

**How to improve performance here?**

What I would do for starters is compute the string length ahead of time. You could use a persisted computed column then add an index ON (stringlength, whatever that TOP 5 is to be ordered by). Then include the other columns used by that query for INCLUDE columns.

Alternatively you could pre-filter using a temp table or table variable and apply your function there instead.

Either way, I have a function and a couple algorithms that will change your life if you deal with string similarity often. I'll post those later tonight when I have more time.

**CONTINUED....**

I'm not familiar with your similarity function as you have not provided any DDL but I do know how to measure string similarity https://itnext.io/string-similarity-the-basic-know-your-algorithms-guide-3de3d7346227 using algorithms such as Levenshtein, Damerau-Levenshtein, The Longest Common Substring,
and the Longest Common Subsequence (LCSQ). Levenshtein can be solved in O(m*n) time. Damerau-Levenshtein in O(m*n*(Max(m,n)) time. The Longest Common Substring in O(n*m) or O(n)+O(n*k) with a generalized suffix tree. The Longest Common Subsequence between two strings is NP-Hard.

Levenshtein & Damerau-Levenshtein are Distance metrics https://en.wikipedia.org/wiki/String_metric and can measure similary like so: take two strings `S1`

and `S2`

, `S1`

always being shorter or equal to `S2`

; `L1`

as the length of S1, `L2`

the length of S2 and E as your Edit Distance... The formula for the similarity between (S1,S2) is: (L1-E)/L2. Consider the words, "Theirs" and "Thiers". Both strings are 6 characters long and the Levenshein Distance is 2. (6-2)/6 = .67; according to Levenshtein these guys are 67% similar. The Damerau-Levenshtein Distance between these strings is 1; (6-1)/6 = .83; per DLD these strings have a similarity score of 83%.

With the longest common substring or subsequence (length of both as `LS`

) the similarity `LS`

/`L2`

. E.g. The longest common substring between 'ABC123' and "ABC12.3" is "ABC12"; LS=5, L2=7, 5/7 = 71% similar. The longest Common Subsequence between the two is "ABC123". LS=6, SELECT 6./7 = 86% similarity.

Introducing the *Bernie Distance*; it's simple: When S1 is a Substring of S2 the Bernie Distance (BD) is L2-L1 and Similarity can be measured by L1/L2. For example: BD("xxx","yyy") returns NULL; BD("Kangaroo", "Kangarooo") = 1... L2=9, L1=8, L2-L1=1. Here, Bernie gives us a distance of 1 and a similarity score of SELECT .88 (88%). These two metrics are computed in O-nastyfast time - they're basically free. The Bernie Mertics will often be NULL; when that's the case you've lost nothing and gained nothing. When bernie *isn't* NULL, however, you have just accomplished something special.... *You have solved Levenshtein (LD), Damerau-Levenshtein (DLD), the Longest Common Substring & Subsequence (LCSS) and many, many more. * When Bernie (B) is NOT NULL then LD=B, DLD=B and LCSS=L1. *I would not be surprised if you could apply bernie to your similarity function* ;^) This is known as reduction:

Included at the end of this post is bernie8K (VARCHAR(8000)). In addition to the Bernie Distance and Similary you can use bernie to calculate the *Maximum Similarity* (MS). Ex: MS = L1/L2. The MS("ABCD","ABCXYZ") is 67%. In other words, when L1=4 and L2=6, the two strings cannot be any more than 67% (4/6=0.6666). Armed with this information you can create a *Minimum Similarity* parameter that allows you to drastically reduce the number of comparisons. Now the Demo.

**Problem:**

I once had a large client with 1000's of employees. The DB they inherited 100's of job duplicate job titles that had been entered by hand such as "Loan Officer" and "Loan Officr". Reports would report that they had 2005 Loan Officers and 16 Loan Officrs. In reality they had 2021 Loan Officers (16 with misspelled job titles.) The task was to identify (and de-dupe) these job titles. This example is a scaled down version of the problem. Note my comments.

```
-- Sample data.
DECLARE @jobs TABLE
(
JobId INT IDENTITY PRIMARY KEY,
JobCat VARCHAR(100) NOT NULL
);
INSERT @jobs(JobCat)
VALUES('Editing Department'),('Director'),('Producer'),('Actor'),
('Film Editing Department'),('Producers'),('Directer');
-- without any pre-filtering I would need to compare 21 pairs of strings "strings pairs"....
SELECT j1.JobCat, j2.JobCat
FROM @jobs AS j1
CROSS JOIN @jobs AS j2
CROSS APPLY samd.bernie8k(j1.JobCat, j2.JobCat) AS b
WHERE j1.JobId < j2.JobId;
```

**Returns:**

```
JobCat JobCat
---------------------------------- ---------------------------------
Editing Department Director
Editing Department Producer
...
Director Directer
Producer Actor
...
```

Now we'll leverage the Bernie Distance to get our answer and exclude unnecessary comparisons. String pairs where B is not NULL have been solved, string pairs where the MS < @MinSim have been removed. *We just reduced our work from 21 comparisons to 5 and identified 2 duplicates very quickly.*

```
DECLARE @MinSim DEC(6,4) = .8;
SELECT j1.JobId, j2.JobId, b.S1, b.S2, b.L1, b.L2, b.B, b.MS, b.S
FROM @jobs AS j1
CROSS JOIN @jobs AS j2
CROSS APPLY samd.bernie8k(j1.JobCat, j2.JobCat) AS b
WHERE j1.JobId < j2.JobId
AND (b.MS >= @MinSim OR b.B IS NOT NULL);
```

**Returns:**

```
JobId JobId S1 S2 L1 L2 B MS S
----------- ----------- --------------------- -------------------------- ---- --- ----- -------- -------
1 5 Editing Department Film Editing Department 18 23 5 0.7826 0.7826
2 3 Director Producer 8 8 NULL 1.0000 NULL
2 6 Director Producers 8 9 NULL 0.8889 NULL
2 7 Director Directer 8 8 NULL 1.0000 NULL
3 6 Producer Producers 8 9 1 0.8889 0.8889
3 7 Producer Directer 8 8 NULL 1.0000 NULL
6 7 Directer Producers 8 9 NULL 0.8889 NULL
```

This reduction stuff is cool! Let's bring a couple more algorithms to the party. First we'll grab a copy ngrams8k and create a function to compute the Hamming Distance for similarity. Hamming (HD) can be computed in O(n) time; the similarity as (L1-HD)/L2. *Note that, when HD=1, then LD=1, DLD=1, LCSS=L1-1 and we probably computed your similarity too.*

```
-- Sample data.
DECLARE @jobs TABLE
(
JobId INT IDENTITY PRIMARY KEY,
JobCat VARCHAR(100) NOT NULL
);
INSERT @jobs(JobCat)
VALUES('Editing Department'),('Director'),('Producer'),('Actor'),
('Film Editing Department'),('Producers'),('Directer');
DECLARE @MinSim DECIMAL(6,4) = .8;
WITH br AS
(
SELECT b.*
FROM @jobs AS j1
CROSS JOIN @jobs AS j2
CROSS APPLY samd.bernie8k(j1.JobCat, j2.JobCat) AS b
WHERE j1.JobId < j2.JobId
AND (b.MS >= @MinSim OR b.B IS NOT NULL)
)
SELECT br.S1, br.S2, br.L1, br.L2, br.D, S = h.MinSim
FROM br
CROSS APPLY samd.HammingDistance8k(br.S1, br.S2) AS h
WHERE br.B IS NULL
AND h.MinSim >= @MinSim
UNION ALL
SELECT br.S1, br.S2, br.L1, br.L2, br.D, br.S
FROM br
WHERE br.B IS NOT NULL;
```

**Returns:**

```
S1 S2 L1 L2 D S
---------------------- ------------------------- ----------- ----------- ----------- --------------
Director Directer 8 8 0 0.87500000000
Editing Department Film Editing Department 18 23 5 0.78260000000
Producer Producers 8 9 1 0.88890000000
```

**Summary:**

We started with 21 string pairs to compare. Using Bernie we reduced that number to 5 (2 solved, 14 excluded) Using Hamming we picked off another. Only four to go!

**The Functions:**

```
CREATE FUNCTION samd.bernie8K
(
@s1 VARCHAR(8000),
@s2 VARCHAR(8000)
)
/*****************************************************************************************
[Purpose]:
This function allows developers to optimize and simplify how they fuzzy comparisons
between two strings (@s1 and @s2).
bernie8K returns:
S1 = short string - LEN(S1) will always be <= LEN(S2); The formula to calculate S1 is:
S1 = CASE WHEN LEN(@s1) > LEN(@s2) THEN @s2, ELSE @s1 END;
S2 = long string - LEN(S1) will always be <= LEN(S2); The formula to calculate S1 is:
S2 = CASE WHEN LEN(@s1) > LEN(@s2) THEN @s1, ELSE @s2;
L1 = short string length = LEN(S1)
L2 = long string length = LEN(S2)
D = distance = L2-L1; how many characters needed to make L1=L2; D tells us:
1. D is the *minimum* Levenshtein distance between S1 and S2
2. L2/D is the *maximum* similarity between S1 and S2
I = index = CHARINDEX(S1,S2);
B = bernie distance = When B is not NULL then:
1. B = The Levenshtein Distance between S1 and S2
2. B = The Damarau-Levenshtein Distance bewteen S1 and S2
3. B = The Longest Common Substring & Longest Common Subsequence of S1 and S2
4. KEY! = The similarity between L1 and L2 is L1/l2
MS = Max Similarity = Maximum similarity
S = Minimum Similarity = When B isn't null S is the same Similarity value returned by
mdq.Similarity: https://msdn.microsoft.com/en-us/library/ee633878(v=sql.105).aspx
[Author]:
Alan Burstein
[Compatibility]:
SQL Server 2005+, Azure SQL Database, Azure SQL Data Warehouse & Parallel Data Warehouse
[Parameters]:
@s1 = varchar(8000); First of two input strings to be compared
@s2 = varchar(8000); Second of two input strings to be compared
[Returns]:
S1 = VARCHAR(8000); The shorter of @s1 and @s2; returns @s1 when LEN(@s1)=LEN(@s2)
S2 = VARCHAR(8000); The longer of @s1 and @s2; returns @s2 when LEN(@s1)=LEN(@s2)
L1 = INT; The length of the shorter of @s1 and @s2 (or both when they're of equal length)
L2 = INT; The length of the longer of @s1 and @s2 (or both when they're of equal length)
D = INT; L2-L1; The "distance" between L1 and L2
I = INT; The location (position) of S1 inside S2; Note that when 1>0 then S1 is both:
1. a substring of S2
2. a subsequence of S2
B = INT; The Bernie Distance between @s1 and @s1; When B is not null then:
1. B = The Levenshtein Distance between S1 and S2
2. B = The Damarau-Levenshtein Distance bewteen S1 and S2
3. B = The Longest Common Substring & Longest Common Subsequence of S1 and S2
4. KEY! = The similarity between L1 and L2 is L1/l2
MS = DECIMAL(6,4); Returns the same simlarity score as mdq.Similarity would if S1 where a
substring of S2
S = DECIMAL(6,4); When B isn't null then S is the same Similarity value returned by
mdq.Similarity
For more about mdq.Similarity visit:
https://msdn.microsoft.com/en-us/library/ee633878(v=sql.105).aspx
[Syntax]:
--===== Autonomous
SELECT b.TX, b.S1, b.S2, b.L1, b.L2, b.D, b.I, b.B, b.MS, b.S
FROM samd.bernie8K('abc123','abc12') AS b;
--===== CROSS APPLY example
SELECT b.TX, b.S1, b.S2, b.L1, b.L2, b.D, b.I, b.B, b.MS, b.S
FROM dbo.SomeTable AS t
CROSS APPLY samd.bernie8K(t.S1,t.S2) AS b;
[Dependencies]:
N/A
[Developer Notes]:
X. Bernie ignores leading and trailing spaces trailing, and returns trimmed strings!
1. When @s1 is NULL then S2 = @s2, L2 = LEN(@s2);
When @s2 is NULL then S1 = @s1, L1 = LEN(@s1)
2. bernie8K ignores leading and trailing whitespace on both input strings (@s1 and @s2).
In other words LEN(@s1)=DATALENGTH(@s1), LEN(@s2)=DATALENGTH(@s2)
3. bernie8K is deterministic; for more about deterministic and nondeterministic
functions see https://msdn.microsoft.com/en-us/library/ms178091.aspx
[Examples]:
--==== 1. BASIC USE:
-- 1.1. When b.I > 0
SELECT b.TX, b.S1, b.S2, b.L1, b.L2, b.D, b.I, b.B, b.MS, b.S
FROM samd.bernie8K('abc1234','bc123') AS b;
-- 1.2. When b.I = 0
SELECT b.TX, b.S1, b.S2, b.L1, b.L2, b.D, b.I, b.B, b.MS, b.S
FROM samd.bernie8K('abc123','xxx') AS b;
-----------------------------------------------------------------------------------------
[Revision History]:
Rev 00 - 20180708 - Inital Creation - Alan Burstein
Rev 01 - 20181231 - Added Boolean logic for transpositions (TX column) - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT
TX = base.TX, -- transposed? boolean - were S1 and S2 transposed?
S1 = base.S1, -- short string >> IIF(LEN(@s1)>LEN(@s2),@s2,@s1)
S2 = base.S2, -- long string >> IIF(LEN(@s1)>LEN(@s2),@s1,@s2)
L1 = base.L1, -- short string length >> IIF(LEN(@s1)>LEN(@s2),LEN(@s2),LEN(@s1))
L2 = base.L2, -- long string length >> IIF(LEN(@s1)>LEN(@s2),LEN(@s1),LEN(@s2))
D = base.D, -- bernie string distance >> # of characters needed to make L1=L2
I = iMatch.idx, -- bernie index >> position of S1 within S2
B = bernie.D, -- bernie distance >> IIF(CHARINDEX(S1,S2)>0,L2-L1,NULL)
MS = maxSim.D, -- maximum similarity
S = similarity.D -- (minimum) similarity
FROM
(
SELECT
TX = CASE WHEN ls.L=1 THEN 1 ELSE 0 END,
S1 = CASE WHEN ls.L=1 THEN s.S2 ELSE s.S1 END,
S2 = CASE WHEN ls.L=1 THEN s.S1 ELSE s.S2 END,
L1 = CASE WHEN ls.L=1 THEN l.S2 ELSE l.S1 END,
L2 = CASE WHEN ls.L=1 THEN l.S1 ELSE l.S2 END,
D = ABS(l.S1-l.S2)
FROM (VALUES(LEN(LTRIM(@s1)),LEN(LTRIM(@s2)))) AS l(S1,S2) -- LEN(S1,S2)
CROSS APPLY (VALUES(RTRIM(LTRIM(@S1)),RTRIM(LTRIM(@S2)))) AS s(S1,S2) -- S1 and S2 trimmed
CROSS APPLY (VALUES(SIGN(l.S1-l.S2))) AS ls(L) -- LeftLength
) AS base
CROSS APPLY (VALUES(ABS(SIGN(base.L1)-1),ABS(SIGN(base.L2)-1))) AS blank(S1,S2)
CROSS APPLY (VALUES(CHARINDEX(base.S1,base.S2))) AS iMatch(idx)
CROSS APPLY (VALUES(CASE WHEN SIGN(iMatch.idx|blank.S1)=1 THEN base.D END)) AS bernie(D)
CROSS APPLY (VALUES(CAST(CASE blank.S1 WHEN 1 THEN 1.*blank.S2
ELSE 1.*base.L1/base.L2 END AS DECIMAL(6,4)))) AS maxSim(D)
CROSS APPLY (VALUES(CAST(1.*NULLIF(SIGN(iMatch.idx),0)*maxSim.D
AS DECIMAL(6,4)))) AS similarity(D);
GO
CREATE FUNCTION dbo.rangeAB
(
@low BIGINT, -- (start) Lowest number in the set
@high BIGINT, -- (stop) Highest number in the set
@gap BIGINT, -- (step) Difference between each number in the set
@row1 BIT -- Base: 0 or 1; should RN begin with 0 or 1?
)
/****************************************************************************************
[Purpose]:
Creates a lazy, in-memory...
[Author]: Alan Burstein
[Compatibility]:
SQL Server 2008+ and Azure SQL Database
[Syntax]:
SELECT r.RN, r.OP, r.N1, r.N2
FROM dbo.rangeAB(@low,@high,@gap,@row1) AS r;
[Parameters]:
@low = BIGINT; represents the lowest value for N1.
@high = BIGINT; represents the highest value for N1.
@gap = BIGINT; represents how much N1 and N2 will increase each row. @gap also
represents the difference between N1 and N2.
@row1 = BIT; represents the base (first) value of RN. When @row1 = 0, RN begins with 0,
when @row = 1 then RN begins with 1.
[Returns]:
Inline Table Valued Function returns:
RN = BIGINT; a row number that works just like T-SQL ROW_NUMBER() except that it can
start at 0 or 1 which is dictated by @row1. If you are returning the numbers:
(0 or 1) Through @high, then use RN as your "N" value, otherwise use N1.
OP = BIGINT; returns the "opposite number" that relates to RN. When RN begins with 0 the
first number in the set will be 0 for RN, the last number in will be 0 for OP. When
RN is 1 to 10, the numbers 1 to 10 are retrurned in ascending order for RN and in
descending order for OP.
Given the Numbers 1 to 3, 3 is the opposite of 1, 2 the opposite of 2, and 1 is the
opposite of 3. Given the numbers -1 to 2, the opposite of -1 is 2, the opposite of 0
is 1, and the opposite of 1 is 0.
N1 = BIGINT; This is the "N" in your tally table/numbers function. this is your *Lazy*
sequence of numbers starting at @low and incrimenting by @gap until the next number
in the sequence is greater than @high.
N2 = BIGINT; a lazy sequence of numbers starting @low+@gap and incrimenting by @gap. N2
will always be greater than N1 by @gap. N2 can also be thought of as:
LEAD(N1,1,N1+@gap) OVER (ORDER BY RN)
[Dependencies]:
N/A
[Developer Notes]:
1. The lowest and highest possible numbers returned are whatever is allowable by a
bigint. The function, however, returns no more than 531,441,000,000 rows (8100^3).
2. @gap does not affect RN, RN will begin at @row1 and increase by 1 until the last row
unless its used in a subquery where a filter is applied to RN.
3. @gap must be greater than 0 or the function will not return any rows.
4. Keep in mind that when @row1 is 0 then the highest RN value (ROWNUMBER) will be the
number of rows returned minus 1
5. If you only need is a sequential set beginning at 0 or 1 then, for best performance
use the RN column. Use N1 and/or N2 when you need to begin your sequence at any
number other than 0 or 1 or if you need a gap between your sequence of numbers.
6. Although @gap is a bigint it must be a positive integer or the function will
not return any rows.
7. The function will not return any rows when one of the following conditions are true:
* any of the input parameters are NULL
* @high is less than @low
* @gap is not greater than 0
To force the function to return all NULLs instead of not returning anything you can
add the following code to the end of the query:
UNION ALL
SELECT NULL, NULL, NULL, NULL
WHERE NOT (@high&@low&@gap&@row1 IS NOT NULL AND @high >= @low AND @gap > 0)
This code was excluded as it adds a ~5% performance penalty.
8. There is no performance penalty for sorting by rn ASC; there is a large performance
penalty for sorting in descending order WHEN @row1 = 1; WHEN @row1 = 0
If you need a descending sort the use OP in place of RN then sort by rn ASC.
9. For 2012+ systems, The TOP logic can be replaced with:
OFFSET 0 ROWS FETCH NEXT
ABS((ISNULL(@high,0)-ISNULL(@low,0))/ISNULL(@gap,0)+ISNULL(@row1,1)) ROWS ONLY
Best Practices:
--===== 1. Using RN (rownumber)
-- (1.1) The best way to get the numbers 1,2,3...@high (e.g. 1 to 5):
SELECT r.RN
FROM dbo.rangeAB(1,5,1,1) AS r;
-- (1.2) The best way to get the numbers 0,1,2...@high-1 (e.g. 0 to 5):
SELECT r.RN
FROM dbo.rangeAB(0,5,1,0) AS r;
--===== 2. Using OP for descending sorts without a performance penalty
-- (2.1) The best way to get the numbers 5,4,3...@high (e.g. 5 to 1):
SELECT r.OP
FROM dbo.rangeAB(1,5,1,1) AS r
ORDER BY R.RN ASC;
-- (2.2) The best way to get the numbers 0,1,2...@high-1 (e.g. 5 to 0):
SELECT r.OP
FROM dbo.rangeAB(1,6,1,0) AS r
ORDER BY r.RN ASC;
--===== 3. Using N1
-- (3.1) To begin with numbers other than 0 or 1 use N1 (e.g. -3 to 3):
SELECT r.N1
FROM dbo.rangeAB(-3,3,1,1) AS r;
-- (3.2) ROW_NUMBER() is built in. If you want a ROW_NUMBER() include RN:
SELECT r.RN, r.N1
FROM dbo.rangeAB(-3,3,1,1) AS r;
-- (3.3) If you wanted a ROW_NUMBER() that started at 0 you would do this:
SELECT r.RN, r.N1
FROM dbo.rangeAB(-3,3,1,0) AS r;
--===== 4. Using N2 and @gap
-- (4.1) To get 0,10,20,30...100, set @low to 0, @high to 100 and @gap to 10:
SELECT r.N1
FROM dbo.rangeAB(0,100,10,1) AS r;
-- (4.2) Note that N2=N1+@gap; this allows you to create a sequence of ranges.
-- For example, to get (0,10),(10,20),(20,30).... (90,100):
SELECT r.N1, r.N2
FROM dbo.rangeAB(0,90,10,1) AS r;
-----------------------------------------------------------------------------------------
[Revision History]:
Rev 00 - 20140518 - Initial Development - AJB
Rev 01 - 20151029 - Added 65 rows. Now L1=465; 465^3=100.5M. Updated comments - AJB
Rev 02 - 20180613 - Complete re-design including opposite number column (op)
Rev 03 - 20180920 - Added additional CROSS JOIN to L2 for 530B rows max - AJB
Rev 04 - 20190306 - Added inline aliasing function(f):
f.R=(@high-@low)/@gap, f.N=@gap+@low - AJB
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
L1(N) AS
(
SELECT 1
FROM (VALUES
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
(0),(0)) T(N) -- 90 values
),
L2(N) AS (SELECT 1 FROM L1 a CROSS JOIN L1 b CROSS JOIN L1 c),
iTally AS (SELECT RN = ROW_NUMBER() OVER (ORDER BY (SELECT 1)) FROM L2 a CROSS JOIN L2 b)
SELECT r.RN, r.OP, r.N1, r.N2
FROM
(
SELECT
RN = 0,
OP = (@high-@low)/@gap,
N1 = @low,
N2 = @gap+@low
WHERE @row1 = 0
UNION ALL
SELECT TOP (ABS((ISNULL(@high,0)-ISNULL(@low,0))/ISNULL(@gap,0)+ISNULL(@row1,1)))
RN = i.RN,
OP = (@high-@low)/@gap+(2*@row1)-i.RN,
N1 = (i.rn-@row1)*@gap+@low,
N2 = (i.rn-(@row1-1))*@gap+@low
FROM iTally AS i
ORDER BY i.RN
) AS r
WHERE @high&@low&@gap&@row1 IS NOT NULL AND @high >= @low
AND @gap > 0;
GO
CREATE FUNCTION samd.NGrams8k
(
@string VARCHAR(8000), -- Input string
@N INT -- requested token size
)
/*****************************************************************************************
[Purpose]:
A character-level N-Grams function that outputs a contiguous stream of @N-sized tokens
based on an input string (@string). Accepts strings up to 8000 varchar characters long.
[Author]:
Alan Burstein
[Compatibility]:
SQL Server 2008+, Azure SQL Database
[Syntax]:
--===== Autonomous
SELECT ng.position, ng.token
FROM samd.NGrams8k(@string,@N) AS ng;
--===== Against a table using APPLY
SELECT s.SomeID, ng.position, ng.token
FROM dbo.SomeTable AS s
CROSS APPLY samd.NGrams8K(s.SomeValue,@N) AS ng;
[Parameters]:
@string = The input string to split into tokens.
@N = The size of each token returned.
[Returns]:
Position = BIGINT; the position of the token in the input string
token = VARCHAR(8000); a @N-sized character-level N-Gram token
[Dependencies]:
1. dbo.rangeAB (iTVF)
[Revision History]:
------------------------------------------------------------------------------------------
Rev 00 - 20140310 - Initial Development - Alan Burstein
Rev 01 - 20150522 - Removed DQS N-Grams functionality, improved iTally logic. Also Added
conversion to bigint in the TOP logic to remove implicit conversion
to bigint - Alan Burstein
Rev 03 - 20150909 - Added logic to only return values if @N is greater than 0 and less
than the length of @string. Updated comment section. - Alan Burstein
Rev 04 - 20151029 - Added ISNULL logic to the TOP clause for the @string and @N
parameters to prevent a NULL string or NULL @N from causing "an
improper value" being passed to the TOP clause. - Alan Burstein
Rev 05 - 20171228 - Small simplification; changed:
(ABS(CONVERT(BIGINT,(DATALENGTH(ISNULL(@string,''))-(ISNULL(@N,1)-1)),0)))
to:
(ABS(CONVERT(BIGINT,(DATALENGTH(ISNULL(@string,''))+1-ISNULL(@N,1)),0)))
Rev 06 - 20180612 - Using CHECKSUM(N) in the to convert N in the token output instead of
using (CAST N as int). CHECKSUM removes the need to convert to int.
Rev 07 - 20180612 - re-designed to: Use dbo.rangeAB - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT
position = r.RN,
token = SUBSTRING(@string, CHECKSUM(r.RN), @N)
FROM dbo.rangeAB(1, LEN(@string)+1-@N,1,1) AS r
WHERE @N > 0 AND @N <= LEN(@string);
GO
CREATE FUNCTION samd.hammingDistance8K
(
@s1 VARCHAR(8000), -- first input string
@s2 VARCHAR(8000) -- second input string
)
/*****************************************************************************************
[Purpose]:
Purely set-based iTVF that returns the Hamming Distance between two strings of equal
length. See: https://en.wikipedia.org/wiki/Hamming_distance
[Author]:
Alan Burstein
[Compatibility]:
SQL Server 2008+
[Syntax]:
--===== Autonomous
SELECT h.HD
FROM samd.hammingDistance8K(@s1,@s2) AS h;
--===== Against a table using APPLY
SELECT t.string, S2 = @s2, h.HD
FROM dbo.someTable AS t
CROSS
APPLY samd.hammingDistance8K(t.string, @s2) AS h;
[Parameters]:
@s1 = VARCHAR(8000); the first input string
@s2 = VARCHAR(8000); the second input string
[Dependencies]:
1. samd.NGrams8K
[Examples]:
--===== 1. Basic Use
DECLARE @s1 VARCHAR(8000) = 'abc1234',
@s2 VARCHAR(8000) = 'abc2234';
SELECT h.HD, h.L, h.minSim
FROM samd.hammingDistance8K(@s1,@s2) AS h;
---------------------------------------------------------------------------------------
[Revision History]:
Rev 00 - 20180800 - Initial re-design - Alan Burstein
Rev 01 - 20181116 - Added L (Length) and minSim
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT H.HD, H.L, minSim = 1.*(H.L-H.HD)/H.L
FROM
(
SELECT LEN(@s1)-SUM(CHARINDEX(ng.token,SUBSTRING(@S2,ng.position,1))),
CASE LEN(@s1) WHEN LEN(@s2) THEN LEN(@s1) END
FROM samd.NGrams8k(@s1,1) AS ng
WHERE LEN(@S1)=LEN(@S2)
) AS H(HD,L);
GO
```

3more comments