SQL Function - Fuzzy Matching with Levenshtein Distance Algorithm - Return Lowest Value Only(SQL 函数 - 使用 Levenshtein 距离算法进行模糊匹配 - 仅返回最低值)
问题描述
问题:需要 SQL 函数使用 Levenshtein 算法返回最低"匹配值.
代码:
<预><代码>创建函数 ufn_levenshtein(@s1 nvarchar(3999), @s2 nvarchar(3999))返回整数作为开始声明 @s1_len int, @s2_len int声明 @i int、@j int、@s1_char nchar、@c int、@c_temp int声明@cv0 varbinary(8000), @cv1 varbinary(8000)选择@s1_len = LEN(@s1),@s2_len = LEN(@s2),@cv1 = 0x0000,@j = 1,@i = 1,@c = 0而@j <= @s2_lenSELECT @cv1 = @cv1 + CAST(@j AS binary(2)),@j = @j + 1当@i <= @s1_len开始选择@s1_char = SUBSTRING(@s1, @i, 1),@c = @i,@cv0 = CAST(@i AS binary(2)),@j = 1当@j <= @s2_len开始设置@c = @c + 1SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +当@s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END如果@c >@c_temp SET @c = @c_tempSET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1如果@c >@c_temp SET @c = @c_tempSELECT @cv0 = @cv0 + CAST(@c AS binary(2)),@j = @j + 1结尾选择@cv1 = @cv0,@i = @i + 1结尾返回@c结尾如果 OBJECT_ID('tempdb..#ExistingCustomers') 不是 NULL删除表#ExistingCustomers;创建表#ExistingCustomers(客户 VARCHAR(255),身份证号码)INSERT #ExistingCustomers SELECT 'Ed''s Barbershop', 1002INSERT #ExistingCustomers SELECT 'GroceryTown', 1003INSERT #ExistingCustomers SELECT 'Candy Place', 1004INSERT #ExistingCustomers SELECT 'Handy Man', 1005如果 OBJECT_ID('tempdb..#POTENTIALCUSTOMERS') 不是 NULL下降表#潜在客户;创建表#POTENTIALCUSTOMERS(客户VARCHAR(255));插入 #POTENTIALCUSTOMERS SELECT 'Eds Barbershop'INSERT #POTENTIALCUSTOMERS SELECT '杂货城'插入 #POTENTIALCUSTOMERS 选择糖果店"INSERT #POTENTIALCUSTOMERS SELECT 'Handee Man'插入 #POTENTIALCUSTOMERS 选择苹果农场"插入 #POTENTIALCUSTOMERS SELECT 'Ride-a-Long Bikes'选择 A. 客户,出价,b.客户作为客户,dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) 作为 ValueLev来自#POTENTIALCUSTOMERS aLEFT JOIN #ExistingCustomers b ON dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) <15;返回:
我想退货:
说明:结果是 Levenshtein 算法的最低"值.有两行 Levenshtein 分数相同 The Apple Farm 和 Ride-a-Long Bikes,在这种情况下,任何值都可以,只要它是一种价值.
参考资料:
SQL 模糊连接 - MSSQL
http://www.kodyaz.com/articles/fuzzy-string-matching-using-levenshtein-distance-sql-server.aspx
如果您按潜在客户进行分区并使用 ValueLev 对结果进行排序,则可以使用 CTE 来获得您想要的结果:
;with CTE AS(SELECT RANK() OVER (PARTITION BY a.Customer ORDER BY dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) ASC) AS RowNbr,一个客户,出价,b.客户作为客户,dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) 作为 ValueLev来自#POTENTIALCUSTOMERS aLEFT JOIN #ExistingCustomers b ON dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) <15)选择客户,MIN(ID) 作为 ID,MIN(cust) AS cust,价值等级从 CTE哪里 CTE.RowNbr = 1按客户分组,ValueLev由于您不介意在重复 ValueLev 的情况下返回哪个结果,请使用 GROUP BY 和 MIN 来缩放结果每个潜在客户最多一个.
输出:
Customer ID cust ValueLev糖果店 1004 糖果店 0杂货镇 1003 杂货镇 0Eds 理发店 1002 Ed 的理发店 1勤杂工 1005 勤杂工 2苹果农场 1004 Candy Place 9Ride-a-Long Bikes 1003 Candy Place 14Problem: Need SQL function to return the 'lowest' matching value using the Levenshtein algorithm.
Code:
CREATE FUNCTION ufn_levenshtein(@s1 nvarchar(3999), @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
DECLARE @s1_len int, @s2_len int
DECLARE @i int, @j int, @s1_char nchar, @c int, @c_temp int
DECLARE @cv0 varbinary(8000), @cv1 varbinary(8000)
SELECT
@s1_len = LEN(@s1),
@s2_len = LEN(@s2),
@cv1 = 0x0000,
@j = 1, @i = 1, @c = 0
WHILE @j <= @s2_len
SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1
WHILE @i <= @s1_len
BEGIN
SELECT
@s1_char = SUBSTRING(@s1, @i, 1),
@c = @i,
@cv0 = CAST(@i AS binary(2)),
@j = 1
WHILE @j <= @s2_len
BEGIN
SET @c = @c + 1
SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +
CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END
IF @c > @c_temp SET @c = @c_temp
SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1
IF @c > @c_temp SET @c = @c_temp
SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1
END
SELECT @cv1 = @cv0, @i = @i + 1
END
RETURN @c
END
IF OBJECT_ID('tempdb..#ExistingCustomers') IS NOT NULL
DROP TABLE #ExistingCustomers;
CREATE TABLE #ExistingCustomers
(
Customer VARCHAR(255),
ID INT
)
INSERT #ExistingCustomers SELECT 'Ed''s Barbershop', 1002
INSERT #ExistingCustomers SELECT 'GroceryTown', 1003
INSERT #ExistingCustomers SELECT 'Candy Place', 1004
INSERT #ExistingCustomers SELECT 'Handy Man', 1005
IF OBJECT_ID('tempdb..#POTENTIALCUSTOMERS') IS NOT NULL
DROP TABLE #POTENTIALCUSTOMERS;
CREATE TABLE #POTENTIALCUSTOMERS(Customer VARCHAR(255));
INSERT #POTENTIALCUSTOMERS SELECT 'Eds Barbershop'
INSERT #POTENTIALCUSTOMERS SELECT 'Grocery Town'
INSERT #POTENTIALCUSTOMERS SELECT 'Candy Place'
INSERT #POTENTIALCUSTOMERS SELECT 'Handee Man'
INSERT #POTENTIALCUSTOMERS SELECT 'The Apple Farm'
INSERT #POTENTIALCUSTOMERS SELECT 'Ride-a-Long Bikes'
SELECT A.Customer,
b.ID,
b.Customer as cust,
dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) as ValueLev
FROM #POTENTIALCUSTOMERS a
LEFT JOIN #ExistingCustomers b ON dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) < 15;
This returns:
What I would like to return:
Explanation: The results are the 'lowest' values from the Levenshtein algorithm. There are two rows where the Levenshtein scores are the same The Apple Farm and Ride-a-Long Bikes, in which case any of the values is fine, just as long as it is one value.
References:
SQL Fuzzy Join - MSSQL
http://www.kodyaz.com/articles/fuzzy-string-matching-using-levenshtein-distance-sql-server.aspx
You can use CTE to get the result you want if you partition by the potential customer and use the ValueLev to order the results:
;WITH CTE AS
(
SELECT RANK() OVER (PARTITION BY a.Customer ORDER BY dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) ASC) AS RowNbr,
A.Customer,
b.ID,
b.Customer as cust,
dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) as ValueLev
FROM #POTENTIALCUSTOMERS a
LEFT JOIN #ExistingCustomers b ON dbo.ufn_levenshtein(REPLACE(A.Customer, ' ', ''), REPLACE(B.Customer, ' ', '')) < 15
)
SELECT Customer,
MIN(ID) AS ID,
MIN(cust) AS cust,
ValueLev
FROM CTE
WHERE CTE.RowNbr = 1
GROUP BY Customer, ValueLev
As you don't mind which result is returned in the case of duplicate ValueLev, use GROUP BY and MIN to scale the results down to one per potential customer.
Output:
Customer ID cust ValueLev
Candy Place 1004 Candy Place 0
Grocery Town 1003 GroceryTown 0
Eds Barbershop 1002 Ed's Barbershop 1
Handee Man 1005 Handy Man 2
The Apple Farm 1004 Candy Place 9
Ride-a-Long Bikes 1003 Candy Place 14
这篇关于SQL 函数 - 使用 Levenshtein 距离算法进行模糊匹配 - 仅返回最低值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:SQL 函数 - 使用 Levenshtein 距离算法进行模糊匹配
- 在SQL中,如何为每个组选择前2行 2021-01-01
- 更改自动增量起始编号? 2021-01-01
- 导入具有可变标题的 Excel 文件 2021-01-01
- 远程 mySQL 连接抛出“无法使用旧的不安全身份验证连接到 MySQL 4.1+"来自 XAMPP 的错误 2022-01-01
- 使用 Oracle PL/SQL developer 生成测试数据 2021-01-01
- 如何使用 pip 安装 Python MySQLdb 模块? 2021-01-01
- 如何将 Byte[] 插入 SQL Server VARBINARY 列 2021-01-01
- SQL 临时表问题 2022-01-01
- 以一个值为轴心,但将一行上的数据按另一行分组? 2022-01-01
- 如何将 SonarQube 6.7 从 MySQL 迁移到 postgresql 2022-01-01
