Sql server 计算UTF8字符串的MD5哈希_Sql Server_Tsql_Hash_Encoding_Sql Server 2008 R2

Sql server 计算UTF8字符串的MD5哈希

sql-server tsql hash encoding sql-server-2008-r2

Sql server 计算UTF8字符串的MD5哈希,sql-server,tsql,hash,encoding,sql-server-2008-r2,Sql Server,Tsql,Hash,Encoding,Sql Server 2008 R2,我有一个SQL表，其中存储了必须唯一的大字符串值。为了确保唯一性，我在一列上有一个唯一的索引，我在其中存储了大字符串的MD5哈希的字符串表示形式保存这些记录的C应用程序使用以下方法进行哈希运算：公共静态字符串CreateMd5HashStringbyte[]输入 { var hashBytes=MD5.Create.ComputeHashinput；返回string.Join，hashBytes.Selectb=>b.ToStringX； } 为了调用它，我首先使用UTF-8编码将字符串

我有一个SQL表，其中存储了必须唯一的大字符串值。为了确保唯一性，我在一列上有一个唯一的索引，我在其中存储了大字符串的MD5哈希的字符串表示形式

保存这些记录的C应用程序使用以下方法进行哈希运算：

公共静态字符串CreateMd5HashStringbyte[]输入 { var hashBytes=MD5.Create.ComputeHashinput；返回string.Join，hashBytes.Selectb=>b.ToStringX； } 为了调用它，我首先使用UTF-8编码将字符串转换为字节[]：

//这就是我在应用程序中使用的内容 CreateMd5HashStringEncoding.UTF8.GetBytesabc //结果：90150983CD24FB0D6963F7D28E17F72 现在，我希望能够使用在SQL中实现此哈希函数，但我得到了一个不同的值：

打印hashbytes'md5'，N'abc' -结果：0xCE1473CF80C6B3FDA8E3DFC006ADC315 这是因为SQL计算字符串的UTF-16表示形式的MD5。如果我创建md5hashStringEncoding.Unicode.GetBytesabc，在C中会得到相同的结果

我无法更改应用程序中进行哈希的方式

有没有办法让SQL Server计算字符串UTF-8字节的MD5哈希

我查找了类似的问题，尝试使用排序规则，但到目前为止运气不佳。

SQL Server本机不支持使用UTF-8字符串，并且。正如你所注意到的

如果坚持使用HASHBYTES函数，则必须能够从C代码中将UTF-8 byte[]作为VARBINARY传递，以保留编码。这可以通过一个CLR函数来实现，该函数接受NVARCHAR并将Encoding.UTF8.GetBytes的结果返回为VARBINARY

话虽如此，我强烈建议将这些类型的业务规则隔离在应用程序中，而不是数据库中。尤其是因为应用程序已经在执行此逻辑。

您需要创建一个UDF，以将NVARCHAR数据转换为UTF-8表示形式的字节。假设它被称为dbo.nchartout8biary，那么您可以执行以下操作：

hashbytes('md5', dbo.NCharToUTF8Binary(N'abc', 1))

下面是一个UDF，它可以做到这一点：

create function dbo.NCharToUTF8Binary(@txt NVARCHAR(max), @modified bit)
returns varbinary(max)
as
begin
-- Note: This is not the fastest possible routine. 
-- If you want a fast routine, use SQLCLR
    set @modified = isnull(@modified, 0)
    -- First shred into a table.
    declare @chars table (
    ix int identity primary key,
    codepoint int,
    utf8 varbinary(6)
    )
    declare @ix int
    set @ix = 0
    while @ix < datalength(@txt)/2  -- trailing spaces
    begin
        set @ix = @ix + 1
        insert @chars(codepoint)
        select unicode(substring(@txt, @ix, 1))
    end

    -- Now look for surrogate pairs.
    -- If we find a pair (lead followed by trail) we will pair them
    -- High surrogate is \uD800 to \uDBFF
    -- Low surrogate  is \uDC00 to \uDFFF
    -- Look for high surrogate followed by low surrogate and update the codepoint   
    update c1 set codepoint = ((c1.codepoint & 0x07ff) * 0x0800) + (c2.codepoint & 0x07ff) + 0x10000
    from @chars c1 inner join @chars c2 on c1.ix = c2.ix -1
    where c1.codepoint >= 0xD800 and c1.codepoint <=0xDBFF
    and c2.codepoint >= 0xDC00 and c2.codepoint <=0xDFFF
    -- Get rid of the trailing half of the pair where found
    delete c2 
    from @chars c1 inner join @chars c2 on c1.ix = c2.ix -1
    where c1.codepoint >= 0x10000

    -- Now we utf-8 encode each codepoint.
    -- Lone surrogate halves will still be here
    -- so they will be encoded as if they were not surrogate pairs.
    update c 
    set utf8 = 
    case 
    -- One-byte encodings (modified UTF8 outputs zero as a two-byte encoding)
    when codepoint <= 0x7f and (@modified = 0 OR codepoint <> 0)
    then cast(substring(cast(codepoint as binary(4)), 4, 1) as varbinary(6))
    -- Two-byte encodings
    when codepoint <= 0x07ff
    then substring(cast((0x00C0 + ((codepoint/0x40) & 0x1f)) as binary(4)),4,1)
    + substring(cast((0x0080 + (codepoint & 0x3f)) as binary(4)),4,1)
    -- Three-byte encodings
    when codepoint <= 0x0ffff
    then substring(cast((0x00E0 + ((codepoint/0x1000) & 0x0f)) as binary(4)),4,1)
    + substring(cast((0x0080 + ((codepoint/0x40) & 0x3f)) as binary(4)),4,1)
    + substring(cast((0x0080 + (codepoint & 0x3f)) as binary(4)),4,1)
    -- Four-byte encodings 
    when codepoint <= 0x1FFFFF
    then substring(cast((0x00F0 + ((codepoint/0x00040000) & 0x07)) as binary(4)),4,1)
    + substring(cast((0x0080 + ((codepoint/0x1000) & 0x3f)) as binary(4)),4,1)
    + substring(cast((0x0080 + ((codepoint/0x40) & 0x3f)) as binary(4)),4,1)
    + substring(cast((0x0080 + (codepoint & 0x3f)) as binary(4)),4,1)

    end
    from @chars c

    -- Finally concatenate them all and return.
    declare @ret varbinary(max)
    set @ret = cast('' as varbinary(max))
    select @ret = @ret + utf8 from @chars c order by ix
    return  @ret

end

它仅在sql server 2019上运行

参考：

我昨晚也做了同样的事。。我猜你用它来存储密码和检查登录。。。为什么不改变逻辑，让C使用MD5并再次将其转换为哈希，然后检查它是否与您存储在DB中的字符串相同？@Veljko89 MD5将用作密码。我建议你避免使用它。但要在任何网站上实际测试它，都有防御措施，5次尝试后超时或其他什么。。。没有一个网站可以处理这么多的登录。即使要找到某人的密码，有没有可能找到添加为salt的20个字符的字符串？@Veljko89是的，但如果攻击者通过SQLI漏洞获取数据库的内容，例如，很容易获得至少一些密码。@Veljko89我想作为最后手段更改应用程序，因为这是非常困难的。我感兴趣的是是否有SQL解决方案。

SELECT HashBytes('MD5', CAST (N'中文' COLLATE Latin1_General_100_CI_AI_SC_UTF8 AS varchar(4000)))