Encoding 检测错误的UTF-8编码:要嗅探的错误字符列表?
我有一个在两个应用程序之间共享的sql server 2010数据库。一个应用程序由我们控制,另一个应用程序是一个第三方应用程序,它首先创建了数据库。我们的应用程序是建立在第三方webmail应用程序之上的CRM 该数据库包含varchar列,采用拉丁语-1编码。第三方应用程序是用php编写的,不关心数据的正确编码,因此它将utf-8编码的字节填充到varchar列中,这些字节被解释为拉丁语-1,看起来像垃圾 我们的CRM应用程序是用.Net编写的,它会自动检测数据库排序规则与内存中字符串的编码不同,因此当.Net写入数据库时,它会转换字节以匹配数据库编码 所以。。。从我们的应用程序写入数据库的数据在数据库中看起来是正确的,但来自第三方应用程序的数据却不正确 当我们的应用程序写入FirstName=Céline时,它作为Céline存储在数据库中 当webmail应用程序写入FirstName=Céline时,它将作为Cé行存储在数据库中 我们的CRM应用程序需要显示在任一系统中创建的联系人。因此,我正在编写一个EncodingSniffer类,该类查找指示其编码错误的字符串的标记字符,并对其进行转换 目前我有: private static string[] _flaggedChars = new string[] { "é" }; 私有静态字符串[]\u flaggedChars=新字符串[]{ "é" }; 它非常适合将Cé行显示为Céline,但我需要添加到列表中 是否有人知道一种资源可以获得utf-8特殊字符可以解释为iso-8859-1的所有可能方式 谢谢 澄清: 因为我在.Net工作。从数据库加载到内存中时,字符串将转换为Unicode UTF-16。因此,不管数据库中的编码是否正确。它现在表示为UTF16字节。我需要能够分析这些UTF-16字节,并确定它们是否由于UTF-8字节被塞入iso-8859-1数据库而出错。。。。像泥一样干净,对吗 这是我到目前为止所拥有的。它已经清理了大多数密文字符的显示,但我仍然有麻烦与É 例如:通过webmail将Ãric存储在数据库中作为Éric,但在检测到错误编码并将其更改回后,它显示为�?里克 看看一个拥有2500个联系人的用户,其中数百个存在编码问题,É是唯一没有正确显示的东西Encoding 检测错误的UTF-8编码:要嗅探的错误字符列表?,encoding,utf-8,iso-8859-1,Encoding,Utf 8,Iso 8859 1,我有一个在两个应用程序之间共享的sql server 2010数据库。一个应用程序由我们控制,另一个应用程序是一个第三方应用程序,它首先创建了数据库。我们的应用程序是建立在第三方webmail应用程序之上的CRM 该数据库包含varchar列,采用拉丁语-1编码。第三方应用程序是用php编写的,不关心数据的正确编码,因此它将utf-8编码的字节填充到varchar列中,这些字节被解释为拉丁语-1,看起来像垃圾 我们的CRM应用程序是用.Net编写的,它会自动检测数据库排序规则与内存中字符串的编码
public static Regex CreateRegex()
{
string specials = "ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö";
List<string> flags = new List<string>();
foreach (char c in specials)
{
string interpretedAsLatin1 = Encoding.GetEncoding("iso-8859-1").GetString(Encoding.UTF8.GetBytes(c.ToString())).Trim();//take the specials, treat them as utf-8, interpret them as latin-1
if (interpretedAsLatin1.Length > 0)//utf-8 chars made up of 2 bytes, interpreted as two single byte latin-1 chars.
flags.Add(interpretedAsLatin1);
}
string regex = string.Empty;
foreach (string s in flags)
{
if (regex.Length > 0)
regex += '|';
regex += s;
}
return new Regex("(" + regex + ")");
}
public static string CheckUTF(string data)
{
Match match = CreateRegex().Match(data);
if (match.Success)
return Encoding.UTF8.GetString(Encoding.GetEncoding("iso-8859-1").GetBytes(data));//from iso-8859-1 (latin-1) to utf-8
else
return data;
}
publicstaticregex CreateRegex()
{
弦乐特辑;
列表标志=新列表();
foreach(特价中的字符c)
{
字符串解释器daslatin1=Encoding.GetEncoding(“iso-8859-1”).GetString(Encoding.UTF8.GetBytes(c.ToString())).Trim();//将特殊值作为utf-8处理,将其解释为拉丁语-1
if(解释器daslatin1.Length>0)//utf-8字符由2个字节组成,被解释为两个单字节拉丁-1字符。
标志。添加(1);
}
string regex=string.Empty;
foreach(标志中的字符串s)
{
如果(regex.Length>0)
正则表达式+='|';
regex+=s;
}
返回新的正则表达式(“(“+Regex+”);
}
公共静态字符串校验UTF(字符串数据)
{
Match Match=CreateRegex().Match(数据);
如果(匹配成功)
返回Encoding.UTF8.GetString(Encoding.GetEncoding(“iso-8859-1”).GetBytes(数据));//从iso-8859-1(拉丁语-1)到utf-8
其他的
返回数据;
}
因此:É被转换为195'Ã',8240'‰'您可能应该尝试将字节字符串解码为UTF-8,如果您得到错误,则假设它是ISO-8859-1 编码为ISO-8859-1的文本很少“碰巧”也是有效的UTF-8。。。除非它是ISO-8859-1,它实际上只包含ASCII,但是在这种情况下,你当然不会有任何问题。因此,该方法具有相当的鲁棒性 忽略哪些字符在实际语言中出现得比其他字符更频繁,下面是一个简单的分析,假设每个字符出现的频率相同。让我们试着找出有效的ISO-8859-1被误认为UTF-8导致mojibake的频率。我还假设C1控制字符(U+0080到U+009F)不会出现 对于字节字符串中的任何给定字节。如果字节接近字符串的结尾,则更有可能检测到格式错误的UTF-8,因为已知某些字节序列的长度不足以成为有效的UTF-8。但假设字节不在字符串末尾:
- p(字节解码为ASCII)=0.57。这不提供有关字符串是ASCII、ISO-8859-1还是UTF-8的信息
- 如果这个字节是0x80到0xc1或0xf8到0xff,它不能是UTF-8,所以您将检测到它。p=0.33
- 如果第一个字节是0xc2到0xdf(p=0.11),则它可能是有效的UTF-8,但前提是后面跟一个值介于0x80和0xbf之间的字节。下一个字节不在该范围内的概率为192/224=0.86。所以UTF-8在这里失败的概率是0.09
- 如果第一个字节是0xe0到0xef,那么它可能是有效的UTF-8,但前提是后面跟着2个连续字节。因此,您检测到不良UTF-8的概率为(16/224)*(1-(0.14*0.14))=0.07
- 与0xf0到0xf7类似,概率为(8/224)*(1-(0.14*0.14*0.14))=0.04
#based on c# in question: https://stackoverflow.com/questions/10484833/detecting-bad-utf-8-encoding-list-of-bad-characters-to-sniff
function Convert-CorruptCodePageString {
[CmdletBinding(DefaultParameterSetName = 'ByInputText')]
param (
[Parameter(Mandatory = $true, ValueFromPipeline = $true, ParameterSetName = 'ByInputText')]
[string]$InputText
,
[Parameter(Mandatory = $true, ValueFromPipeline = $true, ParameterSetName = 'ByInputObject')]
[PSObject]$InputObject
,
[Parameter(Mandatory = $true, ParameterSetName = 'ByInputObject')]
[string]$Property
,
[Parameter()]
[System.Text.Encoding]$SourceEncoding = [System.Text.Encoding]::GetEncoding('Windows-1252')
,
[Parameter()]
[System.Text.Encoding]$DestinationEncoding = [system.Text.Encoding]::UTF8
,
[Parameter()]
[string]$DodgyChars = 'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö'
)
begin {
[string]$InvalidCharRegex = ($DodgyChars.ToCharArray() | %{
[byte[]]$dodgyCharBytes = $DestinationEncoding.GetBytes($_.ToString())
$SourceEncoding.GetString($dodgyCharBytes,0,$dodgyCharBytes.Length).Trim()
}) -join '|'
}
process {
if ($PSCmdlet.ParameterSetName -eq 'ByInputText') {
$InputObject = $null
} else {
$InputText = $InputObject."$Property"
}
[bool]$IsLikelyCorrupted = $InputText -match $InvalidCharRegex
if ($IsLikelyCorrupted) { #only bother to decrupt if we think it's corrupted
[byte[]]$bytes = $SourceEncoding.GetBytes($InputText)
[string]$outputText = $DestinationEncoding.GetString($bytes,0,$bytes.Length)
} else {
[string]$outputText = $InputText
}
[pscustomobject]@{
InputString = $InputText
OutputString = $outputText
InputObject = $InputObject
IsLikelyCorrupted = $IsLikelyCorrupted
}
}
}
#demo of using a simple string without the function (may cause corruption since this doesn't check if the characters being replaced are those likely to have been corrupted / thus is more likely to cause corruption in many strings).
$x = 'Strømmen'
$bytes = [System.Text.Encoding]::GetEncoding('Windows-1252').GetBytes($x)
[system.Text.Encoding]::UTF8.GetString($bytes,0,$bytes.Length)
#demo using the function
$x | Convert-CorruptCodePageString
#demo of checking all records in a table for an issue / reporting those with issues
#amend SQL Query, MyDatabaseInstance, and MyDatabaseCatlogue to point to your DB / query the relevant table
Invoke-SQLQuery -Query 'Select [Description], [RecId] from [DimensionFinancialTag] where [Description] is not null and [Description] > ''''' -DbInstance $MyDatabaseInstance -DbCatalog $MyDatabaseCatalog |
Convert-CorruptCodePageString -Property 'Description' |
?{$_.IsLikelyCorrupted} |
ft @{N='RecordId';E={$_.InputObject.RecId}}, InputString, OutputString
function Invoke-SQLQuery {
[CmdletBinding(DefaultParameterSetName = 'ByQuery')]
param (
[Parameter(Mandatory = $true)]
[string]$DbInstance
,
[Parameter(Mandatory = $true)]
[string]$DbCatalog
,
[Parameter(Mandatory = $true, ParameterSetName = 'ByQuery')]
[string]$Query
,
[Parameter(Mandatory = $true, ParameterSetName = 'ByPath')]
[string]$Path
,
[Parameter(Mandatory = $false)]
[hashtable]$Params = @{}
,
[Parameter(Mandatory = $false)]
[int]$CommandTimeoutSeconds = 30 #this is the SQL default
,
[Parameter(Mandatory = $false)]
[System.Management.Automation.Credential()]
[System.Management.Automation.PSCredential]$Credential=[System.Management.Automation.PSCredential]::Empty
)
begin {
write-verbose "Call to 'Execute-SQLQuery'"
$connectionString = ("Server={0};Database={1}" -f $DbInstance,$DbCatalog)
if ($Credential -eq [System.Management.Automation.PSCredential]::Empty) {
$connectionString = ("{0};Integrated Security=True" -f $connectionString)
} else {
$connectionString = ("{0};User Id={1};Password={2}" -f $connectionString, $Credential.UserName, $Credential.GetNetworkCredential().Password)
$PSCmdlet.Name
}
$connection = New-Object System.Data.SqlClient.SqlConnection
$connection.ConnectionString = $connectionString
$connection.Open()
}
process {
#create the command & assign the connection
$cmd = new-object -TypeName 'System.Data.SqlClient.SqlCommand'
$cmd.Connection = $connection
#load in our query
switch ($PSCmdlet.ParameterSetName) {
'ByQuery' {$cmd.CommandText = $Query; break;}
'ByPath' {$cmd.CommandText = Get-Content -Path $Path -Raw; break;}
default {throw "ParameterSet $($PSCmdlet.ParameterSetName) not recognised by Invoke-SQLQuery"}
}
#assign parameters as required
#NB: these don't need declare statements in our query; so a query of 'select @demo myDemo' would be sufficient for us to pass in a parameter with name @demo and have it used
#we can also pass in parameters that don't exist; they're simply ignored (sometimes useful if writing generic code that has optional params)
$Params.Keys | %{$cmd.Parameters.AddWithValue("@$_", $Params[$_]) | out-null}
$reader = $cmd.ExecuteReader()
while (-not ($reader.IsClosed)) {
$table = new-object 'System.Data.DataTable'
$table.Load($reader)
write-verbose "TableName: $($table.TableName)" #NB: table names aren't always available
$table | Select-Object -ExcludeProperty RowError, RowState, Table, ItemArray, HasErrors
}
}
end {
$connection.Close()
}
}