Perl UTF-8/编码:如何检测$str1 eq$str2以避免MySQL往返?

Perl UTF-8/编码:如何检测$str1 eq$str2以避免MySQL往返?,perl,utf-8,Perl,Utf 8,sub的调用方给了我一个值$new\u value。我已经从MySQL数据库中选择了一个值,并将其转换为标量$current\u value。我不知道如何可靠地检测它们是否“相同”。我的意思是:如果我用$new\u值更新数据库记录,这会改变数据库状态吗 将此归结为其本质: #!/usr/bin/perl -w use utf8; use strict; use Encode qw(encode); my $str = 'æøå'; my $latin1 = encode('latin1', $s

sub的调用方给了我一个值
$new\u value
。我已经从MySQL数据库中选择了一个值,并将其转换为标量
$current\u value
。我不知道如何可靠地检测它们是否“相同”。我的意思是:如果我用
$new\u值更新数据库记录,这会改变数据库状态吗

将此归结为其本质:

#!/usr/bin/perl -w
use utf8;
use strict;
use Encode qw(encode);
my $str = 'æøå';
my $latin1 = encode('latin1', $str);

# This in fact doesn't die. They're eq
$str eq $latin1
    or die;
如果我用
$str
更新MySQL数据库中的一个字段,如果我重新选择它,我会得到一个值—一个UTF-8编码的值。使用
$latin1
,数据库字段以另一个值结束-一个latin1/ISO-8859-1编码值

我正在调试的原始问题使用
CHARSET=latin1
更新了一个字段,但症状同样显示在一个简单的:

my $dbh = DBI->connect(
    "DBI:mysql:mysql",
    'user',
    'pass',
    # No, we don't have these options on our DB handles
    # Introducing them now would causes (too) many regression issues
    # for us, as in other places also, values are latin1 encoded,
    # not UTF-8 encoded.
    # mysql_enable_utf8 => 1,
    # mysql_enable_utf8mb4 => 1`
);
my $sth = $dbh->prepare('SELECT CONCAT(?)')
    or die;
$sth->execute($val);
my ($return_val) = $sth->fetchrow_array();
由于在MySQL往返之后,
$str
$latin1
会产生不同的值,因此我想检测它们实际上并不相等。因此,假设数据库中的当前值是正确编码的拉丁语1
æå
,我已经选择了
将其转换为
$current_值
标量,那么我的问题归结为编码:

sub new_value_will_change_database {
    my ($current_value, $new_value) = @_;
    # How to write I write this sub, so it returns true for $str 
    # and false for $latin1 from above?
    ...
}
我如何做到这一点?我能检测到的唯一区别是,UTF8标志在
$str
上启用,但在
$latin1
上未启用。然而,我似乎还记得,如果我检查UTF-8标志,我的代码被破坏了

更完整的调试脚本 生成此输出:

str          : val:��� mysql:æøå is_utf8:1 hex:E6F8E5 dumper0:"\x{e6}"
latin        : val:��� mysql:��� is_utf8:0 hex:E6F8E5 dumper0:'�'
utf8upgraded : val:��� mysql:æøå is_utf8:1 hex:E6F8E5 dumper0:"\x{e6}"

我不确定你是否正在打印你想要打印的内容。我也不确定你想达到什么目的。这两个字符串相同,只是表示方式不同

如果我们通过更改
for
循环打印更多信息,如下所示:

foreach my $set (
                 [ 'str', $str ],
                 [ 'latin', $latin1 ],
                 [ 'utf8upgraded', $utf8upgraded ],
                ) {
    my ($disp, $val) = @$set;

    my $hex = $val;
    $hex =~ s/(.)/sprintf "%X", ord($1)/ge;

    my $dumper = Data::Dumper->new([substr $val, 0, 1])->Terse(1)->Dump;
    chomp $dumper;
    my $mysql = mysql_roundtrip($val);
    my $dumper_mysql = Data::Dumper->new([substr $mysql, 0, 1])->Terse(1)->Dump;
    (my $hex_mysql = $mysql) =~ s/(.)/sprintf "%X", ord($1)/ge;
    chomp $dumper_mysql;
    printf "%-13s: val  :%s is_utf8:%d hex:%s dumper0:%s\n" .
        "%-13s  mysql:%s is_utf8:%d hex:%s dumper1:%s\n",
    $disp, $val, is_utf8($val), $hex, $dumper,
    "",    $mysql, is_utf8($mysql), $hex_mysql, $dumper_mysql;
}
然后我们得到mysql concat输出是否为utf8的输出,以及十六进制值是什么,等等。然后,为了使其正常工作,我做了以下额外的更改(请参阅其他关于unicode或编码的有趣程度的文章):

  • binmode标准输出':utf8'以使perl正确输出
  • utf8::decode($concat)
    在您的
    mysql\u往返
    函数中正确地将文本解码回perl格式

  • 完成这些操作后,我让val和mysql显示相同的内容,始终显示为æå。

    当您尝试在iso-latin-1字段中放置一个字符,而该字符不在Windows-1252字符集中时,会插入一个问号

    因此,假设您正确地将文本发送到数据库[1],以下操作将起作用:

    sub will_change_db_virtual {
       my ($current_text, $new_text) = @_;
    
       state $re;
       if (!$re) {
          my $cp1252_charset = decode('cp1252', (join '', map chr, 0x00..0xFF), sub { "" });
          $re = qr/[^\Q$cp1252_charset\E]/;
       }
    
       $new_text =~ s/$re/?/g;
       return $new_text ne $current_text;
    }
    
    测试:

    测试输出:

    current:61.62.63.64.E9.66.67 new:61.62.63.64.E9.66.67 changed? real:0 virtual:0 result:pass
    current:61.62.63.64.113.66.67 new:61.62.63.64.113.66.67 changed? real:1 virtual:1 result:pass
    current:61.62.63.64.3F.66.67 new:61.62.63.64.113.66.67 changed? real:0 virtual:0 result:pass
    current:0.1.2.3.4.5.6.7.8.9.A.B.C.D.E.F.10.11.12.13.14.15.16.17.18.19.1A.1B.1C.1D.1E.1F.20.21.22.23.24.25.26.27.28.29.2A.2B.2C.2D.2E.2F.30.31.32.33.34.35.36.37.38.39.3A.3B.3C.3D.3E.3F.40.41.42.43.44.45.46.47.48.49.4A.4B.4C.4D.4E.4F.50.51.52.53.54.55.56.57.58.59.5A.5B.5C.5D.5E.5F.60.61.62.63.64.65.66.67.68.69.6A.6B.6C.6D.6E.6F.70.71.72.73.74.75.76.77.78.79.7A.7B.7C.7D.7E.7F.20AC.FFFD.201A.192.201E.2026.2020.2021.2C6.2030.160.2039.152.FFFD.17D.FFFD.FFFD.2018.2019.201C.201D.2022.2013.2014.2DC.2122.161.203A.153.FFFD.17E.178.A0.A1.A2.A3.A4.A5.A6.A7.A8.A9.AA.AB.AC.AD.AE.AF.B0.B1.B2.B3.B4.B5.B6.B7.B8.B9.BA.BB.BC.BD.BE.BF.C0.C1.C2.C3.C4.C5.C6.C7.C8.C9.CA.CB.CC.CD.CE.CF.D0.D1.D2.D3.D4.D5.D6.D7.D8.D9.DA.DB.DC.DD.DE.DF.E0.E1.E2.E3.E4.E5.E6.E7.E8.E9.EA.EB.EC.ED.EE.EF.F0.F1.F2.F3.F4.F5.F6.F7.F8.F9.FA.FB.FC.FD.FE.FF new:0.1.2.3.4.5.6.7.8.9.A.B.C.D.E.F.10.11.12.13.14.15.16.17.18.19.1A.1B.1C.1D.1E.1F.20.21.22.23.24.25.26.27.28.29.2A.2B.2C.2D.2E.2F.30.31.32.33.34.35.36.37.38.39.3A.3B.3C.3D.3E.3F.40.41.42.43.44.45.46.47.48.49.4A.4B.4C.4D.4E.4F.50.51.52.53.54.55.56.57.58.59.5A.5B.5C.5D.5E.5F.60.61.62.63.64.65.66.67.68.69.6A.6B.6C.6D.6E.6F.70.71.72.73.74.75.76.77.78.79.7A.7B.7C.7D.7E.7F.20AC.FFFD.201A.192.201E.2026.2020.2021.2C6.2030.160.2039.152.FFFD.17D.FFFD.FFFD.2018.2019.201C.201D.2022.2013.2014.2DC.2122.161.203A.153.FFFD.17E.178.A0.A1.A2.A3.A4.A5.A6.A7.A8.A9.AA.AB.AC.AD.AE.AF.B0.B1.B2.B3.B4.B5.B6.B7.B8.B9.BA.BB.BC.BD.BE.BF.C0.C1.C2.C3.C4.C5.C6.C7.C8.C9.CA.CB.CC.CD.CE.CF.D0.D1.D2.D3.D4.D5.D6.D7.D8.D9.DA.DB.DC.DD.DE.DF.E0.E1.E2.E3.E4.E5.E6.E7.E8.E9.EA.EB.EC.ED.EE.EF.F0.F1.F2.F3.F4.F5.F6.F7.F8.F9.FA.FB.FC.FD.FE.FF changed? real:1 virtual:1 result:pass
    

  • 在你的例子中,你没有这样做。您不会根据连接使用的编码对传递给执行的值进行编码
  • 根据MySQL存储的字符范围,如果配置为Latin1,则CP1251编码。EUTF-8转换为CP1251。CP1252中未指定的字符变为问号。代码点[\x81\x8D\x8F\x90\x9D]存储不变

    预测的最简单方法是在子例程
    prediction()
    中实现相同的行为。它可以帮助检测以下情况:

    my $predict = predict($new_string);
    if ($new_string ne $predict) {
      print "WARN: $new_string will not sore correctly in DB\n";
    }
    elsif ($existing_db_string ne $predict) {
      print "INFO: $new_string will change DB string\n";
    }
    
    大范围字符的往返测试:

    #!/usr/bin/perl
    
    use utf8;
    use open ':std', ':encoding(UTF-8)';  # Terminal uses UTF-8.
    
    use strict;
    use warnings;
    use 5.010;
    
    use DBI;
    use Encode qw( decode encode encode_utf8 );
    
    sub mysql_roundtrip {
      my ($val) = @_;
    
      my $dbh = DBI->connect(
        'DBI:mysql:database=testlat;host=192.168.1.3;port=3306',
        'userid',
        'passwd',
        {
            PrintError => 1,
            AutoCommit => 1,
            RaiseError => 1,
            mysql_enable_utf8 => 1,
            mysql_enable_utf8mb4 => 1, 
        }
      ) or die $DBI::errstr;
    
      my $sql = 'UPDATE testlat SET name = ? WHERE id = 1;';
      my $dbz = $dbh->do($sql, undef, encode_utf8($val));
    
      my ($got) = $dbh->selectrow_array('SELECT name FROM testlat WHERE id=1');
    
      return $got;
    }
    
    sub predict {
      my $uni_string = shift;
    
      my @chars = split(//,$uni_string);
      my @predict;
      for my $char (@chars) {
        if ($char =~ /[\x81\x8D\x8F\x90\x9D]/) {
          push @predict, $char;
        }
        else {
          my $predict = decode('CP1252',encode('CP1252',$char));
          if ($predict ne $char) { $predict = '?'; }
          push @predict, $predict;
        }
      }
      return join('',@predict);
    }
    
    my $fails = 0;
    print "*** test via database \n";
    for my $number (0x00..0x2122) {
      my $uni_char = chr($number);
      my $predict = predict($uni_char);
    
      my $got = mysql_roundtrip($uni_char);
    
      if ($predict ne $got) {
        $fails++;
        printf("FAIL uni:%.4X predict:%.4X got:%.4X\n",
          $number,
          ord($predict),
          ord($got)
        );
      }
    }
    
    print "FAILS: $fails\n";
    
    输出:

    $ perl utf8_latin1_mysql_test2.pl
    *** test via database 
    FAILS: 0
    

    这通过了对代码点的测试,
    0x00..0x2122
    ,预计可以在整个Unicode范围内工作。

    下面是我们要使用的解决方案(至少现在):

    感谢@ikegami和@Helmutwolmersdorfer对这个问题的意见。您俩都为
    $dbh
    建议以下选项:

    mysql_enable_utf8 => 1,     # Decodes string received from the DB.
    mysql_enable_utf8mb4 => 1,  # Sets the encoding used for the connection.
    
    正如我所指出的,这将在我们的代码库中导致不可预测的数量的回归,因为句柄由许多库共享

    mysql\u enable\u utf8=>1
    的优点是显而易见的:Perl代码将正确编码的UTF-8数据发送到mysql,然后mysql将其转换为Latin1(CP 1252)并将其放入数据库中。我们保证数据被正确存储,并且我们可以在Perl中使用UTF-8,而不关心数据库的拉丁度

    还有缺点:任何无效的UTF-8数据都将被
    DBI
    DBD::mysql
    拒绝(我不清楚是哪一个),我的测试还表明mysql将拒绝将数据存储在无效的拉丁1表中(CP 1252)。因此,在将数据发送到数据库之前,我们需要更明确地对数据进行编码——实际上这可能是一件好事

    mysql\u enable\u utf8=>0
    的行为似乎非常奇怪。看起来,如果设置了Perl标量上的UTF-8标志,那么数据将被UTF-8编码,否则数据将保留在Perl的内部编码(ISO-8859-1/Latin1)中。然后将该数据发送到MySQL并存储在Latin1表中,不管该数据是否为有效的CP1252数据。使用
    mysql\u enable\u utf8=>0
    我能够毫无问题地将所有字符存储在0x00-0xFF中,即使其中一些字符不是有效的CP1252字符

    如果有人能找到
    @测试的失败测试,请告诉我

    OP手头的任务是预测如果将给定标量交给MySQL进行更新,该标量是否会改变数据库的值,
    次新值\u将\u改变\u数据库
    正是这样做的-而不改变
    $dbh
    的属性。这就是为什么我更喜欢这个解决方案而不是OP

    我同意一个更好的技术解决方案是采用
    mysql\u enable\u utf8=>1
    路线,但这也是一个更糟糕的商业决策,因为它涉及到解决(潜在)倒退的努力

    更完整的调试脚本
    我不担心用正确的编码打印调试数据。我关心的是使用正确的编码设置数据库中的值,并能够在Perl中检测输入值是否会更改
    $ perl utf8_latin1_mysql_test2.pl
    *** test via database 
    FAILS: 0
    
    sub mysql_value_latin1 {
        my ($val) = @_;
        # See text - this looks strange - but works!
        if (is_utf8($val)) {
            $val = encode('utf8', $val);
        } else {
            $val = encode('latin1', $val);
        }
        return $val;
    }
    
    sub new_value_will_change_database {
        my ($current_value, $new_value) = @_;
        my $mysql_new_value = mysql_value_latin1($new_value);
        return $current_value ne $mysql_new_value;
    }
    
    mysql_enable_utf8 => 1,     # Decodes string received from the DB.
    mysql_enable_utf8mb4 => 1,  # Sets the encoding used for the connection.
    
    #!/usr/bin/perl -w
    
    use utf8;
    use strict;
    use feature qw(:5.10);
    
    use Encode qw(encode is_utf8);
    use DBI;
    use Data::Dumper;
    
    my $str = 'æøå';
    
    my $latin1 = encode('latin1', $str);
    
    my $utf8upgraded = $latin1;
    utf8::upgrade($utf8upgraded);
    
    # $str, $latin1 and $utf8upgraded are all eq each other:
    $str eq $latin1
        or die;
    $str eq $utf8upgraded
        or die;
    $latin1 eq $utf8upgraded
        or die;
    
    my $dbh = DBI->connect(
        "DBI:mysql:mysql",
        'user',
        'pass',
    );
    
    $dbh->do(q(
        CREATE TEMPORARY TABLE test (
            name VARCHAR(255) DEFAULT NULL
        ) CHARSET=latin1;
    ));
    $dbh->do(q(
        INSERT INTO test (name) VALUES ('');
    ));
    
    sub mysql_roundtrip_convert {
        my ($val) = @_;
        my $sth = $dbh->prepare('SELECT CONVERT(? USING LATIN1)');
        $sth->execute($val);
        my ($concat) = $sth->fetchrow_array();
        return $concat;
    }
    
    sub mysql_roundtrip_column {
        my ($val) = @_;
        my $updateSth = $dbh->prepare('update test set name=?');
        $updateSth->execute($val);
        my $getSth = $dbh->prepare('select name from test');
        $getSth->execute();
        my ($value) = $getSth->fetchrow_array();
        return $value;
    };
    
    sub mysql_roundtrip {
        my ($val) = @_;
        # Check that these two are the identical:
        my $column = mysql_roundtrip_column($val);
        my $convert = mysql_roundtrip_convert($val);
        $column eq $convert
            or die "column ne convert";
        return $column;
    }
    
    sub mysql_value_latin1 {
        my ($val) = @_;
        # See text - this looks strange - but works!
        if (is_utf8($val)) {
            $val = encode('utf8', $val);
        } else {
            $val = encode('latin1', $val);
        }
        return $val;
    }
    
    sub new_value_will_change_database {
        my ($current_value, $new_value) = @_;
        my $mysql_new_value = mysql_value_latin1($new_value);
        return $current_value ne $mysql_new_value;
    }
    
    my @tests = (
        [ 'str', $str ],
        [ 'latin', $latin1 ],
        [ 'utf8upgraded', $utf8upgraded ],
        map { [ 'char ' . $_ , 'char' . chr($_) ] } ( 0x00 .. 0xFF ),
    );
    
    foreach (@tests) {
        my ($disp, $val) = @$_;
        my $mysql_roundtrip = mysql_roundtrip($val),
        my $mysql_value_latin1 = mysql_value_latin1($val);
        $mysql_value_latin1 eq $mysql_roundtrip
            or die "mysql_value_latin1 ne mysql_roundtrip";
    }
    print "All test are fine\n";