在Powershell 7中使用enwik9（UTF-8多字节字符的子字符串）_Utf 8_Substring_Multibyte_Powershell 7.0

在Powershell 7中使用enwik9（UTF-8多字节字符的子字符串）

utf-8

在Powershell 7中使用enwik9（UTF-8多字节字符的子字符串）,utf-8,substring,multibyte,powershell-7.0,Utf 8,Substring,Multibyte,Powershell 7.0,我正在使用Powershell 7处理Wikipedia enwik9 1Gb UTF-8文本文件。我没有使用Unicode\UTF-8的经验。我已经将偏移量和值捕获到一个dict中，当我使用下面的代码和增量$I++时，它们看起来是成对的2、4和6 $line.Length对此字符串有效吗 $i是一个多字节字符，当它移动到下一个迭代时，它仍然有效吗我怎么知道这个代码包含多少“字符”？它是子字符串（$i，1）还是子字符串（$i，2）或者子字符串（$i，6）我能够回答自己的问题，并根据本页上的信

我正在使用Powershell 7处理Wikipedia enwik9 1Gb UTF-8文本文件。我没有使用Unicode\UTF-8的经验。我已经将偏移量和值捕获到一个dict中，当我使用下面的代码和增量$I++时，它们看起来是成对的2、4和6

$line.Length对此字符串有效吗

$i是一个多字节字符，当它移动到下一个迭代时，它仍然有效吗

我怎么知道这个代码包含多少“字符”？它是子字符串（$i，1）还是子字符串（$i，2）或者子字符串（$i，6）

我能够回答自己的问题，并根据本页上的信息找到有效的解决方案：

[int][char]：MaxValue

给出

（即

0xFFFF

）以便1<代码>[byte][char]$s可能会失败，因为

[byte]：：MaxValue

是

（即

0xFF

），并且2<代码>$line。子字符串（$i，1）可以是Unicode BMP以上字符的替代项…来自子字符串的答案会自动正常工作。我必须通过剥离控制位来处理第一个Unicode，读取以下子字符串字符的正确#并重置这些控制位，将2个二进制数组合成一个字符串，转换为十六进制，然后使用

$uni='\u'+$hex转换为Unicode$uc=[regex]：:Unescape（$uni）

@johnj01201，将来，请每个帖子只问一个问题，而不是3个问题。；-）这适用于2字节序列，但我正在处理的文件具有更长的Unicode序列。例如，我得到0xE282AC，它抛出一个超出范围的错误。我在这里找到的：它是欧元的象征。我还没有找到在Powershell中使用Internet上的任何示例和函数来显示此Unicode字符的方法。我通过将十六进制粘贴到在线十六进制到Unicode转换器中确认了十六进制是有效的：由于对UTF-8和Unicode还不熟悉，我想我正在尝试Powershell已经自动完成的工作。不断发生的问题是[char]有时有3个字节长，我无法使用它。例如，子字符串（$line，$i，1）的长度可以是3字节，而不是1！！！最后，解决方案是更改文件的加载方式。现在，子字符串可以正确地用于每个Unicode字符$text=（获取内容'enwik9.txt'-Raw）更改为$text=（获取内容'enwik9.txt'-Raw-编码UTF8）现在，所有符号都正确显示，长度为1[char]。

$text = (Get-Content 'enwik9.txt' -Raw)
$line = $text.Substring($i, 10000000)
for ($i = 0; $i -lt $line.Length; $i++) {
    $total_cnt++
    $s = $line.Substring($i, 1)
 
    $n = [int][CHAR]$s #I wanted [byte][char] here
    if ($n -ge 128) {
    # Now $n is not what I want because it is not ASCII and > 255 a Unicode\multibyte character
    }
}

clear-host
clear

write-host 'Loading enwik9.txt'
$text = (Get-Content 'enwik9.txt' -Raw)
write-host 'Load Complete - processing...'
 $line = $text.Substring($i,10000000)
  for($i=0;$i -lt $line.Length; $i++)
  {
  $total_cnt++

  $uni=''
  $s=$line.Substring($i,1)
  $n=[int][CHAR]$s

  if($n -ge 128)
  {
  # how many byte units in this Unicode?
  $ns=0
  $bin=0
  $n=$n-128 #reset the 8th contol bit
  $b7 = $n -band 64; if($b7 -eq 64){$ns=1;$n=$n-64} #remove the contorl bits
  $b6 = $n -band 32; if($b6 -eq 32){$ns-2;$n=$n-32}
  $b5 = $n -band 16; if($b5 -eq 16){$ns=3;$n=$n-16}
  $t=[convert]::ToString($n,16).PadLeft(2,'0')   #convert int to hex
  $bin= [convert]::tostring($n,2) 
  
  write-host 'Found a Unicode start byte $ns='$ns ' $n='$n
    for($c=1;$c -le $ns; $c++)
    {
    $i++; $total_cnt++;  #remember to increment the main loop index into #line
    $s=$line.Substring($i,1) #read the next string char
    $n=[int][CHAR]$s         #convert to int

    if($c -eq 1)
    {
    if(  (($n -band 128) -eq 128) -and (($n -band 64) -ne 0) ) 
    {
    write-host 'NOT A CONTINUE BIT $ns='$ns
    }

    $n=$n-128 #reset the 8th bit
    $b7 = $n -band 64; if($b7 -eq 64){$n=$n-64} #remove the contorl bits
   }

    $t=[convert]::ToString($n,16).PadLeft(2,'0') #convert int to hex
    $bin=$bin+ [convert]::tostring($n,2) 
    $number = [Convert]::ToInt32($bin, 2) #conver to int
    $hex = [convert]::ToString($number,16).PadLeft(4,'0')
    write-host '$s='$s ' $n='$n  ' $t='$t ' $bin='$bin ' $hex='$hex
    }

   $uc=''
   if($ns -eq 0){write-host 'SINGLE BYTE'; Read-Host 'ENTER';}
   ELSE{    $uni='\u'+$hex; $uc = [regex]::Unescape($uni) }


    write-host 'FINAL: Unicode is: '$uc
  read-host "press ENTER to find and process next unicode character"
  }
}