File io 读取双字节文件_File Io_Tcl

File io 读取双字节文件

file-io tcl

File io 读取双字节文件,file-io,tcl,File Io,Tcl,我想知道在Tcl中是否有一种简单的方法来读取双字节文件（或者我认为它被称为双字节文件）。我的问题是，当我在记事本（我在Win7上）中打开文件时，它们看起来很好，但当我在Tcl中读取它们时，每个字符之间都有空格（或者更确切地说，是空字符）我当前的解决方法是首先运行字符串映射，以删除所有空值 string map {\0 {}} $file 然后正常地处理信息，但是有没有更简单的方法，通过，或者其他方法我不熟悉编码，所以我不确定应该使用什么参数 fconfigure $input -encod

我想知道在Tcl中是否有一种简单的方法来读取双字节文件（或者我认为它被称为双字节文件）。我的问题是，当我在记事本（我在Win7上）中打开文件时，它们看起来很好，但当我在Tcl中读取它们时，每个字符之间都有空格（或者更确切地说，是空字符）

我当前的解决方法是首先运行

字符串映射

，以删除所有空值

string map {\0 {}} $file

然后正常地处理信息，但是有没有更简单的方法，通过，或者其他方法

我不熟悉编码，所以我不确定应该使用什么参数

fconfigure $input -encoding double

当然会失败，因为

double

不是有效的编码。与“双字节”相同

我实际上是在处理大文本文件（超过2GB）并逐行进行“变通”，因此我相信这会减慢处理速度

编辑：正如@mhawke所指出的，文件是UTF-16-LE编码的，这显然是不受支持的编码。有没有一种优雅的方法可以绕过这个缺点，可能是通过

过程？或者这会比使用字符串映射更复杂吗？
输入文件可能是UTF-16编码的，这在Windows中很常见
尝试：
您可以使用以下方法获得编码列表：
% encoding names
cp860 cp861 cp862 cp863 tis-620 cp864 cp865 cp866 gb12345 gb2312-raw cp949 cp950 cp869 dingbats ksc5601 macCentEuro cp874 macUkraine gb2312 jis0201 euc-cn euc-jp iso8859-10 macThai iso2022-jp jis0208 macIceland iso2022 iso8859-13 jis0212 iso8859-14 iso8859-15 cp737 iso8859-16 big5 euc-kr macRomania gb1988 iso2022-kr macTurkish macGreek ascii cp437 macRoman iso8859-1 iso8859-2 iso8859-3 koi8-r iso8859-4 macCroatian ebcdic cp1250 iso8859-5 iso8859-6 macCyrillic cp1251 iso8859-7 cp1252 koi8-u macDingbats iso8859-8 cp1253 cp1254 iso8859-9 cp1255 cp850 cp932 cp1256 cp852 cp1257 identity cp1258 macJapan utf-8 shiftjis cp936 cp855 symbol cp775 unicode cp857

我决定写一个小程序来转换文件。我正在使用while
循环，因为将一个3 GB文件读入单个变量会完全锁定进程。。。评论让它看起来很长，但没那么长
proc itrans {infile outfile} {
  set f [open $infile r]

  # Note: files I have been getting have CRLF, so I split on CR to keep the LF and
  # used -nonewline in puts
  fconfigure $f -translation cr -eof ""

  # Simple switch just to remove the BOM, since the result will be UTF-8
  set bom 0                              
  set o [open $outfile w]
  while {[gets $f l] != -1} {
    # Convert to binary where the specific characters can be easily identified
    binary scan $l H* l

    # Ignore empty lines
    if {$l == "" || $l == "00"} {continue}

    # If it is the first line, there's the BOM
    if {!$bom} {
      set bom 1

      # Identify and remove the BOM and set what byte should be removed and kept
      if {[regexp -nocase -- {^(?:FFFE|FEFF)} $l m]} {
        regsub -- "^$m" $l "" l

        if {[string toupper $m] eq "FFFE"} {
          set re "(..).."
        } elseif {[string toupper $m] eq "FEFF"} {
          set re "..(..)"
        }
      }
      regsub -all -- $re $l {\1} new
    } else {
      # Regardless of utf-16-le or utf-16-be, that should work since we split on CR
      regsub -all -- {..(..)|00$} $l {\1} new
    }
    puts -nonewline $o [binary format H* $new]
  }
  close $o
  close $f
}

itrans infile.txt outfile.txt

最后警告，这将弄乱实际使用所有16位的字符（例如，代码单元序列04 30
将丢失04
，变成30
，而不是变成D0 B0
，但是00 4D
将正确映射到4D
）在一个字符中静默地显示，因此，在尝试上述操作之前，请确保您不介意，或者您的文件不包含此类字符。
这样做也会给我带来很多？
，而这些字符在原始文件中并不存在。例如，我在文件26-MAR-2014 22:03:47
中有一个日期时间值，这将成为26-MAR-2？？？？？？3:47
。也许这有助于识别编码？我也在十六进制编辑器中打开了文件，前两个字节是FF-FE
，如果这有帮助的话。0xFF 0XFE是一个表示文件编码为UTF-16，具有很小的endian顺序的字符。因此，该文件肯定应被视为UTF-16-LE。但在Tcl中，似乎没有明确指定“unicode”（取决于本机平台），也没有utf-16-le或utf-16-be编码选项。这很有用！我意识到我应该试着深入阅读BOM。我在过去偶然发现了utf-8版本，并且一直在删除它们，因为这些文件不会在特定程序之外使用。无论如何，这有点让人难过。我想知道现在是否有办法绕过这个缺点。我最近收到了很多这样的文件（与utf-8文件相反）。感谢您的输入。也许您可以预处理文件，或者调用外部工具iconv擅长在编码之间转换文件。
proc itrans {infile outfile} {
  set f [open $infile r]

  # Note: files I have been getting have CRLF, so I split on CR to keep the LF and
  # used -nonewline in puts
  fconfigure $f -translation cr -eof ""

  # Simple switch just to remove the BOM, since the result will be UTF-8
  set bom 0                              
  set o [open $outfile w]
  while {[gets $f l] != -1} {
    # Convert to binary where the specific characters can be easily identified
    binary scan $l H* l

    # Ignore empty lines
    if {$l == "" || $l == "00"} {continue}

    # If it is the first line, there's the BOM
    if {!$bom} {
      set bom 1

      # Identify and remove the BOM and set what byte should be removed and kept
      if {[regexp -nocase -- {^(?:FFFE|FEFF)} $l m]} {
        regsub -- "^$m" $l "" l

        if {[string toupper $m] eq "FFFE"} {
          set re "(..).."
        } elseif {[string toupper $m] eq "FEFF"} {
          set re "..(..)"
        }
      }
      regsub -all -- $re $l {\1} new
    } else {
      # Regardless of utf-16-le or utf-16-be, that should work since we split on CR
      regsub -all -- {..(..)|00$} $l {\1} new
    }
    puts -nonewline $o [binary format H* $new]
  }
  close $o
  close $f
}

itrans infile.txt outfile.txt