强制编码从US-ASCII到UTF-8（iconv）_Utf 8_Character Encoding_Iconv

强制编码从US-ASCII到UTF-8（iconv）

utf-8 character-encoding

强制编码从US-ASCII到UTF-8（iconv）,utf-8,character-encoding,iconv,Utf 8,Character Encoding,Iconv,我正在尝试将一组文件从US-ASCII转换为UTF-8 为此，我使用iconv： iconv -f US-ASCII -t UTF-8 file.php > file-utf8.php 我的原始文件是US-ASCII编码的，这使得转换不会发生。显然，这是因为ASCII是UTF-8的子集并引用：在非ASCII之前，文本文件不需要以其他方式显示介绍了该系统的特点对。如果我在文件中引入一个非ASCII字符并保存它，比如说使用，文件编码（字符集）将切换为UTF-8 在我的例子中，我想强

我正在尝试将一组文件从US-ASCII转换为UTF-8

为此，我使用iconv：

iconv -f US-ASCII -t UTF-8 file.php > file-utf8.php

我的原始文件是US-ASCII编码的，这使得转换不会发生。显然，这是因为ASCII是UTF-8的子集

并引用：

在非ASCII之前，文本文件不需要以其他方式显示介绍了该系统的特点

对。如果我在文件中引入一个非ASCII字符并保存它，比如说使用，文件编码（字符集）将切换为UTF-8

在我的例子中，我想强制iconv将文件转码为UTF-8。是否包含非ASCII字符

注意：原因是我的PHP代码（非ASCII文件…）正在处理一些非ASCII字符串，这导致字符串无法很好地解释（法语）：

BarillÃħħ（Procidis），1Ãre

```
US ASCII
```
--是
UTF-8
的子集（见下文）

这意味着美国ASCII文件实际上是以
UTF-8

我的问题来自其他地方

ASCII是UTF-8的子集，因此所有ASCII文件都已进行UTF-8编码。ASCII文件中的字节与“将其编码为UTF-8”产生的字节完全相同。他们之间没有区别，所以没有必要做任何事情
看起来您的问题是文件实际上不是ASCII。您需要确定它们使用的是什么编码，并正确地对它们进行转码。
我认为--您的文件实际上不是ASCII码。试一试

iconv -f ISO-8859-1 -t UTF-8 file.php > file-utf8.php

我只是猜测你实际上正在使用。它在大多数欧洲语言中都很流行。
美国ASCII和UTF-8之间没有区别，因此不需要重新转换它
但是这里有一点提示，如果你在重新编码时遇到特殊字符的问题
在源字符集参数后添加//translatit
示例：

iconv -f ISO-8859-1//TRANSLIT -t UTF-8 filename.sql > utf8-filename.sql
这有助于我处理奇怪的引号类型，它们总是破坏字符集重新编码过程。
简短回答

file
仅猜测文件编码，可能是错误的（特别是在特殊字符只在大文件中出现较晚的情况下）

您可以使用
hextump
查看非7位ASCII文本的字节，并与常用编码（ISO 8859-*，UTF-8）的代码表进行比较，以自行决定编码是什么

iconv
将使用您指定的任何输入/输出编码，而不管文件的内容是什么。如果指定了错误的输入编码，输出将被篡改

即使在运行
iconv
之后，
文件
也可能不会报告任何更改，因为
文件
尝试猜测编码的方式有限。有关具体示例，请参见我的详细答案

7位ASCII（又名US ASCII）在字节级别与UTF-8和8位ASCII扩展（ISO 8859-*）相同。因此，如果您的文件只有7位字符，那么您可以将其称为UTF-8、ISO 8859-*或US ASCII，因为在字节级别，它们都是相同的。只有当文件中的字符超出7位ASCII范围时，才有必要讨论UTF-8和其他编码（在此上下文中）

长话短说我今天遇到了这个问题，遇到了你的问题。也许我可以补充一些信息来帮助其他遇到这个问题的人
ASCII码首先，术语ASCII被重载，这会导致混淆
7位ASCII仅包含128个字符（十进制为00-7F或0-127）。7位ASCII有时也称为US-ASCII

$ pcregrep -no '[^\x00-\x7F]' source-file | head -n1 102321:�

UTF-8 UTF-8编码对其前128个字符使用与7位ASCII相同的编码。所以，只包含前128个字符范围内的字符的文本文件在字节级别上是相同的，无论是用UTF-8还是7位ASCII编码

ISO 8859-*和其他ASCII扩展术语扩展ASCII（或高ASCII）指8位或更大的字符编码，包括标准的7位ASCII字符和附加字符

$ printf '\xEF\xBB\xBF' > bom.txt # put a UTF-8 BOM char in new file $ file bom.txt bom.txt: UTF-8 Unicode text, with no line terminators $ file plain-ascii.txt # our pure 7-bit ascii file plain-ascii.txt: ASCII text $ cat bom.txt plain-ascii.txt > plain-ascii-with-utf8-bom.txt # put them together into one new file with the BOM first $ file plain-ascii-with-utf8-bom.txt plain-ascii-with-utf8-bom.txt: UTF-8 Unicode (with BOM) text

ISO 8859-1（又名“ISO Latin 1”）是一种特定的8位ASCII扩展标准，涵盖西欧的大多数字符。东欧语言和西里尔语还有其他ISO标准。ISO 8859-1包括德语和西班牙语的Ö、é、ñ和ß等字符
“扩展名”指ISO 8859-1包括7位ASCII标准，并使用第8位向其添加字符。因此，对于前128个字符，它在字节级别相当于ASCII和UTF-8编码文件。但是，当您开始处理前128个字符之后的字符时，您的字节级别不再是UTF-8等效的，如果您希望“扩展ASCII”文件是UTF-8编码的，则必须进行转换

使用
文件检测编码我今天学到的一个教训是，我们不能相信file 总是对文件的字符编码给出正确的解释 $ tail -n +102321 source-file | head -n1 | hexdump -C -s85 -n2 00000055 d6 4d |.M| 00000057 $ vim source-file $ head -n1 test-file-2 � $ head -n1 test-file-2 | hexdump -C 00000000 d6 0d 0a |...| 00000003 $ tail -n +102322 test-file-2 | head -n1 | hexdump -C -s85 -n2 00000055 d6 4d |.M| 00000057 该命令只告诉文件的外观，而不告诉它是什么（在文件查看内容的情况下）。将一个魔术数字放入一个内容与之不匹配的文件中，很容易愚弄程序。因此，除了在特定情况下，该命令不能用作安全工具 file 在文件中查找提示类型的幻数，但这些数字可能是错误的，不能保证正确性<代码>文件
还尝试通过查看文件中的字节来猜测字符编码。基本上，
file
有一系列测试，帮助它猜测文件类型和编码

$ tail -n +102321 source-file | head -n1 | hexdump -C -s85 -n2 00000055 d6 4d |.M| 00000057

$ vim source-file $ head -n1 test-file-2 � $ head -n1 test-file-2 | hexdump -C 00000000 d6 0d 0a |...| 00000003 $ tail -n +102322 test-file-2 | head -n1 | hexdump -C -s85 -n2 00000055 d6 4d |.M| 00000057
我的文件是一个大的CSV文件<代码>文件将此文件报告为US ASCII编码
$ tail -n +102321 output-file | head -n1 | hexdump -C -s85 -n2 00000055 c3 96 |..| 00000057

$ sed '1s/^/Ö\'$'\n/' source-file > test-file $ head -n1 test-file Ö $ head -n1 test-file | hexdump -C 00000000 c3 96 0a |...| 00000003

$ tail -n +102322 test-file | head -n1 | hexdump -C -s85 -n2 00000055 d6 4d |.M| 00000057

$ iconv -f iso-8859-1 -t utf8 test-file > test-file-converted $ head -n1 test-file-converted | hexdump -C 00000000 c3 83 c2 96 0a |.....| 00000005 $ tail -n +102322 test-file-converted | head -n1 | hexdump -C -s85 -n2 00000055 c3 96 |..| 00000057

$ vim source-file $ head -n1 test-file-2 � $ head -n1 test-file-2 | hexdump -C 00000000 d6 0d 0a |...| 00000003 $ tail -n +102322 test-file-2 | head -n1 | hexdump -C -s85 -n2 00000055 d6 4d |.M| 00000057

$ file -b --mime-encoding test-file-2 iso-8859-1 $ iconv -f iso-8859-1 -t utf8 test-file-2 > test-file-2-converted $ file -b --mime-encoding test-file-2-converted utf-8

$ first_special=$(pcregrep -o1 -n '()[^\x00-\x7F]' source-file | head -n1 | cut -d":" -f1) $ tail -n +$first_special source-file > /tmp/source-file-shorter $ file -b --mime-encoding /tmp/source-file-shorter iso-8859-1

−P, −−parameter name=value Set various parameter limits. Name Default Explanation bytes 1048576 max number of bytes to read from file

file_to_check="myfile" bytes_to_scan=$(wc -c < $file_to_check) file -b --mime-encoding -P bytes=$bytes_to_scan $file_to_check

$ printf '\xEF\xBB\xBF' > bom.txt # put a UTF-8 BOM char in new file $ file bom.txt bom.txt: UTF-8 Unicode text, with no line terminators $ file plain-ascii.txt # our pure 7-bit ascii file plain-ascii.txt: ASCII text $ cat bom.txt plain-ascii.txt > plain-ascii-with-utf8-bom.txt # put them together into one new file with the BOM first $ file plain-ascii-with-utf8-bom.txt plain-ascii-with-utf8-bom.txt: UTF-8 Unicode (with BOM) text

iconv -f us-ascii -t utf-16 yourfile > youfileinutf16.*

iconv -f utf-16le -t utf-8 yourfileinutf16 > yourfileinutf8.*

iconv -f old_format -t utf-8 input_file -o output_file

#!/usr/bin/env bash find . -name "${1}" | while read line; do echo "***************************" echo "Converting ${line}" encoding=$(file -b --mime-encoding ${line}) echo "Found Encoding: ${encoding}" iconv -f "${encoding}" -t "utf-8" ${line} -o ${line}.tmp mv ${line}.tmp ${line} done

mkdir backup

for f in $(file -i * .sql | grep us-ascii | cut -d ':' -f 1); do iconv -f us-ascii -t utf-8 $f -o $ f.utf-8 && mv $f backup / && mv "$f.utf-8" $f; done

for f $(file -i * .sql | grep iso-8859-1 | cut -d ':' -f 1); do iconv -f iso-8859-1 -t utf-8 $f -o $f.utf-8 && mv $f backup / && mv "$f.utf-8" $f; done

mkdir backup 2>/dev/null; for f in $(file -i *.htm | grep -i us-ascii | cut -d ':' -f 1); do iconv -f "us-ascii" -t "utf-16" $f > $f.tmp; iconv -f "utf-16le" -t "utf-8" $f.tmp > $f.utf8; cp $fic backup/; mv $f.utf8 $f; rm $f.tmp; done; file -i *.htm

vim -es '+set fileencoding=utf-8' '+wq!' file