Sed 如何用HTML实体替换UTF-8字符?

Sed 如何用HTML实体替换UTF-8字符?,sed,Sed,我正在windows10下运行cygwin 有一个字典文件(1-dictionary.txt)如下所示: labelling labeling flavour flavor colour color organisations organizations végétales végétales contr?lée contrôlée " " Cultivar wa

我正在
windows10下运行
cygwin

有一个字典文件(
1-dictionary.txt
)如下所示:

labelling   labeling
flavour flavor
colour  color
organisations   organizations
végétales   végétales
contr?lée   contrôlée
"   "
Cultivar was coined by Bailey and it is generally regarded as a portmanteau of "cultivated" and "variety" ... The International Union for the Protection of New Varieties of Plants (UPOV - French: Union internationale pour la protection des obtentions végétales) offers legal protection of plant cultivars ...Terroir is the basis of the French wine appellation d'origine contrôlée (AOC) system
vé#x00E9;gé#x00E9;tales
"#x0022cultivated"#x0022
contrô#x00F4;lé#x00E9;e
v#x00E9;g#x00E9;tales
#x0022cultivated#x0022
contr#x00F4;l#x00E9;e
之间的分隔符是
选项卡
s(
\t
s)

字典文件编码为
UTF-8

要将第一列中的单词和符号替换为第二列中的单词和HTML实体

我的源文件(
2-source.txt
)具有目标UTF-8和ASCII符号。源文件也被编码为
UTF-8

示例文本如下所示:

labelling   labeling
flavour flavor
colour  color
organisations   organizations
végétales   végétales
contr?lée   contrôlée
"   "
Cultivar was coined by Bailey and it is generally regarded as a portmanteau of "cultivated" and "variety" ... The International Union for the Protection of New Varieties of Plants (UPOV - French: Union internationale pour la protection des obtentions végétales) offers legal protection of plant cultivars ...Terroir is the basis of the French wine appellation d'origine contrôlée (AOC) system
vé#x00E9;gé#x00E9;tales
"#x0022cultivated"#x0022
contrô#x00F4;lé#x00E9;e
v#x00E9;g#x00E9;tales
#x0022cultivated#x0022
contr#x00F4;l#x00E9;e
我在shell脚本(./3-script.sh)中运行以下
sed
1行程序:

sed-f3-translation.txt

3-translation.txt
中用美国(en-US)单词替换英语(en-GB)单词是成功的

但是,替换ASCII符号(如引号符号)和UTF-8字会产生以下结果:

vvégétales#x00E9;gvégétales#x00E9;tales)
contrcontrôlée#x00F4;lcontrôlée#x00E9;e (AOC)
如果我只使用特定的符号(而不是完整的单词),我会得到如下结果:

labelling   labeling
flavour flavor
colour  color
organisations   organizations
végétales   végétales
contr?lée   contrôlée
"   "
Cultivar was coined by Bailey and it is generally regarded as a portmanteau of "cultivated" and "variety" ... The International Union for the Protection of New Varieties of Plants (UPOV - French: Union internationale pour la protection des obtentions végétales) offers legal protection of plant cultivars ...Terroir is the basis of the French wine appellation d'origine contrôlée (AOC) system
vé#x00E9;gé#x00E9;tales
"#x0022cultivated"#x0022
contrô#x00F4;lé#x00E9;e
v#x00E9;g#x00E9;tales
#x0022cultivated#x0022
contr#x00F4;l#x00E9;e
ASCII引号符号附加了
"-未被替换

类似地,UTF-8符号附加其HTML实体,而不是替换为HTML实体

预期输出如下所示:

labelling   labeling
flavour flavor
colour  color
organisations   organizations
végétales   végétales
contr?lée   contrôlée
"   "
Cultivar was coined by Bailey and it is generally regarded as a portmanteau of "cultivated" and "variety" ... The International Union for the Protection of New Varieties of Plants (UPOV - French: Union internationale pour la protection des obtentions végétales) offers legal protection of plant cultivars ...Terroir is the basis of the French wine appellation d'origine contrôlée (AOC) system
vé#x00E9;gé#x00E9;tales
"#x0022cultivated"#x0022
contrô#x00F4;lé#x00E9;e
v#x00E9;g#x00E9;tales
#x0022cultivated#x0022
contr#x00F4;l#x00E9;e

如何修改
sed
脚本,以便将目标ASCII和UTF-8符号替换为字典文件中定义的HTML实体等价物?

我尝试过,只需将
1-dictionary.txt中的所有
&
替换为
\&
即可解决您的问题

Sed的替代品使用a作为from部分,因此当您这样使用它时,请注意那些正则表达式字符,并添加
\
,以准备它们

to部分也将有特殊字符,主要是
\
&
,添加额外的
\
,以使它们也准备好


以上链接到,对于其他
sed
版本,您也可以检查
man-sed

我尝试过的可能重复的可能重复,只需在
1-dictionary中用
\&
替换所有
&
。txt
将解决您的问题。试试看,看看是否有效。