Sed 如何用HTML实体替换UTF-8字符?
我正在Sed 如何用HTML实体替换UTF-8字符?,sed,Sed,我正在windows10下运行cygwin 有一个字典文件(1-dictionary.txt)如下所示: labelling labeling flavour flavor colour color organisations organizations végétales végétales contr?lée contrôlée " " Cultivar wa
windows10下运行cygwin
有一个字典文件(1-dictionary.txt
)如下所示:
labelling labeling
flavour flavor
colour color
organisations organizations
végétales végétales
contr?lée contrôlée
" "
Cultivar was coined by Bailey and it is generally regarded as a portmanteau of "cultivated" and "variety" ... The International Union for the Protection of New Varieties of Plants (UPOV - French: Union internationale pour la protection des obtentions végétales) offers legal protection of plant cultivars ...Terroir is the basis of the French wine appellation d'origine contrôlée (AOC) system
vé#x00E9;gé#x00E9;tales
"#x0022cultivated"#x0022
contrô#x00F4;lé#x00E9;e
v#x00E9;g#x00E9;tales
#x0022cultivated#x0022
contr#x00F4;l#x00E9;e
之间的分隔符是选项卡
s(\t
s)
字典文件编码为UTF-8
要将第一列中的单词和符号替换为第二列中的单词和HTML实体
我的源文件(2-source.txt
)具有目标UTF-8和ASCII符号。源文件也被编码为UTF-8
示例文本如下所示:
labelling labeling
flavour flavor
colour color
organisations organizations
végétales végétales
contr?lée contrôlée
" "
Cultivar was coined by Bailey and it is generally regarded as a portmanteau of "cultivated" and "variety" ... The International Union for the Protection of New Varieties of Plants (UPOV - French: Union internationale pour la protection des obtentions végétales) offers legal protection of plant cultivars ...Terroir is the basis of the French wine appellation d'origine contrôlée (AOC) system
vé#x00E9;gé#x00E9;tales
"#x0022cultivated"#x0022
contrô#x00F4;lé#x00E9;e
v#x00E9;g#x00E9;tales
#x0022cultivated#x0022
contr#x00F4;l#x00E9;e
我在shell脚本(./3-script.sh)中运行以下sed
1行程序:
sed-f3-translation.txt
在3-translation.txt
中用美国(en-US)单词替换英语(en-GB)单词是成功的
但是,替换ASCII符号(如引号符号)和UTF-8字会产生以下结果:
vvégétales#x00E9;gvégétales#x00E9;tales)
contrcontrôlée#x00F4;lcontrôlée#x00E9;e (AOC)
如果我只使用特定的符号(而不是完整的单词),我会得到如下结果:
labelling labeling
flavour flavor
colour color
organisations organizations
végétales végétales
contr?lée contrôlée
" "
Cultivar was coined by Bailey and it is generally regarded as a portmanteau of "cultivated" and "variety" ... The International Union for the Protection of New Varieties of Plants (UPOV - French: Union internationale pour la protection des obtentions végétales) offers legal protection of plant cultivars ...Terroir is the basis of the French wine appellation d'origine contrôlée (AOC) system
vé#x00E9;gé#x00E9;tales
"#x0022cultivated"#x0022
contrô#x00F4;lé#x00E9;e
v#x00E9;g#x00E9;tales
#x0022cultivated#x0022
contr#x00F4;l#x00E9;e
ASCII引号符号附加了"代码>-未被替换
类似地,UTF-8符号附加其HTML实体,而不是替换为HTML实体
预期输出如下所示:
labelling labeling
flavour flavor
colour color
organisations organizations
végétales végétales
contr?lée contrôlée
" "
Cultivar was coined by Bailey and it is generally regarded as a portmanteau of "cultivated" and "variety" ... The International Union for the Protection of New Varieties of Plants (UPOV - French: Union internationale pour la protection des obtentions végétales) offers legal protection of plant cultivars ...Terroir is the basis of the French wine appellation d'origine contrôlée (AOC) system
vé#x00E9;gé#x00E9;tales
"#x0022cultivated"#x0022
contrô#x00F4;lé#x00E9;e
v#x00E9;g#x00E9;tales
#x0022cultivated#x0022
contr#x00F4;l#x00E9;e
如何修改sed
脚本,以便将目标ASCII和UTF-8符号替换为字典文件中定义的HTML实体等价物?我尝试过,只需将1-dictionary.txt中的所有&
替换为\&
即可解决您的问题
Sed的替代品使用a作为from部分,因此当您这样使用它时,请注意那些正则表达式字符,并添加\
,以准备它们
to部分也将有特殊字符,主要是\
和&
,添加额外的\
,以使它们也准备好
以上链接到,对于其他sed
版本,您也可以检查man-sed
我尝试过的可能重复的可能重复,只需在1-dictionary中用\&
替换所有&
。txt
将解决您的问题。试试看,看看是否有效。