xslt将utf-8字符转换为不同的编码_Xslt

xslt将utf-8字符转换为不同的编码

xslt

xslt将utf-8字符转换为不同的编码,xslt,Xslt,这个问题时断时续地发生，也就是说，我已经执行了很多xslt转换，但没有这个问题，然后它突然出现在我最近的xslt转换中我有大量html输入文件，其结构类似于以下a.html： <html> <body> <div class="wrd"> <div class="wrd-id">5</div> <div class="wrd-wrd">address</div>

这个问题时断时续地发生，也就是说，我已经执行了很多xslt转换，但没有这个问题，然后它突然出现在我最近的xslt转换中

我有大量html输入文件，其结构类似于以下a.html：

<html>
  <body>
    <div class="wrd">
      <div class="wrd-id">5</div>
      <div class="wrd-wrd">address</div>
      <div class="wrd-ipa">əˈdres,ˈaˌdres</div>
    </div>
    <div class="a">...</div>
  </body>
</html>

我使用类似于以下a.xslt的xslt转换html文件：

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" >
 <xsl:output omit-xml-declaration="yes" indent="yes" encoding="UTF-8" />
 <xsl:strip-space elements="*" />

 <xsl:template match="@*|node()" >
  <xsl:copy>
   <xsl:apply-templates select="@*|node()" />
  </xsl:copy>
 </xsl:template>

 <xsl:template match="div[@class='a']" >
  <xsl:apply-templates select="*|node()" />
 </xsl:template>

</xsl:stylesheet>

更完整的bash脚本如下所示：

#!/bin/bash
xsltproc --html a.xslt a.html \
| hxnormalize -x -l 1024 \
| sed '/^$/d' \
> b.html

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="html" version="4.0" encoding="UTF-8" indent="yes" />
 <xsl:strip-space elements="*"/>

 <xsl:template match="@* | node()">
  <xsl:copy>
   <xsl:apply-templates select="@* | node()"/>
  </xsl:copy>
 </xsl:template>

</xsl:stylesheet>

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="html" version="4.0" encoding="UTF-8" indent="yes" />
 <xsl:strip-space elements="*"/>

 <xsl:template match="@* | node()">
  <xsl:copy>
   <xsl:apply-templates select="@* | node()"/>
  </xsl:copy>
 </xsl:template>

</xsl:stylesheet>

我得到以下结果b.html：

<html>
  <body>
    <div class="wrd">
      <div class="wrd-id">5</div>
      <div class="wrd-wrd">address</div>
      <div class="wrd-ipa">ÉËdres,ËaËdres</div>
    </div>
    ...
  </body>
</html>

如何防止xslt转换将字符从一种编码更改为另一种编码

更新1

通过从xsltproc命令中删除选项“-html”，问题得到了解决。然而，我仍然不知道为什么

#!/bin/bash
xsltproc a.xslt a.html > b.html

更新2

似乎输入文件被解释为ASCII或ISO-8859-1，而不是UTF-8。我在输入a.html中插入了以下标题：

  <head>
    <meta charset="UTF-8">
    <meta http-equiv="content-type" content="text/html">
  </head>

<head>
  <meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>

但是，输出的b.html仍然是相同的

更新3

我已将a.xslt更新为以下内容：

#!/bin/bash
xsltproc --html a.xslt a.html \
| hxnormalize -x -l 1024 \
| sed '/^$/d' \
> b.html

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="html" version="4.0" encoding="UTF-8" indent="yes" />
 <xsl:strip-space elements="*"/>

 <xsl:template match="@* | node()">
  <xsl:copy>
   <xsl:apply-templates select="@* | node()"/>
  </xsl:copy>
 </xsl:template>

</xsl:stylesheet>

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="html" version="4.0" encoding="UTF-8" indent="yes" />
 <xsl:strip-space elements="*"/>

 <xsl:template match="@* | node()">
  <xsl:copy>
   <xsl:apply-templates select="@* | node()"/>
  </xsl:copy>
 </xsl:template>

</xsl:stylesheet>

请注意不同的xsl:output行

这将创建具有相同问题的b.html，但第一行给出了以下html声明：

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">

也许在这里的背后，是使用ASCII或ISO-8859-1解释输入文件的原因。

解决方案

xsltproc从元内容类型头中获取HTML输入文件的文件编码。当这样的头不存在时，它可能会假定文件编码不正确，并在读取文件时销毁文件

我在输入a.html中插入了以下标题：

  <head>
    <meta charset="UTF-8">
    <meta http-equiv="content-type" content="text/html">
  </head>

<head>
  <meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>

xslt a.xslt如下所示：

#!/bin/bash
xsltproc --html a.xslt a.html \
| hxnormalize -x -l 1024 \
| sed '/^$/d' \
> b.html

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="html" version="4.0" encoding="UTF-8" indent="yes" />
 <xsl:strip-space elements="*"/>

 <xsl:template match="@* | node()">
  <xsl:copy>
   <xsl:apply-templates select="@* | node()"/>
  </xsl:copy>
 </xsl:template>

</xsl:stylesheet>

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="html" version="4.0" encoding="UTF-8" indent="yes" />
 <xsl:strip-space elements="*"/>

 <xsl:template match="@* | node()">
  <xsl:copy>
   <xsl:apply-templates select="@* | node()"/>
  </xsl:copy>
 </xsl:template>

</xsl:stylesheet>

输出文件b.html最终如预期：

<html>
  <body>
    <div class="wrd">
      <div class="wrd-id">5</div>
      <div class="wrd-wrd">address</div>
      <div class="wrd-ipa">əˈdres,ˈaˌdres</div>
    </div>
    <div class="a">...</div>
  </body>
</html>


5.
地址
əˈdres，ˈaˌdres
...

谢谢。这是一个非常有用的答案。事实上，我已经发现了

--encoding

参数，它允许您在html文件中不存在元信息时指定输入文件的编码。