Javascript 从Cheerio.js内容中删除unicode字符_Javascript_Node.js_Unicode_Cheerio

Javascript 从Cheerio.js内容中删除unicode字符

javascript node.js unicode

Javascript 从Cheerio.js内容中删除unicode字符,javascript,node.js,unicode,cheerio,Javascript,Node.js,Unicode,Cheerio,我使用下面的HTML从网页上刮取内容 <p> Although the PM's office could neither confirm nor deny this, the spokesperson, John Doe said the meeting took place on Sunday. <br> <br> “The outcome will be made public in due course,” John sa

我使用下面的HTML从网页上刮取内容

  <p>
     Although the PM's office could neither confirm nor deny this, the spokesperson, John Doe said the meeting took place on Sunday.
  <br>
  <br>
    “The outcome will be made public in due course,” John said in an SMS yesterday.
  <br>
  <br>
 </p>

捕获感兴趣的内容后，我会使用正则表达式“清理”它，如下所示：

let cleanedContent = content.split(/<br>/).join(' \n ');

看起来标点符号，也许还有其他一些字符，是根据它们的unicode代码存储的。在这一点上，我可能是错的，我希望能纠正这种想法

假设它们存储为unicode代码，是否有一个模块可以传递“cleanedContent”变量，将unicode代码转换为人类可读的标点符号/字符

如果不可能做到这一点，是否有更好的实施方案来避免这种情况？我完全接受我没有正确使用的概念，并希望有一些新方法的指导，我可以尝试

我能想到的一种方法是，编写一个包含多个Unicode及其对应Unicode的模块，然后查找匹配项，并用相应的人类可读字符替换匹配的代码。我直觉上觉得有人已经做过这件事或类似的事情。我宁愿不尝试重新发明轮子

提前谢谢

Cheerio在内部使用htmlparser2

因此，在加载HTML字符串期间，您可以使用htmlparser2的
解码实体
选项，该选项允许您配置如何处理HTML实体

示例：

$ = cheerio.load('<ul id="fruits">...</ul>', {
    decodeEntities: false
});

$=cheerio.load（“…”{
解码实体：错误
});

相关单据：

$ = cheerio.load('<ul id="fruits">...</ul>', {
    decodeEntities: false
});

谢谢，这很有帮助。我已经设法解决了我的问题。

$ = cheerio.load('<ul id="fruits">...</ul>', {
    decodeEntities: false
});