如何解码角度'；使用Python自定义HTML编码_Python_Angular_Parsing_Web Scraping_Beautifulsoup

如何解码角度'；使用Python自定义HTML编码

python angular parsing web-scraping

如何解码角度'；使用Python自定义HTML编码,python,angular,parsing,web-scraping,beautifulsoup,Python,Angular,Parsing,Web Scraping,Beautifulsoup,我想抓取并解析一个该网站的几乎全部内容都来自JSON，而JavaScript使用了它。但是，这可以通过BeautifulSoup轻松提取，并通过JSON模块解析但是脚本的编码有点古怪标记有一个id“ng lseg state”，这意味着这是Angular的自定义HTML编码例如： &l;div class=\"news-body-content\"&g;&l;html xmlns=\"http://www.w3.org/1999/xh

我想抓取并解析一个

该网站的几乎全部内容都来自

JSON

，而

JavaScript

使用了它。但是，这可以通过

BeautifulSoup

轻松提取，并通过

JSON

模块解析

但是脚本的编码有点古怪

标记有一个

id

“ng lseg state”，这意味着这是Angular的自定义HTML编码

例如：

&l;div class=\"news-body-content\"&g;&l;html xmlns=\"http://www.w3.org/1999/xhtml\"&g;\n&l;head&g;\n&l;meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" /&g;\n&l;title&g;&l;/title&g;\n&l;meta name=\"generator\"

我使用

.replace（）

链处理此问题：

导入json
导入请求
从bs4导入BeautifulSoup
url=”https://www.londonstockexchange.com/news-article/ESNT/date-for-fy-2020-results-announcement/14850033"
script=BeautifulSoup（requests.get（url.text，“lxml”）.find（“script”，“id”：“ng lseg state”}）
article=json.loads（script.string.replace（“&q；”，“））
main_key=“G.{api_endpoint}}/api/v1/pages？参数=newsId%3D14850033&a；路径=新闻文章“
文章正文=文章[主键][“正文”][“组件”][1][“内容”][“新闻文章”][“价值”]
解码体=(
第11条机构
.替换（'l；'，''）
.替换（“&q；”，““”）
)
打印（BeautifulSoup（解码体，“lxml”）。查找所有（“p”））

但仍有一些字符我不确定如何处理：

```
&；a#160;
```


&；A.amp

&；s


仅举几个例子
那么，问题是，我如何处理剩余的char？或者可能有我不知道的解析器或可靠的字符映射？
使用特殊的转义函数进行角度编码：
导出函数escapeHtml（文本：string）：string{
常量转义文本：{[k:string]：string}={
“&”：“&a；”，
“：”&q；“，
“\”：“&s；”，
''：'&g；'，
};
返回文本。替换（/[&“']/g，s=>escapedText[s]）；
}
导出函数unescapeHtml（文本：字符串）：字符串{
常量unescapedText:{[k:string]：string}={
“&a；”：“&”，
“&q；”：“”，
“&s；”：“\”，
“&l；”：“”，
};
返回文本。替换（/&[^；]+；/g，s=>unescapedText[s]）；
}

您可以在python中复制unescapethtml
函数，并添加html.unescape
以解析其他html实体：
导入json
导入请求
从bs4导入BeautifulSoup
导入html
unescapedText={
“&a；”：“&”，
“&q；”：“”，
“&s；”：“\”，
“&l；”：“”，
}
def unescape（str）：
对于键，unescapedText.items（）中的值：
str=str.replace（键，值）
返回html.unescape（str）
url=”https://www.londonstockexchange.com/news-article/ESNT/date-for-fy-2020-results-announcement/14850033"
script=BeautifulSoup（requests.get（url.text，“lxml”）.find（“脚本”{
“id”：“ng lseg状态”
})
payload=json.loads（unescape（script.string））
main_key=“G.{api_endpoint}}/api/v1/pages？参数=newsId%3D14850033&path=news article”
文章正文=有效载荷[主键][“正文”][“组件”][1][“内容”][“新闻文章”][“价值”]
打印（BeautifulSoup（文章正文，“lxml”）。查找全部（“p”））

您缺少&s和&a
回复：it:
相关：相同的问题，相同的网站。感谢@QHarr指出这一点。似乎我们都可以从一个比一系列.replace（）
方法更通用的解决方案中获益。