Python 为什么bs4错误地解释了网站模式上的JSON？_Python_Json_Web Scraping_Beautifulsoup_Html Entities

Python 为什么bs4错误地解释了网站模式上的JSON？

python json web-scraping

Python 为什么bs4错误地解释了网站模式上的JSON？,python,json,web-scraping,beautifulsoup,html-entities,Python,Json,Web Scraping,Beautifulsoup,Html Entities,我想用schema JobPost刮网站，如下所示：我使用+bs4来执行此操作。我以前做过几次，但在本例中，我遇到了解析JSON结构和按JSON库加载的问题我的代码： import requests from bs4 import BeautifulSoup as soup import json def get_html_return_soup(url): try: client = session.get(url

我想用schema JobPost刮网站，如下所示：

我使用+bs4来执行此操作。我以前做过几次，但在本例中，我遇到了解析JSON结构和按JSON库加载的问题

我的代码：

    import requests
    from bs4 import BeautifulSoup as soup
    import json

    def get_html_return_soup(url):
        try:
            client = session.get(url, timeout=15)
            html_page = client.content
        except Exception as e:
            print('Exc - {}'.format(str(e)))
            return None
        else:
            return soup(html_page, "html.parser")
            
    url = 'https://www.jobtrans.nl/vacatures/oproep-chauffeur-1806816'
    page_soup = get_html_return_soup(url)

    # get 'JobPosting' script
    json_tag = page_soup.findAll('script', type='application/ld+json')[-1].text
    #print(json_tag)

    json_response = json.loads(json_tag)

我发现一个错误：

    Traceback (most recent call last):
      File "C:/Users/Andrzej/PycharmProjects/Praca/test_bs.py", line 71, in <module>
        json_response = json.loads(json_tag)
      File "C:\Users\Andrzej\AppData\Local\Programs\Python\Python37-32\lib\json\__init__.py", line 348, in loads
        return _default_decoder.decode(s)
      File "C:\Users\Andrzej\AppData\Local\Programs\Python\Python37-32\lib\json\decoder.py", line 337, in decode
        obj, end = self.raw_decode(s, idx=_w(s, 0).end())
      File "C:\Users\Andrzej\AppData\Local\Programs\Python\Python37-32\lib\json\decoder.py", line 353, in raw_decode
        obj, end = self.scan_once(s, idx)
    json.decoder.JSONDecodeError: Invalid control character at: line 6 column 242 (char 378)

网站上的json（在Chrome上读取）：

因此，我认为问题在于像“/>”、“lt”、“br”等实体我很久以前就解决了这个问题，比如：

阅读清晰的html

替换错误的实体

BS4解析JSON标记

通过JSON库解析JSON

但我想知道是更好的选择。在这种情况下，问题是在网站上还是我做错了什么？

我测试了来自Documentation（lxml，html5lib）的所有BS4解析器，但情况相同。

当遇到

br/

时，BeautifulSoup解析器将数据解释为损坏的html。有两个关键点需要解决：

使用解析器
```
html5lib
```
而不是
```
html.parser
```
；第一种方法在处理破损的html时比后者更为宽松；一些参考资料这里有文档

及

清理有问题的文本，为
```
json
```
格式做好准备；这可以通过多种方式实现，下面是一种

json\u tag=page\u soup.find\u all（'script'，type='application/ld+json'）[-1].text.replace（'br/'，''）.replace（'\n'，''）

输出样本：

json_response = json.loads(json_tag)
print(json.dumps(json_response,indent=2))

我不想创建新帖子，因为我对其他JSON也有类似的问题

来自网站的JSON-

https://verbund.edeka/karriere/stellenbörse/stelle-verkäuferin-feinkost-m-w-d-edeka-lüders-burgwedel-selbstständiger-einzelhandel?id=60118_60115&type=j

类似于JSON查看器

http://jsonviewer.stack.hu/

没关系，但在python中

json.decoder.JSONDecodeError: Expecting value: line 7 column 20 (char 3556).

我的代码：（get_html_return_soup return bs4对象，也使用“html.parser”和“html5lib”进行测试）

url='1〕https://verbund.edeka/karriere/stellenbörse/stelle verkäuferin-feinkost-m-w-d-
edeka-lüders burgwedel selbstständiger einzelhandel？id=60118_60115&type=j'
page\u soup=get\u html\u return\u soup（url）
json_tag_after=json_tag.replace（“”，“”）
json\u response=json.loads（json\u tag\u after）
打印（json_响应）

感谢您的回复。所以我以前试过你的两种选择；）。html5lib在这种情况下不起作用。这两个步骤都是必需的：

return-soup（html\u-page，“html5lib”）

和

json\u-tag=page\u-soup.find\u all（'script'，type='application/ld+json'）[-1]。text.replace（'br/'，''）。replace（'\n'，''）

。

http://jsonviewer.stack.hu/

json.decoder.JSONDecodeError: Expecting value: line 7 column 20 (char 3556).

url = 'https://verbund.edeka/karriere/stellenbörse/stelle-verkäuferin-feinkost-m-w-d- 
edeka-lüders-burgwedel-selbstständiger-einzelhandel?id=60118_60115&type=j'

page_soup = get_html_return_soup(url)

json_tag_after = json_tag.replace('&lt;','<').replace('&gt;','>')
json_response = json.loads(json_tag_after)

print(json_response)