如何使用BeautifulSoup从HTML中提取特定模式_Html_Python 3.x_Web Scraping_Beautifulsoup

如何使用BeautifulSoup从HTML中提取特定模式

html python-3.x web-scraping

如何使用BeautifulSoup从HTML中提取特定模式,html,python-3.x,web-scraping,beautifulsoup,Html,Python 3.x,Web Scraping,Beautifulsoup,我试图提取HTML的某些特定部分，其中包含重复的模式模式如下所示： <script type="text/javascript"> $(document).ready(function() { itemJS.ProductsList({"Status":"true", "description":"sku_01", "id": "00000001" }); }); </script

我试图提取HTML的某些特定部分，其中包含重复的模式

模式如下所示：

<script type="text/javascript">
    $(document).ready(function() {
        itemJS.ProductsList({"Status":"true",
            "description":"sku_01",
            "id": "00000001"
        });
    });
</script>

但是如何只提取这些特定的模式呢？我希望获得此“dict”作为结果：

({"Status":"true",
 "description":"sku_01",
 "id": "00000001"
})

谢谢

您可以使用

.find（）

和

text=

参数，然后

re

json

模块对数据进行解码

例如：

import re
import json
from bs4 import BeautifulSoup

txt = '''
<script type="text/javascript">
    $(document).ready(function() {
        itemJS.ProductsList({"Status":"true",
            "description":"sku_01",
            "id": "00000001"
        });
    });
</script>'''

soup = BeautifulSoup(txt, 'html.parser')

# locate the <script>
t = soup.find('script', text=lambda t: 'ProductsList' in t).contents[0]

# get the raw string using `re` module
json_data = re.search(r'itemJS\.ProductsList\((.*?)\);', t, flags=re.DOTALL).group(1)

# decode the data
json_data = json.loads(json_data)

# print the data to screen
print(json.dumps(json_data, indent=4))

编辑：如果您有多个

标记，则可以执行以下操作：

import re
import json
from bs4 import BeautifulSoup

txt = '''
<script type="text/javascript">
    $(document).ready(function() {
        itemJS.ProductsList({"Status":"true",
            "description":"sku_01",
            "id": "00000001"
        });
    });
</script>

<script type="text/javascript">
    $(document).ready(function() {
        itemJS.ProductsList({"Status":"true",
            "description":"sku_02",
            "id": "00000002"
        });
    });
</script>
'''

soup = BeautifulSoup(txt, 'html.parser')

for script_tag in soup.find_all('script', text=lambda t: 'ProductsList' in t):
    json_data = re.search(r'itemJS\.ProductsList\((.*?)\);', script_tag.contents[0], flags=re.DOTALL).group(1)
    json_data = json.loads(json_data)
    print(json.dumps(json_data, indent=4))

谢谢Andrej，但是如果我在这个HTML中有多个模式，如何管理您的解决方案呢？类似于：

表示汤中的i.find（'script'，text=lambda t:'ProductsList'在t中）。内容[0]

{
    "Status": "true",
    "description": "sku_01",
    "id": "00000001"
}

import re
import json
from bs4 import BeautifulSoup

txt = '''
<script type="text/javascript">
    $(document).ready(function() {
        itemJS.ProductsList({"Status":"true",
            "description":"sku_01",
            "id": "00000001"
        });
    });
</script>

<script type="text/javascript">
    $(document).ready(function() {
        itemJS.ProductsList({"Status":"true",
            "description":"sku_02",
            "id": "00000002"
        });
    });
</script>
'''

soup = BeautifulSoup(txt, 'html.parser')

for script_tag in soup.find_all('script', text=lambda t: 'ProductsList' in t):
    json_data = re.search(r'itemJS\.ProductsList\((.*?)\);', script_tag.contents[0], flags=re.DOTALL).group(1)
    json_data = json.loads(json_data)
    print(json.dumps(json_data, indent=4))

{
    "Status": "true",
    "description": "sku_01",
    "id": "00000001"
}
{
    "Status": "true",
    "description": "sku_02",
    "id": "00000002"
}