如何使用BeautifulSoup从HTML中提取特定模式
我试图提取HTML的某些特定部分,其中包含重复的模式 模式如下所示:如何使用BeautifulSoup从HTML中提取特定模式,html,python-3.x,web-scraping,beautifulsoup,Html,Python 3.x,Web Scraping,Beautifulsoup,我试图提取HTML的某些特定部分,其中包含重复的模式 模式如下所示: <script type="text/javascript"> $(document).ready(function() { itemJS.ProductsList({"Status":"true", "description":"sku_01", "id": "00000001" }); }); </script
<script type="text/javascript">
$(document).ready(function() {
itemJS.ProductsList({"Status":"true",
"description":"sku_01",
"id": "00000001"
});
});
</script>
但是如何只提取这些特定的模式呢?
我希望获得此“dict”作为结果:
({"Status":"true",
"description":"sku_01",
"id": "00000001"
})
谢谢您可以使用.find()
和text=
参数,然后re
/json
模块对数据进行解码
例如:
import re
import json
from bs4 import BeautifulSoup
txt = '''
<script type="text/javascript">
$(document).ready(function() {
itemJS.ProductsList({"Status":"true",
"description":"sku_01",
"id": "00000001"
});
});
</script>'''
soup = BeautifulSoup(txt, 'html.parser')
# locate the <script>
t = soup.find('script', text=lambda t: 'ProductsList' in t).contents[0]
# get the raw string using `re` module
json_data = re.search(r'itemJS\.ProductsList\((.*?)\);', t, flags=re.DOTALL).group(1)
# decode the data
json_data = json.loads(json_data)
# print the data to screen
print(json.dumps(json_data, indent=4))
编辑:如果您有多个
标记,则可以执行以下操作:
import re
import json
from bs4 import BeautifulSoup
txt = '''
<script type="text/javascript">
$(document).ready(function() {
itemJS.ProductsList({"Status":"true",
"description":"sku_01",
"id": "00000001"
});
});
</script>
<script type="text/javascript">
$(document).ready(function() {
itemJS.ProductsList({"Status":"true",
"description":"sku_02",
"id": "00000002"
});
});
</script>
'''
soup = BeautifulSoup(txt, 'html.parser')
for script_tag in soup.find_all('script', text=lambda t: 'ProductsList' in t):
json_data = re.search(r'itemJS\.ProductsList\((.*?)\);', script_tag.contents[0], flags=re.DOTALL).group(1)
json_data = json.loads(json_data)
print(json.dumps(json_data, indent=4))
谢谢Andrej,但是如果我在这个HTML中有多个模式,如何管理您的解决方案呢?类似于:
表示汤中的i.find('script',text=lambda t:'ProductsList'在t中)。内容[0]
{
"Status": "true",
"description": "sku_01",
"id": "00000001"
}
import re
import json
from bs4 import BeautifulSoup
txt = '''
<script type="text/javascript">
$(document).ready(function() {
itemJS.ProductsList({"Status":"true",
"description":"sku_01",
"id": "00000001"
});
});
</script>
<script type="text/javascript">
$(document).ready(function() {
itemJS.ProductsList({"Status":"true",
"description":"sku_02",
"id": "00000002"
});
});
</script>
'''
soup = BeautifulSoup(txt, 'html.parser')
for script_tag in soup.find_all('script', text=lambda t: 'ProductsList' in t):
json_data = re.search(r'itemJS\.ProductsList\((.*?)\);', script_tag.contents[0], flags=re.DOTALL).group(1)
json_data = json.loads(json_data)
print(json.dumps(json_data, indent=4))
{
"Status": "true",
"description": "sku_01",
"id": "00000001"
}
{
"Status": "true",
"description": "sku_02",
"id": "00000002"
}