用于在Python中提取脚本标记的正则表达式_Python_Regex_Web Scraping

用于在Python中提取脚本标记的正则表达式

python regex web-scraping

用于在Python中提取脚本标记的正则表达式,python,regex,web-scraping,Python,Regex,Web Scraping,我有以下Python代码：导入系统、操作系统、请求、日期时间、时间从bs4导入BeautifulSoup 导入urllib.request 进口稀土导入json def get_html（url）： headers={'User-Agent'：'Mozilla/5.0（Macintosh；英特尔Mac OS X 10_9_3）AppleWebKit/537.36（KHTML，像Gecko）Chrome/35.0.1916.47 Safari/537.36'} r=requests.get

我有以下Python代码：

导入系统、操作系统、请求、日期时间、时间
从bs4导入BeautifulSoup
导入urllib.request
进口稀土
导入json
def get_html（url）：
headers={'User-Agent'：'Mozilla/5.0（Macintosh；英特尔Mac OS X 10_9_3）AppleWebKit/537.36（KHTML，像Gecko）Chrome/35.0.1916.47 Safari/537.36'}
r=requests.get（url，headers=headers）
返回r.content
链接https://www.clubx.com.au/products/womanizer-pro?variant=37834367948'
soup=BeautifulSoup（获取html（链接），'html.parser'）
obj=soup.find_all（'script'）[18]
m=re.search（r“\”变体\“：\[（.*？\]），obj.string）
如果m：
data=json.load（m.group（1））
打印（数据）

使用正则表达式模式

r“\”变体\“：\[（.*？\]）”

演示：

from bs4 import BeautifulSoup
import json
import re

s = """<script>var BOLD = BOLD || {};
    BOLD.products = BOLD.products || {};
    BOLD.variant_lookup = BOLD.variant_lookup || {};BOLD.variant_lookup[31066737740] ="womanizer";BOLD.variant_lookup[31066737804] ="womanizer";BOLD.variant_lookup[31066737868] ="womanizer";BOLD.variant_lookup[31066737996] ="womanizer";BOLD.variant_lookup[1509908217881] ="womanizer";BOLD.products["womanizer"] ={"id":8993669708,"title":"Womanizer","variants":[{"id":37834367948,"title":"Black","option1":"Black","option2":null,"option3":null,"sku":"1725205212"}]}
    </script>
"""

soup = BeautifulSoup(s, "html.parser")
src = soup.find("script")
m = re.search(r"\"variants\":\[(.*?)\]", src.string)
if m:
    data = json.loads(m.group(1))
    print(data)

{u'sku': u'1725205212', u'title': u'Black', u'id': 37834367948L, u'option2': None, u'option3': None, u'option1': u'Black'}

当's'是字符串类型时，它就工作了。但在我的情况下，我有“s”。因此发生了错误：json.decoder.jsondecodecor:Expecting'，'分隔符：第1行第495列（char 494）我的代码有以下结构：def get_html（url）：headers={'User-Agent'：'Mozilla/5.0（Macintosh；Intel Mac OS X 10_9_3）AppleWebKit/537.36（KHTML，像Gecko）Chrome/35.0.1916.47 Safari/537.36'}=requests.get（url，headers=headers）返回r.content link=''soup=BeautifulSoup（get_html（link），'html.parser'）obj=soup.find_all（'script'）[18]m=re.search（r“\”variants\：[（.*？）”，obj.string）如果m:data=json.loads（m.group（1））打印（data）什么是

obj.string

print？它打印内容在您的问题中添加它。