从google表单python BeautifulSoup获取字段ID
例如,在谷歌表单中:如何创建此“字段ID”的列表从google表单python BeautifulSoup获取字段ID,python,beautifulsoup,google-forms,Python,Beautifulsoup,Google Forms,例如,在谷歌表单中:如何创建此“字段ID”的列表 var FB_PUBLIC_LOAD_DATA_ = [null,[null,[[831400739,"Product Title",null,0,[[1089277187,null,0] ] ] ,[2054606931,"SKU",null,0,[[742914399,null,0] ] ] ,[1620039602,"Size",null,0,[[2011436433,null,0
var FB_PUBLIC_LOAD_DATA_ = [null,[null,[[831400739,"Product Title",null,0,[[1089277187,null,0]
]
]
,[2054606931,"SKU",null,0,[[742914399,null,0]
]
]
,[1620039602,"Size",null,0,[[2011436433,null,0]
]
]
,[445859665,"First Name",null,0,[[638818998,null,0]
]
]
,[1417046530,"Last Name",null,0,[[1952962866,null,0]
]
]
,[903472958,"E-mail",null,0,[[916445513,null,0]
]
]
,[549969484,"Phone Number",null,0,[[848461347,null,0
这是HTML的相关部分^
到目前为止,我已经掌握了代码:
from bs4 import BeautifulSoup as bs
a = requests.get(url, proxies=proxies)
soup = bs(a.text, 'html.parser')
fields = soup.find_all('script', {'type': 'text/javascript'})
form_info = fields[1]
print(form_info)
但这会返回大量不相关的数据,除非我包含大量的str.replace()
,str.split()
代码部分,否则我看不到一个简单的方法来实现这一点。这也会非常混乱
我不必使用BeautifulSoup,尽管这似乎是显而易见的方法
在上面的示例中,我需要一个如下列表:
[1089277187742914399201143638818998195296286616445513 84848461347]
Beautiful soup用于查询HTML标记。因此,从JavaScript变量提取数据的方法是使用regex。您可以在[[
上进行匹配。但是,这将返回831400739
。可以通过跳过第一项在正则表达式之后手动排除此项
import re
script = '''var FB_PUBLIC_LOAD_DATA_ = [null,[null,[[831400739,"Product Title",null,0,[[1089277187,null,0]
]
]
,[2054606931,"SKU",null,0,[[742914399,null,0]
]
]
,[1620039602,"Size",null,0,[[2011436433,null,0]
]
]
,[445859665,"First Name",null,0,[[638818998,null,0]
]
]
,[1417046530,"Last Name",null,0,[[1952962866,null,0]
]
]
,[903472958,"E-mail",null,0,[[916445513,null,0]
]
]
,[549969484,"Phone Number",null,0,[[848461347,null,0'''
match = re.findall('(?<=\[\[)(\d+)', script)
# (?<= ) means to look for the following (but not include it in the results):
# \[\[ means find 2 square brackets characters. The backslash is used to tell regex to use the character [ and not the function.
# (\d+) means to match the start of a digit of any size (and return it in results)
results = [x for x in match[1:]] # Skip the first item, which is 831400739
print(results)
您可能希望将结果强制转换为整数。此外,为了使代码更加健壮,您可能希望在调用正则表达式函数之前删除空格和新行,例如:formatted=script.replace(“,”).replace(“\n',”).replace(“\r',”)
['1089277187', '742914399', '2011436433', '638818998', '1952962866', '916445513', '848461347']