从google表单python BeautifulSoup获取字段ID

从google表单python BeautifulSoup获取字段ID,python,beautifulsoup,google-forms,Python,Beautifulsoup,Google Forms,例如,在谷歌表单中:如何创建此“字段ID”的列表 var FB_PUBLIC_LOAD_DATA_ = [null,[null,[[831400739,"Product Title",null,0,[[1089277187,null,0] ] ] ,[2054606931,"SKU",null,0,[[742914399,null,0] ] ] ,[1620039602,"Size",null,0,[[2011436433,null,0

例如,在谷歌表单中:如何创建此“字段ID”的列表

var FB_PUBLIC_LOAD_DATA_ = [null,[null,[[831400739,"Product Title",null,0,[[1089277187,null,0]
]
]
,[2054606931,"SKU",null,0,[[742914399,null,0]
]
]
,[1620039602,"Size",null,0,[[2011436433,null,0]
]
]
,[445859665,"First Name",null,0,[[638818998,null,0]
]
]
,[1417046530,"Last Name",null,0,[[1952962866,null,0]
]
]
,[903472958,"E-mail",null,0,[[916445513,null,0]
]
]
,[549969484,"Phone Number",null,0,[[848461347,null,0
这是HTML的相关部分^

到目前为止,我已经掌握了代码:

    from bs4 import BeautifulSoup as bs
    a = requests.get(url, proxies=proxies)
    soup = bs(a.text, 'html.parser')
    fields = soup.find_all('script', {'type': 'text/javascript'})
    form_info = fields[1]
    print(form_info)
但这会返回大量不相关的数据,除非我包含大量的
str.replace()
str.split()
代码部分,否则我看不到一个简单的方法来实现这一点。这也会非常混乱

我不必使用BeautifulSoup,尽管这似乎是显而易见的方法

在上面的示例中,我需要一个如下列表:


[1089277187742914399201143638818998195296286616445513 84848461347]
Beautiful soup用于查询HTML标记。因此,从JavaScript变量提取数据的方法是使用regex。您可以在
[[
上进行匹配。但是,这将返回
831400739
。可以通过跳过第一项在正则表达式之后手动排除此项

import re

script = '''var FB_PUBLIC_LOAD_DATA_ = [null,[null,[[831400739,"Product Title",null,0,[[1089277187,null,0]
]
]
,[2054606931,"SKU",null,0,[[742914399,null,0]
]
]
,[1620039602,"Size",null,0,[[2011436433,null,0]
]
]
,[445859665,"First Name",null,0,[[638818998,null,0]
]
]
,[1417046530,"Last Name",null,0,[[1952962866,null,0]
]
]
,[903472958,"E-mail",null,0,[[916445513,null,0]
]
]
,[549969484,"Phone Number",null,0,[[848461347,null,0'''

match = re.findall('(?<=\[\[)(\d+)', script) 
# (?<= ) means to look for the following (but not include it in the results):
# \[\[ means find 2 square brackets characters. The backslash is used to tell regex to use the character [ and not the function.
# (\d+) means to match the start of a digit of any size (and return it in results)

results = [x for x in match[1:]] # Skip the first item, which is 831400739
print(results)
您可能希望将结果强制转换为整数。此外,为了使代码更加健壮,您可能希望在调用正则表达式函数之前删除空格和新行,例如:
formatted=script.replace(“,”).replace(“\n',”).replace(“\r',”)

['1089277187', '742914399', '2011436433', '638818998', '1952962866', '916445513', '848461347']