Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/javascript/450.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/310.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Javascript 使用python抓取时访问数据层(JS变量)_Javascript_Python_Python 2.7_Web Scraping_Beautifulsoup - Fatal编程技术网

Javascript 使用python抓取时访问数据层(JS变量)

Javascript 使用python抓取时访问数据层(JS变量),javascript,python,python-2.7,web-scraping,beautifulsoup,Javascript,Python,Python 2.7,Web Scraping,Beautifulsoup,我用漂亮的汤刮网页。我想访问此服务器上存在的数据层(一个javascript变量)?如何使用python检索它? beautifulsoup不是JavaScript仿真器,因此您无法执行JS并获取变量的内容。但这个变量可能由ajax请求填充,如果您使用python脚本发送相同的请求,则可以获取这些数据 另一方面,如果此数据是静态赋值的,则可以使用字符串处理和正则表达式获取它们 免责声明:抱歉,对于一般的答案。您可以在re和json.loads的帮助下从源代码中解析它,以找到包含json的正确脚本

我用漂亮的汤刮网页。我想访问此服务器上存在的数据层(一个javascript变量)?如何使用python检索它?

beautifulsoup不是JavaScript仿真器,因此您无法执行JS并获取变量的内容。但这个变量可能由ajax请求填充,如果您使用python脚本发送相同的请求,则可以获取这些数据

另一方面,如果此数据是静态赋值的,则可以使用字符串处理和正则表达式获取它们


免责声明:抱歉,对于一般的答案。

您可以在re和json.loads的帮助下从源代码中解析它,以找到包含json的正确脚本标记:

from bs4 import BeautifulSoup
import re
from json import loads
url = "http://www.allocine.fr/video/player_gen_cmedia=19561982&cfilm=144185.html"

soup = BeautifulSoup(requests.get(url).content)

script_text = soup.find("script", text=re.compile("var\s+dataLayer")).text.split("= ", 1)[1]

json_data = loads(script_text[:script_text.find(";")])
运行它您会看到我们得到了您想要的:

In [31]: from bs4 import BeautifulSoup
In [32]: import re    
In [33]: from json import loads    
In [34]: import requests

In [35]: url = "http://www.allocine.fr/video/player_gen_cmedia=19561982&cfilm=144185.html"

In [36]: soup = BeautifulSoup(requests.get(url).content, "html.parser")

In [37]: script_text = soup.find("script", text=re.compile("var\s+dataLayer")).text.split("= ", 1)[1]

In [38]: json_data = loads(script_text[:script_text.find(";")])

In [39]: json_data
Out[39]: 
[{'actor': '403573,19358,22868,612492,418933,436500,46797,729453,66391,16893,211493,249636,18324,483703,1193,165792,231665,114167,139915,155111,258115,119842,610268,166263,597100,134791,520768,149470,734146,633703,684803,763372,673220,748361,178486,241328,517093,765381,693327,196630,758799,220756,550759,737383,263596,174710,118600,663153,463379,740361,702873,659451,779133,779134,779135,779136,779137,779138,779139,779140,779141,779142,779143,779144,779145,779146,779147,779241,779242,779243,779244',
  'director': '41198',
  'genre': '13025=action&13012=fantastique',
  'movie_distributors': 929,
  'movie_id': 144185,
  'movie_isshowtime': 1,
  'movie_label': 'suicide_squad',
  'nationality': '5002',
  'press_rating': 2,
  'releasedate': '2016-08-03',
  'site_route': 'moviepage_videos_trailer',
  'site_section': 'movie',
  'user_activity': 'videowatch',
  'user_rating': 3.4,
  'video_id': 19561982,
  'video_label': 'suicide_squad_bande_annonce_finale_vo',
  'video_type_id': 31003,
  'video_type_label': 'trailer'}]

您也可以使用正则表达式,但在本例中,使用str.find获取数据结尾就足够了。

OP试图获取的数据位于源中