Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/327.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在Python中使用beautfulsoup刮取IMDB;搜索结果,然后输入链接,然后获取年份_Python_Web Scraping_Beautifulsoup_Gettext_Imdb - Fatal编程技术网

在Python中使用beautfulsoup刮取IMDB;搜索结果,然后输入链接,然后获取年份

在Python中使用beautfulsoup刮取IMDB;搜索结果,然后输入链接,然后获取年份,python,web-scraping,beautifulsoup,gettext,imdb,Python,Web Scraping,Beautifulsoup,Gettext,Imdb,我正在尝试使用IMDB搜索特定的标题,在搜索结果中输入第一个链接,然后打印电影发行的年份(以及以后的其他信息),但我似乎无法确定要放入html的哪一部分。find() 第一个函数工作并收集原始url,并将其与url的新的第二部分(用于电影页面)连接起来 谢谢你的帮助,我已经坚持了好几天了 from bs4 import BeautifulSoup import requests from urllib.parse import urljoin # For joining next page ur

我正在尝试使用IMDB搜索特定的标题,在搜索结果中输入第一个链接,然后打印电影发行的年份(以及以后的其他信息),但我似乎无法确定要放入html的哪一部分。find()

第一个函数工作并收集原始url,并将其与url的新的第二部分(用于电影页面)连接起来

谢谢你的帮助,我已经坚持了好几天了

from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin # For joining next page url with base url

search_terms = input("What movie do you want to know about?\n> ").split()

url = "http://www.imdb.com/find?ref_=nv_sr_fn&q=" + '+'.join(search_terms) + '&s=all'

def scrape_find_next_page(url):
    headers = {'User-Agent': 'Mozilla/5.0'}
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, "html.parser")

    next_page = soup.find('td', 'result_text').find('a').get('href')

    return next_page


next_page_url = scrape_find_next_page(url)

new_page = urljoin(url, next_page_url)



def scrape_movie_data(next_page_url):
    headers = {'User-Agent': 'Mozilla/5.0'}
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, "html.parser")

    title_year = soup.find('span','titleYear').find('a').get_text()

    return title_year

print(scrape_movie_data(new_page))

第一个问题:在
scrape\u movie\u data(next\u page\u url)
中,您在
requests.get()
中使用
url
而不是
next\u page\u url
,因此您读取了错误的页面

response = requests.get(next_page_url, headers=headers)
第二个问题:您必须在
find()中使用
{'id':'titleYear'}

最终版本:

def scrape_movie_data(next_page_url):
    headers = {'User-Agent': 'Mozilla/5.0'}
    response = requests.get(next_page_url, headers=headers)
    soup = BeautifulSoup(response.text, "html.parser")

    title_year = soup.find('span', {'id': 'titleYear'}).find('a').get_text()

    return title_year

编辑:在Google中检查
IMDB API
。一些有趣的结果

你可以得到JSON格式的结果,这样你就不必费劲了

其他门户网站:


编辑:JSON数据

import requests

url = 'http://www.imdb.com/xml/find?json=1&nr=1&tt=on&q={}'
#url = 'http://www.imdb.com/xml/find?json=1&nr=1&nm=on&q={}'

headers = {'User-Agent': 'Mozilla/5.0'}

title = input("Title: ").split()

response = requests.get(url.format(title[0]), headers=headers)

data = response.json()

for x in data['title_popular']: # data['title_approx']:
    print('title:', x['title'])
    print(' year:', x['title_description'][:4])
    print('---')
    print('  id:', x['id'])
    print('name:', x['name'])
    print('        title:', x['title'])
    print('episode_title:', x['episode_title'])
    print('title_description:', x['title_description'])
    print('      description:', x['description'])
    print('------------------------------------')

使用Chrome/Firefox中的DevTool查找元素(若页面不使用JavaScript加载数据)。
import requests

url = 'http://www.imdb.com/xml/find?json=1&nr=1&tt=on&q={}'
#url = 'http://www.imdb.com/xml/find?json=1&nr=1&nm=on&q={}'

headers = {'User-Agent': 'Mozilla/5.0'}

title = input("Title: ").split()

response = requests.get(url.format(title[0]), headers=headers)

data = response.json()

for x in data['title_popular']: # data['title_approx']:
    print('title:', x['title'])
    print(' year:', x['title_description'][:4])
    print('---')
    print('  id:', x['id'])
    print('name:', x['name'])
    print('        title:', x['title'])
    print('episode_title:', x['episode_title'])
    print('title_description:', x['title_description'])
    print('      description:', x['description'])
    print('------------------------------------')