Python BeautifulSoup webcrawling：如何获取一段文本_Python_Web Scraping_Beautifulsoup_Web Crawler_Html Parsing

Python BeautifulSoup webcrawling：如何获取一段文本

python web-scraping web-crawler

Python BeautifulSoup webcrawling：如何获取一段文本,python,web-scraping,beautifulsoup,web-crawler,html-parsing,Python,Web Scraping,Beautifulsoup,Web Crawler,Html Parsing,我试图抓取的页面是。具体来说，我现在关注的是这个页面：对于第一个链接上的每一部电影，我想获得类型、运行时间、MPAA评级、国外总收入和预算。我很难得到这个，因为信息上没有识别标签。到目前为止，我所拥有的： import requests from bs4 import BeautifulSoup from urllib2 import urlopen def trade_spider(max_pages): page = 1 while page <= max_page

我试图抓取的页面是。具体来说，我现在关注的是这个页面：

对于第一个链接上的每一部电影，我想获得类型、运行时间、MPAA评级、国外总收入和预算。我很难得到这个，因为信息上没有识别标签。到目前为止，我所拥有的：

import requests
from bs4 import BeautifulSoup
from urllib2 import urlopen

def trade_spider(max_pages):
    page = 1
    while page <= max_pages:
        url = 'http://www.boxofficemojo.com/yearly/chart/?page=' + str(page) + '&view=releasedate&view2=domestic&yr=2013&p=.htm'
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text)
        for link in soup.select('td > b > font > a[href^=/movies/?]'):
            href = 'http://www.boxofficemojo.com' + link.get('href')
            title = link.string
            print title, href
            get_single_item_data(href)


def get_single_item_data(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)
    print soup.find_all("Genre: ")
    for person in soup.select('td > font > a[href^=/people/]'):
        print person.string


trade_spider(1)

行，但这不是链接，它只是文本，所以它不起作用

如何获取每部电影的数据

找到流派：文本并获取：

演示：

"for person in soup.select('td > font > a[href^=/people/]'):
        print person.string"

soup.find(text="Genre: ").next_sibling.text

In [1]: import requests

In [2]: from bs4 import BeautifulSoup

In [3]: response = requests.get("http://www.boxofficemojo.com/movies/?id=ironman3.htm")

In [4]: soup = BeautifulSoup(response.content)

In [5]: soup.find(text="Genre: ").next_sibling.text
Out[5]: u'Action / Adventure'