Python webcrawling BeautifulSoup：获取文本和链接_Python_Web Scraping_Beautifulsoup_Web Crawler_Html Parsing

Python webcrawling BeautifulSoup：获取文本和链接

python web-scraping web-crawler

Python webcrawling BeautifulSoup：获取文本和链接,python,web-scraping,beautifulsoup,web-crawler,html-parsing,Python,Web Scraping,Beautifulsoup,Web Crawler,Html Parsing,我尝试爬网的站点是。我现在关注的具体页面是。从这一页，我很难得到两件事。首先，我需要得到“国外总收入”的金额（在终身总收入下）。我不知道该怎么做，因为当我检查元素时，它似乎没有一个特定的标记，周围有大量的css标记。我怎样才能得到这段数据下一步，我将尝试获取每部电影的演员名单。我已经成功地获得了所有附加了链接的参与者（通过搜索a href标签），但无法获得没有链接的参与者 def spider(max_pages): page = 1 while page <= max_pages:

我尝试爬网的站点是。我现在关注的具体页面是。从这一页，我很难得到两件事。首先，我需要得到“国外总收入”的金额（在终身总收入下）。我不知道该怎么做，因为当我检查元素时，它似乎没有一个特定的标记，周围有大量的css标记。我怎样才能得到这段数据

下一步，我将尝试获取每部电影的演员名单。我已经成功地获得了所有附加了链接的参与者（通过搜索a href标签），但无法获得没有链接的参与者

def spider(max_pages):
page = 1
while page <= max_pages:
    url = 'http://www.boxofficemojo.com/yearly/chart/?page=' + str(page) + '&view=releasedate&view2=domestic&yr=2013&p=.htm'
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)
    for link in soup.select('td > b > font > a[href^=/movies/?]'):
        href = 'http://www.boxofficemojo.com' + link.get('href')
        details(href)

        listOfDirectors.append(getDirectors(href))
        str(listOfDirectors).replace('[','').replace(']','')

        listOfActors.append(getActors(href))
        str(listOfActors).replace('[','').replace(']','')
        getActors(href)
        title = link.string
        listOfTitles.append(title)
    page += 1


def getActors(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
tempActors = []
for actor in soup.select('td > font > a[href^=/people/chart/?view=Actor]'):
    tempActors.append(str(actor.string))
return tempActors

这显然不适用于没有链接的演员。我试过了

for actor in soup.findAll('br', {'class', 'mp_box_content'}):
     tempActors.append(str(actor.string))

但这不起作用，它没有增加任何东西。如何获取所有参与者，无论他们是否有链接？

若要获取“外部总量”，请获取包含“外部：”文本的元素，并找到

td

父级的下一个

td

兄弟：

In [4]: soup.find(text="Foreign:").find_parent("td").find_next_sibling("td").get_text(strip=True)
Out[4]: u'$440,244,916'

对于参与者，可以应用一种类似的技术：找到

参与者：

，找到

tr

父节点并找到其中的所有文本节点（

text=True

）：

请注意，这已被证明适用于此特定页面。在其他电影页面上测试它，确保它产生所需的结果

In [4]: soup.find(text="Foreign:").find_parent("td").find_next_sibling("td").get_text(strip=True)
Out[4]: u'$440,244,916'

In [5]: soup.find(text="Actors:").find_parent("tr").find_all(text=True)[1:]
Out[5]: 
[u'Jennifer Lawrence',
 u'Josh Hutcherson',
 u'Liam Hemsworth',
 u'Elizabeth Banks',
 u'Stanley Tucci',
 u'Woody Harrelson',
 u'Philip Seymour Hoffman',
 u'Jeffrey Wright',
 u'Jena Malone',
 u'Amanda Plummer',
 u'Sam Claflin',
 u'Donald Sutherland',
 u'Lenny Kravitz']