Python BeautifulSoup webcrawling：格式化输出_Python_Beautifulsoup_Web Crawler

Python BeautifulSoup webcrawling：格式化输出

python web-crawler

Python BeautifulSoup webcrawling：格式化输出,python,beautifulsoup,web-crawler,Python,Beautifulsoup,Web Crawler,我尝试爬网的站点是。我现在关注的具体页面是从这一页，我有两个问题。第一件事是“国外总收入”金额（在终身总收入项下）。我通过这个函数得到了金额： def getForeign(item_url): response = requests.get(item_url) soup = BeautifulSoup(response.content) print soup.find(text="Foreign:").find_parent("td").find_next_sibli

我尝试爬网的站点是。我现在关注的具体页面是

从这一页，我有两个问题。第一件事是“国外总收入”金额（在终身总收入项下）。我通过这个函数得到了金额：

def getForeign(item_url):
    response = requests.get(item_url)
    soup = BeautifulSoup(response.content)
    print soup.find(text="Foreign:").find_parent("td").find_next_sibling("td").get_text(strip = True)

def getActors(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)
    tempActors = []
    print soup.find(text="Actors:").find_parent("tr").text[7:]

问题是，我可以将此金额打印到控制台，但我无法将这些值附加到列表或写入csv文件。对于我需要在这个网站上获取的以前的数据，我获取了每部电影的单独信息，并将它们全部添加到一个列表中，然后将其导出到csv文件中

我如何将这一“国外毛额”作为每部电影的单独金额？我需要更改什么

第二个问题是关于每部电影的演员名单。我有这个功能：

def getForeign(item_url):
    response = requests.get(item_url)
    soup = BeautifulSoup(response.content)
    print soup.find(text="Foreign:").find_parent("td").find_next_sibling("td").get_text(strip = True)

def getActors(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)
    tempActors = []
    print soup.find(text="Actors:").find_parent("tr").text[7:]

这张照片打印出一张演员名单：詹妮弗·劳伦斯·乔什·哈奇尔索尼亚姆·海姆斯沃斯·莉莎白·班克斯斯坦利·塔奇伍迪·哈勒森菲利普·西摩·霍夫曼杰弗里·莱特杰娜·马洛尼曼达·普卢默萨姆·克拉夫林·唐纳德·萨瑟兰登·克拉维茨 -如此

我也有同样的问题，因为我有与国外总额我想分别获得每个演员，然后将他们全部添加到一个临时列表中，然后将该列表添加到另一个所有电影的完整列表中。我对导演列表进行了此操作，但由于所有导演都是链接，但并非所有演员都有html链接，因此我无法执行相同的操作。现在的另一个问题是，每个参与者之间没有空间

为什么我当前的函数不起作用，如何修复它们

更多代码：：

def spider(max_pages):
page = 1
while page <= max_pages:
    url = 'http://www.boxofficemojo.com/yearly/chart/?page=' + str(page) + '&view=releasedate&view2=domestic&yr=2013&p=.htm'
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)
    for link in soup.select('td > b > font > a[href^=/movies/?]'):
        href = 'http://www.boxofficemojo.com' + link.get('href')
        details(href)

        listOfForeign.append(getForeign(href))

        listOfDirectors.append(getDirectors(href))
        str(listOfDirectors).replace('[','').replace(']','')

        getActors(href)

        title = link.string
        listOfTitles.append(title)
    page

def spider（最大页数）：
页码=1
感谢你花时间写了一个好问题：）我不明白你为什么不能将收入添加到列表中？我收到了一个错误：listOfForeign.append（soup.find（text=“Foreign:）.find_parent（“td”）.find_next_sibling（“td”）.get_text（strip=True））AttributeError:'NoneType'对象没有“find_parent”属性，它对我有效。你在测试我发布的上面的代码吗？这个代码应该可以用，但这不是我想要的。能否尝试用return（soup.find（text=“Foreign:”）.find_parent（“td”）.find_next_sibling（“td”）.get_text（strip=True））替换getForeign方法中的print语句，然后调用getForeign（href）方法？谢谢你花时间写了一个写得很好的问题：）我不明白你为什么不能将收入添加到列表中？我得到了一个错误：listOfForeign.append（soup.find（text=“Foreign:）。find_parent（“td”）。find_next_sibling（“td”）。get_text（strip=True））AttributeError:'NoneType'对象没有“find_parent”属性，它对我有效。你在测试我发布的上面的代码吗？这个代码应该可以用，但这不是我想要的。能否尝试用return（soup.find（text=“Foreign:”）.find_parent（“td”）.find_next_sibling（“td”）.get_text（strip=True））替换getForeign方法中的print语句，然后调用getForeign（href）方法？