Python 带有表数据的漂亮的Soup头_Python_Beautifulsoup_Imdb

Python 带有表数据的漂亮的Soup头

python

Python 带有表数据的漂亮的Soup头,python,beautifulsoup,imdb,Python,Beautifulsoup,Imdb,我正在为IMDB的演员阵容成员清理IMDB。IMDB API没有全面的演员阵容/学分数据。我想要的最后一个产品是一个包含三列的表，它从网页中的所有表中获取数据，并按如下方式对它们进行排序： Produced by | Gary Kurtz | producer Produced by | George Lucas | executive producer Music by | John Williams | 以星球大战为例, 下面的代码就快到了，但是有大量不必要的空白，而且.pa

我正在为IMDB的演员阵容成员清理IMDB。IMDB API没有全面的演员阵容/学分数据。我想要的最后一个产品是一个包含三列的表，它从网页中的所有表中获取数据，并按如下方式对它们进行排序：

Produced by | Gary Kurtz | producer 

Produced by | George Lucas | executive producer

Music by    | John Williams |

以星球大战为例,

下面的代码就快到了，但是有大量不必要的空白，而且.parent函数肯定是用错了。找到表格上方h4值的最佳方法是什么

这是密码

 with open(fname, 'r') as f:
        soup = BeautifulSoup(f.read(),'html5lib')
        soup.prettify()


        with open(fname, 'r') as f:
        soup = BeautifulSoup(f.read(),'html5lib')
        soup.prettify()

        for child in soup.find_all('td',{'class':'name'}):
            print child.parent.text, child.parent.parent.parent.parent.parent.parent.text.encode('utf-8')

我正试图从这些h4头文件中获取诸如Directed by之类的值

欢迎使用stackoverflow。似乎可以同时找到h4和表，因为它们在html中成对出现，所以可以将它们压缩到for循环中。之后，您只需获取并格式化文本。将代码更改为：

soup = BeautifulSoup(f.read(), 'html5lib')
for h4,table in zip(soup.find_all('h4'),soup.find_all('table')):
    header4 = " ".join(h4.text.strip().split())
    table_data = [" ".join(tr.text.strip().replace("\n", "").replace("...", "|").split())  for tr in table.find_all('tr')]
    print("%s | %s \n")%(header4,table_data)

这将打印：

Directed by | [u'George Lucas'] 

Writing Credits | [u'George Lucas | (written by)'] 

Cast (in credits order) verified as complete | ['', u'Mark Hamill | Luke Skywalker', u'Harrison Ford | Han Solo', u'Carrie Fisher | Princess Leia Organa', u'Peter Cushing | Grand Moff Tarkin',...]

Produced by | [u'Gary Kurtz | producer', u'George Lucas | executive producer', u'Rick McCallum | producer (1997 special version)'] 

Music by | [u'John Williams'] 

...

这将避免彻底使用父函数

from urllib.request import urlopen
from bs4 import BeautifulSoup

#this will find all headers eg produced by
def get_header(url):
    bsObj = BeautifulSoup(urlopen(url))
    headers = bsObj.find("div", {"id":"fullcredits_content"}).findAll("h4", {"class":"dataHeaderWithBorder"})
    return headers
#this will find all names eg gary kurtz
def get_table(url):
    bsObj = BeautifulSoup(urlopen(url))
    table = bsObj.findAll("td", {"class":"name"})
    return table

url = "http://www.imdb.com/title/tt0076759/fullcredits"
header= get_header(url)
table  = get_table(url)
#title  = get_title(url)
for h in header:
    for t in table:
        print(h.get_text())
        print(t.get_text())
        print("............")

谢谢你的热情欢迎：D.这个答案太完美了！谢谢你的回复！我在表=bsObj.findAlltd，{class:name}上得到一个语法错误。