BeautifulSoup：在Python中提取“img alt”内容Web抓取_Python_Web_Screen Scraping

BeautifulSoup：在Python中提取“img alt”内容Web抓取

python web

BeautifulSoup：在Python中提取“img alt”内容Web抓取,python,web,screen-scraping,Python,Web,Screen Scraping,我在python 3中工作。我的目标是提取一个表的不同值，并将它们放在不同的列表中问题是我不能在td中获取img alt的值这是我的代码： from bs4 import BeautifulSoup import urllib.request redditFile = urllib.request.urlopen("http://www.mtggoldfish.com/movers/online/all") redditHtml = redditFile.read() reddit

我在python 3中工作。我的目标是提取一个表的不同值，并将它们放在不同的列表中

问题是我不能在td中获取img alt的值

这是我的代码：

    from bs4 import BeautifulSoup
import urllib.request

redditFile = urllib.request.urlopen("http://www.mtggoldfish.com/movers/online/all")
redditHtml = redditFile.read()
redditFile.close()
soup = BeautifulSoup(redditHtml)
all_tables = soup.find_all('table')

right_table = soup.find('table', class_='table table-bordered table-striped table-condensed movers-table')

#create a list
A=[]
B=[]
C=[]
D=[]

for row in right_table.findAll("tr"):
    cells = row.findAll('td')
    increment = row.findAll('span')
    colection = row.findAll('img')
    link = row.findAll('a')
    if len(cells) == 6:
        A.append(cells[0].find(text=True))
        B.append(increment[0].find(text=True))
        C.append(colection[0])
        D.append(link[0].find(text=True))
print(A)
print(B)
print(C)
print(D)

此代码给出了以下结果：

['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
['+8.40', '+2.47', '+1.35', '+1.28', '+1.14', '+0.99', '+0.94', '+0.91', '+0.90', '+0.75']
[<img alt="ORI" class="sprite-set_symbols_ORI" src="//assets1.mtggoldfish.com/assets/s-407aaa9c9786d606684c6967c47739c5.gif"/>, <img alt="PRM" class="sprite-set_symbols_PRM" src="//assets1.mtggoldfish.com/assets/s-407aaa9c9786d606684c6967c47739c5.gif"/>, <img alt="8ED" class="sprite-set_symbols_8ED" src="//assets1.mtggoldfish.com/assets/s-407aaa9c9786d606684c6967c47739c5.gif"/>, <img alt="EX" class="sprite-set_symbols_EX" src="//assets1.mtggoldfish.com/assets/s-407aaa9c9786d606684c6967c47739c5.gif"/>, <img alt="TSB" class="sprite-set_symbols_TSB" src="//assets1.mtggoldfish.com/assets/s-407aaa9c9786d606684c6967c47739c5.gif"/>, <img alt="WL" class="sprite-set_symbols_WL"

src=//assets1.mtggoldfish.com/assets/s-407aaa9c9786d606684c6967c4739c5.gif/> [杰斯，弗林的神童，盖亚的摇篮，'圈套桥'，'叛徒之城'，'潘德尔黑文'，'火焰风暴'，'库尔精神舞者'，'滚烫的塔恩'，'黎明的冠冕'，'伯恩柳树']

但是我需要IMG ALT值，例如，第一个IMG ALT值是ORI

集合变量

我不知道我能做什么。伙计们，你们能帮我一下吗

非常感谢

一旦有了节点实例，就可以使用以下方法获得alt值：

alt_tag = img.attrs['alt']

由于您得到的是img元素的集合，因此可以对其进行迭代并检索每个元素的alt标记：

tags = []
collection = soup.findAll("img")
for img in collection:
    if 'alt' in img.attrs:
        tags.append(img.attrs['alt'])
#do whatever you need to do with your list of alt attributes.
print tags

如果您只需要从img标记中选择alt，只需从表中选择img标记并提取alt属性：

right_table = soup.find('table', class_='table table-bordered table-striped table-condensed movers-table')

print([img["alt"] for img in right_table.select("img[alt]")])
['ORI', 'PRM', '8ED', 'EX', 'TSB', 'WL', 'ROE', 'ZEN', 'FUT', 'FUT']

在您自己的循环中，当您似乎只需要一个元素时，您正在使用findAll，如果您只需要第一个元素，则使用find row.find'span'等。。和row.find'img'[alt]将为您提供每行的alt值，查看页面，每个tr只有一个，因此您肯定不需要findAll

如果您想在本地重新创建表，我会将数据放在dict中：

right_table = soup.find('table', class_='table table-bordered table-striped table-condensed movers-table')


table_dict = {}

for row in right_table.select("tr"):
    # increase class are where increments are 
    increments = [s.text for s in row.select('span.increase')]
    # make sure we have some data in tr
    if increments:
        # rank/place is first text in td, could also use find("td",{"class":"first-right"})
        place = int(row.td.text) 
        # text/character name is in a tag text
        title = row.find("a").text
        increments.append(title)
       # get alt attribute from img tag
        increments.append(row.find("img")["alt"])
        table_dict[place] = increments

from pprint import pprint as pp

pp(table_dict)

输出：

{1: [u'+8.78', u'68.03', u'+15.00%', u"Jace, Vryn's Prodigy", 'ORI'],
 2: [u'+2.47', u'47.96', u'+5.00%', u"Gaea's Cradle", 'PRM'],
 3: [u'+1.95', u'20.37', u'+11.00%', u'Firestorm', 'WL'],
 4: [u'+1.73', u'23.91', u'+8.00%', u'Force of Will', 'VMA'],
 5: [u'+1.35', u'40.88', u'+3.00%', u'Ensnaring Bridge', '8ED'],
 6: [u'+1.28', u'44.02', u'+3.00%', u'City of Traitors', 'EX'],
 7: [u'+1.15', u'41.98', u'+3.00%', u'Time Walk', 'VMA'],
 8: [u'+1.01', u'28.68', u'+4.00%', u'Daze', 'NE'],
 9: [u'+1.01', u'19.96', u'+5.00%', u"Goryo's Vengeance", 'BOK'],
 10: [u'+1.00', u'3.99', u'+33.00%', u'Unearth', 'UL']}

{1: {'alt': 'ORI', 'inc': u'+8.78', 'title': u"Jace, Vryn's Prodigy"},
 2: {'alt': 'PRM', 'inc': u'+2.47', 'title': u"Gaea's Cradle"},
 3: {'alt': 'WL', 'inc': u'+1.95', 'title': u'Firestorm'},
 4: {'alt': 'VMA', 'inc': u'+1.73', 'title': u'Force of Will'},
 5: {'alt': '8ED', 'inc': u'+1.35', 'title': u'Ensnaring Bridge'},
 6: {'alt': 'EX', 'inc': u'+1.28', 'title': u'City of Traitors'},
 7: {'alt': 'VMA', 'inc': u'+1.15', 'title': u'Time Walk'},
 8: {'alt': 'NE', 'inc': u'+1.01', 'title': u'Daze'},
 9: {'alt': 'BOK', 'inc': u'+1.01', 'title': u"Goryo's Vengeance"},
 10: {'alt': 'UL', 'inc': u'+1.00', 'title': u'Unearth'}}

您将看到的与当前表数据完全匹配，如果您希望所有获奖者只需将url更改为http://www.mtggoldfish.com/movers-details/online/all/winners/dod

或者，如果要将字段拆分，只需拉动firs增量：

for row in right_table.select("tr"):
    increment = row.find('span',{"class":'increase'})
    if increment:
        increment = increment.text
        place = int(row.td.text)
        title = row.select("a[data-full-image]")[0].text
        alt = (row.find("img")["alt"])
        table_dict[place] = {"title":title,"alt":alt, "inc":increment}


from pprint import pprint as pp

pp(table_dict)

输出：

{1: [u'+8.78', u'68.03', u'+15.00%', u"Jace, Vryn's Prodigy", 'ORI'],
 2: [u'+2.47', u'47.96', u'+5.00%', u"Gaea's Cradle", 'PRM'],
 3: [u'+1.95', u'20.37', u'+11.00%', u'Firestorm', 'WL'],
 4: [u'+1.73', u'23.91', u'+8.00%', u'Force of Will', 'VMA'],
 5: [u'+1.35', u'40.88', u'+3.00%', u'Ensnaring Bridge', '8ED'],
 6: [u'+1.28', u'44.02', u'+3.00%', u'City of Traitors', 'EX'],
 7: [u'+1.15', u'41.98', u'+3.00%', u'Time Walk', 'VMA'],
 8: [u'+1.01', u'28.68', u'+4.00%', u'Daze', 'NE'],
 9: [u'+1.01', u'19.96', u'+5.00%', u"Goryo's Vengeance", 'BOK'],
 10: [u'+1.00', u'3.99', u'+33.00%', u'Unearth', 'UL']}

{1: {'alt': 'ORI', 'inc': u'+8.78', 'title': u"Jace, Vryn's Prodigy"},
 2: {'alt': 'PRM', 'inc': u'+2.47', 'title': u"Gaea's Cradle"},
 3: {'alt': 'WL', 'inc': u'+1.95', 'title': u'Firestorm'},
 4: {'alt': 'VMA', 'inc': u'+1.73', 'title': u'Force of Will'},
 5: {'alt': '8ED', 'inc': u'+1.35', 'title': u'Ensnaring Bridge'},
 6: {'alt': 'EX', 'inc': u'+1.28', 'title': u'City of Traitors'},
 7: {'alt': 'VMA', 'inc': u'+1.15', 'title': u'Time Walk'},
 8: {'alt': 'NE', 'inc': u'+1.01', 'title': u'Daze'},
 9: {'alt': 'BOK', 'inc': u'+1.01', 'title': u"Goryo's Vengeance"},
 10: {'alt': 'UL', 'inc': u'+1.00', 'title': u'Unearth'}}

回溯最近的电话最后：谢谢你的回答！！！但是我现在不知道为什么。。显示我的错误。。第27行，在tags.appendimg.attrs['alt']KeyError:'alt'再次感谢！！当您使用Python字典并尝试访问字典中尚未定义的元素时，会发生一个键错误。因为alt属性不是必需的，所以在尝试访问它之前，您应该确认它是在元素上设置的。编辑我的答案以显示此检查。Padraic，感谢您的支持！你的印刷品很好用。但我的问题是，当我想将此值放入变量colection=row.find'img'[alt]时，它会显示此错误colection=row.find'img'[alt]TypeError:“NoneType”对象不可下标，非常感谢您的解释。@CarlosRocaPin，如果您按原样使用上述代码，您将获得所需的所有数据。我忘了添加alt，但它们现在在编辑栏上。刚才我读了你的最后一个答案。非常感谢！！！！工作完美！我将研究代码。谢谢谢谢你@CarlosRocaPin，没问题，不客气，我补充了一些评论，希望能有所帮助。