Python 无法使用BeautifulSoup刮取图像url_Python_Beautifulsoup

Python 无法使用BeautifulSoup刮取图像url

python

Python 无法使用BeautifulSoup刮取图像url,python,beautifulsoup,Python,Beautifulsoup,我正试着把这个擦掉。我的刮削代码是 from bs4 import BeautifulSoup import re root_tag=["article",{"class":"story"}] image_tag=["img",{"":""},"org-src"] header=["h3",{"class":&quo

我正试着把这个擦掉。我的刮削代码是

from bs4 import BeautifulSoup
import re

root_tag=["article",{"class":"story"}]
image_tag=["img",{"":""},"org-src"]
header=["h3",{"class":"story-title"}]
news_tag=["a",{"":""},"href"]
txt_data=["p",{"":""}]



import requests
ua1 = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
ua2 = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit 537.36 (KHTML, like Gecko) Chrome'
headers = {'User-Agent': ua2,
           'Accept': 'text/html,application/xhtml+xml,application/xml;' \
                     'q=0.9,image/webp,*/*;q=0.8'}
session = requests.Session()
response = session.get("website-link", headers=headers)
webContent = response.content


bs = BeautifulSoup(webContent, 'lxml')
all_tab_data = bs.findAll(root_tag[0], root_tag[1])

output=[]
for div in all_tab_data:
    image_url = None
    div_img = str(div)
    match = re.search(r"(http(s?):)([/|.|\w|\s|-])*\.(?:jpg|gif|png|jpeg)", div_img)
    print(match)
    # match = re.search(r"([^\\s]+(\\.(?i)(jpg|png|gif|bmp))$)",div)
    if match != None:
        image_url = str(match.group(0))
    else:
        image_url = div.find(image_tag[0], image_tag[1]).get(image_tag[2])
    if image_url !=None:
        if image_url[0] == '/' and image_url[1] != '/':
            image_url = main_url + image_url
        if image_url[0] == '/' and image_url[1] == '/':
            image_url="https://" + image_url[2:]
    output.append(image_url)

它只给出一个图像url，然后给出错误AttributeError:“NoneType”对象没有属性“get”

您可能应该尝试重用解析库，而不是自己解析部件。考虑这种方法：

从bs4导入美化组
进口稀土
root_tag=[“article”，{“class”：“story”}]
image_tag=[“img”，{“：”}，“org src”]
标题=[“h3”，{“类”：“故事标题”}]
新闻标签=[“a”，{“：”}，“href”]
txt_数据=[“p”，{“：”“}]
#导入请求
#ua1='Mozilla/5.0（兼容；Googlebot/2.1+http://www.google.com/bot.html)'
#ua2='Mozilla/5.0（Macintosh；Intel Mac OS X 10_9_5）AppleWebKit 537.36（KHTML，类似Gecko）Chrome'
#headers={'User-Agent'：ua2，
#“接受”：“text/html，application/xhtml+xml，application/xml；”\
#'q=0.9，image/webp，*/*；q=0.8'}
#会话=请求。会话（）
#response=session.get（“https://www.reuters.com/energy-environment，headers=headers）
#webContent=response.content
#文件=打开（'output'，'wb'）
#file.write（webContent）
#file.close（）文件
文件=打开（'output'，'r'）
webContent=file.read（）
bs=BeautifulSoup（webContent'html.parser'）
所有标签数据=bs.findAll（*根标签）
输出=[]
对于所有选项卡数据中的div：
image\u url=None
div_img=str（div）
article\u section=BeautifulSoup（div\u img，'html.parser'）
article_images=article_section.findAll（*图像标签）
如果article_图像不是无：
output.extend（[i.get（'org-src'）如果i和i.get（'org-src'）不是None，则文章中i的[i.get（'org-src'））