Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/313.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 无法使用BeautifulSoup刮取图像url_Python_Beautifulsoup - Fatal编程技术网

Python 无法使用BeautifulSoup刮取图像url

Python 无法使用BeautifulSoup刮取图像url,python,beautifulsoup,Python,Beautifulsoup,我正试着把这个擦掉。我的刮削代码是 from bs4 import BeautifulSoup import re root_tag=["article",{"class":"story"}] image_tag=["img",{"":""},"org-src"] header=["h3",{"class":&quo

我正试着把这个擦掉。我的刮削代码是

from bs4 import BeautifulSoup
import re

root_tag=["article",{"class":"story"}]
image_tag=["img",{"":""},"org-src"]
header=["h3",{"class":"story-title"}]
news_tag=["a",{"":""},"href"]
txt_data=["p",{"":""}]



import requests
ua1 = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
ua2 = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit 537.36 (KHTML, like Gecko) Chrome'
headers = {'User-Agent': ua2,
           'Accept': 'text/html,application/xhtml+xml,application/xml;' \
                     'q=0.9,image/webp,*/*;q=0.8'}
session = requests.Session()
response = session.get("website-link", headers=headers)
webContent = response.content


bs = BeautifulSoup(webContent, 'lxml')
all_tab_data = bs.findAll(root_tag[0], root_tag[1])

output=[]
for div in all_tab_data:
    image_url = None
    div_img = str(div)
    match = re.search(r"(http(s?):)([/|.|\w|\s|-])*\.(?:jpg|gif|png|jpeg)", div_img)
    print(match)
    # match = re.search(r"([^\\s]+(\\.(?i)(jpg|png|gif|bmp))$)",div)
    if match != None:
        image_url = str(match.group(0))
    else:
        image_url = div.find(image_tag[0], image_tag[1]).get(image_tag[2])
    if image_url !=None:
        if image_url[0] == '/' and image_url[1] != '/':
            image_url = main_url + image_url
        if image_url[0] == '/' and image_url[1] == '/':
            image_url="https://" + image_url[2:]
    output.append(image_url)

它只给出一个图像url,然后给出错误AttributeError:“NoneType”对象没有属性“get”

您可能应该尝试重用解析库,而不是自己解析部件。考虑这种方法:

从bs4导入美化组
进口稀土
root_tag=[“article”,{“class”:“story”}]
image_tag=[“img”,{“:”},“org src”]
标题=[“h3”,{“类”:“故事标题”}]
新闻标签=[“a”,{“:”},“href”]
txt_数据=[“p”,{“:”“}]
#导入请求
#ua1='Mozilla/5.0(兼容;Googlebot/2.1+http://www.google.com/bot.html)'
#ua2='Mozilla/5.0(Macintosh;Intel Mac OS X 10_9_5)AppleWebKit 537.36(KHTML,类似Gecko)Chrome'
#headers={'User-Agent':ua2,
#“接受”:“text/html,application/xhtml+xml,application/xml;”\
#'q=0.9,image/webp,*/*;q=0.8'}
#会话=请求。会话()
#response=session.get(“https://www.reuters.com/energy-environment,headers=headers)
#webContent=response.content
#文件=打开('output','wb')
#file.write(webContent)
#file.close()文件
文件=打开('output','r')
webContent=file.read()
bs=BeautifulSoup(webContent'html.parser')
所有标签数据=bs.findAll(*根标签)
输出=[]
对于所有选项卡数据中的div:
image\u url=None
div_img=str(div)
article\u section=BeautifulSoup(div\u img,'html.parser')
article_images=article_section.findAll(*图像标签)
如果article_图像不是无:
output.extend([i.get('org-src')如果i和i.get('org-src')不是None,则文章中i的[i.get('org-src'))