Python 无法使用BeautifulSoup刮取图像url
我正试着把这个擦掉。我的刮削代码是Python 无法使用BeautifulSoup刮取图像url,python,beautifulsoup,Python,Beautifulsoup,我正试着把这个擦掉。我的刮削代码是 from bs4 import BeautifulSoup import re root_tag=["article",{"class":"story"}] image_tag=["img",{"":""},"org-src"] header=["h3",{"class":&quo
from bs4 import BeautifulSoup
import re
root_tag=["article",{"class":"story"}]
image_tag=["img",{"":""},"org-src"]
header=["h3",{"class":"story-title"}]
news_tag=["a",{"":""},"href"]
txt_data=["p",{"":""}]
import requests
ua1 = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
ua2 = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit 537.36 (KHTML, like Gecko) Chrome'
headers = {'User-Agent': ua2,
'Accept': 'text/html,application/xhtml+xml,application/xml;' \
'q=0.9,image/webp,*/*;q=0.8'}
session = requests.Session()
response = session.get("website-link", headers=headers)
webContent = response.content
bs = BeautifulSoup(webContent, 'lxml')
all_tab_data = bs.findAll(root_tag[0], root_tag[1])
output=[]
for div in all_tab_data:
image_url = None
div_img = str(div)
match = re.search(r"(http(s?):)([/|.|\w|\s|-])*\.(?:jpg|gif|png|jpeg)", div_img)
print(match)
# match = re.search(r"([^\\s]+(\\.(?i)(jpg|png|gif|bmp))$)",div)
if match != None:
image_url = str(match.group(0))
else:
image_url = div.find(image_tag[0], image_tag[1]).get(image_tag[2])
if image_url !=None:
if image_url[0] == '/' and image_url[1] != '/':
image_url = main_url + image_url
if image_url[0] == '/' and image_url[1] == '/':
image_url="https://" + image_url[2:]
output.append(image_url)
它只给出一个图像url,然后给出错误AttributeError:“NoneType”对象没有属性“get”您可能应该尝试重用解析库,而不是自己解析部件。考虑这种方法:
从bs4导入美化组
进口稀土
root_tag=[“article”,{“class”:“story”}]
image_tag=[“img”,{“:”},“org src”]
标题=[“h3”,{“类”:“故事标题”}]
新闻标签=[“a”,{“:”},“href”]
txt_数据=[“p”,{“:”“}]
#导入请求
#ua1='Mozilla/5.0(兼容;Googlebot/2.1+http://www.google.com/bot.html)'
#ua2='Mozilla/5.0(Macintosh;Intel Mac OS X 10_9_5)AppleWebKit 537.36(KHTML,类似Gecko)Chrome'
#headers={'User-Agent':ua2,
#“接受”:“text/html,application/xhtml+xml,application/xml;”\
#'q=0.9,image/webp,*/*;q=0.8'}
#会话=请求。会话()
#response=session.get(“https://www.reuters.com/energy-environment,headers=headers)
#webContent=response.content
#文件=打开('output','wb')
#file.write(webContent)
#file.close()文件
文件=打开('output','r')
webContent=file.read()
bs=BeautifulSoup(webContent'html.parser')
所有标签数据=bs.findAll(*根标签)
输出=[]
对于所有选项卡数据中的div:
image\u url=None
div_img=str(div)
article\u section=BeautifulSoup(div\u img,'html.parser')
article_images=article_section.findAll(*图像标签)
如果article_图像不是无:
output.extend([i.get('org-src')如果i和i.get('org-src')不是None,则文章中i的[i.get('org-src'))