Python 3.x 刮取图像时FileNotFoundError

Python 3.x 刮取图像时FileNotFoundError,python-3.x,web-scraping,beautifulsoup,Python 3.x,Web Scraping,Beautifulsoup,我编写这个脚本是为了从subreddit下载图像 # A script to download pictures from reddit.com/r/HistoryPorn from urllib.request import urlopen from urllib.request import urlretrieve from bs4 import BeautifulSoup import re import os import sys #TODO: sys.argv print('Down

我编写这个脚本是为了从subreddit下载图像

# A script to download pictures from reddit.com/r/HistoryPorn
from urllib.request import urlopen
from urllib.request import urlretrieve
from bs4 import BeautifulSoup
import re
import os
import sys #TODO: sys.argv

print('Downloading images...')

# Create a directory for photographs
path_to_hist = '/home/tautvydas/Documents/histphoto'
os.chdir(path_to_hist)
if not os.path.exists('/home/tautvydas/Documents/histphoto'):
    os.mkdir(path_to_hist)

website = 'https://www.reddit.com/r/HistoryPorn'

# Go to the internet and connect to the subreddit, start a loop
for i in range(3):
    subreddit = urlopen(website)
    bs_subreddit = BeautifulSoup(subreddit, 'lxml')

    # Create a regex and find all the titles in the page
    remove_reddit_tag = re.compile('(\s*\(i.redd.it\)(\s*))')
    title_bs_subreddit = bs_subreddit.findAll('p', {'class': 'title'})

    # Get text off the page
    pic_name = []
    for item in title_bs_subreddit[1:]:
        item = item.get_text()
        item = remove_reddit_tag.sub('', item)
        pic_name.append(item)

    # Get picture links
    pic_bs_subreddit = bs_subreddit.findAll('div', {'data-url' : re.compile('.*')})
    pic_img = []
    for pic in pic_bs_subreddit[1:]:
        pic_img.append(pic['data-url'])

    # Zip all info into one
    name_link = zip(pic_name, pic_img)
    for i in name_link:
        urlretrieve(i[1],i[0])


    # Click next
    for link in bs_subreddit.find('span', {'class' : 'next-button'}).children:
        website = link['href']
然而,我得到这个FileNotFoundError

Downloading images...
Traceback (most recent call last):
  File "gethist.py", line 44, in <module>
    urlretrieve(i[1],i[0])
  File "/home/tautvydas/anaconda3/lib/python3.6/urllib/request.py", line 258, in urlretrieve
    tfp = open(filename, 'wb')
FileNotFoundError: [Errno 2] No such file or directory: 'Preparation of rocket carrying test instruments, Kauai. June 29, 1962 [2880x1620] https://www.topic.com/a-crimson-fracture-in-the-sky'
正在下载图像。。。
回溯(最近一次呼叫最后一次):
文件“gethist.py”,第44行,在
urlretrieve(i[1],i[0])
文件“/home/tautvydas/anaconda3/lib/python3.6/urllib/request.py”,第258行,在urlretrieve中
tfp=打开(文件名为“wb”)
FileNotFoundError:[Errno 2]没有这样的文件或目录:“考艾岛运载火箭测试仪器的准备”。1962年6月29日[2880x1620]https://www.topic.com/a-crimson-fracture-in-the-sky'

有什么问题吗?“数据url”中的链接检索良好,如果单击该链接,则可以正常工作。名称包含超链接可能是个问题吗?还是名字太长?因为在该图像之前,所有其他图像都是下载的,没有任何问题。

这里的问题与收集到的名称有关:它们将图片源作为url字符串包含,并且被误解为文件夹路径

您可能需要清理文本以避免特殊的恼人字符,并可能使它们稍微短一些,但我建议也更改模式,以确保结果,您可以只解析包含标题的
标记,而不是包含链接的整个

另外,您可以通过搜索类
thing
(相当于
findAll('div',{'data-url':re.compile('.')
),创建一个主块列表,而不是使用两个不同的循环来构建一个zip,然后使用此列表对每个块执行相对查询,以查找标题和url

[...]
remove_reddit_tag = re.compile('(\s*\(i.redd.it\)(\s*))')

name_link = []
for block in bs_subreddit.findAll('div', {'class': 'thing'})[1:]:
    item = block.find('a',{'class': 'title'}).get_text()
    title = remove_reddit_tag.sub('', item)[:100]

    url = block.get('data-url')
    name_link.append((title, url))
    print(url, title)

for title, url in name_link:
    urlretrieve(url, title)