Pdf 下载内容前如何查看链接扩展?
我有一个问题,我觉得很有趣。我收集了很多关于网络抓取的链接,我想从普通链接下载内容,所以在抓取阶段我忽略了所有扩展名为.PDF、.avi、.jpeg和类似的链接 因此,我有一个没有扩展的刮链接列表,但当我开始 下载内容,其中有些是PDF、音乐文件、图像或MS Word文档。在下载内容之前,我怎样才能忽略它们并看到隐藏的链接扩展 示例: PDF: PDF: 在这里,我应该在链接中查找string.PDF MS Word: 图片: MP4: 在这里,我应该在链接中查找字符串MP4 CSS: 我的代码:Pdf 下载内容前如何查看链接扩展?,pdf,Pdf,我有一个问题,我觉得很有趣。我收集了很多关于网络抓取的链接,我想从普通链接下载内容,所以在抓取阶段我忽略了所有扩展名为.PDF、.avi、.jpeg和类似的链接 因此,我有一个没有扩展的刮链接列表,但当我开始 下载内容,其中有些是PDF、音乐文件、图像或MS Word文档。在下载内容之前,我怎样才能忽略它们并看到隐藏的链接扩展 示例: PDF: PDF: 在这里,我应该在链接中查找string.PDF MS Word: 图片: MP4: 在这里,我应该在链接中查找字符串MP4 CSS: 我的代码
#!/usr/bin/python
# -*- coding: utf-8 -*-
# encoding=UTF-8
#
# DOWNLOADER
# To grab the text content of webpages and save it to TinyDB database.
import re, time, urllib, requests, bs4
from bs4 import BeautifulSoup
start_time = time.time()
# Open file with urls.
with open("Q:/SIIT/JV_Marko_Boro/Detector/test_podjetja_2015/podjetja_0_100_url_test.txt") as f:
urls = f.readlines()
# Open file to write content to.
with open("Q:/SIIT/JV_Marko_Boro/Detector/test_podjetja_2015/podjetja_0_100_vsebina_test.txt", 'wb') as v:
# Read the urls one by one
for url in urls[0:len(urls)]:
# HTTP
if str(url)[0:7] == "http://":
print "URL " + str(url)
# Read the HTML of url
soup = BeautifulSoup(urllib.urlopen(url).read(), "html.parser")
# EXTRACT TEXT
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text().encode('utf-8')
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
# manually insert Slavic characters
text = text.replace('ÄŤ', 'č')
text = text.replace('ÄŤ', 'č')
text = text.replace('ÄŚ', 'Č')
text = text.replace('Ĺľ', 'ž')
text = text.replace('Ĺľ', 'ž')
text = text.replace('Ĺ˝', 'Ž')
text = text.replace('Ĺ˝', 'Ž')
text = text.replace('š', 'š')
text = text.replace('š', 'š')
text = text.replace('Ĺ ', 'Š')
text = text.replace('Â', '')
text = text.replace('–', '')
# Write url to file.
v.write(url)
# Write delimiter between url and text
v.write("__delimiter_*_between_*_url_*_and_*_text__")
v.write(text)
# Delimiter to separate contents. Stupid way of writing content to file but due to problems with čšž characters ...
v.write("__delimiter_*_between_*_two_*_webpages__")
# HTTPS
elif str(url)[0:8] == "https://":
print "URL " + str(url)
r = requests.get(url, verify=True)
html = r.text.encode('utf-8')
#soup = BeautifulSoup(html, "lxml")
soup = BeautifulSoup(html, "html.parser")
# EXTRACT TEXT
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text().encode('utf-8')
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
# manually insert Slavic characters
text = text.replace('ž', 'ž')
text = text.replace('Ž', 'Ž')
text = text.replace('Å¡', 'š')
text = text.replace('Å ', 'Š')
text = text.replace('Ä', 'č')
#text = text.replace('•', '')
# Write url to file.
v.write(url)
# Write delimiter between url and text
v.write("__delimiter_*_between_*_url_*_and_*_text__")
v.write(text)
# Delimiter to separate contents. Stupid way of writing content to file but due to problems with čšž characters ...
v.write("__delimiter_*_between_*_two_*_webpages__")
else:
print "URL ERROR"
print "--- %s seconds ---" % round((time.time() - start_time),2)
你不能。即使在链接中明确给出了文件扩展名,您也不能100%确定该文件属于给定的类型。例如,一个名为lookAtMe!的文件!。png可能是一个可执行文件。是的@Jongware,但我不能将打开url和查看内容的过程分解为几个步骤,然后再下载,并检查每个步骤发生了什么?我这样问是因为我不太清楚这些请求和BeautifulSoup内部会发生什么。