Pdf 下载内容前如何查看链接扩展?

Pdf 下载内容前如何查看链接扩展?,pdf,Pdf,我有一个问题,我觉得很有趣。我收集了很多关于网络抓取的链接,我想从普通链接下载内容,所以在抓取阶段我忽略了所有扩展名为.PDF、.avi、.jpeg和类似的链接 因此,我有一个没有扩展的刮链接列表,但当我开始 下载内容,其中有些是PDF、音乐文件、图像或MS Word文档。在下载内容之前,我怎样才能忽略它们并看到隐藏的链接扩展 示例: PDF: PDF: 在这里,我应该在链接中查找string.PDF MS Word: 图片: MP4: 在这里,我应该在链接中查找字符串MP4 CSS: 我的代码

我有一个问题,我觉得很有趣。我收集了很多关于网络抓取的链接,我想从普通链接下载内容,所以在抓取阶段我忽略了所有扩展名为.PDF、.avi、.jpeg和类似的链接

因此,我有一个没有扩展的刮链接列表,但当我开始 下载内容,其中有些是PDF、音乐文件、图像或MS Word文档。在下载内容之前,我怎样才能忽略它们并看到隐藏的链接扩展

示例:

PDF:

PDF: 在这里,我应该在链接中查找string.PDF

MS Word:

图片:

MP4: 在这里,我应该在链接中查找字符串MP4

CSS:

我的代码:

#!/usr/bin/python
# -*- coding: utf-8 -*-
# encoding=UTF-8  
#
# DOWNLOADER
# To grab the text content of webpages and save it to TinyDB database.

import re, time, urllib, requests, bs4
from bs4 import BeautifulSoup 

start_time = time.time()




# Open file with urls.
with open("Q:/SIIT/JV_Marko_Boro/Detector/test_podjetja_2015/podjetja_0_100_url_test.txt") as f:
    urls = f.readlines()



# Open file to write content to.
with open("Q:/SIIT/JV_Marko_Boro/Detector/test_podjetja_2015/podjetja_0_100_vsebina_test.txt", 'wb') as v:


    # Read the urls one by one 
    for url in urls[0:len(urls)]:


        # HTTP
        if str(url)[0:7] == "http://":

            print "URL  " + str(url)
            # Read the HTML of url
            soup = BeautifulSoup(urllib.urlopen(url).read(), "html.parser") 

            # EXTRACT TEXT
            # kill all script and style elements
            for script in soup(["script", "style"]):
                script.extract()    # rip it out
            # get text
            text = soup.get_text().encode('utf-8')  
            # break into lines and remove leading and trailing space on each
            lines = (line.strip() for line in text.splitlines())
            # break multi-headlines into a line each
            chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
            # drop blank lines
            text = '\n'.join(chunk for chunk in chunks if chunk)

            # manually insert Slavic characters
            text = text.replace('ÄŤ', 'č')
            text = text.replace('ÄŤ', 'č')
            text = text.replace('ÄŚ', 'Č')

            text = text.replace('Ĺľ', 'ž')
            text = text.replace('Ĺľ', 'ž')
            text = text.replace('Ĺ˝', 'Ž')
            text = text.replace('Ĺ˝', 'Ž')

            text = text.replace('š', 'š')
            text = text.replace('š', 'š')
            text = text.replace('Ĺ ', 'Š')

            text = text.replace('Â', '')
            text = text.replace('–', '')

            # Write url to file.
            v.write(url)
            # Write delimiter between url and text
            v.write("__delimiter_*_between_*_url_*_and_*_text__")

            v.write(text)
            # Delimiter to separate contents. Stupid way of writing content to file but due to problems with čšž characters ...
            v.write("__delimiter_*_between_*_two_*_webpages__")



        # HTTPS
        elif str(url)[0:8] == "https://":


            print "URL  " + str(url)

            r = requests.get(url, verify=True)
            html = r.text.encode('utf-8')
            #soup = BeautifulSoup(html, "lxml")
            soup = BeautifulSoup(html, "html.parser")

            # EXTRACT TEXT
            # kill all script and style elements
            for script in soup(["script", "style"]):
                script.extract()    # rip it out
            # get text
            text = soup.get_text().encode('utf-8')
            # break into lines and remove leading and trailing space on each
            lines = (line.strip() for line in text.splitlines())
            # break multi-headlines into a line each
            chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
            # drop blank lines
            text = '\n'.join(chunk for chunk in chunks if chunk)

            # manually insert Slavic characters
            text = text.replace('ž', 'ž')
            text = text.replace('Ž', 'Ž')
            text = text.replace('Å¡', 'š')
            text = text.replace('Å ', 'Š')
            text = text.replace('Ä', 'č')

            #text = text.replace('•', '')

            # Write url to file.
            v.write(url)
            # Write delimiter between url and text
            v.write("__delimiter_*_between_*_url_*_and_*_text__")

            v.write(text)
            # Delimiter to separate contents. Stupid way of writing content to file but due to problems with čšž characters ...
            v.write("__delimiter_*_between_*_two_*_webpages__")

        else:
            print "URL ERROR"

print "--- %s seconds ---" % round((time.time() - start_time),2)

你不能。即使在链接中明确给出了文件扩展名,您也不能100%确定该文件属于给定的类型。例如,一个名为lookAtMe!的文件!。png可能是一个可执行文件。是的@Jongware,但我不能将打开url和查看内容的过程分解为几个步骤,然后再下载,并检查每个步骤发生了什么?我这样问是因为我不太清楚这些请求和BeautifulSoup内部会发生什么。