Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/349.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python爬虫正在忽略页面上的链接_Python_Beautifulsoup_Web Crawler - Fatal编程技术网

Python爬虫正在忽略页面上的链接

Python爬虫正在忽略页面上的链接,python,beautifulsoup,web-crawler,Python,Beautifulsoup,Web Crawler,因此,我为我的朋友编写了一个爬虫程序,它将浏览一大串作为搜索结果的网页,从网页上提取所有链接,检查它们是否在输出文件中,如果不在,则添加。它花了很多调试,但效果非常好!不幸的是,这个小家伙对它认为重要到可以添加的锚定标记非常挑剔 代码如下: #!C:\Python27\Python.exe from bs4 import BeautifulSoup from urlparse import urljoin #urljoin is a class that's included in urlpar

因此,我为我的朋友编写了一个爬虫程序,它将浏览一大串作为搜索结果的网页,从网页上提取所有链接,检查它们是否在输出文件中,如果不在,则添加。它花了很多调试,但效果非常好!不幸的是,这个小家伙对它认为重要到可以添加的锚定标记非常挑剔

代码如下:

#!C:\Python27\Python.exe
from bs4 import BeautifulSoup
from urlparse import urljoin #urljoin is a class that's included in urlparse
import urllib2
import requests #not necessary but keeping here in case additions to code in future

urls_filename = "myurls.txt" #this is the input text file,list of urls or objects to scan
output_filename = "output.txt" #this is the output file that you will export to Excel
keyword = "skin" #optional keyword, not used for this script. Ignore

with open(urls_filename, "r") as f:   
    url_list = f.read() #This command opens the input text file and reads the information inside it

with open(output_filename, "w") as f: 
    for url in url_list.split("\n"):  #This command splits the text file into separate      lines so it's easier to scan               
            hdr = {'User-Agent': 'Mozilla/5.0'} #This (attempts) to tell the webpage that the program is a Firefox browser
            try:
                    response = urllib2.urlopen(url) #tells program to open the url from the text file
            except:
                    print "Could not access", url
                    continue
            page = response.read() #this assigns a variable to the open page. like algebra, X=page opened
            soup = BeautifulSoup(page) #we are feeding the variable to BeautifulSoup so it can analyze it
            urls_all = soup('a') #beautiful soup is analyzing all the 'anchored' links in the page
            for link in urls_all:
                    if('href' in dict(link.attrs)):
                            url = urljoin(url, link['href']) #this combines the relative link e.g. "/support/contactus.html" and adds to domain 
                    if url.find("'")!=-1: continue #explicit statement that the value is not void. if it's NOT void, continue
                    url=url.split('#')[0]
                    if (url[0:4] == 'http' and url not in output_filename): #this checks if the item is a webpage and if it's already in the list
                            f.write(url + "\n") #if it's not in the list, it writes it to the output_filename
除了以下链接外,它工作得非常好:

这个链接有很多类似“tvotech.asp?Submit=List&ID=796”的链接,直接忽略它们。进入我的输出文件的唯一锚点是主页本身。这很奇怪,因为从源代码看,它们的锚非常标准,比如-
他们有'a'和'href',我看不出bs4有什么理由只通过它,只包括主链接。请帮忙。我尝试过从第30行删除http或将其更改为https,这样只会删除所有结果,甚至连主页都不会进入输出。

这是因为其中一个链接的href中有一个mailto,然后将其设置为
url
参数,并断开其余链接,从而导致未通过
url[0:4]=='http'
条件,它看起来像这样:

mailto:research@bidmc.harvard.edu?subject=Question about TVO Available Technology Abstracts
您应该过滤掉它,或者不要在循环中使用相同的参数
url
,请注意对url1的更改:

for link in urls_all:
    if('href' in dict(link.attrs)):
            url1 = urljoin(url, link['href']) #this combines the relative link e.g. "/support/contactus.html" and adds to domain 
    if url1.find("'")!=-1: continue #explicit statement that the value is not void. if it's NOT void, continue
    url1=url1.split('#')[0]
    if (url1[0:4] == 'http' and url1 not in output_filename): #this checks if the item is a webpage and if it's already in the list
            f.write(url1 + "\n") #if it's not in the list, it writes it to the output_filename