Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/354.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 无法获取URL列表_Python_Web Scraping_Mechanize - Fatal编程技术网

Python 无法获取URL列表

Python 无法获取URL列表,python,web-scraping,mechanize,Python,Web Scraping,Mechanize,我正在尝试使用下面的脚本。为什么不检索此站点的URL列表?它可以在其他网站上使用 起初,我认为问题在于,robots.txt不允许使用is,但当我运行它时,它不会返回错误 import urllib from bs4 import BeautifulSoup import urlparse import mechanize url = "https://www.danmurphys.com.au" br = mechanize.Browser() br.set_handle_robots(F

我正在尝试使用下面的脚本。为什么不检索此站点的URL列表?它可以在其他网站上使用

起初,我认为问题在于,
robots.txt
不允许使用is,但当我运行它时,它不会返回错误

import urllib
from bs4 import BeautifulSoup
import urlparse
import mechanize

url = "https://www.danmurphys.com.au"

br = mechanize.Browser()
br.set_handle_robots(False)

urls = [url]
visited =[url]

print 
while len(urls)>0:
try:
    br.open(urls[0])
    urls.pop(0) 
    for link in br.links():
        #print link
        #print "The base url is :" + link.base_url # just check there is this applicable to all sites.
        #print "The url is: " + link.url # This gives generally just the page name
        new_url = urlparse.urljoin(link.base_url,link.url)
        b1 = urlparse.urlparse(new_url).hostname
        b2 = urlparse.urlparse(new_url).path
        new_url = "http://"+ b1 + b2

        if new_url not in visited and urlparse.urlparse(url).hostname in new_url:
            visited.append(new_url)
            urls.append(new_url)
            print new_url
except:
    print "error"
    urls.pop(0)

您需要使用其他工具来刮取该URL,例如使用或由于Mechanize库不能使用Javascript

r = br.open(urls[0])
html = r.read()
print html
您将看到输出:

<noscript>Please enable JavaScript to view the page content.</noscript>
请启用JavaScript以查看页面内容。