Python Web爬虫类不工作_Python_Web_Web Crawler

Python Web爬虫类不工作

python web web-crawler

Python Web爬虫类不工作,python,web,web-crawler,Python,Web,Web Crawler,最近，我开始构建一个简单的网络爬虫。我的初始代码只迭代了两次，效果很好，但当我试图将其转换为一个具有错误异常处理的类时，它就不再编译了 import re, urllib class WebCrawler: """A Simple Web Crawler That Is Readily Extensible""" def __init__(): size = 1 def containsAny(seq, aset): for c in se

最近，我开始构建一个简单的网络爬虫。我的初始代码只迭代了两次，效果很好，但当我试图将其转换为一个具有错误异常处理的类时，它就不再编译了

import re, urllib
class WebCrawler:
    """A Simple Web Crawler That Is Readily Extensible"""
    def __init__():
        size = 1
    def containsAny(seq, aset):
        for c in seq:
            if c in aset: return True
        return False

    def crawlUrls(url, depth):
        textfile = file('UrlMap.txt', 'wt')
        urlList = [url]
        size = 1
        for i in range(depth):
            for ee in range(size):
                if containsAny(urlList[ee], "http://"):
                    try:
                        webpage = urllib.urlopen(urlList[ee]).read()
                        break
                    except:
                        print "Following URL failed!"
                        print urlList[ee]
                    for ee in re.findall('''href=["'](.[^"']+)["']''',webpage, re.I):
                        print ee
                        urlList.append(ee)
                        size+=1
                        textfile.write(ee+'\n')

myCrawler = WebCrawler

myCrawler.crawlUrls("http://www.wordsmakeworlds.com/", 2)

下面是生成的错误代码

Traceback (most recent call last):
  File "C:/Users/Noah Huber-Feely/Desktop/Python/WebCrawlerClass", line 33, in <module>
    myCrawler.crawlUrls("http://www.wordsmakeworlds.com/", 2)
TypeError: unbound method crawlUrls() must be called with WebCrawler instance as first argument (got str instance instead)

回溯（最近一次呼叫最后一次）：
文件“C:/Users/Noah Huber-Feely/Desktop/Python/WebCrawlerClass”，第33行，在
myCrawler.crawlURL（“http://www.wordsmakeworlds.com/", 2)
TypeError:必须使用WebCrawler实例作为第一个参数调用未绑定的方法crawUrls（）（改为获取str实例）

您有两个问题。一个是这一行的一个：

myCrawler = WebCrawler

def crawlUrls(url, depth):

您没有创建

WebCrawler

的实例，您只是将名称

myCrawler

绑定到

WebCrawler

（基本上是为类创建别名）。您应该这样做：

myCrawler = WebCrawler()

然后，在这一行：

myCrawler = WebCrawler

def crawlUrls(url, depth):

Python实例方法将接收器作为方法的第一个参数。它通常被称为

self

，但从技术上讲，你可以随意称呼它。因此，您应该将方法签名更改为：

def crawlUrls(self, url, depth):

（您还需要为您定义的其他方法执行此操作。）

应该是

myCrawler=WebCrawler（）

-注意括号。当我执行此操作时，它返回了此错误。TypeError:uuu init_uuuuuuuuuu（）不接受任何参数（给定1个参数）是的，您需要将

self

参数指定为

\uuu init_uuuuuuuuuuu

（以及其他所有方法）。另外，在

\uuuu init\uuuu

中本地分配

大小有什么意义？我强烈建议您遵循教程（例如）；你不能只是猜测和期待最好的结果。当我这样做时，它仍然会返回一个错误，当我通过声明self.containsAny（）使用containsAny（）函数时，该错误就会停止。然而，程序只是执行，然后很快停止，没有在屏幕上打印任何内容。我仍在寻找这个问题的答案，因此如果您能发现错误，将不胜感激。谢谢@Noahuber Feely：你需要发布一个堆栈跟踪，记录发生了什么。这个问题的解决方案就在每个人眼前。我需要从try语句中删除break语句。