Python 从脚本运行Scrapy,需要帮助理解它

Python 从脚本运行Scrapy,需要帮助理解它,python,scrapy,Python,Scrapy,我对Python比较陌生,因此非常感谢您的帮助/建议 我试图建立一个脚本,将运行一个刮蜘蛛。 到目前为止,我有下面的代码 from scrapy.contrib.loader import XPathItemLoader from scrapy.item import Item, Field from scrapy.selector import HtmlXPathSelector from scrapy.spider import BaseSpider from scrapy.crawler

我对Python比较陌生,因此非常感谢您的帮助/建议

我试图建立一个脚本,将运行一个刮蜘蛛。 到目前为止,我有下面的代码

from scrapy.contrib.loader import XPathItemLoader
from scrapy.item import Item, Field
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.crawler import CrawlerProcess


class QuestionItem(Item):
"""Our SO Question Item"""
    title = Field()
    summary = Field()
    tags = Field()

    user = Field()
    posted = Field()

    votes = Field()
    answers = Field()
    views = Field()


class MySpider(BaseSpider):
    """Our ad-hoc spider"""
    name = "myspider"
    start_urls = ["http://stackoverflow.com/"]

    question_list_xpath = '//div[@id="content"]//div[contains(@class, "question-    summary")]'

    def parse(self, response):
        hxs = HtmlXPathSelector(response)

        for qxs in hxs.select(self.question_list_xpath):
            loader = XPathItemLoader(QuestionItem(), selector=qxs)
            loader.add_xpath('title', './/h3/a/text()')
            loader.add_xpath('summary', './/h3/a/@title')
            loader.add_xpath('tags', './/a[@rel="tag"]/text()')
            loader.add_xpath('user', './/div[@class="started"]/a[2]/text()')
            loader.add_xpath('posted', './/div[@class="started"]/a[1]/span/@title')
            loader.add_xpath('votes', './/div[@class="votes"]/div[1]/text()')
            loader.add_xpath('answers', './/div[contains(@class,  "answered")]/div[1]/text()')
            loader.add_xpath('views', './/div[@class="views"]/div[1]/text()')

            yield loader.load_item()     

class CrawlerWorker(Process):
    def __init__(self, spider, results):
        Process.__init__(self)
        self.results = results

        self.crawler = CrawlerProcess(settings)
        if not hasattr(project, 'crawler'):
            self.crawler.install()
        self.crawler.configure()

        self.items = []
        self.spider = spider
        dispatcher.connect(self._item_passed, signals.item_passed)

    def _item_passed(self, item):
        self.items.append(item)

    def run(self):
        self.crawler.crawl(self.spider)
        self.crawler.start()
        self.crawler.stop()
        self.results.put(self.items)

def main():
results = Queue()
crawler = CrawlerWorker(MySpider(BaseSpider), results)
crawler.start()
for item in results.get():
   pass # Do something with item
我在下面得到这个错误

ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (157, 0))
...
C:\Python27\lib\site-packages\twisted\internet\win32eventreactor.py:64: UserWarn
ing: Reliable disconnection notification requires pywin32 215 or later
  category=UserWarning)
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (157, 0))
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Python27\lib\multiprocessing\forking.py", line 374, in main
    self = load(from_parent)
  File "C:\Python27\lib\pickle.py", line 1378, in load
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (157, 0))
    return Unpickler(file).load()
  File "C:\Python27\lib\pickle.py", line 858, in load
    dispatch[key](self)
  File "C:\Python27\lib\pickle.py", line 1090, in load_global
    klass = self.find_class(module, name)
  File "C:\Python27\lib\pickle.py", line 1124, in find_class
    __import__(module)
  File "Webscrap.py", line 53, in <module>
    class CrawlerWorker(Process):
NameError: name 'Process' is not defined
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (157, 0))
...
"PicklingError: <function remove at 0x07871CB0>: Can't pickle <function remove at 0x077F6BF0>: it's not found as weakref.remove".
我意识到我在做一些逻辑错误的事情。我是新手,看不出来。谁能给我一些帮助让这个代码运行吗


最终,我只需要一个脚本,它将运行、废弃所需的数据并将其存储在数据库中,但首先,我只想让废弃工作正常进行。我以为这会运行它,但到目前为止运气不好。

我想你想要一个独立的蜘蛛/爬虫程序。。。这实际上非常简单,尽管我没有使用自定义流程


我想你想要一个独立的蜘蛛/爬虫。。。这实际上非常简单,尽管我没有使用自定义流程

class StandAloneSpider( CyledgeSpider ):
    #a regular spider

settings.overrides['LOG_ENABLED'] = True
#more settings can be changed...

crawler = CrawlerProcess( settings )
crawler.install()
crawler.configure()

spider = StandAloneSpider()

crawler.crawl( spider )
crawler.start()