Python 将刮取的数据加载到Postgresql中
我结合了一些关于网络抓取的教程,制作了一个简单的网络爬虫,可以抓取新发布的问题。我想将它们加载到我的postgresql数据库中,但我的爬虫程序显示的一个解码错误给我带来了麻烦 错误:Python 将刮取的数据加载到Postgresql中,python,postgresql,Python,Postgresql,我结合了一些关于网络抓取的教程,制作了一个简单的网络爬虫,可以抓取新发布的问题。我想将它们加载到我的postgresql数据库中,但我的爬虫程序显示的一个解码错误给我带来了麻烦 错误: 2015-06-09 06:07:10+0200 [stack] ERROR: Error processing {'title': u'Laravel 5 Confused when implements ShoudlQueue', 'url': u'/questions/30722718/larav
2015-06-09 06:07:10+0200 [stack] ERROR: Error processing {'title': u'Laravel 5 Confused when implements ShoudlQueue',
'url': u'/questions/30722718/laravel-5-confused-when-implements-shoudlqueue'}
Traceback (most recent call last):
File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/scrapy/middleware.py", line 62, in _process_chain
return process_chain(self.methods[methodname], obj, *args)
File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 65, in process_chain
d.callback(input)
File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 393, in callback
self._startRunCallbacks(result)
File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 501, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/petarp/Documents/PyScraping/RealPython/WebScraping/stack/stack/pipelines.py", line 27, in process_item
session.commit()
File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 790, in commit
self.transaction.commit()
File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 392, in commit
self._prepare_impl()
File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 372, in _prepare_impl
self.session.flush()
File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2004, in flush
self._flush(objects)
File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2122, in _flush
transaction.rollback(_capture_exception=True)
File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 60, in __exit__
compat.reraise(exc_type, exc_value, exc_tb)
File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2086, in _flush
flush_context.execute()
File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 373, in execute
rec.execute(self)
File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 532, in execute
uow
File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 174, in save_obj
mapper, table, insert)
File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 761, in _emit_insert_statements
execute(statement, params)
File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 914, in execute
return meth(self, multiparams, params)
File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 323, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1010, in _execute_clauseelement
compiled_sql, distilled_params
File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1146, in _execute_context
context)
File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1341, in _handle_dbapi_exception
exc_info
File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb)
File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context
context)
File "/home/petarp/.virtualenvs/webscraping/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in do_execute
cursor.execute(statement, parameters)
sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) column "url" of relation "reals" does not exist
LINE 1: INSERT INTO reals (title, url) VALUES ('Laravel 5 Confused w...
^
[SQL: 'INSERT INTO reals (title, url) VALUES (%(title)s, %(url)s) RETURNING reals.id'] [parameters: {'url': u'/questions/30722718/laravel-5-confused-when-implements-shoudlqueue', 'title': u'Laravel 5 Confused when implements ShoudlQueue'}]
Models.py:
from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.engine.url import URL
import settings
DeclarativeBase = declarative_base()
def db_connect():
""" Performs database connections using database settings from settings.py
Returns sqlalchemy engine instance
"""
return create_engine(URL(**settings.DATABASE))
def create_reals_table(engine):
""""""
DeclarativeBase.metadata.create_all(engine)
class Reals(DeclarativeBase):
"""SQLAlchemy Reals Model"""
__tablename__ = 'reals'
id = Column(Integer, primary_key=True)
title = Column('title', String)
url = Column('url', String, nullable=True)
Pipeline.py:
from sqlalchemy.orm import sessionmaker
from models import Reals, db_connect, create_reals_table
class StackPipeline(object):
""" Stack Exchange pipeline for storing scraped items in the database """
def __init__(self):
""" Initialize database connection and sessionmaker """
engine = db_connect()
create_reals_table(engine)
self.Session = sessionmaker(bind=engine)
def process_item(self, item, spider):
"""Save reals in database.
This method is called for every item pipeline componenet."""
session = self.Session()
real = Reals(**item)
try:
session.add(real)
session.commit()
except:
session.rollback()
raise
finally:
session.close()
return item
实数表的Shema:
realpython=# select * from reals limit 5;
id | title | link
----+-------+------
(0 rows)
有人能帮我理解这里发生了什么,并对其进行解码吗?错误消息实际上是不言自明的-您只需查看最后几行:
sqlalchemy.exc.ProgrammingError:(psycopg2.ProgrammingError)关系“reals”的“url”列不存在
因此,您需要更改SQL以插入到名为
link
的列中,或者需要使用重命名表中的列。我找到了解决方案
问题在于我的Items.py
中的url
,link
定义中,我这样定义了它,在我的模型中,我创建了一个带有link
的模式表,所以我只需将url
替换为link
,数据就成功加载到postgresql中
from scrapy import Item, Field
class StackItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = Field()
url = Field()
New Items.py:
from scrapy import Item, Field
class StackItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = Field()
link = Field()
预期结果:
id | title | link
----+------------------------------------------------------------------------+----------------------------------------------------------------------------------------
1 | pointcut execution for specific class constructor | /questions/30723494/pointcut-execution-for-specific-class-constructor
2 | PWX-00001 Error opening repository “dtlmsg.txt”. RCs = 268/150/2 | /questions/30723493/pwx-00001-error-opening-repository-dtlmsg-txt-rcs-268-150-2
3 | Can anyone share a sample c++ program, that reads ASCII stl type file? | /questions/30723491/can-anyone-share-a-sample-c-program-that-reads-ascii-stl-type-file
4 | Where should I do the core logic code in express js? | /questions/30723487/where-should-i-do-the-core-logic-code-in-express-js
5 | configuring rails application to make ui router work | /questions/30723485/configuring-rails-application-to-make-ui-router-work
(5 rows)
你能检查一下你的
reals
表是否确实包含了所有必需的列吗?提供reals
的架构,看起来里面没有列url是的,里面没有列url,我应该删除并重新创建表吗?不,只需将url输入到链接列,是的,我需要将架构url列更改为链接。
id | title | link
----+------------------------------------------------------------------------+----------------------------------------------------------------------------------------
1 | pointcut execution for specific class constructor | /questions/30723494/pointcut-execution-for-specific-class-constructor
2 | PWX-00001 Error opening repository “dtlmsg.txt”. RCs = 268/150/2 | /questions/30723493/pwx-00001-error-opening-repository-dtlmsg-txt-rcs-268-150-2
3 | Can anyone share a sample c++ program, that reads ASCII stl type file? | /questions/30723491/can-anyone-share-a-sample-c-program-that-reads-ascii-stl-type-file
4 | Where should I do the core logic code in express js? | /questions/30723487/where-should-i-do-the-core-logic-code-in-express-js
5 | configuring rails application to make ui router work | /questions/30723485/configuring-rails-application-to-make-ui-router-work
(5 rows)