Python Django与Scrapy的关系如何保存项目?
我只需要了解如何检测scrapy是否已保存和spider中的项目?我从一个站点获取项目,然后获取该项目的评论。因此,首先我必须保存项目,然后我将保存注释。但当我在屈服后编写代码时,它会给我这个错误Python Django与Scrapy的关系如何保存项目?,python,django,scrapy,scrapy-spider,scrapy-pipeline,Python,Django,Scrapy,Scrapy Spider,Scrapy Pipeline,我只需要了解如何检测scrapy是否已保存和spider中的项目?我从一个站点获取项目,然后获取该项目的评论。因此,首先我必须保存项目,然后我将保存注释。但当我在屈服后编写代码时,它会给我这个错误 save()被禁止,以防止由于未保存的相关对象“”而导致数据丢失。 这是我的密码 def parseProductComments(self, response): name = response.css('h1.product-name::text').extract_first(
save()被禁止,以防止由于未保存的相关对象“”而导致数据丢失。
这是我的密码
def parseProductComments(self, response):
name = response.css('h1.product-name::text').extract_first()
price = response.css('span[id=offering-price] > span::text').extract_first()
node = response.xpath("//script[contains(text(),'var utagData = ')]/text()")
data = node.re('= (\{.+\})')[0] #data = xpath.re(" = (\{.+\})")
data = json.loads(data)
barcode = data['product_barcode']
objectImages = []
for imageThumDiv in response.css('div[id=productThumbnailsCarousel]'):
images = imageThumDiv.xpath('img/@data-src').extract()
for image in images:
imageQuality = image.replace('/80/', '/500/')
objectImages.append(imageQuality)
company = Company.objects.get(pk=3)
comments = []
item = ProductItem(name=name, price=price, barcode=barcode, file_urls=objectImages, product_url=response.url,product_company=company, comments = comments)
yield item
print item["pk"]
for commentUl in response.css('ul.chevron-list-container'):
url = commentUl.css('span.link-more-results::attr(href)').extract_first()
if url is not None:
for commentLi in commentUl.css('li.review-item'):
comment = commentLi.css('p::text').extract_first()
commentItem = CommentItem(comment=comment, product=item.instance)
yield commentItem
else:
yield scrapy.Request(response.urljoin(url), callback=self.parseCommentsPages, meta={'item': item.instance})
这是我的管道
def comment_to_model(item):
model_class = getattr(item, 'Comment')
if not model_class:
raise TypeError("Item is not a `DjangoItem` or is misconfigured")
def get_comment_or_create(model):
model_class = type(model)
created = False
# Normally, we would use `get_or_create`. However, `get_or_create` would
# match all properties of an object (i.e. create a new object
# anytime it changed) rather than update an existing object.
#
# Instead, we do the two steps separately
try:
# We have no unique identifier at the moment; use the name for now.
obj = model_class.objects.get(product=model.product, comment=model.comment)
except model_class.DoesNotExist:
created = True
obj = model # DjangoItem created a model for us.
obj.save()
return (obj, created)
def get_or_create(model):
model_class = type(model)
created = False
# Normally, we would use `get_or_create`. However, `get_or_create` would
# match all properties of an object (i.e. create a new object
# anytime it changed) rather than update an existing object.
#
# Instead, we do the two steps separately
try:
# We have no unique identifier at the moment; use the name for now.
obj = model_class.objects.get(product_company=model.product_company, barcode=model.barcode)
except model_class.DoesNotExist:
created = True
obj = model # DjangoItem created a model for us.
obj.save()
return (obj, created)
def update_model(destination, source, commit=True):
pk = destination.pk
source_dict = model_to_dict(source)
for (key, value) in source_dict.items():
setattr(destination, key, value)
setattr(destination, 'pk', pk)
if commit:
destination.save()
return destination
class ProductItemPipeline(object):
def process_item(self, item, spider):
if isinstance(item, ProductItem):
item['cover_photo'] = item['files'][0]['path']
item_model = item.instance
model, created = get_or_create(item_model)
#update_model(model, item_model)
if created:
for image in item['files']:
imageItem = ProductImageItem(image=image['path'], product=item.instance)
imageItem.save()
# for comment in item['comments']:
# commentItem = CommentItem(comment=comment, product= item.instance)
# commentItem.save()
return item
if isinstance(item, CommentItem):
comment_to_model = item.instance
model, created = get_comment_or_create(comment_to_model)
if created:
print model
else:
print created
return item
获取或创建
您的代码的很大一部分似乎处理了get\u或create的一个明显弱点
# Normally, we would use `get_or_create`. However, `get_or_create` would
# match all properties of an object (i.e. create a new object
# anytime it changed) rather than update an existing object.
幸运的是,这种明显的短期趋势可以轻易克服。由于默认参数为
传递给get_或_create()的任何关键字参数-可选参数除外
一个名为defaults的函数将在get()调用中使用。如果一个对象是
found、get_或_create()返回该对象的元组,返回False。如果
找到多个对象,获取或创建
返回多个对象。如果找不到对象,则获取或创建
将实例化并保存一个新对象,返回新对象的元组
客观真实
更新或创建
还是不相信get_或create是这份工作的合适人选?我也不是。还有更好的东西
使用给定KWARG更新对象的简便方法,
如有必要,创建一个新的。默认值是一本字典
(字段,值)用于更新对象的对
但是我不打算详细讨论update_或create的用户,因为代码中试图更新模型的行已经被注释掉,并且您还没有明确说明要更新什么
新管道
使用标准API方法,包含管道的模块只需简化为ProductItemPipeline类。这是可以修改的
class ProductItemPipeline(object):
def process_item(self, item, spider):
if isinstance(item, ProductItem):
item['cover_photo'] = item['files'][0]['path']
model, created = ProductItem.get_or_create(product_company=item['product_company'], barcode=item['bar_code'],
defaults={'Other_field1': value1, 'Other_field2': value2})
if created:
for image in item['files']:
imageItem = ProductImageItem(image=image['path'], product=item.instance)
imageItem.save()
return item
if isinstance(item, CommentItem):
model, created = CommentItem.get_or_create(field1=value1, defaults={ other fields go in here'})
if created:
print model
else:
print created
return item
原始代码中的Bug
我相信这就是臭虫存在的地方
obj = model_class.objects.get(product=model.product, comment=model.comment)
现在我们没有使用它,所以bug应该会消失。如果您仍然有问题,请粘贴完整的回溯 一,。你能告诉我你想保存什么吗?2.很抱歉,但是您的代码看起来很糟糕:
get\u或\u create
与get\u comment\u或\u create
相同,它们重复默认的djangoget\u或\u create
方法,方法comment\u to\u model
不可读。我建议格式化代码,使逻辑更具可读性。