Python 3.x 刮擦过滤产生的项目
我正在尝试刮取一些项目,如下所示:Python 3.x 刮擦过滤产生的项目,python-3.x,xpath,scrapy,xml-parsing,css-selectors,Python 3.x,Xpath,Scrapy,Xml Parsing,Css Selectors,我正在尝试刮取一些项目,如下所示: def parse(self, response): item = GameItem() item['game_commentary'] = response.css('tr td:nth-child(2)[style*=vertical-align]::text').extract() item['game_movement'] = response.xpath("//tr/td[1][contains(@style,'vertic
def parse(self, response):
item = GameItem()
item['game_commentary'] = response.css('tr td:nth-child(2)[style*=vertical-align]::text').extract()
item['game_movement'] = response.xpath("//tr/td[1][contains(@style,'vertical-align: top')]/text()").extract()
yield item
我的问题是我不想产生当前response.xpath
或response.css
选择器提取的所有项
在将这些命令分配给item['game\u commentation']
和item['game\u movement']
之前,是否有一种方法可以应用regex
或其他方法来过滤未生成的值?我将研究如何实现这一点。
您必须按如下方式重写解析:
def parse(self, response):
loader = GameItemLoader(item=GameItem(), response=response)
loader.add_css('game_commentary', 'tr td:nth-child(2)[style*=vertical-align]::text')
loader.add_xpath('game_movement', "//tr/td[1][contains(@style,'vertical-align: top')]/text()")
item = loader.load_item()
yield item
您的items.py将如下所示:
from scrapy.item import Item, Field
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst
class GameItemLoader(Item):
# default input & output processors
# will be executed for each item loaded,
# except if a specific in or output processor is specified
default_output_processor = TakeFirst()
# you can specify specific input & output processors per field
game_commentary_in = '...'
game_commentary_out = '...'
class GameItem(RetviewsItem):
game_commentary = Field()
game_movement = Field()
不能用XPath过滤未删除的值吗?XPath 2.0在需要时支持正则表达式。我不知道。谢谢有趣的解决方案!作为初学者,我不知道scrapy还有很多特性。谢谢你,维姆!