Web scraping 在<;br>;标签
HTML:Web scraping 在<;br>;标签,web-scraping,scrapy,scrapy-shell,Web Scraping,Scrapy,Scrapy Shell,HTML: <span class="number"> - Sep 15, 1991<br><strong>Some Number: </strong>123, 123, 145</span> samples = response.css('ul li.somthing') for sample in samples: loader = ItemLoader(item=CatelogIte
<span class="number"> - Sep 15, 1991<br><strong>Some Number: </strong>123, 123, 145</span>
samples = response.css('ul li.somthing')
for sample in samples:
loader = ItemLoader(item=CatelogItem(), selector=sample)
loader.add_css('some', 'span.number::text')
yield loader.load_item()
Item.py
some = Field(
input_processor=MapCompose(str.strip),
output_processor=Join()
)
结果
- Sep 15, 1991
预期的
- Sep 15, 1991 Some Number: 123, 123, 145
为什么会有这种行为?如何在itemloader中加载完整值?您需要获取所有innerhtml,而不是包含所有嵌套组件的文本
loader.add_css('some', 'span.number *::text')
你的意思是,loader.add_css('some','span.number::innerHtml')结果是:伪元素::innerHtml是未知的。。谢谢。这很有魅力。修正:loader.add_css('some','span.number*::text')我只想把它记下来,然后向上投票并勾选答案