如何使用scrapy python提取写在h4标记外的文本_Python_Web Scraping_Scrapy_Imdb

如何使用scrapy python提取写在h4标记外的文本

python web-scraping scrapy

如何使用scrapy python提取写在h4标记外的文本,python,web-scraping,scrapy,imdb,Python,Web Scraping,Scrapy,Imdb,预算： "€650,000 " （估计）尝试在xpath中使用以下同级：：text（）。如下所示：response.xpath（'//div[contains（@class，“txt block”）]/h4/following sibling:：text（））。get（）它提供了所需的信息。尝试使用： data = [d.strip() for d in response.css('.txt-block::text') if d.strip()] 实际上，您需要的数据在div标记中，我正在


预算：
"€650,000
"
（估计）

尝试在xpath中使用

以下同级：：text（）

。如下所示：

response.xpath（'//div[contains（@class，“txt block”）]/h4/following sibling:：text（））。get（）

它提供了所需的信息。

尝试使用：

data = [d.strip() for d in response.css('.txt-block::text') if d.strip()]

实际上，您需要的数据在div标记中，我正在使用该标记获取数据。

您似乎在寻找一个真实的演示。请查看以下实现：

import requests
from scrapy import Selector

url = "https://www.imdb.com/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=e31d89dd-322d-4646-8962-327b42fe94b1&pf_rd_r=702AB91P12YZ9Z98XH5T&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_tt_1"

res = requests.get(url)
sel = Selector(res)
budget = ' '.join(sel.css(".txt-block:contains('Budget')::text").extract()).strip()
gross = ' '.join(sel.css(".txt-block:contains('Gross USA')::text").extract()).strip()
cumulative = ' '.join(sel.css(".txt-block:contains('Cumulative Worldwide')::text").extract()).strip()
print(f'budget: {budget}\ngross: {gross}\ncumulative: {cumulative}')

此时的输出：

budget: $25,000,000
gross: $28,341,469
cumulative: $58,500,000

您需要将文本提取到数组中，并从所需位置的数组中获取值。范例

import scrapy
# Print Your code here
html_text="""
<div class="txt-block">'+
    <h4 class="inline">Budget:</h4>650,000
    <span class="attribute">(estimated)</span>
</div>
 """
# Parse text selector
selector=scrapy.Selector(text=html_text)
print(selector)
# Extract div
d=selector.xpath('//div[@class="txt-block"]//text()')
values=d.extract() # Gives an array of text values
print(values)
# Value index 2 is what you need
print(values[2])

import scrapy
#在这里打印代码
html_text=“”
'+
预算：65万
（估计）
"""
#解析文本选择器
选择器=scrapy.selector（text=html\u text）
打印（选择器）
#提取部
d=selector.xpath（'//div[@class=“txt block”]//text（）
values=d.extract（）#给出一个文本值数组
打印（值）
#值索引2是您所需要的
打印（值[2]）

Scrapy缺少BeautifulSoup中可用的标记删除功能。

有多个空格，\n因此，少数标记的输出是空的。如何获取包含空格的整个数据，以及\n让我们像这里一样剥离它们：

[i.strip（）for i in response.xpath（'//div[contains（@class，“txt block”）]/h4/following sibling:：text（））.extract（）如果i.strip（）]

这对我不起作用，它会从父节点获取所有数据。父节点很奇怪。添加了一些regexp以避免额外的空格

response.xpath（'//div[@class=“txt block”]/h4/以下同级：：text（））.re（r'（[\d\，]+）

）。但你们可以从这个表达式中看到，我们以h4为基础，得到文本，它跟在h4后面。