Python 抓取HTML标记的主要文本内容，而不使用<；span>；在…内_Python_Web Scraping_Beautifulsoup_Web Crawler_Html Parsing

Python 抓取HTML标记的主要文本内容，而不使用<；span>；在…内

python web-scraping web-crawler

Python 抓取HTML标记的主要文本内容，而不使用<；span>；在…内,python,web-scraping,beautifulsoup,web-crawler,html-parsing,Python,Web Scraping,Beautifulsoup,Web Crawler,Html Parsing,我正在构建一个Python网络刮板，它可以通过eBay搜索结果页面（在本例中为“游戏笔记本电脑”）抓取每件商品的标题进行销售。我使用BeautifulSoup首先获取存储每个标题的h1标记，然后将其打印为文本： for item_name in soup.findAll('h1', {'class': 'it-ttl'}): print(item_name.text) 但是，在“it ttl”类的每个h1标记中，还有一个包含一些文本的span标记： <h1 class="

我正在构建一个Python网络刮板，它可以通过eBay搜索结果页面（在本例中为“游戏笔记本电脑”）抓取每件商品的标题进行销售。我使用BeautifulSoup首先获取存储每个标题的h1标记，然后将其打印为文本：

    for item_name in soup.findAll('h1', {'class': 'it-ttl'}):
    print(item_name.text)

但是，在“it ttl”类的每个h1标记中，还有一个包含一些文本的span标记：

<h1 class="it-ttl" itemprop="name" id="itemTitle">
 <span class="g-hdn">Details about  &nbsp;</span>
 Acer - Nitro 5 15.6" Gaming Laptop - Intel Core i5 - 8GB Memory - NVIDIA GeFo…
</h1>


详情
宏碁-Nitro 5 15.6英寸游戏笔记本电脑-英特尔Core i5-8GB内存-NVIDIA GeFo…

我当前的程序打印出span标签的内容和项目标题：

有人能告诉我如何在忽略包含“关于”文本的“详细信息”的span标记的同时抓取项目标题吗？谢谢！

只需删除有问题的

：

只需删除有问题的

：

另一个解决方案

from simplified_scrapy import SimplifiedDoc,req,utils
html = '''
<h1 class="it-ttl" itemprop="name" id="itemTitle">
 <span class="g-hdn">Details about  &nbsp;</span>
 Acer - Nitro 5 15.6" Gaming Laptop - Intel Core i5 - 8GB Memory - NVIDIA GeFo…
</h1>
'''
doc = SimplifiedDoc(html)
item_names = doc.selects('h1.it-ttl').span.nextText()

print(item_names)

这里有更多的例子。

另一种解决方案

from simplified_scrapy import SimplifiedDoc,req,utils
html = '''
<h1 class="it-ttl" itemprop="name" id="itemTitle">
 <span class="g-hdn">Details about  &nbsp;</span>
 Acer - Nitro 5 15.6" Gaming Laptop - Intel Core i5 - 8GB Memory - NVIDIA GeFo…
</h1>
'''
doc = SimplifiedDoc(html)
item_names = doc.selects('h1.it-ttl').span.nextText()

print(item_names)

以下是更多示例。

请参见

from simplified_scrapy import SimplifiedDoc,req,utils
html = '''
<h1 class="it-ttl" itemprop="name" id="itemTitle">
 <span class="g-hdn">Details about  &nbsp;</span>
 Acer - Nitro 5 15.6" Gaming Laptop - Intel Core i5 - 8GB Memory - NVIDIA GeFo…
</h1>
'''
doc = SimplifiedDoc(html)
item_names = doc.selects('h1.it-ttl').span.nextText()

print(item_names)

['Acer - Nitro 5 15.6" Gaming Laptop - Intel Core i5 - 8GB Memory - NVIDIA GeFo…']