Python 使用Beauty soup从html标记中提取文本_Python_Html_Web Scraping_Beautifulsoup

Python 使用Beauty soup从html标记中提取文本

python html web-scraping

Python 使用Beauty soup从html标记中提取文本,python,html,web-scraping,beautifulsoup,Python,Html,Web Scraping,Beautifulsoup,我有一些html页面可以从中提取数据。所以我需要像这里这样的标题：“卡钳环”。我从标题出现的标签中获取数据： item_title = base_page.find_all('h1', class_='itemTitle') 它包含以下标记结构： > [<h1 class="itemTitle"> <div class="l1">Caliper</div> > Ball >

我有一些html页面可以从中提取数据。所以我需要像这里这样的标题：“卡钳环”。我从标题出现的标签中获取数据：

item_title = base_page.find_all('h1', class_='itemTitle')

它包含以下标记结构：

> [<h1 class="itemTitle"> <div class="l1">Caliper</div>
>                                 Ball
>                             </h1>]

所以我在收集器列表中得到了如此丑陋的输出：

[u"\nCaliper\r\n                                Ball\r\n                            "]

我怎样才能像这里的“卡钳球”一样清晰地输出这个正则表达式将帮助您获得输出（

卡钳球

）

重新导入
str=“”[卡钳]
球
]"""
regex=r'.*>（[^您可以使用replace（）方法将\n和\r替换为不带任何内容或空格，然后使用方法trim（）删除remvoe空格。
不要使用regex
。对于一些简单的东西，你增加了太多的开销。BeautifulSoup4
已经有了一个名为stripped\u strings
的东西。请参阅下面的我的代码
from bs4 import BeautifulSoup as bsoup

html = """[<h1 class="itemTitle"> <div class="l1">Caliper</div>
                               Ball
                           </h1>]"""
soup = bsoup(html)
soup.prettify()

item = soup.find("h1", class_="itemTitle")
base =  list(item.stripped_strings)
print " ".join(base)

说明：stripped\u strings
基本上获取指定标记中的所有文本，去掉所有空格、换行符等。它返回一个生成器，我们可以使用list
捕获它，因此它返回一个列表。一旦它是一个列表，只需使用”。join

让我们知道这是否有帮助
PS：我只是想纠正一下--实际上没有必要对剥离的字符串的结果使用list
，但是最好将上面的内容显示出来，这样它才是明确的。这类似于。你想要列表中的输出吗？@AvinashRaj doesnter，string是我用正则表达式完成的最好的结果：“”。加入（在u.text.split（）中逐字逐句）
import re
str="""[<h1 class="itemTitle"> <div class="l1">Caliper</div>
                                 Ball 
                             </h1>]"""
regex = r'.*>([^<]*)<\/div>\s*\n\s*(\w*).*'
match = re.findall(regex, str)
new_data = (' '.join(w) for w in match)
print ''.join(new_data) # => Caliper Ball

from bs4 import BeautifulSoup as bsoup

html = """[<h1 class="itemTitle"> <div class="l1">Caliper</div>
                               Ball
                           </h1>]"""
soup = bsoup(html)
soup.prettify()

item = soup.find("h1", class_="itemTitle")
base =  list(item.stripped_strings)
print " ".join(base)

Caliper Ball
[Finished in 0.5s]