Python 获取BeautifulSoup4中标记之间的句子长度
我试图从一个网站收集一些统计数据,我试图做的是提取一个Python 获取BeautifulSoup4中标记之间的句子长度,python,beautifulsoup,Python,Beautifulsoup,我试图从一个网站收集一些统计数据,我试图做的是提取一个单词,并计算在同一标签中找到的相邻单词的数量 输入 <div class="col-xs-12"> <p class="w50">Operating Temperature (Min.)[°C]</p> <p class="w50 upperC">-40</p> </div> 这就是我最终的目的,但它提取了整个文本 url = 'https://www.ro
单词
,并计算在同一标签中找到的相邻单词的数量
输入
<div class="col-xs-12">
<p class="w50">Operating Temperature (Min.)[°C]</p>
<p class="w50 upperC">-40</p>
</div>
这就是我最终的目的,但它提取了整个文本
url = 'https://www.rohm.com/products/wireless-communication/wireless-lan-modules/bp3580-product#'
with urllib.request.urlopen(url) as url:
page = url.read()
soup = BeautifulSoup(page, features='lxml')
# [print(tag.name) for tag in soup.find_all()]
for script in soup(["script", "style"]):
script.decompose() # rip it out
invalid_tags = ['br']
for tag in invalid_tags:
for match in soup.findAll(tag):
match.replaceWithChildren()
html = soup.find_all(recursive=False)
for tag in html:
print(tag.get_text())
我试着用递归=真来解释,但是结果重复了很多次,这可能不是你执行的结果,但至少它给了你一个提示。我稍微修改了你的代码
url = 'https://www.rohm.com/products/wireless-communication/wireless-lan-modules/bp3580-product#'
with urllib.request.urlopen(url) as url:
page = url.read()
soup = BeautifulSoup(page, features='lxml')
for script in soup(["script", "style"]):
script.decompose() # rip it out
invalid_tags = ['br']
for tag in invalid_tags:
for match in soup.findAll(tag):
match.replaceWithChildren()
html = soup.find_all(recursive=False)
textlist = []
for tag in html:
text = tag.text.replace("\r","").replace("\t","").split("\n")
for t in text:
if t != '':
textlist.append(t)
for tt in textlist:
print(tt)
for ts in tt.split():
print ("{}, {}".format(ts,len(tt.split())-1))
print("-----------------------------")
你能提供你想要删除的URL吗?@Yusufsn,在代码片段中添加了你说的“在同一标记中找到的相邻单词数”。你指的是什么标签?@Yusufsn,我的目标是所有包含文本的标签directly@fadytaher当前位置你能提到你的期望产量吗。
url = 'https://www.rohm.com/products/wireless-communication/wireless-lan-modules/bp3580-product#'
with urllib.request.urlopen(url) as url:
page = url.read()
soup = BeautifulSoup(page, features='lxml')
# [print(tag.name) for tag in soup.find_all()]
for script in soup(["script", "style"]):
script.decompose() # rip it out
invalid_tags = ['br']
for tag in invalid_tags:
for match in soup.findAll(tag):
match.replaceWithChildren()
html = soup.find_all(recursive=False)
for tag in html:
print(tag.get_text())
url = 'https://www.rohm.com/products/wireless-communication/wireless-lan-modules/bp3580-product#'
with urllib.request.urlopen(url) as url:
page = url.read()
soup = BeautifulSoup(page, features='lxml')
for script in soup(["script", "style"]):
script.decompose() # rip it out
invalid_tags = ['br']
for tag in invalid_tags:
for match in soup.findAll(tag):
match.replaceWithChildren()
html = soup.find_all(recursive=False)
textlist = []
for tag in html:
text = tag.text.replace("\r","").replace("\t","").split("\n")
for t in text:
if t != '':
textlist.append(t)
for tt in textlist:
print(tt)
for ts in tt.split():
print ("{}, {}".format(ts,len(tt.split())-1))
print("-----------------------------")