Python 获取BeautifulSoup4中标记之间的句子长度_Python_Beautifulsoup

Python 获取BeautifulSoup4中标记之间的句子长度

python

Python 获取BeautifulSoup4中标记之间的句子长度,python,beautifulsoup,Python,Beautifulsoup,我试图从一个网站收集一些统计数据，我试图做的是提取一个单词，并计算在同一标签中找到的相邻单词的数量输入 <div class="col-xs-12"> <p class="w50">Operating Temperature (Min.)[°C]</p> <p class="w50 upperC">-40</p> </div> 这就是我最终的目的，但它提取了整个文本 url = 'https://www.ro

我试图从一个网站收集一些统计数据，我试图做的是提取一个

单词

，并计算在同一标签中找到的相邻单词的数量

输入

<div class="col-xs-12">
   <p class="w50">Operating Temperature (Min.)[°C]</p>
   <p class="w50 upperC">-40</p>
</div>

这就是我最终的目的，但它提取了整个文本

url = 'https://www.rohm.com/products/wireless-communication/wireless-lan-modules/bp3580-product#'
    with urllib.request.urlopen(url) as url:
        page = url.read()

soup = BeautifulSoup(page, features='lxml')

# [print(tag.name) for tag in soup.find_all()]

for script in soup(["script", "style"]):
    script.decompose()  # rip it out

invalid_tags = ['br']

for tag in invalid_tags:
    for match in soup.findAll(tag):
        match.replaceWithChildren()

html = soup.find_all(recursive=False)

for tag in html:
    print(tag.get_text())

我试着用递归=真来解释，但是结果重复了很多次，这可能不是你执行的结果，但至少它给了你一个提示。我稍微修改了你的代码

url = 'https://www.rohm.com/products/wireless-communication/wireless-lan-modules/bp3580-product#'
with urllib.request.urlopen(url) as url:
    page = url.read()

soup = BeautifulSoup(page, features='lxml')

for script in soup(["script", "style"]):
    script.decompose()  # rip it out

invalid_tags = ['br']

for tag in invalid_tags:
    for match in soup.findAll(tag):
        match.replaceWithChildren()

html = soup.find_all(recursive=False)

textlist = []
for tag in html:
    text = tag.text.replace("\r","").replace("\t","").split("\n")
    for t in text:
        if t != '':
            textlist.append(t)
for tt in textlist:
    print(tt)
    for ts in tt.split():
        print ("{}, {}".format(ts,len(tt.split())-1))
    print("-----------------------------")

你能提供你想要删除的URL吗？@Yusufsn，在代码片段中添加了你说的“在同一标记中找到的相邻单词数”。你指的是什么标签？@Yusufsn，我的目标是所有包含文本的标签directly@fadytaher当前位置你能提到你的期望产量吗。

url = 'https://www.rohm.com/products/wireless-communication/wireless-lan-modules/bp3580-product#'
    with urllib.request.urlopen(url) as url:
        page = url.read()

soup = BeautifulSoup(page, features='lxml')

# [print(tag.name) for tag in soup.find_all()]

for script in soup(["script", "style"]):
    script.decompose()  # rip it out

invalid_tags = ['br']

for tag in invalid_tags:
    for match in soup.findAll(tag):
        match.replaceWithChildren()

html = soup.find_all(recursive=False)

for tag in html:
    print(tag.get_text())

url = 'https://www.rohm.com/products/wireless-communication/wireless-lan-modules/bp3580-product#'
with urllib.request.urlopen(url) as url:
    page = url.read()

soup = BeautifulSoup(page, features='lxml')

for script in soup(["script", "style"]):
    script.decompose()  # rip it out

invalid_tags = ['br']

for tag in invalid_tags:
    for match in soup.findAll(tag):
        match.replaceWithChildren()

html = soup.find_all(recursive=False)

textlist = []
for tag in html:
    text = tag.text.replace("\r","").replace("\t","").split("\n")
    for t in text:
        if t != '':
            textlist.append(t)
for tt in textlist:
    print(tt)
    for ts in tt.split():
        print ("{}, {}".format(ts,len(tt.split())-1))
    print("-----------------------------")