Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/309.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 获取BeautifulSoup4中标记之间的句子长度_Python_Beautifulsoup - Fatal编程技术网

Python 获取BeautifulSoup4中标记之间的句子长度

Python 获取BeautifulSoup4中标记之间的句子长度,python,beautifulsoup,Python,Beautifulsoup,我试图从一个网站收集一些统计数据,我试图做的是提取一个单词,并计算在同一标签中找到的相邻单词的数量 输入 <div class="col-xs-12"> <p class="w50">Operating Temperature (Min.)[°C]</p> <p class="w50 upperC">-40</p> </div> 这就是我最终的目的,但它提取了整个文本 url = 'https://www.ro

我试图从一个网站收集一些统计数据,我试图做的是提取一个
单词
,并计算在同一标签中找到的相邻单词的数量

输入

<div class="col-xs-12">
   <p class="w50">Operating Temperature (Min.)[°C]</p>
   <p class="w50 upperC">-40</p>
</div>
这就是我最终的目的,但它提取了整个文本

url = 'https://www.rohm.com/products/wireless-communication/wireless-lan-modules/bp3580-product#'
    with urllib.request.urlopen(url) as url:
        page = url.read()

soup = BeautifulSoup(page, features='lxml')

# [print(tag.name) for tag in soup.find_all()]

for script in soup(["script", "style"]):
    script.decompose()  # rip it out

invalid_tags = ['br']

for tag in invalid_tags:
    for match in soup.findAll(tag):
        match.replaceWithChildren()

html = soup.find_all(recursive=False)

for tag in html:
    print(tag.get_text())

我试着用递归=真来解释,但是结果重复了很多次,这可能不是你执行的结果,但至少它给了你一个提示。我稍微修改了你的代码

url = 'https://www.rohm.com/products/wireless-communication/wireless-lan-modules/bp3580-product#'
with urllib.request.urlopen(url) as url:
    page = url.read()

soup = BeautifulSoup(page, features='lxml')

for script in soup(["script", "style"]):
    script.decompose()  # rip it out

invalid_tags = ['br']

for tag in invalid_tags:
    for match in soup.findAll(tag):
        match.replaceWithChildren()

html = soup.find_all(recursive=False)

textlist = []
for tag in html:
    text = tag.text.replace("\r","").replace("\t","").split("\n")
    for t in text:
        if t != '':
            textlist.append(t)
for tt in textlist:
    print(tt)
    for ts in tt.split():
        print ("{}, {}".format(ts,len(tt.split())-1))
    print("-----------------------------")

你能提供你想要删除的URL吗?@Yusufsn,在代码片段中添加了你说的“在同一标记中找到的相邻单词数”。你指的是什么标签?@Yusufsn,我的目标是所有包含文本的标签directly@fadytaher当前位置你能提到你的期望产量吗。
url = 'https://www.rohm.com/products/wireless-communication/wireless-lan-modules/bp3580-product#'
    with urllib.request.urlopen(url) as url:
        page = url.read()

soup = BeautifulSoup(page, features='lxml')

# [print(tag.name) for tag in soup.find_all()]

for script in soup(["script", "style"]):
    script.decompose()  # rip it out

invalid_tags = ['br']

for tag in invalid_tags:
    for match in soup.findAll(tag):
        match.replaceWithChildren()

html = soup.find_all(recursive=False)

for tag in html:
    print(tag.get_text())
url = 'https://www.rohm.com/products/wireless-communication/wireless-lan-modules/bp3580-product#'
with urllib.request.urlopen(url) as url:
    page = url.read()

soup = BeautifulSoup(page, features='lxml')

for script in soup(["script", "style"]):
    script.decompose()  # rip it out

invalid_tags = ['br']

for tag in invalid_tags:
    for match in soup.findAll(tag):
        match.replaceWithChildren()

html = soup.find_all(recursive=False)

textlist = []
for tag in html:
    text = tag.text.replace("\r","").replace("\t","").split("\n")
    for t in text:
        if t != '':
            textlist.append(t)
for tt in textlist:
    print(tt)
    for ts in tt.split():
        print ("{}, {}".format(ts,len(tt.split())-1))
    print("-----------------------------")