Python 如何在不访问其子标记文本的情况下获取父标记文本?

Python 如何在不访问其子标记文本的情况下获取父标记文本?,python,beautifulsoup,Python,Beautifulsoup,我知道还有其他问题,但我无法得到那里的解释,我有下面的代码请帮助 我想输出一个像这样的字典 dictionary { '[1.1]':'this is extracted text from a parent tag', '[1.2]':'this is child tag text', '[1.3]':'this is child tag text', '[1.4]':'this is child tag text' } 但问题是,我在[1.1]中得到的是父标记加上子标记的文

我知道还有其他问题,但我无法得到那里的解释,我有下面的代码请帮助

我想输出一个像这样的字典

dictionary
{
  '[1.1]':'this is extracted text from a parent tag',
  '[1.2]':'this is child tag text',
  '[1.3]':'this is child tag text',
  '[1.4]':'this is child tag text'
}
但问题是,我在
[1.1]
中得到的是父标记加上子标记的文本,而不仅仅是父标记

我尝试了其他的解决方案,但没能成功。请用简单的方法帮助某人

我的代码在这里

from bs4 import BeautifulSoup
import requests

headers = requests.utils.default_headers()
headers.update({
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
})

URL = "https://patents.google.com/patent/US20120303322A1/en"

content = requests.get(URL, headers=headers)
soup = BeautifulSoup(content.text,'html.parser')

independent_claim_tag = soup.find('div',{'class':'claim'})

claimdictionary = {}

# While loop to get all the independent claims tag works perfectly!!
while(independent_claim_tag):
    base = independent_claim_tag.find("div", {"class":"claim"})['num'].lstrip('0')
    print(independent_claim_tag.prettify())
    print('-------')
    elementTags = independent_claim_tag.find_all('div', {'class':'claim-text'})
    i = 1
    for tag in elementTags:
        key = "[ "+str(base)+"."+str(i)+" ] "
        ######################
        # some code need to be here to get only parent tag text for [1.1]
        value = tag.get_text()
        ######################      
        claimdictionary[key.strip()] = value.strip()
        print("[ "+str(base)+"."+str(i)+" ] "+tag.get_text())
        i = i + 1
    print('-------')
    ##################
    ##################
    print("Number of claim Element: "+str(len(independent_claim_tag.find_all('div',{'class':'claim-text'}))))
    print("---- Next Sibling")
    independent_claim_tag = independent_claim_tag.find_next_sibling('div',{'class':'claim'})


print(claimdictionary)

我需要提取的HTML标记

<div class="claim">
 <div class="claim" id="CLM-00001" num="00001">
  <div class="claim-text">
   <b>
    1
   </b>
   . A computer readable storage medium comprising a set of instructions which, if executed by a processor, cause a computer to:
   <div class="claim-text">
    receive data corresponding to a computing node;
   </div>
   <div class="claim-text">
    identify a processor usage, a memory usage and an input/output usage based at least in part on the data corresponding to the computing node; and
   </div>
   <div class="claim-text">
    determine a compute usage value for the computing node based at least in part on the processor usage, the memory usage and the input/output usage.
   </div>
  </div>
 </div>
</div>

Number of claim Element: 4
更新:这是我更新后的输出

<div class="claim">
 <div class="claim" id="CLM-00001" num="00001">
  <div class="claim-text">
   <b>
    1
   </b>
   . A computer readable storage medium comprising a set of instructions which, if executed by a processor, cause a computer to:
   <div class="claim-text">
    receive data corresponding to a computing node;
   </div>
   <div class="claim-text">
    identify a processor usage, a memory usage and an input/output usage based at least in part on the data corresponding to the computing node; and
   </div>
   <div class="claim-text">
    determine a compute usage value for the computing node based at least in part on the processor usage, the memory usage and the input/output usage.
   </div>
  </div>
 </div>
</div>

-------
[ 1.1 ]  1. A computer readable storage medium comprising a set of instructions which, if executed by a processor, cause a computer to:
receive data corresponding to a computing node; identify a processor usage, a memory usage and an input/output usage based at least in part on the data corresponding to the computing node; and determine a compute usage value for the computing node based at least in part on the processor usage, the memory usage and the input/output usage.
[ 1.2 ] receive data corresponding to a computing node;
[ 1.3 ] identify a processor usage, a memory usage and an input/output usage based at least in part on the data corresponding to the computing node; and
[ 1.4 ] determine a compute usage value for the computing node based at least in part on the processor usage, the memory usage and the input/output usage.
-------
Number of claim Element: 4
---- Next Sibling
<div class="claim">
 <div class="claim" id="CLM-00008" num="00008">
  <div class="claim-text">
   <b>
    8
   </b>
   . A system comprising:
   <div class="claim-text">
    a processor; and
   </div>
   <div class="claim-text">
    a computer readable storage medium including a set of instructions which, if executed by the processor, cause the system to,
    <div class="claim-text">
     receive data corresponding to a computing node,
    </div>
    <div class="claim-text">
     identify a processor usage, a memory usage and an input/output usage based at least in part on the data corresponding to the computing node, and
    </div>
    <div class="claim-text">
     determine a compute usage value for the computing node based at least in part on the processor usage, the memory usage and the input/output usage.
    </div>
   </div>
  </div>
 </div>
</div>

-------
[ 8.1 ]  8. A system comprising:
a processor; and a computer readable storage medium including a set of instructions which, if executed by the processor, cause the system to,
receive data corresponding to a computing node,
identify a processor usage, a memory usage and an input/output usage based at least in part on the data corresponding to the computing node, and
determine a compute usage value for the computing node based at least in part on the processor usage, the memory usage and the input/output usage.

[ 8.2 ] a processor; and
[ 8.3 ] a computer readable storage medium including a set of instructions which, if executed by the processor, cause the system to,
receive data corresponding to a computing node,
identify a processor usage, a memory usage and an input/output usage based at least in part on the data corresponding to the computing node, and
determine a compute usage value for the computing node based at least in part on the processor usage, the memory usage and the input/output usage.

[ 8.4 ] receive data corresponding to a computing node,
[ 8.5 ] identify a processor usage, a memory usage and an input/output usage based at least in part on the data corresponding to the computing node, and
[ 8.6 ] determine a compute usage value for the computing node based at least in part on the processor usage, the memory usage and the input/output usage.
-------
Number of claim Element: 6
---- Next Sibling
<div class="claim">
 <div class="claim" id="CLM-00015" num="00015">
  <div class="claim-text">
   <b>
    15
   </b>
   . A computer readable storage medium comprising a set of instructions which, if executed by a processor, cause a computer to:
   <div class="claim-text">
    collect data corresponding to a computing node, wherein the data is to be associated with a processor usage, a memory usage and an input/output usage; and
   </div>
   <div class="claim-text">
    send the data to a compute usage calculation node.
   </div>
  </div>
 </div>
</div>

-------
[ 15.1 ]  15. A computer readable storage medium comprising a set of instructions which, if executed by a processor, cause a computer to:
collect data corresponding to a computing node, wherein the data is to be associated with a processor usage, a memory usage and an input/output usage; and send the data to a compute usage calculation node.
[ 15.2 ] collect data corresponding to a computing node, wherein the data is to be associated with a processor usage, a memory usage and an input/output usage; and
[ 15.3 ] send the data to a compute usage calculation node.
-------
Number of claim Element: 3
---- Next Sibling

1.
. 一种计算机可读存储介质,包括一组指令,如果由处理器执行,则使计算机:
接收对应于计算节点的数据;
至少部分地基于对应于计算节点的数据来识别处理器使用、存储器使用和输入/输出使用;和
至少部分地基于处理器使用、内存使用和输入/输出使用来确定计算节点的计算使用值。
-------
[ 1.1 ]  1. 一种计算机可读存储介质,包括一组指令,如果由处理器执行,则使计算机:
接收对应于计算节点的数据;至少部分地基于对应于计算节点的数据来识别处理器使用、存储器使用和输入/输出使用;以及至少部分地基于处理器使用情况、存储器使用情况和输入/输出使用情况来确定计算节点的计算使用情况值。
[1.2]接收计算节点对应的数据;
[1.3]至少部分地基于与计算节点对应的数据来识别处理器使用、存储器使用和输入/输出使用;和
[1.4]至少部分基于处理器使用、内存使用和输入/输出使用确定计算节点的计算使用值。
-------
索赔要素编号:4
----下一个兄弟姐妹
8.
. 一种系统,包括:
处理器;和
一种计算机可读存储介质,包括一组指令,如果由处理器执行,则使系统,
接收对应于计算节点的数据,
至少部分地基于对应于计算节点的数据来识别处理器使用情况、存储器使用情况和输入/输出使用情况,以及
至少部分地基于处理器使用、内存使用和输入/输出使用来确定计算节点的计算使用值。
-------
[ 8.1 ]  8. 一种系统,包括:
处理器;以及计算机可读存储介质,包括一组指令,如果由处理器执行,则使系统,
接收对应于计算节点的数据,
至少部分地基于对应于计算节点的数据来识别处理器使用情况、存储器使用情况和输入/输出使用情况,以及
至少部分地基于处理器使用、内存使用和输入/输出使用来确定计算节点的计算使用值。
[8.2]处理器;和
[8.3]一种计算机可读存储介质,包括一组指令,如果由处理器执行,则会导致系统,
接收对应于计算节点的数据,
至少部分地基于对应于计算节点的数据来识别处理器使用情况、存储器使用情况和输入/输出使用情况,以及
至少部分地基于处理器使用、内存使用和输入/输出使用来确定计算节点的计算使用值。
[8.4]接收与计算节点对应的数据,
[8.5]至少部分地基于与计算节点对应的数据来识别处理器使用、内存使用和输入/输出使用,以及
[8.6]至少部分基于处理器使用、内存使用和输入/输出使用确定计算节点的计算使用值。
-------
索赔要素编号:6
----下一个兄弟姐妹
15
. 一种计算机可读存储介质,包括一组指令,如果由处理器执行,则使计算机:
收集对应于计算节点的数据,其中所述数据将与处理器使用、存储器使用和输入/输出使用相关联;和
将数据发送到计算使用率计算节点。
-------
[ 15.1 ]  15. 一种计算机可读存储介质,包括一组指令,如果由处理器执行,则使计算机:
收集对应于计算节点的数据,其中所述数据将与处理器使用、存储器使用和输入/输出使用相关联;并将数据发送到计算使用率计算节点。
[15.2]收集与计算节点对应的数据,其中数据与处理器使用、内存使用和输入/输出使用相关联;和
[15.3]将数据发送到计算使用率计算节点。
-------
索赔要素编号:3
----下一个兄弟姐妹
将父标记中的子元素添加到dict时,可以
提取()

from bs4 import BeautifulSoup
import requests

headers = requests.utils.default_headers()
headers.update({
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
})

URL = "https://patents.google.com/patent/US20120303322A1/en"

content = requests.get(URL, headers=headers)
soup = BeautifulSoup(content.text,'html.parser')

independent_claim_tag = soup.find('div',{'class':'claim'})

claimdictionary = {}

# While loop to get all the independent claims tag works perfectly!!
while(independent_claim_tag):
    base = independent_claim_tag.find("div", {"class":"claim"})['num'].lstrip('0')
    print(independent_claim_tag.prettify())
    print('-------')
    elementTags = independent_claim_tag.find_all('div', {'class':'claim-text'})
    i = 1
    for tag in elementTags:
        key = "[ "+str(base)+"."+str(i)+" ] "
        if i == 1:
            #parent
            for subtag in tag.find_all('div',{'class':'claim-text'}):
                subtag.extract()
            value = tag.get_text()
        else:
            # child
            value = tag.get_text()
        claimdictionary[key.strip()] = value.strip()
        print("[ "+str(base)+"."+str(i)+" ] "+tag.get_text())
        i = i + 1
    print('-------')
    ##################
    # some code need to be here to process parent tag text from the child tag text
    ##################
    print("Number of claim Element: "+str(len(independent_claim_tag.find_all('div',{'class':'claim-text'}))))
    print("---- Next Sibling")
    independent_claim_tag = independent_claim_tag.find_next_sibling('div',{'class':'claim'})


print(claimdictionary)
在这里,您可以看到我检查
I
的值,如果
I
为1,我将删除标记中的child。然后我应用
get_text()
方法

编辑:

您可以删除else部分,也可以执行以下操作:

if i == 1:
    #parent
    for subtag in tag.find_all('div',{'class':'claim-text'}):                               
        subtag.extract()
value = tag.get_text()

感谢@Maaz的尝试让我检查一下,然后说点什么它给我的错误是
----回溯(最近一次调用最后一次):文件“patentextraction.py”,第38行,value=tag.get_extract()TypeError:“非类型”对象不可调用
可能在第一次迭代中没有得到子对象,但是当
i==1
@RinkuYadav我尝试时没有出现此错误,什么是
get_extract
?它是
get_text()
BeautifulSoup函数来获取标记的文本。
if i == 1:
    #parent
    for subtag in tag.find_all('div',{'class':'claim-text'}):                               
        subtag.extract()
value = tag.get_text()