Python 美丽的汤：分离span元素和p元素_Python_Html_Beautifulsoup_Html Parsing

Python 美丽的汤：分离span元素和p元素

python html

Python 美丽的汤：分离span元素和p元素,python,html,beautifulsoup,html-parsing,Python,Html,Beautifulsoup,Html Parsing,我需要从我的总p元素中拉出一个跨度元素下面是我正在解析的一个p元素的具体示例 <p id="p-9"> <span class="inline-l2-heading">H5N1 virus pathogenic phenotypes among inbred mouse strains. </span> We experimentally inoculated 21 mouse

我需要从我的总p元素中拉出一个跨度元素

下面是我正在解析的一个p元素的具体示例

<p id="p-9">
   <span class="inline-l2-heading">H5N1 virus pathogenic phenotypes among 
          inbred mouse strains.
   </span>
   We experimentally inoculated 21 mouse strains with the highly 
   pathogenic H5N1 influenza A virus A/Hong Kong/213/03 (HK213) 
   and monitored the animals for 30 days thereafter for signs of
   morbidity and mortality. The 50% mouse lethal dose (MLD<sub>50</sub>) 
   values varied from 40 50% egg infective doses (EID<sub>50</sub>) 
   for the influenza virus-susceptible strain DBA/2<sub>S</sub> 
   (susceptibility indicated by “S”) to more than 10<sup>6</sup> 
   EID<sub>50</sub> for the influenza virus-resistant strains 
   BALB/c<sub>R</sub> and BALB/cBy<sub>R</sub> 
   (resistance indicated by “R”) (<a class="xref-fig" href="#F1" id="xref-fig-1- 
   1">Fig. 1</a>).
</p>

结果是

H5N1 virus pathogenic phenotypes among inbred mouse strains.We experimentally
inoculated 21 mouse strains with the highly pathogenic H5N1 influenza A virus
A/Hong Kong/213/03 (HK213) and monitored the animals for 30 days thereafter 
for signs of morbidity and mortality. The 50% mouse lethal dose (MLD50) 
values varied from 40 50% egg infective doses (EID50) for the influenza 
virus-susceptible strain DBA/2S (susceptibility indicated by “S”) to more 
than 106 EID50 for the influenza virus-resistant strains BALB/cR and 
BALB/cByR (resistance indicated by “R”) (Fig. 1).

正如你在第一句和第二句中所看到的，它没有在跨度中的文本和段落其余部分中的文本之间创建空格

它最终看起来像：

“近交系小鼠中的H5N1病毒致病表型。我们在实验上……”

正如你所看到的，这导致两个独立的句子在句号后没有空格，这是一个很大的问题，因为我稍后将逐句拆分，大多数分句器用句号和空格分隔，我的大多数其他句子都是正确形成的

有没有什么方法可以用bs4将span中的文本与其余文本隔离开来，然后用适当的间距将它们连接在一起？

我假设您使用的是

get\u result（）

。您可以在bs4中执行另一个名为。这将给出一个包含所有字符串的数组。然后您可以

将它们连接在一起，以获得格式正确的文本：
from bs4 import BeautifulSoup

html_doc = """
<p>
    <span>Some Text.</span>
    Some text and probably other stuff.
</p>
"""
soup = BeautifulSoup(html_doc, 'html.parser')

print(" ".join(soup.strings))
print(" ".join(soup.stripped_strings))

从bs4导入美化组
html_doc=“”

一些文本。
一些文字，可能还有其他东西。

"""
soup=BeautifulSoup（html_doc，'html.parser'）
打印（“.join（soup.strings））
打印（“.join（soup.stripped_字符串））

另外，在您的示例中，我看到有很多空白用于格式化。您可以通过执行剥离字符串来消除这些问题，而不是执行
尝试：
import re
from bs4 import BeautifulSoup
html = '''
<p id="p-9">
   <span class="inline-l2-heading">H5N1 virus pathogenic phenotypes among 
          inbred mouse strains.
   </span>
   We experimentally inoculated 21 mouse strains with the highly 
   pathogenic H5N1 influenza A virus A/Hong Kong/213/03 (HK213) 
   and monitored the animals for 30 days thereafter for signs of
   morbidity and mortality. The 50% mouse lethal dose (MLD<sub>50</sub>) 
   values varied from 40 50% egg infective doses (EID<sub>50</sub>) 
   for the influenza virus-susceptible strain DBA/2<sub>S</sub> 
   (susceptibility indicated by “S”) to more than 10<sup>6</sup> 
   EID<sub>50</sub> for the influenza virus-resistant strains 
   BALB/c<sub>R</sub> and BALB/cBy<sub>R</sub> 
   (resistance indicated by “R”) (<a class="xref-fig" href="#F1" id="xref-fig-1- 
   1">Fig. 1</a>).
</p>
'''

soup = BeautifulSoup(html, 'lxml')

p = soup.select('p')

for text in p:
    para = text.get_text(' ').replace('\n','')
para = re.sub(' +', ' ', para)
print(para.strip())

依此类推。
另一种解决方案：
import re
from bs4 import BeautifulSoup


txt = '''<p id="p-9">
   <span class="inline-l2-heading">H5N1 virus pathogenic phenotypes among
          inbred mouse strains.
   </span>
   We experimentally inoculated 21 mouse strains with the highly
   pathogenic H5N1 influenza A virus A/Hong Kong/213/03 (HK213)
   and monitored the animals for 30 days thereafter for signs of
   morbidity and mortality. The 50% mouse lethal dose (MLD<sub>50</sub>)
   values varied from 40 50% egg infective doses (EID<sub>50</sub>)
   for the influenza virus-susceptible strain DBA/2<sub>S</sub>
   (susceptibility indicated by “S”) to more than 10<sup>6</sup>
   EID<sub>50</sub> for the influenza virus-resistant strains
   BALB/c<sub>R</sub> and BALB/cBy<sub>R</sub>
   (resistance indicated by “R”) (<a class="xref-fig" href="#F1" id="xref-fig-1-
   1">Fig. 1</a>).
</p>'''

soup = BeautifulSoup(txt, 'html.parser')
paragraph = soup.select_one('p')

# add space at the end of each span:
for span in paragraph.select('span'):
    span.append(BeautifulSoup('&nbsp;', 'html.parser'))

# post-process the text:
print(re.sub(r'\s{2,}', ' ', paragraph.text).strip())

请分享你的代码
H5N1 virus pathogenic phenotypes among inbred mouse strains. We experimentally inoculated 21 mouse...

import re
from bs4 import BeautifulSoup


txt = '''<p id="p-9">
   <span class="inline-l2-heading">H5N1 virus pathogenic phenotypes among
          inbred mouse strains.
   </span>
   We experimentally inoculated 21 mouse strains with the highly
   pathogenic H5N1 influenza A virus A/Hong Kong/213/03 (HK213)
   and monitored the animals for 30 days thereafter for signs of
   morbidity and mortality. The 50% mouse lethal dose (MLD<sub>50</sub>)
   values varied from 40 50% egg infective doses (EID<sub>50</sub>)
   for the influenza virus-susceptible strain DBA/2<sub>S</sub>
   (susceptibility indicated by “S”) to more than 10<sup>6</sup>
   EID<sub>50</sub> for the influenza virus-resistant strains
   BALB/c<sub>R</sub> and BALB/cBy<sub>R</sub>
   (resistance indicated by “R”) (<a class="xref-fig" href="#F1" id="xref-fig-1-
   1">Fig. 1</a>).
</p>'''

soup = BeautifulSoup(txt, 'html.parser')
paragraph = soup.select_one('p')

# add space at the end of each span:
for span in paragraph.select('span'):
    span.append(BeautifulSoup('&nbsp;', 'html.parser'))

# post-process the text:
print(re.sub(r'\s{2,}', ' ', paragraph.text).strip())

H5N1 virus pathogenic phenotypes among inbred mouse strains. We experimentally inoculated 21 mouse strains with the highly pathogenic H5N1 influenza A virus A/Hong Kong/213/03 (HK213) and monitored the animals for 30 days thereafter for signs of morbidity and mortality. The 50% mouse lethal dose (MLD50) values varied from 40 50% egg infective doses (EID50) for the influenza virus-susceptible strain DBA/2S (susceptibility indicated by “S”) to more than 106 EID50 for the influenza virus-resistant strains BALB/cR and BALB/cByR (resistance indicated by “R”) (Fig. 1).