Python 有没有办法找到类名并获取父标记的全部文本？_Python_Beautifulsoup_Html Parsing

Python 有没有办法找到类名并获取父标记的全部文本？

python

Python 有没有办法找到类名并获取父标记的全部文本？,python,beautifulsoup,html-parsing,Python,Beautifulsoup,Html Parsing,我有很多html文件，我必须获取完整的文件头。标题的标记位置不同：class=“c6”，class=“c7” 我试过美容素 for head_c6 in soup.find_all('span', attrs={'class': 'c6'}): print(head_c6.get_text()) for head_c7 in soup.find_all('span', attrs={'class': 'c7'}): print(head_c7.get_text(

我有很多html文件，我必须获取完整的文件头。标题的标记位置不同：class=“c6”，class=“c7”

我试过美容素

for head_c6 in soup.find_all('span', attrs={'class': 'c6'}):
        print(head_c6.get_text())
for head_c7 in soup.find_all('span', attrs={'class': 'c7'}):
        print(head_c7.get_text())

但结果是：

2017美国运通公司的财报电话会议-最终<强>长度：< /强>

2016年第二季度Akamai Technologies Inc电话会议-最终收入

以下是不同文件的外观：

文件1

<div class="c4">
<p class="c5">
<span class="c6">
      Q3 2017 American Express Co Earnings Call - Final
     </span>
</p>
</div>
<div class="c4">
<p class="c5">
<span class="c7">
      LENGTH:
     </span>
<span class="c2">
      11051 words
     </span>
</p>
</div>



Q3美国运通2017财报电话会议


长度：
11051字

文件2

<div class="c4">
<p class="c5">
<span class="c6">
      Q2 2018 Akamai Technologies Inc
     </span>
<span class="c7">
      Earnings
     </span>
<span class="c6">
      Call - Final
     </span>
</p>
</div>



2018年第二季度Akamai技术公司
收益
呼叫-决赛

文件3

<div class="c4">
    <p class="c5">
     <span class="c6">
      Q4 2018
     </span>
     <span class="c7">
      Facebook
     </span>
     <span class="c6">
      Inc
     </span>
     <span class="c7">
      Earnings
     </span>
     <span class="c6">
      Call - Final
     </span>
    </p>



2018年第4季度
脸谱网
股份有限公司
收益
呼叫-决赛

我想要的是获取标题的全文：

<强> 2017美国运通公司财报电话会议-最终< /强> < <强>阿卡迈科技公司2018财报电话会议-最终< /强> <

<强2018脸谱网财报电话会议-最终< /强>

使用正则表达式<代码> Re/Cord>我更新了最后一个文件HTML。你可以用剩余文件

做同样的事。

from bs4 import BeautifulSoup
import re
data='''<div class="c4">
    <p class="c5">
     <span class="c6">
      Q4 2018
     </span>
     <span class="c7">
      Facebook
     </span>
     <span class="c6">
      Inc
     </span>
     <span class="c7">
      Earnings
     </span>
     <span class="c6">
      Call - Final
     </span>
    </p>'''

soup=BeautifulSoup(data,'html.parser')

items=[item.text.strip() for item in soup.find_all('span', class_=re.compile("c"))]
stritem=' '.join(items)
print(stritem.replace('\n',''))

您也可以使用以下方法

items=[item.text.strip() for item in soup.find_all('span', class_=re.compile("c6|c7"))]
stritem=' '.join(items)
print(stritem.replace('\n',''))

或者，要获取父标记文本，请尝试

from bs4 import BeautifulSoup
import re
data='''<div class="c4">
    <p class="c5">
     <span class="c6">
      Q4 2018
     </span>
     <span class="c7">
      Facebook
     </span>
     <span class="c6">
      Inc
     </span>
     <span class="c7">
      Earnings
     </span>
     <span class="c6">
      Call - Final
     </span>
    </p>'''

soup=BeautifulSoup(data,'html.parser')
childtag=soup.find('span', class_=re.compile("c6|c7"))
parenttag=childtag.parent
print(parenttag.text.replace('\n',''))

从bs4导入美化组
进口稀土
数据=“”

2018年第4季度
脸谱网
股份有限公司
收益
呼叫-决赛
''
soup=BeautifulSoup（数据，'html.parser'）
childtag=soup.find（'span'，class=re.compile（“c6 | c7”））
parenttag=childtag.parent
打印（parenttag.text.replace（'\n'，''）

Python的内置函数用于删除字符串中的所有前导空格和尾随空格

-返回一个字符串，该字符串是iterable中字符串的串联

from bs4 import BeautifulSoup

html1 = ''' <div class="c4">
    <p class="c5">
     <span class="c6">
      Q4 2018
     </span>
     <span class="c7">
      Facebook
     </span>
     <span class="c6">
      Inc
     </span>
     <span class="c7">
      Earnings
     </span>
     <span class="c6">
      Call - Final
     </span>
    </p></div>'''

soup = BeautifulSoup(html1,'lxml')
tag =  soup.find('div',{'class':'c4'})
header = ' '.join(("".join((tag.text.strip()).split('\n'))).split())
print(header)

从bs4导入美化组
html1=''

2018年第4季度
脸谱网
股份有限公司
收益
呼叫-决赛
''
soup=BeautifulSoup（html1，'lxml'）
tag=soup.find（'div'，{'class'：'c4'}）
header=''.join（''.join（（tag.text.strip（））.split（'\n'））.split（））
打印（页眉）

O/p

脸谱网2018财报电话会议-最终< > P>似乎通过一个或列表选择更容易，更有效

from bs4 import BeautifulSoup as bs

html = '''<div class="c4">
    <p class="c5">
     <span class="c6">
      Q4 2018
     </span>
     <span class="c7">
      Facebook
     </span>
     <span class="c6">
      Inc
     </span>
     <span class="c7">
      Earnings
     </span>
     <span class="c6">
      Call - Final
     </span>
    </p>'''

soup= bs(html,'html.parser')  
result = ' '.join([item.text.strip() for item in soup.select('.c6,.c7')])
print(result)

从bs4导入美化组作为bs
html=“”

2018年第4季度
脸谱网
股份有限公司
收益
呼叫-决赛
''
soup=bs（html，'html.parser'）
结果=“”.join（[item.text.strip（）表示汤中的项目。选择（'.c6.c7'））
打印（结果）

use findAll而不是findI used finu\u all的可能重复。和不同顺序的结果您是否可以共享输出。@Siddharth Das我编辑过，显示了find_allI的结果。我在文件中有许多c6和c7类标记。第一个解决方案采取了一切，第二个工作在脸谱网和Akamai技术，但Q3 2017美国运通公司财报电话会议-最终长度：负载日期：语言：。我不需要长度：加载日期：语言：

from bs4 import BeautifulSoup as bs

html = '''<div class="c4">
    <p class="c5">
     <span class="c6">
      Q4 2018
     </span>
     <span class="c7">
      Facebook
     </span>
     <span class="c6">
      Inc
     </span>
     <span class="c7">
      Earnings
     </span>
     <span class="c6">
      Call - Final
     </span>
    </p>'''

soup= bs(html,'html.parser')  
result = ' '.join([item.text.strip() for item in soup.select('.c6,.c7')])
print(result)