Python 在p元素之间派生文本_Python_Html_Python 3.x_Web Scraping_Beautifulsoup

Python 在p元素之间派生文本

python html python-3.x web-scraping

Python 在p元素之间派生文本,python,html,python-3.x,web-scraping,beautifulsoup,Python,Html,Python 3.x,Web Scraping,Beautifulsoup,我想为一堆下载的文件提取p个强元素之间的文本。我想要所有P-strong“高管”和P-strong“分析师”之间的P文本，我附上了一个html示例，请参见我知道如何加载htmls，但我不知道如何使用BS4提取前面提到的数据： import textwrap import os from bs4 import BeautifulSoup directory ='C:/test/out' for filename in os.listdir(directory): if filenam

我想为一堆下载的文件提取p个强元素之间的文本。我想要所有P-strong“高管”和P-strong“分析师”之间的P文本，我附上了一个html示例，请参见我知道如何加载htmls，但我不知道如何使用BS4提取前面提到的数据：

import textwrap
import os
from bs4 import BeautifulSoup

directory ='C:/test/out'
for filename in os.listdir(directory):
    if filename.endswith('.html'):
        fname = os.path.join(directory,filename)
        with open(fname, 'r') as f:
            soup = BeautifulSoup(f.read(),'html.parser')

html示例：

</header><div id="a-cont"><div class="p p1"></div><div class="sa-art article-width" id="a-body"><p>Apple, Inc. (NASDAQ:<a href="https://seekingalpha.com/symbol/AAPL" title="Apple Inc.">AAPL</a>)</p>
<p>Q4 2016 Earnings Call</p>
<p>October 25, 2016 5:00 pm ET</p>
<p><strong>Executives</strong></p>
<p>Nancy Paxton - Apple, Inc.</p>
<p>Timothy Donald Cook - Apple, Inc.</p>
<p>Luca Maestri - Apple, Inc.</p>
<p><strong>Analysts</strong></p>
<p>Eugene Charles Munster - Piper Jaffray &amp; Co.</p>
<p>Kathryn Lynn Huberty - Morgan Stanley &amp; Co. LLC</p>
<p>Shannon S. Cross - Cross Research LLC</p>
<p>Antonio M. Sacconaghi - Sanford C. Bernstein &amp; Co. LLC</p>
<p>Simona K. Jankowski - Goldman Sachs &amp; Co.</p>
<p>Steven M. Milunovich - UBS Securities LLC</p>
<p>Wamsi Mohan - Bank of America Merrill Lynch</p>
<p>James D. Suva - Citigroup Global Markets, Inc. (Broker)</p>
<p>Rod B. Hall - JPMorgan Securities LLC</p>

苹果公司（纳斯达克：）

2016年第四季度收益电话会议

2016年10月25日东部时间下午5:00

高管

南希帕克斯顿-苹果公司

蒂莫西·唐纳德·库克-苹果公司

卢卡·梅斯特里-苹果公司

分析师

尤金·查尔斯·蒙斯特-派珀·贾弗里；公司

凯瑟琳·林恩·休伯蒂-摩根士丹利；有限责任公司

香农S.交叉研究有限责任公司

安东尼奥M.萨科纳吉-桑福德C.伯恩斯坦；有限责任公司

Simona K.Jankowski-高盛公司；公司

Steven M.Milunovich-瑞银证券有限责任公司

Wamsi Mohan-美国银行美林

James D.Suva-花旗集团全球市场公司（经纪人）

Rod B.Hall-摩根大通证券有限责任公司

IIUC，一个非常粗略的解决方案可能是：

from bs4 import BeautifulSoup

s = '''
<div id="a-cont"><div class="p p1"></div><div class="sa-art article-width" id="a-body"><p>Apple, Inc. (NASDAQ:<a href="https://seekingalpha.com/symbol/AAPL" title="Apple Inc.">AAPL</a>)</p>
<p>Q4 2016 Earnings Call</p>
<p>October 25, 2016 5:00 pm ET</p>
<p><strong>Executives</strong></p>
<p>Nancy Paxton - Apple, Inc.</p>
<p>Timothy Donald Cook - Apple, Inc.</p>
<p>Luca Maestri - Apple, Inc.</p>
<p><strong>Analysts</strong></p>
<p>Eugene Charles Munster - Piper Jaffray &amp; Co.</p>
<p>Kathryn Lynn Huberty - Morgan Stanley &amp; Co. LLC</p>
<p>Shannon S. Cross - Cross Research LLC</p>
<p>Antonio M. Sacconaghi - Sanford C. Bernstein &amp; Co. LLC</p>
<p>Simona K. Jankowski - Goldman Sachs &amp; Co.</p>
<p>Steven M. Milunovich - UBS Securities LLC</p>
<p>Wamsi Mohan - Bank of America Merrill Lynch</p>
<p>James D. Suva - Citigroup Global Markets, Inc. (Broker)</p>
<p>Rod B. Hall - JPMorgan Securities LLC</p>
'''

bsobj = BeautifulSoup(s, "lxml")
res = []

for i in bsobj.find('strong').find_all_next('p'):
    if i.text == 'Analysts':
        break
    else:
        res.append(i.text)

res

在OP的进一步解释之后，最终代码应该如下所示：

import textwrap
import os
from bs4 import BeautifulSoup

res = {}
directory ='C:/Research syntheses - Meta analysis/Transcripts/test/1/'
for filename in os.listdir(directory):
    if filename.endswith('.html'):
        fname = os.path.join(directory,filename)
        with open(fname, 'r') as f:
            soup = BeautifulSoup(f.read(),'html.parser')

            res[filename] = []
            for i in soup.find('strong').find_all_next('p'):
                if i.text == 'Analysts':
                    break
                else:
                    res[filename].append(i.text)

欢迎来到SO。你能分享这个“html.parser”文件的样本数据吗？谢谢@Sampath！我用html代码更新了我的问题。这对我已经下载的文件也有效吗？（如我的问题中所述）并调用，代码为“import textwrap import os from bs4 import beautifulsou directory='C:/test/out'作为os.listdir（目录）中的文件名：if filename.endswith（'.html'）：fname=os.path.join（目录，文件名），open（fname，'r'）作为f:soup=beautifulsou（f.read（）“，'html.parser'）”只要

bsobj=BeautifulSoup（s，“lxml”）

中的对象与我的代码中的对象具有类似的结构，我就会说是的。试试看。如果这个答案解决了你的问题，请接受它！我试图将我的“目录加载”与您的方法合并，但这似乎不起作用（在问题中，我编辑了我正在谈论的合并）。在您的代码中，删除

bsobj=BeautifulSoup（f.read（），“lxml”）

，将我的

for

循环移动到初始

for

循环中，将

bsobj

重命名为

soup

。我发现了问题，tekst应该是“Analysts:”我也可以按列或行打印吗？我会把这个问题放在回答上，谢谢！！！

import textwrap
import os
from bs4 import BeautifulSoup

res = {}
directory ='C:/Research syntheses - Meta analysis/Transcripts/test/1/'
for filename in os.listdir(directory):
    if filename.endswith('.html'):
        fname = os.path.join(directory,filename)
        with open(fname, 'r') as f:
            soup = BeautifulSoup(f.read(),'html.parser')

            res[filename] = []
            for i in soup.find('strong').find_all_next('p'):
                if i.text == 'Analysts':
                    break
                else:
                    res[filename].append(i.text)