Python 在两个P'之间提取文本;s
我试图从两个元素“高管”和“分析师”之间提取数据,但我不知道如何进行。 我的html是:Python 在两个P'之间提取文本;s,python,html,beautifulsoup,Python,Html,Beautifulsoup,我试图从两个元素“高管”和“分析师”之间提取数据,但我不知道如何进行。 我的html是: <div class="content_part hid" id="article_participants"> <p>Wabash National Corporation (NYSE:<a title="" href="http://seekingalpha.com/symbol/wnc">WNC</a>)</p><p>Q4 201
<div class="content_part hid" id="article_participants">
<p>Wabash National Corporation (NYSE:<a title="" href="http://seekingalpha.com/symbol/wnc">WNC</a>)</p><p>Q4 2014 <span class="transcript-search-span" style="background-color: yellow;">Earnings</span> Conference <span class="transcript-search-span" style="background-color: rgb(243, 134, 134);">Call</span></p><p>February 04, 2015 10:00 AM ET</p>
<p><strong>Executives</strong></p>
<p>Mike Pettit - Vice President of Finance and Investor Relations</p>
<p>Richard Giromini - President and Chief Executive Officer</p>
<p>Jeffery Taylor - Senior Vice President and Chief Financial Officer</p>
<p><strong>Analysts</strong></p>
我是Python方面的新手,请容忍我
我喜欢的输出是:
标题可在以下HTML中找到:
<div class="page_header_email_alerts" id="page_header">
<h1>
<span itemprop="headline">Wabash National's (WNC) CEO Richard Giromini on Q4 2014 Results - Earnings Call Transcript</span>
</h1>
<div id="article_info">
<div class="article_info_pos">
<span itemprop="datePublished" content="2015-02-04T21:48:03Z">Feb. 4, 2015 4:48 PM ET</span>
<span id="title_article_comments"></span>
<span class="print_hide"><span class="print_hide"> | </span> <span>About:</span> <span id="about_primary_stocks"><a title="Wabash National Corporation" href="/symbol/WNC" sasource="article_primary_about_trc">Wabash National Corporation (WNC)</a></span></span>
<span class="author_name_for_print">by: SA Transcripts</span>
<span id="second_line_wrapper"></span>
</div>
'''
Wab什国家(WNC)首席执行官Richard Giromini财报2014业绩-财报电话会议记录
美国东部时间2015年2月4日下午4:48
|关于:
作者:SA成绩单
'''
这不是最有效的方法,但您可以尝试:
file = open(File_Path,'r') #open my file ( be careful with encoding)
text = file.readlines() #extract the content of the file
file.close() #close my file
Goal = [] # will include all the lines beetwen Executives and Analysts
for indice,line in enumerate(text):
if "<p><strong>Executives</strong></p>" in line:
"""
when the line with "<p><strong>Executives</strong></p>" is found, it will add to Goal all the next line until <p><strong>Analysts</strong></p> appear in a line
"""
i = 1
while not("<p><strong>Analysts</strong></p>" in text[indice+i]):
Goal.append(text[indice+i])
i +=1
break
print(Goal)
在标记之间获取数据的方法很少,例如在HTMLPasser中使用handle_数据,或者您可以在re中使用findall函数:
data_in_line = re.findall(r'>(.*?)<',line)
data\u in\u line=re.findall(r'>(.*)(.*)和'atest'
它将返回['atest']
它对您有帮助吗?合并您的代码
import os
from simplified_scrapy.simplified_doc import SimplifiedDoc
directory ='C:/Research syntheses - Meta analysis/SeekingAlpha'
for filename in os.listdir(directory):
if filename.endswith('.html'):
fname = os.path.join(directory,filename)
with open(fname, 'r') as f:
page=f.read()
doc = SimplifiedDoc(page)
headline = doc.select('div#article_info>span#about_primary_stocks>a>text()')
div = doc.select('div#article_participants')
if not div: continue
ps = div.getElements('p',start='<strong>Executives</strong>',end='<strong>Analysts</strong>')
Executives = [p.text.split('-')[0].strip() for p in ps]
ps = div.getElements('p',start='<strong>Analysts</strong>')
Analysts = [p.text.split('-')[0].strip() for p in ps]
print (headline)
print (Executives)
print (Analysts)
这里有更多的例子:@dabinsou有一个很好的解决方案,但是这里有一个非常简化的方法,无需使用复杂的存储库:
from re import search
html = """<div class="content_part hid" id="article_participants">
<p>Wabash National Corporation (NYSE:<a title="" href="http://seekingalpha.com/symbol/wnc">WNC</a>)</p><p>Q4 2014 <span class="transcript-search-span" style="background-color: yellow;">Earnings</span> Conference <span class="transcript-search-span" style="background-color: rgb(243, 134, 134);">Call</span></p><p>February 04, 2015 10:00 AM ET</p>
<p><strong>Executives</strong></p>
<p>Mike Pettit - Vice President of Finance and Investor Relations</p>
<p>Richard Giromini - President and Chief Executive Officer</p>
<p>Jeffery Taylor - Senior Vice President and Chief Financial Officer</p>
<p><strong>Analysts</strong></p>"""
soup = search( r"(<strong>Executives(.+))<strong>", html, re.DOTALL)
print ( soup.group(1) )
你发布的图像非常模糊,你能在你的问题中发布一个html代码片段吗?谢谢,但我如何将其合并到我当前的(多个html)代码中?谢谢,但我收到错误“名称‘搜索’未定义”,我做错了什么?@Jose,尝试:从重新导入搜索(它是re.search)评论不适用于长时间的讨论;此对话已结束。
data_in_line = re.findall(r'>(.*?)<',line)
import os
from simplified_scrapy.simplified_doc import SimplifiedDoc
directory ='C:/Research syntheses - Meta analysis/SeekingAlpha'
for filename in os.listdir(directory):
if filename.endswith('.html'):
fname = os.path.join(directory,filename)
with open(fname, 'r') as f:
page=f.read()
doc = SimplifiedDoc(page)
headline = doc.select('div#article_info>span#about_primary_stocks>a>text()')
div = doc.select('div#article_participants')
if not div: continue
ps = div.getElements('p',start='<strong>Executives</strong>',end='<strong>Analysts</strong>')
Executives = [p.text.split('-')[0].strip() for p in ps]
ps = div.getElements('p',start='<strong>Analysts</strong>')
Analysts = [p.text.split('-')[0].strip() for p in ps]
print (headline)
print (Executives)
print (Analysts)
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''
<div class="page_header_email_alerts" id="page_header">
<h1>
<span itemprop="headline">Wabash National's (WNC) CEO Richard Giromini on Q4 2014 Results - Earnings Call Transcript</span>
</h1>
<div id="article_info">
<div class="article_info_pos">
<span itemprop="datePublished" content="2015-02-04T21:48:03Z">Feb. 4, 2015 4:48 PM ET</span>
<span id="title_article_comments"></span>
<span class="print_hide"><span class="print_hide"> | </span> <span>About:</span> <span id="about_primary_stocks"><a title="Wabash National Corporation" href="/symbol/WNC" sasource="article_primary_about_trc">Wabash National Corporation (WNC)</a></span></span>
<span class="author_name_for_print">by: SA Transcripts</span>
<span id="second_line_wrapper"></span>
</div>
</div>
</div>
<div class="content_part hid" id="article_participants">
<p>Wabash National Corporation (NYSE:<a title="" href="http://seekingalpha.com/symbol/wnc">WNC</a>)</p><p>Q4 2014 <span class="transcript-search-span" style="background-color: yellow;">Earnings</span> Conference <span class="transcript-search-span" style="background-color: rgb(243, 134, 134);">Call</span></p><p>February 04, 2015 10:00 AM ET</p>
<p><strong>Executives</strong></p>
<p>Mike Pettit - Vice President of Finance and Investor Relations</p>
<p>Richard Giromini - President and Chief Executive Officer</p>
<p>Jeffery Taylor - Senior Vice President and Chief Financial Officer</p>
<p><strong>Analysts</strong></p>
<p>Jeffery Taylor - Senior Vice President and Chief Financial Officer</p>
</div>
'''
doc = SimplifiedDoc(html)
headline = doc.select('div#article_info>span#about_primary_stocks>a>text()')
div = doc.select('div#article_participants')
ps = div.getElements('p',start='<strong>Executives</strong>',end='<strong>Analysts</strong>')
Executives = [p.text.split('-')[0].strip() for p in ps]
ps = div.getElements('p',start='<strong>Analysts</strong>')
Analysts = [p.text.split('-')[0].strip() for p in ps]
print (headline)
print (Executives)
print (Analysts)
Wabash National Corporation (WNC)
[u'Mike Pettit', u'Richard Giromini', u'Jeffery Taylor']
[u'Jeffery Taylor']
from re import search
html = """<div class="content_part hid" id="article_participants">
<p>Wabash National Corporation (NYSE:<a title="" href="http://seekingalpha.com/symbol/wnc">WNC</a>)</p><p>Q4 2014 <span class="transcript-search-span" style="background-color: yellow;">Earnings</span> Conference <span class="transcript-search-span" style="background-color: rgb(243, 134, 134);">Call</span></p><p>February 04, 2015 10:00 AM ET</p>
<p><strong>Executives</strong></p>
<p>Mike Pettit - Vice President of Finance and Investor Relations</p>
<p>Richard Giromini - President and Chief Executive Officer</p>
<p>Jeffery Taylor - Senior Vice President and Chief Financial Officer</p>
<p><strong>Analysts</strong></p>"""
soup = search( r"(<strong>Executives(.+))<strong>", html, re.DOTALL)
print ( soup.group(1) )
<strong>Executives</strong></p>
<p>Mike Pettit - Vice President of Finance and Investor Relations</p>
<p>Richard Giromini - President and Chief Executive Officer</p>
<p>Jeffery Taylor - Senior Vice President and Chief Financial Officer</p>
<p>
print ( bs(soup.group(1), "lxml").get_text() )
Executives
Mike Pettit - Vice President of Finance and Investor Relations
Richard Giromini - President and Chief Executive Officer
Jeffery Taylor - Senior Vice President and Chief Financial Officer