Python 将HTML文件解析部分转换为csv
我是Python的新手。我正试图从一个网页()的主管(顶部提到)那里得到所有答案。此网页位于我的硬盘上(因此没有url) 因此,我的最终结果是:Python 将HTML文件解析部分转换为csv,python,html,beautifulsoup,local,Python,Html,Beautifulsoup,Local,我是Python的新手。我正试图从一个网页()的主管(顶部提到)那里得到所有答案。此网页位于我的硬盘上(因此没有url) 因此,我的最终结果是: Column 1 All executives Column 2 all the answers 答案只能从“问答部分”中得出 我尝试了以下几点: from bs4 import BeautifulSoup import requests with open('transcript-86-855.html') as html_file:
Column 1
All executives
Column 2
all the answers
答案只能从“问答部分”中得出
我尝试了以下几点:
from bs4 import BeautifulSoup
import requests
with open('transcript-86-855.html') as html_file:
soup=BeautifulSoup(html_file, 'lxml')
article_qanda = soup.find('DIV', id='article_qanda'
有人能帮我吗?如果我没听错,你想打印两列,一列是姓名(在本例中是
Dror Ben Asher
),另一列是他的答案
例如:
import textwrap
from bs4 import BeautifulSoup
with open('page.html', 'r') as f_in:
soup = BeautifulSoup(f_in.read(), 'html.parser')
print('{:<30} {:<70}'.format('Name', 'Answer'))
print('-' * 101)
for answer in soup.select('p:contains("Question-and-Answer Session") ~ strong:contains("Dror Ben Asher") + p'):
txt = answer.get_text(strip=True)
s = answer.find_next_sibling()
while s:
if s.name == 'strong' or s.find('strong'):
break
if s.name == 'p':
txt += ' ' + s.get_text(strip=True)
s = s.find_next_sibling()
txt = ('\n' + ' '*31).join(textwrap.wrap(txt))
print('{:<30} {:<70}'.format('Dror Ben Asher - CEO', txt))
print()
你说的专栏是什么意思???“article_qanda”节点中没有此类列。子节点只是
p
标记。我希望输出为列。也许这在问题中还不清楚。使用find_all()
awesome,只不过也可以让所有高管都使用它?对于多个html文件(这些文件具有相同的html设置)?@nikos若要刮取另一名高管,只需将另一名高管的名字替换为'Dror Ben Asher'
。如果您想刮取多个文件,可以遍历当前目录并打开所有.html
文件。但是没有选项自动识别“执行者”(所有文件中的执行者也不同,但所有文件中的html设置相同)?如何在不同的列中检索“symbolslug”?对于本页,结果应为第三列中的RDHL。我希望所有的高管不仅仅是Dror Ben Asher
Name Answer
-----------------------------------------------------------------------------------------------------
Dror Ben Asher - CEO Thank you, Scott. Its a very good question indeed in January we
announced a new amendment and that amendment includes anti-TNF
patients some of them not all of them, those who qualify. And we are
talking about anti-TNF failures to be clear and only Remicade and
Humira. The idea here was to increase very significantly the patients
pooled of those potentially eligible for the study thus expediting
recruitment. Did I answer your question?
Dror Ben Asher - CEO Right, this is one of most important tasks; right now the most
important item here is the divestment of non-core assets. All other
non-core assets, the non-core assets are those that are not within our
therapeutic focus of GI and inflammation. And those are specifically
RHB-103 RIZAPORT for migraine and RHB-101 which is a cardio drug.
RHB-101 is a legacy drug, we have recently announced last month, we
announced that we are in discussions for both of these product for
out-licensing, which we hope to complete in the first half of 2015. So
this is the highest priority, obviously discussion on other product,
but Redhill is in the fortunate position that we are able to complete
our Phase III studies with our existing results, resources and as time
goes by obviously the value of the assets keeps going up. So we are in
no rush to out-license everything else and so there is obviously in
track.
...and so on.