Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/351.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/72.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 将HTML文件解析部分转换为csv_Python_Html_Beautifulsoup_Local - Fatal编程技术网

Python 将HTML文件解析部分转换为csv

Python 将HTML文件解析部分转换为csv,python,html,beautifulsoup,local,Python,Html,Beautifulsoup,Local,我是Python的新手。我正试图从一个网页()的主管(顶部提到)那里得到所有答案。此网页位于我的硬盘上(因此没有url) 因此,我的最终结果是: Column 1 All executives Column 2 all the answers 答案只能从“问答部分”中得出 我尝试了以下几点: from bs4 import BeautifulSoup import requests with open('transcript-86-855.html') as html_file:

我是Python的新手。我正试图从一个网页()的主管(顶部提到)那里得到所有答案。此网页位于我的硬盘上(因此没有url)

因此,我的最终结果是:

Column 1  
All executives

Column 2  
all the answers
答案只能从“问答部分”中得出

我尝试了以下几点:

from bs4 import BeautifulSoup
import requests 

with open('transcript-86-855.html') as html_file:
    soup=BeautifulSoup(html_file, 'lxml')
article_qanda = soup.find('DIV', id='article_qanda'


有人能帮我吗?

如果我没听错,你想打印两列,一列是姓名(在本例中是
Dror Ben Asher
),另一列是他的答案

例如:

import textwrap
from bs4 import BeautifulSoup

with open('page.html', 'r') as f_in:
    soup = BeautifulSoup(f_in.read(), 'html.parser')

print('{:<30} {:<70}'.format('Name', 'Answer'))
print('-' * 101)
for answer in soup.select('p:contains("Question-and-Answer Session") ~ strong:contains("Dror Ben Asher") + p'):
    txt = answer.get_text(strip=True)

    s = answer.find_next_sibling()
    while s:
        if s.name == 'strong' or s.find('strong'):
            break
        if s.name == 'p':
            txt += ' ' + s.get_text(strip=True)
        s = s.find_next_sibling()

    txt = ('\n' + ' '*31).join(textwrap.wrap(txt))

    print('{:<30} {:<70}'.format('Dror Ben Asher - CEO', txt))
    print()

你说的专栏是什么意思???“article_qanda”节点中没有此类列。子节点只是
p
标记。我希望输出为列。也许这在问题中还不清楚。使用
find_all()
awesome,只不过也可以让所有高管都使用它?对于多个html文件(这些文件具有相同的html设置)?@nikos若要刮取另一名高管,只需将另一名高管的名字替换为
'Dror Ben Asher'
。如果您想刮取多个文件,可以遍历当前目录并打开所有
.html
文件。但是没有选项自动识别“执行者”(所有文件中的执行者也不同,但所有文件中的html设置相同)?如何在不同的列中检索“symbolslug”?对于本页,结果应为第三列中的RDHL。我希望所有的高管不仅仅是Dror Ben Asher
Name                           Answer                                                                
-----------------------------------------------------------------------------------------------------
Dror Ben Asher - CEO           Thank you, Scott. Its a very good question indeed in January we
                               announced a new amendment and that amendment includes anti-TNF
                               patients some of them not all of them, those who qualify. And we are
                               talking about anti-TNF failures to be clear and only Remicade and
                               Humira. The idea here was to increase very significantly the patients
                               pooled of those potentially eligible for the study thus expediting
                               recruitment. Did I answer your question?

Dror Ben Asher - CEO           Right, this is one of most important tasks; right now the most
                               important item here is the divestment of non-core assets. All other
                               non-core assets, the non-core assets are those that are not within our
                               therapeutic focus of GI and inflammation. And those are specifically
                               RHB-103 RIZAPORT for migraine and RHB-101 which is a cardio drug.
                               RHB-101 is a legacy drug, we have recently announced last month, we
                               announced that we are in discussions for both of these product for
                               out-licensing, which we hope to complete in the first half of 2015. So
                               this is the highest priority, obviously discussion on other product,
                               but Redhill is in the fortunate position that we are able to complete
                               our Phase III studies with our existing results, resources and as time
                               goes by obviously the value of the assets keeps going up. So we are in
                               no rush to out-license everything else and so there is obviously in
                               track.

...and so on.