如何在BeautifulSoup中迭代未列出的html以列表格式提取内容_Html_Python 3.x_Beautifulsoup

如何在BeautifulSoup中迭代未列出的html以列表格式提取内容

html python-3.x

如何在BeautifulSoup中迭代未列出的html以列表格式提取内容,html,python-3.x,beautifulsoup,Html,Python 3.x,Beautifulsoup,我的问题有点特别。我一直在研究其他所有关于“美丽之路”的问题，但还没有找到我的问题的答案。我已经采取了一个pdf文件，并把它转换成有点体面的html的意图进一步转录到一个csv文件我的网页看起来是这样的，除了我编辑了一堆我不确定我想让普通谷歌用户使用的东西： (RUSI) US Foundation Last Updated: 2014-12-29 At A Glance [st # redacted] I St. N.W. Washington, DC United States 20006

我的问题有点特别。我一直在研究其他所有关于“美丽之路”的问题，但还没有找到我的问题的答案。我已经采取了一个pdf文件，并把它转换成有点体面的html的意图进一步转录到一个csv文件

我的网页看起来是这样的，除了我编辑了一堆我不确定我想让普通谷歌用户使用的东西：

(RUSI) US Foundation
Last Updated: 2014-12-29
At A Glance
[st # redacted] I St. N.W.
Washington, DC United States 20006
Type of Grantmaker
Independent foundation
Financial Data
(yr. ended 2013-12-31)
Assets: $3,085 Total giving: $0
EIN
[redacted]
990
[redacted]
Application Information
Unsolicited requests for funds not accepted.
Application form not required.
Directors Michael Clarke Sean Murphy Timothy Voake
Financial Data
Year ended 2013-12-31
Assets: $3,085 (market value)
Expenditures: $387
Total giving: $0
Qualifying distributions: $387
Additional Location Information
County: District of Columbia
Metropolitan area: Washington-Arlington-Alexandria, DC-VA-MD-WV Congressional district: District of Columbia District At-large

04Arts Foundation
Last Updated: 2013-05-15
At A Glance
P.O. Box [redacted]
San Antonio, TX United States 78283-1253 Telephone:(210) [redacted] Contact: Penelope Speier URL: www.04arts.org
Type of Grantmaker
Independent foundation
Financial Data
(yr. ended 2012-12-31)
Assets: $40,957 Total giving: $1,698
EIN
[redacted]
990
[redacted]
Additional Contact Information
Application Address: [redacted] Dallas, New Braunfels, TX 78130
Background
Established in 1995 in TX.
Limitations
No grants to individuals.
Fields of Interest Subjects
Arts
Application Information
Application form not required.
Initial approach: Proposal Deadline(s): None
Donor(s)
Note: If a donor is deceased, the symbol (f) follows the name.
Penelope Gallagher William Gallagher Edward Everett Collins, III Edwards Aquifer Authority
Officer
Penelope Speier, Pres.
Directors Wendy W. Atwell Jon Cochran
Financial Data
Year ended 2012-12-31
Assets: $40,957 (market value)
Gifts received: $[redacted] Expenditures: $[redacted] Total giving: $[redacted] Qualifying distributions: $[redacted] Giving activities include:
$[redacted] for grants
Additional Location Information
County: Bexar
Metropolitan area: San Antonio, TX Congressional district: Texas District 35

1 in 9: The Long Island Breast Cancer Action Coalition, Inc
Last Updated: 2011-12-19
At A Glance
[redacted] E. Rockaway Rd.
Hewlett, NY United States 11557-1736 Telephone:(516) [redacted] Fax: (516) [redacted] E-mail: [redacted]
Type of Grantmaker
Public charity
Additional Descriptor
Organization that normally receives a substantial part of its support from a governmental unit or from the general public
EIN
[redacted]
990
[redacted]
Purpose and Activities
The coalition's mission is to promote awareness of the breast cancer epidemic through education, outreach, advocacy, and direct support of research which is being done to find the causes of and cures for breast cancer and other related cancers.
Fields of Interest Subjects
Breast cancer
Breast cancer research
Cancer
Cancer research
Types of Support
Research
Publications
Newsletter
Officers and Directors
Note: An asterisk (*) following an individual's name indicates an officer who is also a trustee or director.
Geri Barish *, Pres.
Louise Levrie, V.P.
Larry Slatky *, Treas.
Caroline Boss Fran Kritchek Frank P. Naudus Leon Newman
Additional Location Information
County: Nassau
Metropolitan area: New York-Northern New Jersey-Long Island, NY-NJ-PA Congressional district: New York District 04

我的html当前看起来是这样的（与此完全相同，所以请注意，这很可怕）：

我得到

(RUSI) US Foundation
['Last Updated: ', '2014', '-', '12-29']
At A Glance

太棒了！下一部分我很难理解。我需要抓住“一瞥”和“格兰特梅克类型”之间的一切。然后，我需要为“Grantmaker类型”和下一集这样做。这样做的一个好处是，对于类似的标题，标签几乎总是相同的。例如，我可以通过

titles=html….

code获得所有标题的名称

我想要的输出是一个如下所示的列表：

[[first organization, last_updated, at_a_glance, type_of_grantmaker, financial_data, ...], 
[second organization, ...], [third organization, ...], ...]

任何正确方向的步骤都是非常感谢的！如果你认为我的问题因为任何原因都很糟糕，我希望能在-1的基础上加上一句评论，这样我就可以解决它了。我是新来的，我最后的问题没有得到很好的回答…

事实证明，对我来说，最简单的方法是将其拆分，然后再放入BeautifulSoup。因此，我所做的是使用以下代码将其拆分，然后（目前）编写一个函数来处理文本拆分

from bs4 import BeautifulSoup as Soup

with open('found1.html', 'r') as f:
    html = f.read()
sections = html.split('</a><span class="font6" style="font-weight:bold;">')


# Developing this bit to extract text cleanly.
def extract(html):
    html = Soup(html)
    html.find_all(text=True)
    print(extract)
    print(html.text)


# Gives me the whole html between the first title and the second
print(sections[1])
extract(sections[1])

从bs4导入BeautifulSoup作为汤
以open（'found1.html'，'r'）作为f：
html=f.read（）
sections=html.split（“”）
#开发此位以干净地提取文本。
def摘录（html）：
html=汤（html）
html.find_all（text=True）
打印（摘录）
打印（html.text）
#给出了第一个标题和第二个标题之间的整个html
打印（第[1]节）
摘录（第[1]节）

如果我错了，请纠正我-您提供的HTML包含单个组织的数据，对吗？不，它提供了一系列组织的数据。

[[first organization, last_updated, at_a_glance, type_of_grantmaker, financial_data, ...], 
[second organization, ...], [third organization, ...], ...]

from bs4 import BeautifulSoup as Soup

with open('found1.html', 'r') as f:
    html = f.read()
sections = html.split('</a><span class="font6" style="font-weight:bold;">')


# Developing this bit to extract text cleanly.
def extract(html):
    html = Soup(html)
    html.find_all(text=True)
    print(extract)
    print(html.text)


# Gives me the whole html between the first title and the second
print(sections[1])
extract(sections[1])