Python 将汤内容放入结构化csv
我正在尝试将一个网站整理成结构化的数据格式。 我想以一个包含6列的.csv结尾:国家、日期、一般文本、财政文本、货币文本、外汇文本 映射如下所示:Python 将汤内容放入结构化csv,python,csv,beautifulsoup,Python,Csv,Beautifulsoup,我正在尝试将一个网站整理成结构化的数据格式。 我想以一个包含6列的.csv结尾:国家、日期、一般文本、财政文本、货币文本、外汇文本 映射如下所示: country <- h3 date <- h6 general_text <- p (h3) (the p tag that follows the h3 header) fiscal_text <- p (1st h5 ul li) (the p tag that follows the **first** h5. Th
country <- h3
date <- h6
general_text <- p (h3) (the p tag that follows the h3 header)
fiscal_text <- p (1st h5 ul li) (the p tag that follows the **first** h5. This tag is inside ul and li blocks)
monetary_text <- p (2nd h5 ul li) (the p tag that follows the **second** h5. This tag is inside ul and li blocks)
fx_text <- p (3rd h5 ul li) (the p tag that follows the **third** h5. This tag is inside ul and li blocks)
我有以下用于简单文本提取的代码:
import requests
import io
import csv
from bs4 import BeautifulSoup
from urllib.request import urlopen
URL = 'https://www.imf.org/en/Topics/imf-and-covid19/Policy-Responses-to-COVID-19'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(class_='rr-intro')
with io.open('test.txt', 'w', encoding='utf8') as f:
for header in results.find_all(['h3', 'h6', 'h5']):
f.write(header.get_text() + u'\n')
for elem in header.next_siblings:
if elem.name and elem.name.startswith('h'):
# stop at next header
break
if elem.name and elem.find_all('p'):
f.write(elem.get_text() + u'\n')
从评论中,我认为创建列表并以某种方式压缩它们是有意义的。我试过这个:
h3 = results.find_all('h3')
h6 = results.find_all('h6')
h5 = results.find_all('h5')
h5f = results.find_all('h5', text='Fiscal')
h5m = results.find_all('h5', text='Monetary and macro-financial')
h5x = results.find_all('h5', text='Exchange rate and balance of payments')
country = [country.get_text() for country in h3] #list of countries
date = [date.get_text() for date in h6] #date string
我被困在这里了。不确定如何将p标签的内容放到列表中正确的位置,以便将其压缩,或直接压缩到csv
我是python新手,所以我根据在stackoverflow上的发现制作了这些。任何帮助都将不胜感激
编辑:澄清一下,我想要的结构是这样的
<div class="rr-intro">
<h3>
Country 1
</h3>
<p>
summary text
</p>
<h6>
date
</h6>
<h5>
Fiscal
</h5>
<ul>
<li>
<p>
text for fiscal of country 1
</p>
</li>
</ul>
<h5>
Monetary and macro-financial
</h5>
<ul>
<li>
<p>
text for monetary of country 1
</p>
</li>
</ul>
<h5>
Exchange rate and balance of payments
</h5>
<ul>
<li>
<p>
text for FX of country 1
</p>
</li>
</ul>
<h3>
Country 2
</h3>
<p>
summary text
</p>
<h6>
date
</h6>
<h5>
Fiscal
</h5>
<ul>
<li>
<p>
text for fiscal of country 2
</p>
</li>
</ul>
<h5>
Monetary and macro-financial
</h5>
<ul>
<li>
<p>
text for monetary of country 2
</p>
</li>
</ul>
<h5>
Exchange rate and balance of payments
</h5>
<ul>
<li>
<p>
text for FX of country 2
</p>
</li>
</ul>
<h3>
Country 3
</h3>
国家1
摘要文本
日期
财政
-
国家1财政预算案案文
货币和宏观金融
-
第1国货币基金组织案文
汇率和国际收支
-
国家1外汇的文本
国家2
摘要文本
日期
财政
-
国家2财政预算案案文
货币和宏观金融
-
第2国货币基金组织案文
汇率和国际收支
-
国家2外汇的文本
国家3
等等。我觉得最简单的方法是在按顺序阅读时处理元素。为此,您可以跟踪当前节,然后将文本附加到该节中 一旦找到下一个
h3
国家,就可以使用Python CSVDictWriter
编写一行信息。例如:
from collections import defaultdict
import requests
import csv
from bs4 import BeautifulSoup
URL = 'https://www.imf.org/en/Topics/imf-and-covid19/Policy-Responses-to-COVID-19'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find('div', class_='rr-intro')
section_lookup = {
'Fiscal' : 'fiscal_text',
'Moneta' : 'monetary_text',
'Macro-' : 'monetary_text',
'Exchan' : 'fx_text',
}
with open('data.csv', 'w', encoding='utf8', newline='') as f_output:
fieldnames = ['country', 'date', 'general_text', 'fiscal_text', 'monetary_text', 'fx_text']
csv_output = csv.DictWriter(f_output, fieldnames=fieldnames)
csv_output.writeheader()
row = defaultdict(str)
section = None
for elem in results.find_all(['h3', 'h6', 'h5', 'p']):
if elem.name == 'h3':
if row:
csv_output.writerow(row)
row = defaultdict(str)
row['country'] = elem.get_text(strip=True)
section = "general_text"
elif elem.name == 'h5':
section = section_lookup[elem.get_text(strip=True)[:6]]
elif elem.name == 'h6':
row['date'] = elem.get_text(strip=True)[27:]
elif elem.name == 'p' and section:
row[section] = f"{row[section]} {elem.get_text(strip=True)}"
if row:
csv_output.writerow(row)
给您一个data.csv
文件开始:
有帮助吗?结果从哪里来?@CannedScientist我理解这些代码,但我无法将其应用于我正在抓取的站点所需的逻辑。我需要元素h3=国家,h6=日期,h5=文本类型。h3之后是中的含量。h6没有内容(只是日期)。h5是复杂的,因为在
- 和
- 之后。我不太明白如何将每个内容都引导到.csv中正确的位置。我想我需要列出清单什么的,但我无能为力。@xxMrPHDxx对不起。我漏了一行代码。它现在就在那里。这是对我的教育。我想出了另一个代码来创建4个不同的.csv文件并将它们连接起来,但由于多个p标签,它出现了一个关闭一个错误。这样好多了。我不确定“行”逻辑在做什么,但我会从中学习。非常巧妙地使用查找字典。我还学习了一些关于“section_lookup”和[:6]逻辑的知识。非常酷且备受赞赏的
是一个保存一行值的字典。我使用一个默认字典来更容易地附加p标记。我在这里的最后一个挑战是处理一些编码。该页面是UTF-8,我知道如何使用文本和Soup获取它。它阅读和显示良好。但我认为,当脚本提取p标签时,它会丢失这一点(例如,圣多美和普林西比)。文本在csv中被篡改。我尝试将格式化代码(.encode('utf-8')和(.decode('utf-8','ignore')添加到“csv_output.writerow(row)”和“row=defaultdict(str)”中,但情况变得更糟。我会深入研究这个问题,但我想我应该指出,如果有人使用它并遇到相同的问题。干杯行
from collections import defaultdict import requests import csv from bs4 import BeautifulSoup URL = 'https://www.imf.org/en/Topics/imf-and-covid19/Policy-Responses-to-COVID-19' page = requests.get(URL) soup = BeautifulSoup(page.content, 'html.parser') results = soup.find('div', class_='rr-intro') section_lookup = { 'Fiscal' : 'fiscal_text', 'Moneta' : 'monetary_text', 'Macro-' : 'monetary_text', 'Exchan' : 'fx_text', } with open('data.csv', 'w', encoding='utf8', newline='') as f_output: fieldnames = ['country', 'date', 'general_text', 'fiscal_text', 'monetary_text', 'fx_text'] csv_output = csv.DictWriter(f_output, fieldnames=fieldnames) csv_output.writeheader() row = defaultdict(str) section = None for elem in results.find_all(['h3', 'h6', 'h5', 'p']): if elem.name == 'h3': if row: csv_output.writerow(row) row = defaultdict(str) row['country'] = elem.get_text(strip=True) section = "general_text" elif elem.name == 'h5': section = section_lookup[elem.get_text(strip=True)[:6]] elif elem.name == 'h6': row['date'] = elem.get_text(strip=True)[27:] elif elem.name == 'p' and section: row[section] = f"{row[section]} {elem.get_text(strip=True)}" if row: csv_output.writerow(row)