Python 将汤内容放入结构化csv_Python_Csv_Beautifulsoup

Python 将汤内容放入结构化csv

python csv

Python 将汤内容放入结构化csv,python,csv,beautifulsoup,Python,Csv,Beautifulsoup,我正在尝试将一个网站整理成结构化的数据格式。我想以一个包含6列的.csv结尾：国家、日期、一般文本、财政文本、货币文本、外汇文本映射如下所示： country <- h3 date <- h6 general_text <- p (h3) (the p tag that follows the h3 header) fiscal_text <- p (1st h5 ul li) (the p tag that follows the **first** h5. Th

我正在尝试将一个网站整理成结构化的数据格式。我想以一个包含6列的.csv结尾：国家、日期、一般文本、财政文本、货币文本、外汇文本

映射如下所示：

country <- h3
date <- h6
general_text <- p (h3) (the p tag that follows the h3 header)
fiscal_text  <- p (1st h5 ul li) (the p tag that follows the **first** h5. This tag is inside ul and li blocks)
monetary_text <- p (2nd h5 ul li) (the p tag that follows the **second** h5. This tag is inside ul and li blocks)
fx_text <- p (3rd h5 ul li) (the p tag that follows the **third** h5. This tag is inside ul and li blocks)

我有以下用于简单文本提取的代码：

import requests
import io
import csv 
from bs4 import BeautifulSoup
from urllib.request import urlopen
URL = 'https://www.imf.org/en/Topics/imf-and-covid19/Policy-Responses-to-COVID-19'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

results = soup.find(class_='rr-intro')

with io.open('test.txt', 'w', encoding='utf8') as f:
    for header in results.find_all(['h3', 'h6', 'h5']):
        f.write(header.get_text() + u'\n') 
        for elem in header.next_siblings:
            if elem.name and elem.name.startswith('h'):
                # stop at next header
                break
            if elem.name and elem.find_all('p'):
                f.write(elem.get_text() + u'\n')

从评论中，我认为创建列表并以某种方式压缩它们是有意义的。我试过这个：

h3 = results.find_all('h3')
h6 = results.find_all('h6')
h5 = results.find_all('h5')
h5f = results.find_all('h5', text='Fiscal')
h5m = results.find_all('h5', text='Monetary and macro-financial')
h5x = results.find_all('h5', text='Exchange rate and balance of payments')
country = [country.get_text() for country in h3]  #list of countries
date = [date.get_text() for date in h6]  #date string

我被困在这里了。不确定如何将p标签的内容放到列表中正确的位置，以便将其压缩，或直接压缩到csv

我是python新手，所以我根据在stackoverflow上的发现制作了这些。任何帮助都将不胜感激

编辑：澄清一下，我想要的结构是这样的

<div class="rr-intro">

 <h3>
  Country 1
 </h3>
 <p>
  summary text
 </p>
 <h6>
  date
 </h6>
 <h5>
  Fiscal
 </h5>
 <ul>
  <li>
   <p>
    text for fiscal of country 1
   </p>
  </li>
 </ul>
 <h5>
  Monetary and macro-financial
 </h5>
 <ul>
  <li>
   <p>
    text for monetary of country 1
   </p>
  </li>
 </ul>
 <h5>
  Exchange rate and balance of payments
 </h5>
 <ul>
  <li>
   <p>
    text for FX of country 1
   </p>
  </li>
 </ul>
  <h3>
  Country 2
 </h3>
 <p>
  summary text
 </p>
 <h6>
  date
 </h6>
 <h5>
  Fiscal
 </h5>
 <ul>
  <li>
   <p>
    text for fiscal of country 2
   </p>
  </li>
 </ul>
 <h5>
  Monetary and macro-financial
 </h5>
 <ul>
  <li>
   <p>
    text for monetary of country 2
   </p>
  </li>
 </ul>
 <h5>
  Exchange rate and balance of payments
 </h5>
 <ul>
  <li>
   <p>
    text for FX of country 2
   </p>
  </li>
 </ul>
   <h3>
  Country 3
 </h3>


国家1

摘要文本

日期
财政



国家1财政预算案案文



货币和宏观金融



第1国货币基金组织案文



汇率和国际收支



国家1外汇的文本



国家2

摘要文本

日期
财政



国家2财政预算案案文



货币和宏观金融



第2国货币基金组织案文



汇率和国际收支



国家2外汇的文本



国家3

等等。

我觉得最简单的方法是在按顺序阅读时处理元素。为此，您可以跟踪当前节，然后将文本附加到该节中

一旦找到下一个

h3

国家，就可以使用Python CSV

DictWriter

编写一行信息。例如：

from collections import defaultdict
import requests
import csv 
from bs4 import BeautifulSoup

URL = 'https://www.imf.org/en/Topics/imf-and-covid19/Policy-Responses-to-COVID-19'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

results = soup.find('div', class_='rr-intro')

section_lookup = {
    'Fiscal' : 'fiscal_text',
    'Moneta' : 'monetary_text',
    'Macro-' : 'monetary_text',
    'Exchan' : 'fx_text',
}

with open('data.csv', 'w', encoding='utf8', newline='') as f_output:
    fieldnames = ['country', 'date', 'general_text', 'fiscal_text', 'monetary_text', 'fx_text']
    csv_output = csv.DictWriter(f_output, fieldnames=fieldnames)
    csv_output.writeheader()

    row = defaultdict(str)
    section = None

    for elem in results.find_all(['h3', 'h6', 'h5', 'p']):
        if elem.name == 'h3':
            if row:
                csv_output.writerow(row)
                row = defaultdict(str)

            row['country'] = elem.get_text(strip=True)
            section = "general_text"

        elif elem.name == 'h5':
            section = section_lookup[elem.get_text(strip=True)[:6]]
        elif elem.name == 'h6':
            row['date'] = elem.get_text(strip=True)[27:]
        elif elem.name == 'p' and section:
            row[section] = f"{row[section]} {elem.get_text(strip=True)}"

    if row:
        csv_output.writerow(row)

给您一个

data.csv

文件开始：

有帮助吗？结果从哪里来？@CannedScientist我理解这些代码，但我无法将其应用于我正在抓取的站点所需的逻辑。我需要元素h3=国家，h6=日期，h5=文本类型。h3之后是中的含量。h6没有内容（只是日期）。h5是复杂的，因为在

之后。我不太明白如何将每个内容都引导到.csv中正确的位置。我想我需要列出清单什么的，但我无能为力。@xxMrPHDxx对不起。我漏了一行代码。它现在就在那里。这是对我的教育。我想出了另一个代码来创建4个不同的.csv文件并将它们连接起来，但由于多个p标签，它出现了一个关闭一个错误。这样好多了。我不确定“行”逻辑在做什么，但我会从中学习。非常巧妙地使用查找字典。我还学习了一些关于“section_lookup”和[：6]逻辑的知识。非常酷且备受赞赏的

行

是一个保存一行值的字典。我使用一个默认字典来更容易地附加p标记。我在这里的最后一个挑战是处理一些编码。该页面是UTF-8，我知道如何使用文本和Soup获取它。它阅读和显示良好。但我认为，当脚本提取p标签时，它会丢失这一点（例如，圣多美和普林西比）。文本在csv中被篡改。我尝试将格式化代码（.encode（'utf-8'）和（.decode（'utf-8'，'ignore'）添加到“csv_output.writerow（row）”和“row=defaultdict（str）”中，但情况变得更糟。我会深入研究这个问题，但我想我应该指出，如果有人使用它并遇到相同的问题。干杯

from collections import defaultdict
import requests
import csv 
from bs4 import BeautifulSoup

URL = 'https://www.imf.org/en/Topics/imf-and-covid19/Policy-Responses-to-COVID-19'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

results = soup.find('div', class_='rr-intro')

section_lookup = {
    'Fiscal' : 'fiscal_text',
    'Moneta' : 'monetary_text',
    'Macro-' : 'monetary_text',
    'Exchan' : 'fx_text',
}

with open('data.csv', 'w', encoding='utf8', newline='') as f_output:
    fieldnames = ['country', 'date', 'general_text', 'fiscal_text', 'monetary_text', 'fx_text']
    csv_output = csv.DictWriter(f_output, fieldnames=fieldnames)
    csv_output.writeheader()

    row = defaultdict(str)
    section = None

    for elem in results.find_all(['h3', 'h6', 'h5', 'p']):
        if elem.name == 'h3':
            if row:
                csv_output.writerow(row)
                row = defaultdict(str)

            row['country'] = elem.get_text(strip=True)
            section = "general_text"

        elif elem.name == 'h5':
            section = section_lookup[elem.get_text(strip=True)[:6]]
        elif elem.name == 'h6':
            row['date'] = elem.get_text(strip=True)[27:]
        elif elem.name == 'p' and section:
            row[section] = f"{row[section]} {elem.get_text(strip=True)}"

    if row:
        csv_output.writerow(row)