Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/315.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 将汤内容放入结构化csv_Python_Csv_Beautifulsoup - Fatal编程技术网

Python 将汤内容放入结构化csv

Python 将汤内容放入结构化csv,python,csv,beautifulsoup,Python,Csv,Beautifulsoup,我正在尝试将一个网站整理成结构化的数据格式。 我想以一个包含6列的.csv结尾:国家、日期、一般文本、财政文本、货币文本、外汇文本 映射如下所示: country <- h3 date <- h6 general_text <- p (h3) (the p tag that follows the h3 header) fiscal_text <- p (1st h5 ul li) (the p tag that follows the **first** h5. Th

我正在尝试将一个网站整理成结构化的数据格式。 我想以一个包含6列的.csv结尾:国家、日期、一般文本、财政文本、货币文本、外汇文本

映射如下所示:

country <- h3
date <- h6
general_text <- p (h3) (the p tag that follows the h3 header)
fiscal_text  <- p (1st h5 ul li) (the p tag that follows the **first** h5. This tag is inside ul and li blocks)
monetary_text <- p (2nd h5 ul li) (the p tag that follows the **second** h5. This tag is inside ul and li blocks)
fx_text <- p (3rd h5 ul li) (the p tag that follows the **third** h5. This tag is inside ul and li blocks)
我有以下用于简单文本提取的代码:

import requests
import io
import csv 
from bs4 import BeautifulSoup
from urllib.request import urlopen
URL = 'https://www.imf.org/en/Topics/imf-and-covid19/Policy-Responses-to-COVID-19'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

results = soup.find(class_='rr-intro')

with io.open('test.txt', 'w', encoding='utf8') as f:
    for header in results.find_all(['h3', 'h6', 'h5']):
        f.write(header.get_text() + u'\n') 
        for elem in header.next_siblings:
            if elem.name and elem.name.startswith('h'):
                # stop at next header
                break
            if elem.name and elem.find_all('p'):
                f.write(elem.get_text() + u'\n')
从评论中,我认为创建列表并以某种方式压缩它们是有意义的。我试过这个:

h3 = results.find_all('h3')
h6 = results.find_all('h6')
h5 = results.find_all('h5')
h5f = results.find_all('h5', text='Fiscal')
h5m = results.find_all('h5', text='Monetary and macro-financial')
h5x = results.find_all('h5', text='Exchange rate and balance of payments')
country = [country.get_text() for country in h3]  #list of countries
date = [date.get_text() for date in h6]  #date string

我被困在这里了。不确定如何将p标签的内容放到列表中正确的位置,以便将其压缩,或直接压缩到csv

我是python新手,所以我根据在stackoverflow上的发现制作了这些。任何帮助都将不胜感激

编辑:澄清一下,我想要的结构是这样的

<div class="rr-intro">

 <h3>
  Country 1
 </h3>
 <p>
  summary text
 </p>
 <h6>
  date
 </h6>
 <h5>
  Fiscal
 </h5>
 <ul>
  <li>
   <p>
    text for fiscal of country 1
   </p>
  </li>
 </ul>
 <h5>
  Monetary and macro-financial
 </h5>
 <ul>
  <li>
   <p>
    text for monetary of country 1
   </p>
  </li>
 </ul>
 <h5>
  Exchange rate and balance of payments
 </h5>
 <ul>
  <li>
   <p>
    text for FX of country 1
   </p>
  </li>
 </ul>
  <h3>
  Country 2
 </h3>
 <p>
  summary text
 </p>
 <h6>
  date
 </h6>
 <h5>
  Fiscal
 </h5>
 <ul>
  <li>
   <p>
    text for fiscal of country 2
   </p>
  </li>
 </ul>
 <h5>
  Monetary and macro-financial
 </h5>
 <ul>
  <li>
   <p>
    text for monetary of country 2
   </p>
  </li>
 </ul>
 <h5>
  Exchange rate and balance of payments
 </h5>
 <ul>
  <li>
   <p>
    text for FX of country 2
   </p>
  </li>
 </ul>
   <h3>
  Country 3
 </h3> 

国家1

摘要文本

日期 财政
  • 国家1财政预算案案文

货币和宏观金融
  • 第1国货币基金组织案文

汇率和国际收支
  • 国家1外汇的文本

国家2 摘要文本

日期 财政
  • 国家2财政预算案案文

货币和宏观金融
  • 第2国货币基金组织案文

汇率和国际收支
  • 国家2外汇的文本

国家3

等等。

我觉得最简单的方法是在按顺序阅读时处理元素。为此,您可以跟踪当前节,然后将文本附加到该节中

一旦找到下一个
h3
国家,就可以使用Python CSV
DictWriter
编写一行信息。例如:

from collections import defaultdict
import requests
import csv 
from bs4 import BeautifulSoup

URL = 'https://www.imf.org/en/Topics/imf-and-covid19/Policy-Responses-to-COVID-19'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

results = soup.find('div', class_='rr-intro')

section_lookup = {
    'Fiscal' : 'fiscal_text',
    'Moneta' : 'monetary_text',
    'Macro-' : 'monetary_text',
    'Exchan' : 'fx_text',
}

with open('data.csv', 'w', encoding='utf8', newline='') as f_output:
    fieldnames = ['country', 'date', 'general_text', 'fiscal_text', 'monetary_text', 'fx_text']
    csv_output = csv.DictWriter(f_output, fieldnames=fieldnames)
    csv_output.writeheader()

    row = defaultdict(str)
    section = None

    for elem in results.find_all(['h3', 'h6', 'h5', 'p']):
        if elem.name == 'h3':
            if row:
                csv_output.writerow(row)
                row = defaultdict(str)

            row['country'] = elem.get_text(strip=True)
            section = "general_text"

        elif elem.name == 'h5':
            section = section_lookup[elem.get_text(strip=True)[:6]]
        elif elem.name == 'h6':
            row['date'] = elem.get_text(strip=True)[27:]
        elif elem.name == 'p' and section:
            row[section] = f"{row[section]} {elem.get_text(strip=True)}"

    if row:
        csv_output.writerow(row)
给您一个
data.csv
文件开始:


有帮助吗?结果从哪里来?@CannedScientist我理解这些代码,但我无法将其应用于我正在抓取的站点所需的逻辑。我需要元素h3=国家,h6=日期,h5=文本类型。h3之后是中的含量。h6没有内容(只是日期)。h5是复杂的,因为在
  • 之后。我不太明白如何将每个内容都引导到.csv中正确的位置。我想我需要列出清单什么的,但我无能为力。@xxMrPHDxx对不起。我漏了一行代码。它现在就在那里。这是对我的教育。我想出了另一个代码来创建4个不同的.csv文件并将它们连接起来,但由于多个p标签,它出现了一个关闭一个错误。这样好多了。我不确定“行”逻辑在做什么,但我会从中学习。非常巧妙地使用查找字典。我还学习了一些关于“section_lookup”和[:6]逻辑的知识。非常酷且备受赞赏的
    是一个保存一行值的字典。我使用一个默认字典来更容易地附加p标记。我在这里的最后一个挑战是处理一些编码。该页面是UTF-8,我知道如何使用文本和Soup获取它。它阅读和显示良好。但我认为,当脚本提取p标签时,它会丢失这一点(例如,圣多美和普林西比)。文本在csv中被篡改。我尝试将格式化代码(.encode('utf-8')和(.decode('utf-8','ignore')添加到“csv_output.writerow(row)”和“row=defaultdict(str)”中,但情况变得更糟。我会深入研究这个问题,但我想我应该指出,如果有人使用它并遇到相同的问题。干杯
    from collections import defaultdict
    import requests
    import csv 
    from bs4 import BeautifulSoup
    
    URL = 'https://www.imf.org/en/Topics/imf-and-covid19/Policy-Responses-to-COVID-19'
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    results = soup.find('div', class_='rr-intro')
    
    section_lookup = {
        'Fiscal' : 'fiscal_text',
        'Moneta' : 'monetary_text',
        'Macro-' : 'monetary_text',
        'Exchan' : 'fx_text',
    }
    
    with open('data.csv', 'w', encoding='utf8', newline='') as f_output:
        fieldnames = ['country', 'date', 'general_text', 'fiscal_text', 'monetary_text', 'fx_text']
        csv_output = csv.DictWriter(f_output, fieldnames=fieldnames)
        csv_output.writeheader()
    
        row = defaultdict(str)
        section = None
    
        for elem in results.find_all(['h3', 'h6', 'h5', 'p']):
            if elem.name == 'h3':
                if row:
                    csv_output.writerow(row)
                    row = defaultdict(str)
    
                row['country'] = elem.get_text(strip=True)
                section = "general_text"
    
            elif elem.name == 'h5':
                section = section_lookup[elem.get_text(strip=True)[:6]]
            elif elem.name == 'h6':
                row['date'] = elem.get_text(strip=True)[27:]
            elif elem.name == 'p' and section:
                row[section] = f"{row[section]} {elem.get_text(strip=True)}"
    
        if row:
            csv_output.writerow(row)