Python 如何确保BeautifulSoup不将逗号视为制表符_Python_Html_Csv_Web Scraping_Beautifulsoup

Python 如何确保BeautifulSoup不将逗号视为制表符

python html csv web-scraping

Python 如何确保BeautifulSoup不将逗号视为制表符,python,html,csv,web-scraping,beautifulsoup,Python,Html,Csv,Web Scraping,Beautifulsoup,我已经创建了一个抓取代码，从本地报纸网站获取信息。当前代码存在两个问题当它检索段落数据并将其保存到CSV时。它将“，”识别为中断，并将相关数据保存在相邻单元格中。我该如何阻止这种情况发生我希望他们将刮取的信息排成一行。i、 e段落、标题、网络链接代码如下 from bs4 import BeautifulSoup as soup from urllib.request import urlopen as uReq page_url = "https://neweralive

我已经创建了一个抓取代码，从本地报纸网站获取信息。当前代码存在两个问题

当它检索段落数据并将其保存到CSV时。它将“，”识别为中断，并将相关数据保存在相邻单元格中。我该如何阻止这种情况发生

我希望他们将刮取的信息排成一行。i、 e段落、标题、网络链接

代码如下

from bs4 import BeautifulSoup as soup from urllib.request import urlopen as uReq page_url = "https://neweralive.na/today/" ne_url = "https://neweralive.na/posts/" uClient = uReq(page_url) page_soup = soup(uClient.read(), "html.parser") uClient.close() containers = page_soup.findAll("article", {"class": "post-item"}) filename = "newera.csv" headers = "paragraph,title,link\n" f = open(filename, "w") f.write(headers) for container in containers: paragraph_container = container.findAll("p", {"class": "post-excerpt"}) paragraph = paragraph_container[0].text title_container = container.findAll("h3", {"class": "post-title"}) title = title_container[0].text weblink = ne_url + title_container[0].a["href"] f.write(paragraph + "," + title + "," + weblink + "\n") f.close()
可以使用编写格式良好的CSV，并在需要的字符串（例如，包含逗号的字符串）周围加引号
在此期间，我重构了您的代码以使用可重用函数：

get\u soup\u from\u url（）
下载一个url并从中获得一个漂亮的组合

parse\u today\u page（）
是一个生成器函数，它可以遍历这道汤并返回每篇文章的目录

主代码现在只在打开的文件上使用
csv.DictWriter
；解析的dict打印到控制台以便于调试，并馈送到CSV编写器以进行输出

生成的CSV最终看起来像

paragraph,title,link "The mayor of Helao Nafidi, Elias Nghipangelwa, has expressed disappointment after Covid-19 relief food was stolen and sold by two security officers entrusted to guard the warehouse where the food was stored.","Guards arrested for theft of relief food",https://neweralive.na/posts/posts/guards-arrested-for-theft-of-relief-food "Government has decided to construct 1 200 affordable homes, starting Thursday this week.","Govt to construct 1 200 low-cost houses",https://neweralive.na/posts/posts/govt-to-construct-1-200-low-cost-houses ...

您可以使用pandas模块轻松地将dataframe表转换为csv

import pandas as pd from bs4 import BeautifulSoup as soup from urllib.request import urlopen as uReq page_url = "https://neweralive.na/today/" ne_url = "https://neweralive.na/posts/" uClient = uReq(page_url) page_soup = soup(uClient.read(), "html.parser") uClient.close() containers = page_soup.findAll("article", {"class": "post-item"}) filename = "newera.csv" rows = [] # Initialize list of list which is converted to dataframe. for container in containers: paragraph_container = container.findAll("p", {"class": "post-excerpt"}) paragraph = paragraph_container[0].text title_container = container.findAll("h3", {"class": "post-title"}) title = title_container[0].text weblink = ne_url + title_container[0].a["href"] rows.append([paragraph, title, weblink]) # each row is appended df = pd.DataFrame(rows, columns = ["paragraph","title","link"]) # col-name is headers df.to_csv(filename, index=None)

这个世界上有英雄。非常感谢。大家好-哇，这太棒了，我喜欢这个解决方案-这是一个python中的cristal clear csv示例。非常感谢！这是一项绝对不需要熊猫的任务。我同意。亲爱的Shimo-非常感谢你展示了我们与熊猫的选择和可能性-我喜欢它-拥有我们所拥有的各种方式和路径的想法。这使得Python如此强大！所以，谢谢你用你的想法把盘子钉起来
import pandas as pd from bs4 import BeautifulSoup as soup from urllib.request import urlopen as uReq page_url = "https://neweralive.na/today/" ne_url = "https://neweralive.na/posts/" uClient = uReq(page_url) page_soup = soup(uClient.read(), "html.parser") uClient.close() containers = page_soup.findAll("article", {"class": "post-item"}) filename = "newera.csv" rows = [] # Initialize list of list which is converted to dataframe. for container in containers: paragraph_container = container.findAll("p", {"class": "post-excerpt"}) paragraph = paragraph_container[0].text title_container = container.findAll("h3", {"class": "post-title"}) title = title_container[0].text weblink = ne_url + title_container[0].a["href"] rows.append([paragraph, title, weblink]) # each row is appended df = pd.DataFrame(rows, columns = ["paragraph","title","link"]) # col-name is headers df.to_csv(filename, index=None)