Python 如何确保BeautifulSoup不将逗号视为制表符
我已经创建了一个抓取代码,从本地报纸网站获取信息。当前代码存在两个问题Python 如何确保BeautifulSoup不将逗号视为制表符,python,html,csv,web-scraping,beautifulsoup,Python,Html,Csv,Web Scraping,Beautifulsoup,我已经创建了一个抓取代码,从本地报纸网站获取信息。当前代码存在两个问题 当它检索段落数据并将其保存到CSV时。它将“,”识别为中断,并将相关数据保存在相邻单元格中。我该如何阻止这种情况发生 我希望他们将刮取的信息排成一行。i、 e段落、标题、网络链接 代码如下 from bs4 import BeautifulSoup as soup from urllib.request import urlopen as uReq page_url = "https://neweralive
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
page_url = "https://neweralive.na/today/"
ne_url = "https://neweralive.na/posts/"
uClient = uReq(page_url)
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
containers = page_soup.findAll("article", {"class": "post-item"})
filename = "newera.csv"
headers = "paragraph,title,link\n"
f = open(filename, "w")
f.write(headers)
for container in containers:
paragraph_container = container.findAll("p", {"class": "post-excerpt"})
paragraph = paragraph_container[0].text
title_container = container.findAll("h3", {"class": "post-title"})
title = title_container[0].text
weblink = ne_url + title_container[0].a["href"]
f.write(paragraph + "," + title + "," + weblink + "\n")
f.close()
可以使用编写格式良好的CSV,并在需要的字符串(例如,包含逗号的字符串)周围加引号
在此期间,我重构了您的代码以使用可重用函数:
下载一个url并从中获得一个漂亮的组合get\u soup\u from\u url()
是一个生成器函数,它可以遍历这道汤并返回每篇文章的目录parse\u today\u page()
- 主代码现在只在打开的文件上使用
;解析的dict打印到控制台以便于调试,并馈送到CSV编写器以进行输出csv.DictWriter
paragraph,title,link
"The mayor of Helao Nafidi, Elias Nghipangelwa, has expressed disappointment after Covid-19 relief food was stolen and sold by two security officers entrusted to guard the warehouse where the food was stored.","Guards arrested for theft of relief food",https://neweralive.na/posts/posts/guards-arrested-for-theft-of-relief-food
"Government has decided to construct 1 200 affordable homes, starting Thursday this week.","Govt to construct 1 200 low-cost houses",https://neweralive.na/posts/posts/govt-to-construct-1-200-low-cost-houses
...
您可以使用pandas模块轻松地将dataframe表转换为csv
import pandas as pd
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
page_url = "https://neweralive.na/today/"
ne_url = "https://neweralive.na/posts/"
uClient = uReq(page_url)
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
containers = page_soup.findAll("article", {"class": "post-item"})
filename = "newera.csv"
rows = [] # Initialize list of list which is converted to dataframe.
for container in containers:
paragraph_container = container.findAll("p", {"class": "post-excerpt"})
paragraph = paragraph_container[0].text
title_container = container.findAll("h3", {"class": "post-title"})
title = title_container[0].text
weblink = ne_url + title_container[0].a["href"]
rows.append([paragraph, title, weblink]) # each row is appended
df = pd.DataFrame(rows, columns = ["paragraph","title","link"]) # col-name is headers
df.to_csv(filename, index=None)
这个世界上有英雄。非常感谢。大家好-哇,这太棒了,我喜欢这个解决方案-这是一个python中的cristal clear csv示例。非常感谢!这是一项绝对不需要熊猫的任务。我同意。亲爱的Shimo-非常感谢你展示了我们与熊猫的选择和可能性-我喜欢它-拥有我们所拥有的各种方式和路径的想法。这使得Python如此强大!所以,谢谢你用你的想法把盘子钉起来
import pandas as pd
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
page_url = "https://neweralive.na/today/"
ne_url = "https://neweralive.na/posts/"
uClient = uReq(page_url)
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
containers = page_soup.findAll("article", {"class": "post-item"})
filename = "newera.csv"
rows = [] # Initialize list of list which is converted to dataframe.
for container in containers:
paragraph_container = container.findAll("p", {"class": "post-excerpt"})
paragraph = paragraph_container[0].text
title_container = container.findAll("h3", {"class": "post-title"})
title = title_container[0].text
weblink = ne_url + title_container[0].a["href"]
rows.append([paragraph, title, weblink]) # each row is appended
df = pd.DataFrame(rows, columns = ["paragraph","title","link"]) # col-name is headers
df.to_csv(filename, index=None)