Python 如何确保BeautifulSoup不将逗号视为制表符

Python 如何确保BeautifulSoup不将逗号视为制表符,python,html,csv,web-scraping,beautifulsoup,Python,Html,Csv,Web Scraping,Beautifulsoup,我已经创建了一个抓取代码,从本地报纸网站获取信息。当前代码存在两个问题 当它检索段落数据并将其保存到CSV时。它将“,”识别为中断,并将相关数据保存在相邻单元格中。我该如何阻止这种情况发生 我希望他们将刮取的信息排成一行。i、 e段落、标题、网络链接 代码如下 from bs4 import BeautifulSoup as soup from urllib.request import urlopen as uReq page_url = "https://neweralive

我已经创建了一个抓取代码,从本地报纸网站获取信息。当前代码存在两个问题

  • 当它检索段落数据并将其保存到CSV时。它将“,”识别为中断,并将相关数据保存在相邻单元格中。我该如何阻止这种情况发生

  • 我希望他们将刮取的信息排成一行。i、 e段落、标题、网络链接

  • 代码如下

    from bs4 import BeautifulSoup as soup
    from urllib.request import urlopen as uReq
    
    page_url = "https://neweralive.na/today/"
    
    ne_url = "https://neweralive.na/posts/"
    
    uClient = uReq(page_url)
    
    page_soup = soup(uClient.read(), "html.parser")
    uClient.close()
    
    
    containers = page_soup.findAll("article", {"class": "post-item"})
    
    filename = "newera.csv"
    headers = "paragraph,title,link\n"
    
    f = open(filename, "w")
    f.write(headers)
    
    for container in containers:
        paragraph_container = container.findAll("p", {"class": "post-excerpt"})
        paragraph = paragraph_container[0].text
    
        title_container = container.findAll("h3", {"class": "post-title"})
        title = title_container[0].text
        weblink = ne_url + title_container[0].a["href"]
    
        f.write(paragraph + "," + title + "," + weblink + "\n")
    
    f.close()
    
    可以使用编写格式良好的CSV,并在需要的字符串(例如,包含逗号的字符串)周围加引号

    在此期间,我重构了您的代码以使用可重用函数:

    • get\u soup\u from\u url()
      下载一个url并从中获得一个漂亮的组合
    • parse\u today\u page()
      是一个生成器函数,它可以遍历这道汤并返回每篇文章的目录
    • 主代码现在只在打开的文件上使用
      csv.DictWriter
      ;解析的dict打印到控制台以便于调试,并馈送到CSV编写器以进行输出
    生成的CSV最终看起来像

    paragraph,title,link
    "The mayor of Helao Nafidi, Elias Nghipangelwa, has expressed disappointment after Covid-19 relief food was stolen and sold by two security officers entrusted to guard the warehouse where the food was stored.","Guards arrested for theft of relief food",https://neweralive.na/posts/posts/guards-arrested-for-theft-of-relief-food
    "Government has decided to construct 1 200 affordable homes, starting Thursday this week.","Govt to construct  1 200 low-cost houses",https://neweralive.na/posts/posts/govt-to-construct-1-200-low-cost-houses
    ...
    

    您可以使用pandas模块轻松地将dataframe表转换为csv

    import pandas as pd
    from bs4 import BeautifulSoup as soup
    from urllib.request import urlopen as uReq
    
    page_url = "https://neweralive.na/today/"
    
    ne_url = "https://neweralive.na/posts/"
    
    uClient = uReq(page_url)
    
    page_soup = soup(uClient.read(), "html.parser")
    
    uClient.close()
    
    containers = page_soup.findAll("article", {"class": "post-item"})
    
    filename = "newera.csv"
    
    rows = []  # Initialize list of list which is converted to dataframe.
    
    for container in containers:
    
        paragraph_container = container.findAll("p", {"class": "post-excerpt"})
        paragraph = paragraph_container[0].text
    
        title_container = container.findAll("h3", {"class": "post-title"})
        title = title_container[0].text
        weblink = ne_url + title_container[0].a["href"]
        
        rows.append([paragraph, title, weblink])  # each row is appended 
    
    df = pd.DataFrame(rows, columns = ["paragraph","title","link"])  # col-name is headers 
    
    df.to_csv(filename, index=None)
    

    这个世界上有英雄。非常感谢。大家好-哇,这太棒了,我喜欢这个解决方案-这是一个python中的cristal clear csv示例。非常感谢!这是一项绝对不需要熊猫的任务。我同意。亲爱的Shimo-非常感谢你展示了我们与熊猫的选择和可能性-我喜欢它-拥有我们所拥有的各种方式和路径的想法。这使得Python如此强大!所以,谢谢你用你的想法把盘子钉起来
    import pandas as pd
    from bs4 import BeautifulSoup as soup
    from urllib.request import urlopen as uReq
    
    page_url = "https://neweralive.na/today/"
    
    ne_url = "https://neweralive.na/posts/"
    
    uClient = uReq(page_url)
    
    page_soup = soup(uClient.read(), "html.parser")
    
    uClient.close()
    
    containers = page_soup.findAll("article", {"class": "post-item"})
    
    filename = "newera.csv"
    
    rows = []  # Initialize list of list which is converted to dataframe.
    
    for container in containers:
    
        paragraph_container = container.findAll("p", {"class": "post-excerpt"})
        paragraph = paragraph_container[0].text
    
        title_container = container.findAll("h3", {"class": "post-title"})
        title = title_container[0].text
        weblink = ne_url + title_container[0].a["href"]
        
        rows.append([paragraph, title, weblink])  # each row is appended 
    
    df = pd.DataFrame(rows, columns = ["paragraph","title","link"])  # col-name is headers 
    
    df.to_csv(filename, index=None)