Python 美化组：将所有p元素合并为一个字符串？_Python_Web Scraping_Beautifulsoup

Python 美化组：将所有p元素合并为一个字符串？

python web-scraping

Python 美化组：将所有p元素合并为一个字符串？,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我目前使用以下Python代码摘录获取网页的所有元素： def scraping(url, html): data = {} soup = BeautifulSoup(html,"lxml") data["news"] = [] page = soup.find("div", {"class":"container_news"}).findAll('p') page_text = '' for p in page: page_

我目前使用以下Python代码摘录获取网页的所有元素：

def scraping(url, html):
    data = {}
    soup = BeautifulSoup(html,"lxml")

    data["news"] = []

    page = soup.find("div", {"class":"container_news"}).findAll('p')
    page_text = ''

    for p in page:
        page_text += ''.join(p.findAll(text = True))
        data["news"].append(page_text)
    print(page_text)

    return data

但是，

page\u text

的输出如下所示：

"['New news on the internet. ', 'Here is some text. ', ""Here is some other."", ""And then there are other variations \n\nLooks like there are some non-text elements. \n\xa0""]" ...

是否可以将内容清理并将列表合并为一个字符串？与正则表达式变体相比，最好使用BeautifulSoup解决方案

谢谢大家!

我不确定维护

数据[“新闻”]

的重要性，但这可以在一行中完成：

page_text = ' '.join(e.text for p in page for e in p.findAll(text=True))

您可以使用您想要的任何字符串作为分隔符，而不是

否则

page_text = []

for p in page:
    page_text.extend(e.text for e in p.findAll(text=True))
    data["news"].append(page_text)

print(' '.join(page_text))

我不确定维护

数据[“新闻”]

的重要性，但这可以通过一行来完成：

page_text = ' '.join(e.text for p in page for e in p.findAll(text=True))

您可以使用您想要的任何字符串作为分隔符，而不是

否则

page_text = []

for p in page:
    page_text.extend(e.text for e in p.findAll(text=True))
    data["news"].append(page_text)

print(' '.join(page_text))