使用多个URL+合并数据的Python Web抓取

使用多个URL+合并数据的Python Web抓取,python,beautifulsoup,urllib,Python,Beautifulsoup,Urllib,我想做的是 获取多个URL。 在每个URL中获取h2文本。 合并h2文本,然后写入csv。 在这段代码中,我做到了: 以一个URL为例。获取URL中的h2文本 from bs4 import BeautifulSoup as soup from urllib.request import urlopen as uReq page_url = "https://example.com/ekonomi/20200108/" #i am trying to do | urls = ['https:

我想做的是

获取多个URL。 在每个URL中获取h2文本。 合并h2文本,然后写入csv。 在这段代码中,我做到了: 以一个URL为例。获取URL中的h2文本

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq

page_url = "https://example.com/ekonomi/20200108/"

#i am trying to do | urls = ['https://example.com/ekonomi/20200114/', 'https://example.com/ekonomi/20200113/', 'https://example.com/ekonomi/20200112/', 'https://example.com/ekonomi/20200111/]

uClient = uReq(page_url)

page_soup = soup(uClient.read(), "html.parser")
uClient.close()

# finds each product from the store page
containers = page_soup.findAll("div", {"class": "b-plainlist__info"})

out_filename = "output.csv"

headers = "title \n"


f = open(out_filename, "w")
f.write(headers)

container = containers[0]

for container in containers:
    title = container.h2.get_text()

    f.write(title.replace(",", " ") + "\n")

f.close()  # Close the file

如果您在容器中的迭代是正确的,那么应该可以:

您希望遍历URL。每个url将获取标题,并将其附加到列表中。然后,只需使用该列表创建一个系列,并使用Pandas写入csv:

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import pandas as pd


urls = ['https://example.com/ekonomi/20200114/', 'https://example.com/ekonomi/20200113/', 'https://example.com/ekonomi/20200112/', 'https://example.com/ekonomi/20200111/']

titles = []
for page_url in urls:
    uClient = uReq(page_url)

    page_soup = soup(uClient.read(), "html.parser")
    uClient.close()

    # finds each product from the store page
    containers = page_soup.findAll("div", {"class": "b-plainlist__info"})

    for container in containers:
        titles.append(container.h2.get_text())

df = pd.DataFrame(titles, columns=['title'])
df.to_csv("output.csv", index=False)

我想Scrapy的项目管道可以帮你解决你到底有什么问题?@alec_djinn我不能接受多个url。好吧,你需要1设置一个for循环,在url中的page_url的url列表上迭代:;2您希望将它们附加到您的文件中,而不是在每次迭代后覆盖f=openout\u filename,a