Python 从多个页面中刮取一个表并存储在单个数据帧中_Python_Pandas_Beautifulsoup

Python 从多个页面中刮取一个表并存储在单个数据帧中

python pandas

Python 从多个页面中刮取一个表并存储在单个数据帧中,python,pandas,beautifulsoup,Python,Pandas,Beautifulsoup,问题：一个网站有大约80个页面，每个页面都包含一个结构相同的表。我需要清理每个表并将结果存储在单个数据帧中。表格内容定期更新，因此需要经常重复练习我可以从一个页面上刮表，但我很难做到多页。我发现的所有示例都是针对反复更改的URL，例如（www.example.com/page1、/page2等），而不是针对指定的URL列表我尝试了以下URL子集（理想情况下，我希望从csv列表中读取URL），但它似乎只是将最终的表刮到数据框中（即ZZ）抱歉，如果这看起来很模糊，我对Python相当陌生，主要

问题：一个网站有大约80个页面，每个页面都包含一个结构相同的表。我需要清理每个表并将结果存储在单个数据帧中。表格内容定期更新，因此需要经常重复练习

我可以从一个页面上刮表，但我很难做到多页。我发现的所有示例都是针对反复更改的URL，例如（www.example.com/page1、/page2等），而不是针对指定的URL列表

我尝试了以下URL子集（理想情况下，我希望从csv列表中读取URL），但它似乎只是将最终的表刮到数据框中（即ZZ）

抱歉，如果这看起来很模糊，我对

Python

相当陌生，主要使用

pandas

进行数据分析，直接从

csv

阅读。任何帮助都将不胜感激

如何从csv列表中读取URL？我目前的解决方案并没有像我所期望的那样让整个桌子都吃光

from bs4 import BeautifulSoup
import requests
import pandas as pd

COLUMNS = ['ID', 'Serial', 'Aircraft', 'Notes']

urls = ['http://www.ukserials.com/results.php?serial=ZR',
'http://www.ukserials.com/results.php?serial=ZT',
'http://www.ukserials.com/results.php?serial=ZZ']
#scrape elements
for url in urls:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    table = soup.find("table") # Find the "table" tag in the page
    rows = table.find_all("tr") # Find all the "tr" tags in the table
    cy_data = [] 
    for row in rows:
        cells = row.find_all("td") #  Find all the "td" tags in each row 
        cells = cells[0:4] # Select the correct columns
        cy_data.append([cell.text for cell in cells]) # For each "td" tag, get the text inside it

data = pd.DataFrame(cy_data, columns=COLUMNS).drop(0, axis=0)

您不能将每个数据帧添加到一个列表中，然后在末尾合并该列表的元素吗

...
dataframes = []
for url in urls:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    table = soup.find("table") # Find the "table" tag in the page
    rows = table.find_all("tr") # Find all the "tr" tags in the table
    cy_data = []
    for row in rows:
        cells = row.find_all("td") #  Find all the "td" tags in each row
        cells = cells[0:4] # Select the correct columns
        cy_data.append([cell.text for cell in cells]) # For each "td" tag, get the text inside it

    dataframes.append(pd.DataFrame(cy_data, columns=COLUMNS).drop(0, axis=0))

data = pd.concat(dataframes)

注意：您可能需要为每个数据帧指定索引偏移量（合并前），如图所示：

感谢您在这方面提供的快速帮助-您的解决方案非常有效。