Python 为每个url数据集创建单独的数据帧
我正在使用python、BeautifulSoup、pandas和Google Sheets创建一个网页抓取程序。到目前为止,我已经设法从谷歌表单中的列表中从几个网页中抓取数据表。我想要实现的是,对于每个url中的每个表,我想要创建一个数据帧。现在,来自列表中最后一个url的数据正在导入到Google Sheets,但似乎来自第一个url的数据正在导入,但随后被来自下一个url的数据覆盖-可能与索引有关 到目前为止,我的代码如下所示:Python 为每个url数据集创建单独的数据帧,python,pandas,dataframe,google-sheets,beautifulsoup,Python,Pandas,Dataframe,Google Sheets,Beautifulsoup,我正在使用python、BeautifulSoup、pandas和Google Sheets创建一个网页抓取程序。到目前为止,我已经设法从谷歌表单中的列表中从几个网页中抓取数据表。我想要实现的是,对于每个url中的每个表,我想要创建一个数据帧。现在,来自列表中最后一个url的数据正在导入到Google Sheets,但似乎来自第一个url的数据正在导入,但随后被来自下一个url的数据覆盖-可能与索引有关 到目前为止,我的代码如下所示: from google.oauth2 import serv
from google.oauth2 import service_account
from google.auth.transport.requests import AuthorizedSession
from df2gspread import df2gspread as d2g
import pandas as pd
from gspread_dataframe import get_as_dataframe, set_with_dataframe
from bs4 import BeautifulSoup
import requests
credentials = service_account.Credentials.from_service_account_file(
'credentials.json')
scoped_credentials = credentials.with_scopes(
['https://spreadsheets.google.com/feeds',
'https://www.googleapis.com/auth/drive']
)
gc = gspread.Client(auth=scoped_credentials)
gc.session = AuthorizedSession(scoped_credentials)
spreadsheet_key = gc.open_by_key('api_key')
worksheet = spreadsheet_key.sheet1
# List of url's from Google Sheets
link_list = worksheet.col_values(5)
def get_info(page_url) :
page = requests.get(page_url)
soup = BeautifulSoup(page.content, 'html.parser')
try :
tbl = soup.find('table')
labels = []
results = []
for tr in tbl.findAll('tr'):
headers = [th.text.strip() for th in tr.findAll('th')]
data = [td.text.strip() for td in tr.findAll('td')]
labels.append(headers)
results.append(data)
final_results = []
for final_labels, final_data in zip(labels, results):
final_results.append({'Labels': final_labels, 'Data': final_data})
df = pd.DataFrame(final_results)
set_with_dataframe(worksheet, df, include_index=False)
except Exception as e:
print(e)
for link in link_list :
get_info(link)
以及输出:
Labels Data
0 [Celebrated Name] [2 Chainz]
1 [Age] [43 Years]
2 [Nick Name] [Tity Boi, Drenchgod]
3 [Birth Name] [Tauheed Epps]
4 [Birth Date] [1977-09-12]
5 [Gender] [Male]
6 [Profession] [Rapper]
7 [Place Of Birth] [College Park, Georgia, United States]
8 [Nationality] [American]
9 [Ethnicity] [Afro-American]
10 [Horoscope] [Virgo]
11 [High School] [North Clayton High School]
12 [University] [Alabama State University and Virginia State U...
13 [Marital Status] [Married]
14 [Wife] [Kesha Ward]
15 [Children] [Heaven, Harmony, and Halo]
16 [Body Build/Type] [Athletic]
17 [Body Measurement] [43-15-34 inches]
18 [Chest Size] [43 inches]
19 [Bicep Size] [15 inches]
20 [Waist Size] [34 inches]
21 [Shoe Size] [14 (US]
22 [Height] [6 feet 5 inches]
23 [Weight] [88 kg]
24 [Net Worth] [$ 6 Million]
25 [Salary] [$ 100,000]
26 [Sexual Orientation] [Straight]
27 [Eye Color] [Dark Brown]
28 [Hair Color] [Black]
29 [Links] [Wikipedia,Instagram,Twitter,Facebook]
Labels Data
0 [Celebrated Name] [Don Lemon]
1 [Age] [54 Years]
2 [Nick Name] [Don Lemon]
3 [Birth Name] [Don Lemon]
4 [Birth Date] [1966-03-01]
5 [Gender] [Male]
6 [Profession] [Journalist]
7 [Birth Nation] [United States]
8 [Place Of Birth] [Baton Rouge, Louisiana, United States]
9 [Nationality] [American]
10 [Siblings] [Leisa Lemon, Yma Lemon]
11 [Ethnicity] [Mixed]
12 [Eye Color] [Brown]
13 [Hair Color] [Black]
14 [Religion] [Christian]
15 [Height] [5 Feet 6 Inch]
16 [Weight] [Not Known]
17 [Working For] [CNN]
18 [Best Known For] [CNN Tonight]
19 [School] [Baker High School]
20 [College / University] [Brookyn College]
21 [University] [Louisiana State University]
22 [Horoscope] [Pisces]
23 [Net Worth] [$ 3 million (As of 2018)]
24 [Famous For] [For hosting the program ‘CNN Tonight’]
25 [Body Measurement] [40-32-35]
26 [Awards] [Emmy Award]
27 [Salary] [$125000]
28 [Links] [WikipediaFacebookTwitterInstagram]
所以,来自两个URL的数据都被打印出来了,但是我如何为每个数据集创建一个数据框并将其导入到Google Sheets中呢