Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/329.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 为每个url数据集创建单独的数据帧_Python_Pandas_Dataframe_Google Sheets_Beautifulsoup - Fatal编程技术网

Python 为每个url数据集创建单独的数据帧

Python 为每个url数据集创建单独的数据帧,python,pandas,dataframe,google-sheets,beautifulsoup,Python,Pandas,Dataframe,Google Sheets,Beautifulsoup,我正在使用python、BeautifulSoup、pandas和Google Sheets创建一个网页抓取程序。到目前为止,我已经设法从谷歌表单中的列表中从几个网页中抓取数据表。我想要实现的是,对于每个url中的每个表,我想要创建一个数据帧。现在,来自列表中最后一个url的数据正在导入到Google Sheets,但似乎来自第一个url的数据正在导入,但随后被来自下一个url的数据覆盖-可能与索引有关 到目前为止,我的代码如下所示: from google.oauth2 import serv

我正在使用python、BeautifulSoup、pandas和Google Sheets创建一个网页抓取程序。到目前为止,我已经设法从谷歌表单中的列表中从几个网页中抓取数据表。我想要实现的是,对于每个url中的每个表,我想要创建一个数据帧。现在,来自列表中最后一个url的数据正在导入到Google Sheets,但似乎来自第一个url的数据正在导入,但随后被来自下一个url的数据覆盖-可能与索引有关

到目前为止,我的代码如下所示:

from google.oauth2 import service_account
from google.auth.transport.requests import AuthorizedSession
from df2gspread import df2gspread as d2g
import pandas as pd
from gspread_dataframe import get_as_dataframe, set_with_dataframe
from bs4 import BeautifulSoup
import requests



credentials = service_account.Credentials.from_service_account_file(
    'credentials.json')

scoped_credentials = credentials.with_scopes(
        ['https://spreadsheets.google.com/feeds',
         'https://www.googleapis.com/auth/drive']
        )

gc = gspread.Client(auth=scoped_credentials)
gc.session = AuthorizedSession(scoped_credentials)
spreadsheet_key = gc.open_by_key('api_key')

worksheet = spreadsheet_key.sheet1



# List of url's from Google Sheets
link_list = worksheet.col_values(5)


def get_info(page_url) :

    page = requests.get(page_url)
    soup = BeautifulSoup(page.content, 'html.parser')

    try :

        tbl = soup.find('table')

        labels = []
        results = []


        for tr in tbl.findAll('tr'):
            headers = [th.text.strip() for th in tr.findAll('th')]
            data = [td.text.strip() for td in tr.findAll('td')]
            labels.append(headers)
            results.append(data)
        

        final_results = []

        for final_labels, final_data in zip(labels, results):
            final_results.append({'Labels': final_labels, 'Data': final_data})


        df = pd.DataFrame(final_results)
        
        set_with_dataframe(worksheet, df, include_index=False)


    except Exception as e:
        print(e)



for link in link_list :
    get_info(link)



以及输出:

                  Labels                                               Data
0      [Celebrated Name]                                         [2 Chainz]
1                  [Age]                                         [43 Years]
2            [Nick Name]                              [Tity Boi, Drenchgod]
3           [Birth Name]                                     [Tauheed Epps]
4           [Birth Date]                                       [1977-09-12]
5               [Gender]                                             [Male]
6           [Profession]                                           [Rapper]
7       [Place Of Birth]             [College Park, Georgia, United States]
8          [Nationality]                                         [American]
9            [Ethnicity]                                    [Afro-American]
10           [Horoscope]                                            [Virgo]
11         [High School]                        [North Clayton High School]
12          [University]  [Alabama State University and Virginia State U...
13      [Marital Status]                                          [Married]
14                [Wife]                                       [Kesha Ward]
15            [Children]                        [Heaven, Harmony, and Halo]
16     [Body Build/Type]                                         [Athletic]
17    [Body Measurement]                                  [43-15-34 inches]
18          [Chest Size]                                        [43 inches]
19          [Bicep Size]                                        [15 inches]
20          [Waist Size]                                        [34 inches]
21           [Shoe Size]                                           [14 (US]
22              [Height]                                  [6 feet 5 inches]
23              [Weight]                                            [88 kg]
24           [Net Worth]                                      [$ 6 Million]
25              [Salary]                                        [$ 100,000]
26  [Sexual Orientation]                                         [Straight]
27           [Eye Color]                                       [Dark Brown]
28          [Hair Color]                                            [Black]
29               [Links]             [Wikipedia,Instagram,Twitter,Facebook]
                    Labels                                     Data
0        [Celebrated Name]                              [Don Lemon]
1                    [Age]                               [54 Years]
2              [Nick Name]                              [Don Lemon]
3             [Birth Name]                              [Don Lemon]
4             [Birth Date]                             [1966-03-01]
5                 [Gender]                                   [Male]
6             [Profession]                             [Journalist]
7           [Birth Nation]                          [United States]
8         [Place Of Birth]  [Baton Rouge, Louisiana, United States]
9            [Nationality]                               [American]
10              [Siblings]                 [Leisa Lemon, Yma Lemon]
11             [Ethnicity]                                  [Mixed]
12             [Eye Color]                                  [Brown]
13            [Hair Color]                                  [Black]
14              [Religion]                              [Christian]
15                [Height]                          [5 Feet 6 Inch]
16                [Weight]                              [Not Known]
17           [Working For]                                    [CNN]
18        [Best Known For]                            [CNN Tonight]
19                [School]                      [Baker High School]
20  [College / University]                        [Brookyn College]
21            [University]             [Louisiana State University]
22             [Horoscope]                                 [Pisces]
23             [Net Worth]               [$ 3 million (As of 2018)]
24            [Famous For]  [For hosting the program ‘CNN Tonight’]
25      [Body Measurement]                               [40-32-35]
26                [Awards]                             [Emmy Award]
27                [Salary]                                [$125000]
28                 [Links]      [WikipediaFacebookTwitterInstagram]
所以,来自两个URL的数据都被打印出来了,但是我如何为每个数据集创建一个数据框并将其导入到Google Sheets中呢