Python 从Google云存储读取.csv到dataframe在Google云函数中运行时存在错误_Python_Pandas_Google Cloud Platform_Google Cloud Functions_Google Cloud Storage

Python 从Google云存储读取.csv到dataframe在Google云函数中运行时存在错误

python pandas google-cloud-platform google-cloud-storage

Python 从Google云存储读取.csv到dataframe在Google云函数中运行时存在错误,python,pandas,google-cloud-platform,google-cloud-functions,google-cloud-storage,Python,Pandas,Google Cloud Platform,Google Cloud Functions,Google Cloud Storage,我正在用python编写一个轻量级ETL函数。为了便于测试，我们一直在谷歌数据实验室中构建它工作流的一部分包括从云存储获取.csv并将其保存为数据帧。这在数据实验室中工作得非常完美，但在云函数中，由于某种原因，它再次从.csv开始读取，并在底部结果中添加约300个重复行我尝试了几种不同的读取.csv的方法（pd.read_csv、gcsfs、gsutil、%gcs），它们在数据实验室读取正确数量的行时都可以正常工作，但当放入云函数时，我会得到重复的行。以下是gcsfs的一个示例： i

我正在用python编写一个轻量级ETL函数。为了便于测试，我们一直在谷歌数据实验室中构建它

工作流的一部分包括从云存储获取.csv并将其保存为数据帧。这在数据实验室中工作得非常完美，但在云函数中，由于某种原因，它再次从.csv开始读取，并在底部结果中添加约300个重复行

我尝试了几种不同的读取.csv的方法（pd.read_csv、gcsfs、gsutil、%gcs），它们在数据实验室读取正确数量的行时都可以正常工作，但当放入云函数时，我会得到重复的行。以下是gcsfs的一个示例：

 import gcsfs
 import pandas as pd
 bucket = 'my_bucket'
 gc_project = 'my-project'
 latest_filename = 'my.csv'
 gs_current_object = bucket + '/' + latest_filename
 fs = gcsfs.GCSFileSystem(project=gc_project)
 with fs.open(gs_current_object, 'rb') as f:
     df_new = pd.read_csv(f)
 print(df_new.shape)

我希望形状是（15097，26），这是我在Datalabs中得到的，我在testing.csv中有多少行，但我得到（15428，26），这是原始的.csv，从一开始就附加了重复的行

我可以使用拖放副本，但： 1.我宁愿让函数保持轻量，特别是因为它在云函数中，我有高达2GB的内存供它运行 2.标题也会被追加，所以它开始变得混乱，因为我需要找到它并删除它，以及简单地使用.drop\u重复项

以前有人碰到过类似的事情吗？我能做些什么来通过读取.csv来修复这个错误，这样我就不必清理错误读取的文件了

编辑：以下是我的云函数实例中的完整代码（显然删除了实际名称和个人信息）。我尝试处理此版本中的重复行，但无法处理。事实上，我得到了一个非常奇怪的输出，在我删除重复项后，df_new的形状显示为（15065，26），而当我执行df_new.tail（）时，我得到了15098行，最后一行也是重复的头，这在我尝试解析日期时给出了一个错误

def csv_update(request):
    #Moved all imports and isntalls at top
    print('Importing packages and setting variables')
    from datetime import datetime 
    import ftplib
    import gcsfs
    import glob
    from googleapiclient import discovery
    import gspread
    from gspread_dataframe import get_as_dataframe, set_with_dataframe
    from oauth2client.client import GoogleCredentials
    from oauth2client.service_account import ServiceAccountCredentials
    import os
    import pandas as pd

    #Defining function variables.
    ftp_usr = "myemail@dotcom.com"
    ftp_pass = "my_unsecured_pass"
    bucket = 'my_bucket'
    gc_project = 'my-project'
    json_original = {
      "type": "service_account",
      "project_id": "my-project",
      "private_key_id": "my_id",
      "private_key": "-----BEGIN PRIVATE KEY-----\MY KEY\n-----END PRIVATE KEY-----\n",
      "client_email": "my_service_account@my_project.iam.gserviceaccount.com",
      "client_id": "my_client_id",
      "auth_uri": "https://accounts.google.com/o/oauth2/auth",
      "token_uri": "https://oauth2.googleapis.com/token",
      "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
      "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/client_email"
    }
    g_spreadsheet_id = 'my_gsheet_id'
    g_sheet_name = 'test'
    dtypes = {'LeadId': 'str'}
    root_dir = '/tmp'
    ftp_path = 'my_ftp_dir'
    date_col_name = 'LeadCreationDate'
    lead_col_name = 'LeadId'

    #Import ftplib. Connect to box (encrypted FTPES) with my credentials and download latest file from crown_reporting. 
    #Get downloaded file from local to crown_test bucket
    print('Connecting to FTP and downloading most recent file to local and then to GS bucket')
    os.chdir(root_dir)
    ftp = ftplib.FTP_TLS("ftp.box.com") 
    ftp.login(ftp_usr, ftp_pass) 
    ftp.cwd(ftp_path)
    ftp.retrlines('LIST')
    lines = ftp.nlst("-t")
    latest_filename = lines[-1]
    print(lines)
    print(latest_filename)
    ftp.retrbinary("RETR " + latest_filename ,open(latest_filename, 'wb').write)
    ftp.quit()
    credentials = GoogleCredentials.get_application_default()
    service = discovery.build('storage', 'v1', credentials=credentials)     
    body = {'name': latest_filename}
    req = service.objects().insert(bucket=bucket, body=body, media_body=latest_filename)
    resp = req.execute()
    files = glob.glob(root_dir +'/*')
    for f in files:
        os.remove(f)

    #Read the newest CSV from Google Storage (uses latest_filename from initial FTP download).
    #Had to add .drop_duplicates(keep='first', inplace=True) because some of the lead IDs have multiple rows.
    #Added a custom function to parse the dates as they have 2 different formats and needs to be parsed as datetime in order to sort after appending to df_old later.
    print('Read current csv from GS bucket as df_new')
    gs_current_object = bucket + '/' + latest_filename
    fs = gcsfs.GCSFileSystem(project=gc_project)
    col_names=['LeadId', 'Lead_Status', 'MoveType', 'Relo_Status', 'LeadCreationDate',
       'EstServiceRevenueUSD', 'EstServiceCostUSD', 'ActServiceRevenueUSD',
       'ActInsuranceRevenueUSD', 'ActServiceCostUSD', 'ActInsCostUSD',
       'ActServiceMarginUSD', 'CustomerType', 'SaleDate',
       'ControllingOfficeName', 'ControllingCountry', 'ControllingRegion',
       'OriginCity', 'OriginState', 'OriginCountry', 'DestinationCity',
       'DestinationState', 'DestinationCountry', 'UnqualifyReason',
       'LeadControllingCountry', 'URL']
    with fs.open(gs_current_object, 'rb') as f:
        df_new = pd.read_csv(f, header=None, names=col_names)
    print(df_new.shape)
    print(df_new.dtypes)
    df_new[lead_col_name] = df_new[lead_col_name].astype(str)
    df_new.drop_duplicates(subset=lead_col_name, keep='first', inplace=True)
    print(df_new.shape)
    df_new = df_new[1:]
    print(df_new.shape)                       
    dt_strings = []
    for dt_str in df_new[date_col_name]:
        dt_str = dt_str[:dt_str.find(' ')] 
        dt_strings.append(dt_str)
    print(len(dt_strings))
    def try_parsing_date(text):
        if len(text) == 10:
            return datetime.strptime(text, '%m/%d/%Y')
        else:
            text = '0' + text
            return datetime.strptime(text, '%m/%d/%Y')
    print(df_new.index[(df_new[date_col_name] == date_col_name) | (df_new[date_col_name] == '0LeadCreationDat') ].values)
    print(df_new.tail())
    dt_strings_conv = [try_parsing_date(date) for date in dt_strings]
    df_new[date_col_name] = dt_strings_conv
    print(df_new[date_col_name])
    print(dt_strings_conv)
    df_new.set_index(lead_col_name, drop=True, inplace=True)

    #Authorize for G sheet with JSON. Changed this to JSON parsed dictionary so it's saved within script.  
    scope = ['https://spreadsheets.google.com/feeds','https://www.googleapis.com/auth/drive']
    creds = ServiceAccountCredentials.from_json_keyfile_dict(json_original, scope)
    gs = gspread.authorize(creds)

    #Now we can access sheet. NB I had to enable sheets api in console here for this to work. Import pandas and gspread_dataframe. 
    #Set up worksheet via gspread and get the current (old) data in a df. 
    #We also specify a dtype of leadid column as otherwise Pandas thinks it's an integer (first IDs are just numbers). 
    #Had to add .drop_duplicates(keep='first', inplace=True) because some of the lead IDs have multiple rows.
    print('Read current gsheet as df_old')
    sheet = gs.open_by_key(g_spreadsheet_id).worksheet(g_sheet_name) 
    df_old=get_as_dataframe(sheet, dtype=dtypes, parse_dates=[date_col_name])
    df_old.drop_duplicates(subset=lead_col_name, keep='first', inplace=True)
    df_old.set_index(lead_col_name, drop=True, inplace=True)
    print(df_old.dtypes)

    #Update any changed rows in df_old with df_new values. Add any new rows (using append and dropping duplicates). Added sort=True to concat because of future warning.
    print('Update df_old with df_new values')
    df_old.update(df_new)
    #print(df_old.shape)
    #df_old.tail(15)
    print('Concat df_old with df_new and drop duplicates')
    df_combined = pd.concat([df_old, df_new], sort=True).reset_index()
    df_combined.drop_duplicates(subset=lead_col_name, keep='last', inplace=True)
    df_combined.sort_values(by=[date_col_name], inplace=True)
    #df_combined.reset_index(inplace=True, drop=True)
    #print(df_combined.shape)

    #Connect to gsheet and select worksheet again (in case of timeout, these are commented out as was running fine in tests). Replace all data with newly combined df.
    print('Write updated and concat df_combined to gsheet')
    set_with_dataframe(sheet, df_combined)

因此，所有从google存储读取数据帧的方法都会直接在云函数中为我产生这个bug。我很想在某个时候弄清它的真相，但现在我需要让我的函数正常工作

如果有人有类似的问题-在使用pd.read_csv之前，我首先使用下面的代码存储在本地临时存储中，这很好（请注意，google.cloud在requirements.txt中与google cloud存储一起安装）：

我已经尝试使用您描述的相同功能部署一个函数，并且我得到的输出对我来说是正确的。可能函数被多次触发，因此输出不正确。你为你的函数设置了哪个触发器？我在测试模式下得到了这些意外的结果，触发事件保留为默认的{}。我的函数最初不是用参数定义的，但因为云函数需要它，所以我在下面放了一个参数“request”，这个参数在该函数的任何地方都没有实际使用<代码>定义我的功能（请求）：

from google.cloud import storage
def download_blob(bucket_name, source_blob_name, destination_file_name):
    """Downloads a blob from the bucket."""
    storage_client = storage.Client()
    bucket = storage_client.get_bucket(bucket_name)
    blob = bucket.blob(source_blob_name)

    blob.download_to_filename(destination_file_name)

    print('Blob {} downloaded to {}.'.format(
    source_blob_name,
    destination_file_name))
download_blob(bucket, latest_filename, latest_filename)
df_new = pd.read_csv(root_dir + "/" + latest_filename, dtype=dtypes)