在Python中比较2个数据帧时出现问题，应排除所有重复项，但效果不理想_Python_Sql_Pandas_Duplicates_Google Analytics Api

在Python中比较2个数据帧时出现问题，应排除所有重复项，但效果不理想

python sql pandas

在Python中比较2个数据帧时出现问题，应排除所有重复项，但效果不理想,python,sql,pandas,duplicates,google-analytics-api,Python,Sql,Pandas,Duplicates,Google Analytics Api,我正在开发一个连接Google Analytics和SQL Server数据库的连接器，但存在重复值问题首先，脚本使用GA Accounts config解析嵌套的dict，然后将其转换为df并将所有响应存储在一个列表中，然后获取包含所有GA数据的当前SQL表，并创建一个比较新值（来自GA API）和当前值（在SQL表中）的循环但由于某些原因，在比较这两个dfs时，会保留所有重复项如果有人能帮助我，我会非常高兴用于发出GA API请求的带配置的嵌套dict 数据检验={ 'view_i

我正在开发一个连接Google Analytics和SQL Server数据库的连接器，但存在重复值问题

首先，脚本使用GA Accounts config解析嵌套的dict，然后将其转换为df并将所有响应存储在一个列表中，然后获取包含所有GA数据的当前SQL表，并创建一个比较新值（来自GA API）和当前值（在SQL表中）的循环

但由于某些原因，在比较这两个dfs时，会保留所有重复项

如果有人能帮助我，我会非常高兴

用于发出GA API请求的带配置的嵌套dict


数据检验={
'view_id_111'：{'view_id'：'111'，
“开始日期”：“2019-08-01”，
“结束日期”：2019-09-01，
'metrics'：[{'expression'：'ga:sessions'}，{'expression'：'ga:users'}]，
'dimensions'：[{'name'：'ga:country'}，{'name'：'ga:userType'}，{'name'：'ga:date'}]}，
'view_id_222'：{'view_id'：'222'，
“开始日期”：“2019-08-01”，
“结束日期”：2019-09-01，
'metrics'：[{'expression'：'ga:sessions'}，{'expression'：'ga:users'}]，
'dimensions'：[{'name'：'ga:country'}，{'name'：'ga:date'}]}，
'view_id_333'：{'view_id'：'333'，
“开始日期”：“2019-01-01”，
“结束日期”：“2019-05-01”，
'metrics'：[{'expression'：'ga:sessions'}，{'expression'：'ga:users'}]，
'dimensions'：[{'name'：'ga:country'}，{'name'：'ga:date'}]}
}

将请求发送到GoogleAPI，将其转换为df并将值存储在列表中

responses=[]
对于数据测试项（）中的k，v：
样本_请求={
“视图id”：v[“视图id”]，
“日期范围”：{
“开始日期”：v[“开始日期”]，
“结束日期”：v[“结束日期”]
},
“度量”：v[“度量”]，
“维度”：v[“维度”]
}
响应=analytics.reports（）.batchGet(
身体={
“reportRequests”：示例请求
}).execute（）
n_响应=打印响应新测试（响应）
responses.append（n_response）

使用GA数据获取当前SQL表

def get_current_sql_gadata_table（）：
全局sql\u表\u当前\u数据
sql\u table\u current\u gadata=pd.read\u sql（'SELECT*FROM table'，con=conn）
sql_表_当前_gadata['date']=pd.to_datetime（sql_表_当前_gadata['date']）
返回sql\u表\u当前\u数据

最后比较两个DFs，如果它们之间有任何差异，请更新SQL表


def compare_df_gadata（）：
关于答复中的报告：
response=pd.DataFrame.equals（sql\u table\u current\u gadata，report）
如果响应==False：
比较的dfs=pd.concat（[sql\u table\u current\u gadata，report]，sort=False）
已比较\u dfs.drop\u重复项（keep=False，inplace=True）
#sqlalchemy中的sql参数
params=urllib.parse.quote_plus（#params）
engine=create_engine（'mssql+pyodbc://？odbc_connect={}'。格式（参数））
#向sql表插入新值
将\u dfs.与\u sql（'Table'，con=engine，如果\u exists='append'，index=False）进行比较

我也尝试过合并两个表，但结果是一样的。也许在MS Studio办理入住手续更为合理

也不能正常工作

df_outer = pd.merge(sql_table_current_gadata, report, on=None, how='left', sort=True)

更新

我用concat函数检查了另一次，看起来问题出在“index”中

原来的240行（960行已经有重复项，所以只需清理SQL表并再次运行脚本）

我有3个GA帐户，当前SQL表由这些帐户组成：72行+13行+154行+标题=240行

当再次运行脚本时，与pd.concat进行比较并将结果存储在数据帧中（与dfs进行比较）（不将其发送到数据库），它包含最后一次请求GA API的154行

我尝试在此处重置：

if response==False:
            compared_dfs = pd.concat([sql_table_current_gadata, report], sort=False)
            compared_dfs.drop_duplicates(keep=False, inplace=True)
            compared_dfs.reset_index(inplace=True)

但结果是，它作为一个附加列添加到一个比较的目录中

它显示了两个索引列，一个来自SQL表，另一个来自pandas

您的问题很详细，但很清晰。我想首先问一下，如果你确定你的索引，你可以尝试合并特定的列，看看这是否解决了问题？我首先关注熊猫部分，因为它似乎是你问题的焦点

import pandas as pd
import numpy as np

merge = True
concat = False

anp = np.ones((2, 5))
anp[1, 1] = 3
anp[1, 4] = 3
bnp = np.ones((1, 5))
bnp[0, 1] = 4  # use 4 to make it different, also works with nan
bnp[0, 4] = 4  # use 4 to make it different, also works with nan
a = pd.DataFrame(anp)
b = pd.DataFrame(bnp)
if merge:
    a.rename(columns=dict(zip(range(5), ['a', 'b', 'c', 'd', 'e'])), inplace=True)
    b.rename(columns=dict(zip(range(5), ['a', 'b', 'c', 'd', 'e'])), inplace=True)
    # choose suitable and meaningful column(s) for your merge (do you have any id column etc.?)
    a = pd.merge(a, b, how='outer', copy=False, on=['a', 'c', 'd', 'e'])
    # che
    print(a)

if concat:
    # can use ignore_index or pass keys to maintain distiction
    c = pd.concat((a, b), axis=0, join='outer', keys=['a', 'b'])
    print(c)
    c.drop_duplicates(inplace=True)
    print(c)

正在检查Luca Peruzzo解决方案，但如果列为空，它将崩溃

从当前sql表中获取列列表

list_of_col = list(sql_table_current_gadata.columns)

迭代响应列表中的报告（GA API响应）

抛出错误

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-321-4fbfe59db175> in <module>
      1 for report in responses:
----> 2     df_outer = pd.merge(test, report, how='outer', copy=False, on=list_of_col)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     45                          right_index=right_index, sort=sort, suffixes=suffixes,
     46                          copy=copy, indicator=indicator,
---> 47                          validate=validate)
     48     return op.get_result()
     49 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\merge.py in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
    527         (self.left_join_keys,
    528          self.right_join_keys,
--> 529          self.join_names) = self._get_merge_keys()
    530 
    531         # validate the merge keys dtypes. We may need to coerce

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\merge.py in _get_merge_keys(self)
    831                         if rk is not None:
    832                             right_keys.append(
--> 833                                 right._get_label_or_level_values(rk))
    834                         else:
    835                             # work-around for merge_asof(right_index=True)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in _get_label_or_level_values(self, key, axis)
   1704             values = self.axes[axis].get_level_values(key)._values
   1705         else:
-> 1706             raise KeyError(key)
   1707 
   1708         # Check for duplicates

KeyError: 'userGender'

我还检查了“userGender”没有任何值，它在所有空列上崩溃

Hello，很抱歉我的问题表达式。我曾尝试在“view_id”列上进行合并，但它会将960个原始行和35个列的所有行数乘以23716行×41列。它将API需求中指示的列（如尺寸、度量、开始日期和结束日期）加倍。代码：

lang py for report in responses:df_outer=pd.merge（report，test，on='view\u id'，how='left'，sort=False）

。原来的240行（960行已经有重复项，所以只需清理sql表并再次运行脚本）。我有3个GA账户：72行+13行+154行+页眉=240行。当再次运行脚本时，与pd.concat比较，并在数据帧中对结果进行排序，它包含上次对GA API的请求中的154行。很好，一旦索引设置好，您应该能够获得所需的内容，或者我使用基于列的示例更新了我的答案。T

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-321-4fbfe59db175> in <module>
      1 for report in responses:
----> 2     df_outer = pd.merge(test, report, how='outer', copy=False, on=list_of_col)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     45                          right_index=right_index, sort=sort, suffixes=suffixes,
     46                          copy=copy, indicator=indicator,
---> 47                          validate=validate)
     48     return op.get_result()
     49 

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\merge.py in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
    527         (self.left_join_keys,
    528          self.right_join_keys,
--> 529          self.join_names) = self._get_merge_keys()
    530 
    531         # validate the merge keys dtypes. We may need to coerce

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\merge.py in _get_merge_keys(self)
    831                         if rk is not None:
    832                             right_keys.append(
--> 833                                 right._get_label_or_level_values(rk))
    834                         else:
    835                             # work-around for merge_asof(right_index=True)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in _get_label_or_level_values(self, key, axis)
   1704             values = self.axes[axis].get_level_values(key)._values
   1705         else:
-> 1706             raise KeyError(key)
   1707 
   1708         # Check for duplicates

KeyError: 'userGender'

['view_id',
 'start_date',
 'end_date',
 'userType',
 'userGender',
 'userAgeBracket',
 'sourceMedium',
 'source',
 'socialNetwork',
 'region',
 'regionId',
 'pageTitle',
 'pagePath',
 'pageDepth',
 'operatingSystemVersion',
 'operatingSystem',
 'mobileDeviceModel',
 'mobileDeviceMarketingName',
 'mobileDeviceInfo',
 'mobileDeviceBranding',
 'medium',
 'deviceCategory',
 'dataSource',
 'country',
 'continent',
 'continentId',
 'cityId',
 'city',
 'users',
 'sessions',
 'sessionDuration',
 'pageviews',
 'newUsers',
 'bounces',
 'date']