在Python中比较2个数据帧时出现问题,应排除所有重复项,但效果不理想
我正在开发一个连接Google Analytics和SQL Server数据库的连接器,但存在重复值问题 首先,脚本使用GA Accounts config解析嵌套的dict,然后将其转换为df并将所有响应存储在一个列表中,然后获取包含所有GA数据的当前SQL表,并创建一个比较新值(来自GA API)和当前值(在SQL表中)的循环 但由于某些原因,在比较这两个dfs时,会保留所有重复项 如果有人能帮助我,我会非常高兴 用于发出GA API请求的带配置的嵌套dict在Python中比较2个数据帧时出现问题,应排除所有重复项,但效果不理想,python,sql,pandas,duplicates,google-analytics-api,Python,Sql,Pandas,Duplicates,Google Analytics Api,我正在开发一个连接Google Analytics和SQL Server数据库的连接器,但存在重复值问题 首先,脚本使用GA Accounts config解析嵌套的dict,然后将其转换为df并将所有响应存储在一个列表中,然后获取包含所有GA数据的当前SQL表,并创建一个比较新值(来自GA API)和当前值(在SQL表中)的循环 但由于某些原因,在比较这两个dfs时,会保留所有重复项 如果有人能帮助我,我会非常高兴 用于发出GA API请求的带配置的嵌套dict 数据检验={ 'view_i
数据检验={
'view_id_111':{'view_id':'111',
“开始日期”:“2019-08-01”,
“结束日期”:2019-09-01,
'metrics':[{'expression':'ga:sessions'},{'expression':'ga:users'}],
'dimensions':[{'name':'ga:country'},{'name':'ga:userType'},{'name':'ga:date'}]},
'view_id_222':{'view_id':'222',
“开始日期”:“2019-08-01”,
“结束日期”:2019-09-01,
'metrics':[{'expression':'ga:sessions'},{'expression':'ga:users'}],
'dimensions':[{'name':'ga:country'},{'name':'ga:date'}]},
'view_id_333':{'view_id':'333',
“开始日期”:“2019-01-01”,
“结束日期”:“2019-05-01”,
'metrics':[{'expression':'ga:sessions'},{'expression':'ga:users'}],
'dimensions':[{'name':'ga:country'},{'name':'ga:date'}]}
}
responses=[]
对于数据测试项()中的k,v:
样本_请求={
“视图id”:v[“视图id”],
“日期范围”:{
“开始日期”:v[“开始日期”],
“结束日期”:v[“结束日期”]
},
“度量”:v[“度量”],
“维度”:v[“维度”]
}
响应=analytics.reports().batchGet(
身体={
“reportRequests”:示例请求
}).execute()
n_响应=打印响应新测试(响应)
responses.append(n_response)
def get_current_sql_gadata_table():
全局sql\u表\u当前\u数据
sql\u table\u current\u gadata=pd.read\u sql('SELECT*FROM table',con=conn)
sql_表_当前_gadata['date']=pd.to_datetime(sql_表_当前_gadata['date'])
返回sql\u表\u当前\u数据
def compare_df_gadata():
关于答复中的报告:
response=pd.DataFrame.equals(sql\u table\u current\u gadata,report)
如果响应==False:
比较的dfs=pd.concat([sql\u table\u current\u gadata,report],sort=False)
已比较\u dfs.drop\u重复项(keep=False,inplace=True)
#sqlalchemy中的sql参数
params=urllib.parse.quote_plus(#params)
engine=create_engine('mssql+pyodbc://?odbc_connect={}'。格式(参数))
#向sql表插入新值
将\u dfs.与\u sql('Table',con=engine,如果\u exists='append',index=False)进行比较
我也尝试过合并两个表,但结果是一样的。也许在MS Studio办理入住手续更为合理
也不能正常工作
df_outer = pd.merge(sql_table_current_gadata, report, on=None, how='left', sort=True)
更新
我用concat函数检查了另一次,看起来问题出在“index”中 原来的240行(960行已经有重复项,所以只需清理SQL表并再次运行脚本) 我有3个GA帐户,当前SQL表由这些帐户组成:72行+13行+154行+标题=240行 当再次运行脚本时,与pd.concat进行比较并将结果存储在数据帧中(与dfs进行比较)(不将其发送到数据库),它包含最后一次请求GA API的154行 我尝试在此处重置:
if response==False:
compared_dfs = pd.concat([sql_table_current_gadata, report], sort=False)
compared_dfs.drop_duplicates(keep=False, inplace=True)
compared_dfs.reset_index(inplace=True)
但结果是,它作为一个附加列添加到一个比较的目录中
它显示了两个索引列,一个来自SQL表,另一个来自pandas您的问题很详细,但很清晰。我想首先问一下,如果你确定你的索引,你可以尝试合并特定的列,看看这是否解决了问题? 我首先关注熊猫部分,因为它似乎是你问题的焦点
import pandas as pd
import numpy as np
merge = True
concat = False
anp = np.ones((2, 5))
anp[1, 1] = 3
anp[1, 4] = 3
bnp = np.ones((1, 5))
bnp[0, 1] = 4 # use 4 to make it different, also works with nan
bnp[0, 4] = 4 # use 4 to make it different, also works with nan
a = pd.DataFrame(anp)
b = pd.DataFrame(bnp)
if merge:
a.rename(columns=dict(zip(range(5), ['a', 'b', 'c', 'd', 'e'])), inplace=True)
b.rename(columns=dict(zip(range(5), ['a', 'b', 'c', 'd', 'e'])), inplace=True)
# choose suitable and meaningful column(s) for your merge (do you have any id column etc.?)
a = pd.merge(a, b, how='outer', copy=False, on=['a', 'c', 'd', 'e'])
# che
print(a)
if concat:
# can use ignore_index or pass keys to maintain distiction
c = pd.concat((a, b), axis=0, join='outer', keys=['a', 'b'])
print(c)
c.drop_duplicates(inplace=True)
print(c)
正在检查Luca Peruzzo解决方案,但如果列为空,它将崩溃 从当前sql表中获取列列表
list_of_col = list(sql_table_current_gadata.columns)
迭代响应列表中的报告(GA API响应)
抛出错误
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-321-4fbfe59db175> in <module>
1 for report in responses:
----> 2 df_outer = pd.merge(test, report, how='outer', copy=False, on=list_of_col)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
45 right_index=right_index, sort=sort, suffixes=suffixes,
46 copy=copy, indicator=indicator,
---> 47 validate=validate)
48 return op.get_result()
49
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\merge.py in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
527 (self.left_join_keys,
528 self.right_join_keys,
--> 529 self.join_names) = self._get_merge_keys()
530
531 # validate the merge keys dtypes. We may need to coerce
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\merge.py in _get_merge_keys(self)
831 if rk is not None:
832 right_keys.append(
--> 833 right._get_label_or_level_values(rk))
834 else:
835 # work-around for merge_asof(right_index=True)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in _get_label_or_level_values(self, key, axis)
1704 values = self.axes[axis].get_level_values(key)._values
1705 else:
-> 1706 raise KeyError(key)
1707
1708 # Check for duplicates
KeyError: 'userGender'
我还检查了“userGender”没有任何值,它在所有空列上崩溃Hello,很抱歉我的问题表达式。我曾尝试在“view_id”列上进行合并,但它会将960个原始行和35个列的所有行数乘以23716行×41列。它将API需求中指示的列(如尺寸、度量、开始日期和结束日期)加倍。代码:
lang py for report in responses:df_outer=pd.merge(report,test,on='view\u id',how='left',sort=False)
。原来的240行(960行已经有重复项,所以只需清理sql表并再次运行脚本)。我有3个GA账户:72行+13行+154行+页眉=240行。当再次运行脚本时,与pd.concat比较,并在数据帧中对结果进行排序,它包含上次对GA API的请求中的154行。很好,一旦索引设置好,您应该能够获得所需的内容,或者我使用基于列的示例更新了我的答案。T
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-321-4fbfe59db175> in <module>
1 for report in responses:
----> 2 df_outer = pd.merge(test, report, how='outer', copy=False, on=list_of_col)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
45 right_index=right_index, sort=sort, suffixes=suffixes,
46 copy=copy, indicator=indicator,
---> 47 validate=validate)
48 return op.get_result()
49
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\merge.py in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
527 (self.left_join_keys,
528 self.right_join_keys,
--> 529 self.join_names) = self._get_merge_keys()
530
531 # validate the merge keys dtypes. We may need to coerce
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\merge.py in _get_merge_keys(self)
831 if rk is not None:
832 right_keys.append(
--> 833 right._get_label_or_level_values(rk))
834 else:
835 # work-around for merge_asof(right_index=True)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in _get_label_or_level_values(self, key, axis)
1704 values = self.axes[axis].get_level_values(key)._values
1705 else:
-> 1706 raise KeyError(key)
1707
1708 # Check for duplicates
KeyError: 'userGender'
['view_id',
'start_date',
'end_date',
'userType',
'userGender',
'userAgeBracket',
'sourceMedium',
'source',
'socialNetwork',
'region',
'regionId',
'pageTitle',
'pagePath',
'pageDepth',
'operatingSystemVersion',
'operatingSystem',
'mobileDeviceModel',
'mobileDeviceMarketingName',
'mobileDeviceInfo',
'mobileDeviceBranding',
'medium',
'deviceCategory',
'dataSource',
'country',
'continent',
'continentId',
'cityId',
'city',
'users',
'sessions',
'sessionDuration',
'pageviews',
'newUsers',
'bounces',
'date']