Python 合并两个大数据帧导致内存错误
-表格图像 我试图合并两个非常大的数据帧,这给了我内存错误。下面是我试图将其转换为pandas的SQL代码Python 合并两个大数据帧导致内存错误,python,pandas,Python,Pandas,-表格图像 我试图合并两个非常大的数据帧,这给了我内存错误。下面是我试图将其转换为pandas的SQL代码 SELECT a.period, a.houseid, a.custid, a.productid, b.local_time FROM table_a JOIN table_b ON a.period = b.period AND a.productid = b.productid AND b.local_time BETWEEN a.start_time AND a.end
SELECT a.period, a.houseid, a.custid, a.productid, b.local_time
FROM table_a
JOIN table_b
ON a.period = b.period
AND a.productid = b.productid
AND b.local_time BETWEEN a.start_time AND a.end_time
表a
和表b
包含以百万为单位的行。
尝试使用键联接表,并且当表_b中的localtime在表a中的开始时间和结束时间范围之间时
DF1:
DF2:
结果在DF1中:
请帮帮我。
谢谢好的,这是我的解决方案。希望对你的案子来说足够好。这是我目前的技能水平所能提供的全部。另一种方法是循环一个表,并在开始时间和结束时间之间应用条件检查,但我决定采用这种方法,因为您说过表中有数百万行 此处的步骤数取决于DF2中开始时间的bin。在我的解决方案中,我需要两个步骤,因为我首先以半小时的开始时间加入,然后在每小时的开始时间重复它
import pandas as pd
import datetime
df1 = pd.read_excel('my_sample_data.xls')
df2 = pd.read_excel('my_sample_data2.xls')
# Construct a new index column for df1 - based on half-hourly START_TIME
df1['localtime'] = pd.to_timedelta(df1.loc[:, 'localtime'])
df1['START_TIME'] = df1.loc[:,'localtime'].apply(lambda x: x.floor('1800s'))
df1['START_TIME'] = df1.loc[:,'START_TIME'].apply(lambda x: x + datetime.datetime(1970,1,1))
# Drop unneeded colums
df2 = df2.loc[:,['START_TIME', 'prodid', 'Product_info', 'Name']]
df2.set_index(['prodid', 'START_TIME'], inplace=True)
df = df1.join(df2, on=['prodid', 'START_TIME'])
# Good portion
df_done = df.loc[df['Name'].isnull() == False]
# Bad portion
df_nan = df.loc[df['Name'].isnull() == True, ['period', 'houseid', 'custid', 'prodid', 'localtime']]
# Some ranges in DF2 come with hourly frequencies. Repeat the same process above for this case
df_nan['START_TIME'] = df_nan.loc[:,'localtime'].apply(lambda x: x.floor('3600s'))
df_nan['START_TIME'] = df_nan.loc[:,'START_TIME'].apply(lambda x: x + datetime.datetime(1970,1,1))
df_nan = df_nan.join(df2, on=['prodid', 'START_TIME'])
df = pd.concat([df_done, df_nan])
df['localtime'] = df.loc[:, 'localtime'].apply(lambda x: x + datetime.datetime(1970,1,1))
>>>df
period houseid custid prodid localtime START_TIME Product_info Name
0 20181001 1 aa 2 2018-01-10 19:04:00 2018-01-10 19:00:00 GHI Xab
2 20181001 1 zz 178 2018-01-10 13:01:00 2018-01-10 13:00:00 Chase T3
1 20181001 1 zz 9 2018-01-10 15:57:00 2018-01-10 15:00:00 Road S2
3 20181001 1 zz 231 2018-02-10 02:51:00 2018-02-10 02:00:00 NaN NaN
开始和结束之间的切片数据大小是多少?您好,谢谢您的回复。在这个合并之后,这个表包含10,57,43701行。当然,最简单的解决方案是在开始和结束之间占用较小的时间段。获取合并所需的最小列数的切片也会有所帮助。在这之后,只要保留唯一键,您就可以慢慢地将它们追加回去。在任何情况下,您的句点值是否不够唯一?例如,如果它们是月份标签(“1月”、“2月”等),你将以爆炸性的输出结束。从2000年到2019年,在两个方向上,用“2019年1月”而不是“2019年1月”与所有“2019年1月”进行匹配可能更有意义谢谢你的回答。你能给我看一下合并这两个数据帧的代码吗。正如我所知,使用on=with period进行合并,但有点困惑如何在从开始到结束的时间范围内使用本地时间。如果可能的话,你能给我看一下密码吗?谢谢你的回答
PERIOD prodid Name Product_info START_TIME END_TIME
20181001 2 Xab GHI 01/10/2018 19:00 01/10/2018 19:29
20181001 2 Xab QQQ 01/10/2018 19:30 01/10/2018 19:59
20181001 2 Xab asd 01/10/2018 20:00 01/10/2018 20:29
20181001 9 S2 Angele 01/10/2018 14:00 01/10/2018 14:59
20181001 9 S2 Road 01/10/2018 15:00 01/10/2018 15:59
20181001 9 S2 Flash 01/10/2018 16:00 01/10/2018 16:59
20181001 9 S2 Simpson 01/10/2018 17:00 01/10/2018 17:29
20181001 178 T3 Chase 01/10/2018 13:00 01/10/2018 13:59
20181001 178 T3 Chase 01/10/2018 14:00 01/10/2018 14:59
20181001 178 T3 Elaine 01/10/2018 15:00 01/10/2018 15:59
period houseid custid prodid localtime Product_info Name
20181001 1 aa 2 01/10/2018 19:04 GHI Xab
20181001 1 zz 9 01/10/2018 15:57 Road S2
20181001 1 zz 178 01/10/2018 13:01 Chase T3
20181001 1 zz 231 02/10/2018 02:51 None None
import pandas as pd
import datetime
df1 = pd.read_excel('my_sample_data.xls')
df2 = pd.read_excel('my_sample_data2.xls')
# Construct a new index column for df1 - based on half-hourly START_TIME
df1['localtime'] = pd.to_timedelta(df1.loc[:, 'localtime'])
df1['START_TIME'] = df1.loc[:,'localtime'].apply(lambda x: x.floor('1800s'))
df1['START_TIME'] = df1.loc[:,'START_TIME'].apply(lambda x: x + datetime.datetime(1970,1,1))
# Drop unneeded colums
df2 = df2.loc[:,['START_TIME', 'prodid', 'Product_info', 'Name']]
df2.set_index(['prodid', 'START_TIME'], inplace=True)
df = df1.join(df2, on=['prodid', 'START_TIME'])
# Good portion
df_done = df.loc[df['Name'].isnull() == False]
# Bad portion
df_nan = df.loc[df['Name'].isnull() == True, ['period', 'houseid', 'custid', 'prodid', 'localtime']]
# Some ranges in DF2 come with hourly frequencies. Repeat the same process above for this case
df_nan['START_TIME'] = df_nan.loc[:,'localtime'].apply(lambda x: x.floor('3600s'))
df_nan['START_TIME'] = df_nan.loc[:,'START_TIME'].apply(lambda x: x + datetime.datetime(1970,1,1))
df_nan = df_nan.join(df2, on=['prodid', 'START_TIME'])
df = pd.concat([df_done, df_nan])
df['localtime'] = df.loc[:, 'localtime'].apply(lambda x: x + datetime.datetime(1970,1,1))
>>>df
period houseid custid prodid localtime START_TIME Product_info Name
0 20181001 1 aa 2 2018-01-10 19:04:00 2018-01-10 19:00:00 GHI Xab
2 20181001 1 zz 178 2018-01-10 13:01:00 2018-01-10 13:00:00 Chase T3
1 20181001 1 zz 9 2018-01-10 15:57:00 2018-01-10 15:00:00 Road S2
3 20181001 1 zz 231 2018-02-10 02:51:00 2018-02-10 02:00:00 NaN NaN