在Python中基于条件连接两个表_Python_Python 3.x_Pandas_Join_Merge

在Python中基于条件连接两个表

python python-3.x pandas join merge

在Python中基于条件连接两个表,python,python-3.x,pandas,join,merge,Python,Python 3.x,Pandas,Join,Merge,我有两张熊猫桌： df1：包含150K用户的用户ID和IP_地址 |---------------|---------------| | User_ID | IP_Address | |---------------|---------------| | U1 | 732758368.8 | | U2 | 350311387.9 | | U3 | 2621473820 | |--------

我有两张熊猫桌：

df1：包含150K用户的用户ID和IP_地址

|---------------|---------------|  
|    User_ID    |   IP_Address  |
|---------------|---------------|  
|      U1       |   732758368.8 |
|      U2       |   350311387.9 |
|      U3       |   2621473820  |
|---------------|---------------|

df2：包含IP地址范围及其所属国家/地区，139K记录

|---------------|-----------------|------------------|  
|    Country    | Lower_Bound_IP  |  Upper_Bound_IP  |
|---------------|-----------------|------------------|  
|   Australia   |   1023787008    |    1023791103    |
|   USA         |   3638734848    |    3638738943    |
|   Australia   |   3224798976    |    3224799231    |
|   Poland      |   1539721728    |    1539721983    |
|---------------|-----------------|------------------|

我的目标是在df1中创建一个国家列，使df1的IP地址位于df2中该国家的下限IP和上限IP之间

|---------------|---------------|---------------|   
|    User_ID    |   IP_Address  |    Country    |
|---------------|---------------|---------------|   
|      U1       |   732758368.8 |   Indonesia   |
|      U2       |   350311387.9 |   Australia   |
|      U3       |   2621473820  |   Albania     |
|---------------|---------------|---------------|

我的第一种方法是对两个表进行交叉连接（笛卡尔积），然后过滤到相关记录。但是，使用pandas.merge（）进行交叉连接是不可行的，因为它将创建210亿条记录。代码每次都崩溃。你能推荐一个可行的替代方案吗？

我不确定如何处理熊猫。在哪里，但在

numpy。在哪里你可以做
idx = numpy.where((df1.Ip_Address[:,None] >= df2.Lower_Bound_IP[None,:]) 
    & (df1.IP_Address[:,None] <= df2.Upper_Bound_IP[None,:]))[1]
df1["Country"] = df2.Country[idx]

IP_地址范围是否全面？i、 例如，df1
中是否有IP_地址值，您希望Country为空？@cmaher我现在假设范围是全面的，因此任何用户都不会有空的Country。它工作起来很有魅力。非常感谢你。批量计算在我的例子中非常有用。大大减轻了内存的负载。
batch_size = 1000
n_batches = df1.shape[0] // batch_size
# Integer division rounds down, so if the number
# of User_ID's is not divisable by the batch_size,
# we need to add 1 to n_batches
if n_batches * batch_size < df1.shape[0]:
    n_batches += 1
indices = []
for i in range(n_batches):
    idx = numpy.where((df1.Ip_Address[i*batch_size:(i+1)*batch_size,None]
            >= df2.Lower_Bound_IP[None,:]) & 
            (df1.IP_Address[i*batch_size:(i+1)*batch_size,None] 
            <= df2.Upper_Bound_IP[None,:]))[1]
    indices.extend(idx.tolist())

df1["Country"] = df2.Country[np.asarray(indices)]