Python 查找最相关的项_Python_Pandas

Python 查找最相关的项

python pandas

Python 查找最相关的项,python,pandas,Python,Pandas,我有一个餐厅的销售详情如下 +----------+------------+---------+----------+ | Location | Units Sold | Revenue | Footfall | +----------+------------+---------+----------+ | Loc - 01 | 100 | 1,150 | 85 | +----------+------------+---------+----------+

我有一个餐厅的销售详情如下

+----------+------------+---------+----------+
| Location | Units Sold | Revenue | Footfall |
+----------+------------+---------+----------+
| Loc - 01 |        100 | 1,150   |       85 |
+----------+------------+---------+----------+

我想从下表餐厅数据中找出与上述最相关的餐厅

+----------+------------+---------+----------+
| Location | Units Sold | Revenue | Footfall |
+----------+------------+---------+----------+
| Loc - 02 |        100 | 1,250   |       60 |
| Loc - 03 |         90 | 990     |       90 |
| Loc - 04 |        120 | 1,200   |       98 |
| Loc - 05 |        115 | 1,035   |       87 |
| Loc - 06 |         89 | 1,157   |       74 |
| Loc - 07 |        110 | 1,265   |       80 |
+----------+------------+---------+----------+

请指导我如何使用python或pandas实现这一点。。

注：-相关性是指在销售单位、收入和落脚点方面最匹配/相似的餐厅。

可能是更好的方法，但我认为这是有效的，它非常详细，因此我尝试保持代码干净易读：

首先，让我们使用post中的自定义numpy函数

然后使用数据帧的数组，传入第一个数据帧中的值以查找最接近的匹配项

us = find_nearest(df2['Units Sold'],df['Units Sold'][0])
ff = find_nearest(df2['Footfall'],df['Footfall'][0])
rev = find_nearest(df2['Revenue'],df['Revenue'][0])

print(us,ff,rev,sep=',')
100,87,1157

然后返回包含所有三个条件的数据帧

    new_ df = (df2.loc[
    (df2['Units Sold'] == us) |
    (df2['Footfall'] == ff) |
    (df2['Revenue'] == rev)])

这给了我们：

    Location    Units Sold  Revenue Footfall
0   Loc - 02    100         1250    60
3   Loc - 05    115         1035    87
4   Loc - 06    89          1157    74

这可能是一种更好的方法，但我认为这是可行的，它非常冗长，因此我尝试保持代码的干净性和可读性：

首先，让我们使用post中的自定义numpy函数

然后使用数据帧的数组，传入第一个数据帧中的值以查找最接近的匹配项

us = find_nearest(df2['Units Sold'],df['Units Sold'][0])
ff = find_nearest(df2['Footfall'],df['Footfall'][0])
rev = find_nearest(df2['Revenue'],df['Revenue'][0])

print(us,ff,rev,sep=',')
100,87,1157

然后返回包含所有三个条件的数据帧

    new_ df = (df2.loc[
    (df2['Units Sold'] == us) |
    (df2['Footfall'] == ff) |
    (df2['Revenue'] == rev)])

这给了我们：

    Location    Units Sold  Revenue Footfall
0   Loc - 02    100         1250    60
3   Loc - 05    115         1035    87
4   Loc - 06    89          1157    74

如果您的相关性应描述为最小欧几里德距离，则解决方案为：

#convert columns to numeric
df1['Revenue'] = df1['Revenue'].str.replace(',','').astype(int)
df2['Revenue'] = df2['Revenue'].str.replace(',','').astype(int)

#distance of all columns subtracted by first row of first DataFrame
dist = np.sqrt((df2['Units Sold']-df1.loc[0, 'Units Sold'])**2 + 
               (df2['Revenue']- df1.loc[0, 'Revenue'])**2 + 
               (df2['Footfall']- df1.loc[0, 'Footfall'])**2)

print (dist)
0    103.077641
1    160.390149
2     55.398556
3    115.991379
4     17.058722
5    115.542200
dtype: float64

#get index of minimal value and select row of second df
print (df2.loc[[dist.idxmin()]])
   Location  Units Sold  Revenue  Footfall
4  Loc - 06          89     1157        74

如果您的相关性应描述为最小欧几里德距离，则解决方案为：

#convert columns to numeric
df1['Revenue'] = df1['Revenue'].str.replace(',','').astype(int)
df2['Revenue'] = df2['Revenue'].str.replace(',','').astype(int)

#distance of all columns subtracted by first row of first DataFrame
dist = np.sqrt((df2['Units Sold']-df1.loc[0, 'Units Sold'])**2 + 
               (df2['Revenue']- df1.loc[0, 'Revenue'])**2 + 
               (df2['Footfall']- df1.loc[0, 'Footfall'])**2)

print (dist)
0    103.077641
1    160.390149
2     55.398556
3    115.991379
4     17.058722
5    115.542200
dtype: float64

#get index of minimal value and select row of second df
print (df2.loc[[dist.idxmin()]])
   Location  Units Sold  Revenue  Footfall
4  Loc - 06          89     1157        74

固定数据用于数字列。我大概概括得太多了。另外，我将索引设置为“Location”列

曼哈顿距离欧几里德距离固定数据用于数字列。我大概概括得太多了。另外，我将索引设置为“Location”列

曼哈顿距离欧几里德距离

你试过什么？请表现出你的努力。我对熊猫不熟悉。这就是为什么我要求指导我解决这个问题的过程。至少把你的数据作为一个数据帧来读取，然后计算相关矩阵。这应该是一个好的开始。相关性根据哪个特性？你熟悉Numpy吗？我最近处理了一个类似的问题，我可以用numpy/pandas的方法为您解决这个问题。您尝试过什么？请表现出你的努力。我对熊猫不熟悉。这就是为什么我要求指导我解决这个问题的过程。至少把你的数据作为一个数据帧来读取，然后计算相关矩阵。这应该是一个好的开始。相关性根据哪个特性？你熟悉Numpy吗？我最近处理过一个类似的问题，我可以用numpy/pandas的方法来为您解决这个问题。@DataNoveler-谢谢，我希望这是OP需要的@数据新手@piRSquared我得到了答案。谢谢大家。但是，我有一个后续问题。在这里，我想根据Loc-01选择所有相关餐厅。如果我想选择所有类似相关餐厅，而没有像Loc-01这样的基础餐厅，该怎么办？一种基于相互关联的聚类？@Tommy-我认为最好的方法是创建新问题。@datanoveler-谢谢，我希望这是OP需要的@数据新手@piRSquared我得到了答案。谢谢大家。但是，我有一个后续问题。在这里，我想根据Loc-01选择所有相关餐厅。如果我想选择所有类似相关餐厅，而没有像Loc-01这样的基础餐厅，该怎么办？一种基于相互关联的聚类？@Tommy-我认为最好的方法是创建新问题。