Python：使用pandas将一个数组连接到另一个数组_Python_Pandas

Python：使用pandas将一个数组连接到另一个数组

python pandas

Python：使用pandas将一个数组连接到另一个数组,python,pandas,Python,Pandas,如何使用pandas生成aoiFeatures和allFeaturesReadings的合并结果，从而产生以下结果： 183 0.03 845 0.03 853 0.01 给定以下起始代码和数据： import numpy import pandas as pd allFeatures = [101, 179, 181, 183, 185, 843, 845, 847, 849, 851, 853, 855] allReadings = [0.03, 0.01, 0.01, 0.03,

如何使用pandas生成aoiFeatures和allFeaturesReadings的合并结果，从而产生以下结果：

183  0.03
845  0.03
853  0.01

给定以下起始代码和数据：

import numpy
import pandas as pd
allFeatures = [101, 179, 181, 183, 185, 843, 845, 847, 849, 851, 853, 855]
allReadings = [0.03, 0.01, 0.01, 0.03, 0.03, 0.01, 0.03, 0.02, 0.07, 0.06, 0.01, 0.04]
aoiFeatures = [183, 845, 853]

allFeaturesReadings = zip(allFeatures, allReadings)
#
# Use pandas to create Series and Join here?
#
sAllFeaturesReadings = pd.Series(dict(allFeaturesReadings))
sAOIFeatures = pd.Series(numpy.ma.filled(aoiFeatures))
sIndexedAOIFeatures = sAOIFeatures.reindex(numpy.ma.filled(aoiFeatures))
result = pd.concat([sIndexedAOIFeatures,sAllFeaturesReadings], axis=1, join='inner')

无需拉链，您可以执行以下操作：

df = pd.DataFrame(data={"allFeatures":allFeatures, "allReadings":allReadings})
df[df["allFeatures"].isin(aoiFeatures)]

您可以使用

isin

：

import pandas as pd
allFeatures = [101, 179, 181, 183, 185, 843, 845, 847, 849, 851, 853, 855]
allReadings = [0.03, 0.01, 0.01, 0.03, 0.03, 0.01, 0.03, 0.02, 0.07, 0.06, 0.01, 0.04]
aoiFeatures = [183, 845, 853]

df = pd.DataFrame({'features':allFeatures, 'readings':allReadings})
result = df.loc[df['features'].isin(aoiFeatures)]
print(result)

屈服

    features  readings
3        183      0.03
6        845      0.03
10       853      0.01

     readings
183      0.03
845      0.03
853      0.01

如果您计划经常根据

features

值选择行，并且如果

features

可以制作成一个唯一的索引，并且如果数据帧至少相当大（例如~10K行），那么制作

features

索引可能会更好（性能方面）：

import pandas as pd
allFeatures = [101, 179, 181, 183, 185, 843, 845, 847, 849, 851, 853, 855]
allReadings = [0.03, 0.01, 0.01, 0.03, 0.03, 0.01, 0.03, 0.02, 0.07, 0.06, 0.01, 0.04]
aoiFeatures = [183, 845, 853]

df = pd.DataFrame({'readings':allReadings}, index=allFeatures)
result = df.loc[aoiFeatures]
print(result)

屈服

    features  readings
3        183      0.03
6        845      0.03
10       853      0.01

     readings
183      0.03
845      0.03
853      0.01

以下是我用来进行IPython%timeit测试的设置：

import pandas as pd
N = 10000
allFeatures = np.repeat(np.arange(N), 1)
allReadings = np.random.random(N)
aoiFeatures = np.random.choice(allFeatures, N//10, replace=False)

def using_isin():
    df = pd.DataFrame({'features':allFeatures, 'readings':allReadings})
    for i in range(1000):
        result = df.loc[df['features'].isin(aoiFeatures)]
    return result


def using_index():
    df = pd.DataFrame({'readings':allReadings}, index=allFeatures)
    for i in range(1000):
        result = df.loc[aoiFeatures]
    return result

这表明

使用_index

可以更快一些：

In [108]: %timeit using_isin()
1 loop, best of 3: 697 ms per loop

In [109]: %timeit using_index()
1 loop, best of 3: 432 ms per loop

但是，请注意，如果

allFeatures

包含重复项，则使其成为索引是不有利的。例如，如果将上述设置更改为使用：

allFeatures = np.repeat(np.arange(N//2), 2)    # repeat every value twice

然后

这看起来是正确的还是有更简单的方法？很好，托比。。。也快多了！谢谢-@unutbu的答案是相同的概念，但更彻底。我希望最终完全按照您在第二个示例中所说的那样，使用allFeatures作为索引。。。谢谢大家!@杰克：我把答案贴得有点太早了。我做了一些timeit测试，发现制作<代码>功能<代码>索引的好处只有在某些条件下才会出现：索引必须是唯一的，并且数据帧必须适度大（在我的机器上，至少10K行），然后才有性能优势。所以。。。假设我有另外一组读数，比如：nextReadings=[0.04,0.09,0.21,0.01,0.06,0.08,0.13,0.01,0.01,0.02,0.04,0.06]，它们仍然按照所有特征的顺序索引。对于给定的allFeatures元素，是否有一种简单的方法可以从每个数组中找到最大的值？（在上述示例中，845和853的读数将分别更改为0.13和0.04）。此外，在现实世界中，所有特征和所有读数都有240万个元素，而AOiFeature有67000个元素。我还有17个其他的读数数组，比如要比较的AllReads

df.loc[[845853]]

选择其索引标签对应于

或

的行。要找到每行的最大值，可以使用

df.loc[[845853]].max（axis=1）

。但我不确定我是否正确理解了你的问题。如果我不是，那么请发布一个新的问题，包括所有细节。（您在本问题中提供的示例数据和预期输出非常有用。）