Python:使用pandas将一个数组连接到另一个数组

Python:使用pandas将一个数组连接到另一个数组,python,pandas,Python,Pandas,如何使用pandas生成aoiFeatures和allFeaturesReadings的合并结果,从而产生以下结果: 183 0.03 845 0.03 853 0.01 给定以下起始代码和数据: import numpy import pandas as pd allFeatures = [101, 179, 181, 183, 185, 843, 845, 847, 849, 851, 853, 855] allReadings = [0.03, 0.01, 0.01, 0.03,

如何使用pandas生成aoiFeatures和allFeaturesReadings的合并结果,从而产生以下结果:

183  0.03
845  0.03
853  0.01
给定以下起始代码和数据:

import numpy
import pandas as pd
allFeatures = [101, 179, 181, 183, 185, 843, 845, 847, 849, 851, 853, 855]
allReadings = [0.03, 0.01, 0.01, 0.03, 0.03, 0.01, 0.03, 0.02, 0.07, 0.06, 0.01, 0.04]
aoiFeatures = [183, 845, 853]

allFeaturesReadings = zip(allFeatures, allReadings)
#
# Use pandas to create Series and Join here?
#
sAllFeaturesReadings = pd.Series(dict(allFeaturesReadings))
sAOIFeatures = pd.Series(numpy.ma.filled(aoiFeatures))
sIndexedAOIFeatures = sAOIFeatures.reindex(numpy.ma.filled(aoiFeatures))
result = pd.concat([sIndexedAOIFeatures,sAllFeaturesReadings], axis=1, join='inner')

无需拉链,您可以执行以下操作:

df = pd.DataFrame(data={"allFeatures":allFeatures, "allReadings":allReadings})
df[df["allFeatures"].isin(aoiFeatures)]

您可以使用
isin

import pandas as pd
allFeatures = [101, 179, 181, 183, 185, 843, 845, 847, 849, 851, 853, 855]
allReadings = [0.03, 0.01, 0.01, 0.03, 0.03, 0.01, 0.03, 0.02, 0.07, 0.06, 0.01, 0.04]
aoiFeatures = [183, 845, 853]

df = pd.DataFrame({'features':allFeatures, 'readings':allReadings})
result = df.loc[df['features'].isin(aoiFeatures)]
print(result)
屈服

    features  readings
3        183      0.03
6        845      0.03
10       853      0.01
     readings
183      0.03
845      0.03
853      0.01
如果您计划经常根据
features
值选择行,并且如果
features
可以制作成一个唯一的索引,并且如果数据帧至少相当大(例如~10K行),那么制作
features
索引可能会更好(性能方面):

import pandas as pd
allFeatures = [101, 179, 181, 183, 185, 843, 845, 847, 849, 851, 853, 855]
allReadings = [0.03, 0.01, 0.01, 0.03, 0.03, 0.01, 0.03, 0.02, 0.07, 0.06, 0.01, 0.04]
aoiFeatures = [183, 845, 853]

df = pd.DataFrame({'readings':allReadings}, index=allFeatures)
result = df.loc[aoiFeatures]
print(result)
屈服

    features  readings
3        183      0.03
6        845      0.03
10       853      0.01
     readings
183      0.03
845      0.03
853      0.01

以下是我用来进行IPython%timeit测试的设置:

import pandas as pd
N = 10000
allFeatures = np.repeat(np.arange(N), 1)
allReadings = np.random.random(N)
aoiFeatures = np.random.choice(allFeatures, N//10, replace=False)

def using_isin():
    df = pd.DataFrame({'features':allFeatures, 'readings':allReadings})
    for i in range(1000):
        result = df.loc[df['features'].isin(aoiFeatures)]
    return result


def using_index():
    df = pd.DataFrame({'readings':allReadings}, index=allFeatures)
    for i in range(1000):
        result = df.loc[aoiFeatures]
    return result
这表明
使用_index
可以更快一些:

In [108]: %timeit using_isin()
1 loop, best of 3: 697 ms per loop

In [109]: %timeit using_index()
1 loop, best of 3: 432 ms per loop
但是,请注意,如果
allFeatures
包含重复项,则使其成为索引是有利的。例如,如果将上述设置更改为使用:

allFeatures = np.repeat(np.arange(N//2), 2)    # repeat every value twice
然后


这看起来是正确的还是有更简单的方法?很好,托比。。。也快多了!谢谢-@unutbu的答案是相同的概念,但更彻底。我希望最终完全按照您在第二个示例中所说的那样,使用allFeatures作为索引。。。谢谢大家!@杰克:我把答案贴得有点太早了。我做了一些timeit测试,发现制作<代码>功能<代码>索引的好处只有在某些条件下才会出现:索引必须是唯一的,并且数据帧必须适度大(在我的机器上,至少10K行),然后才有性能优势。所以。。。假设我有另外一组读数,比如:nextReadings=[0.04,0.09,0.21,0.01,0.06,0.08,0.13,0.01,0.01,0.02,0.04,0.06],它们仍然按照所有特征的顺序索引。对于给定的allFeatures元素,是否有一种简单的方法可以从每个数组中找到最大的值?(在上述示例中,845和853的读数将分别更改为0.13和0.04)。此外,在现实世界中,所有特征和所有读数都有240万个元素,而AOiFeature有67000个元素。我还有17个其他的读数数组,比如要比较的AllReads
df.loc[[845853]]
选择其索引标签对应于
845
853
的行。要找到每行的最大值,可以使用
df.loc[[845853]].max(axis=1)
。但我不确定我是否正确理解了你的问题。如果我不是,那么请发布一个新的问题,包括所有细节。(您在本问题中提供的示例数据和预期输出非常有用。)