如何在Python中使用数组选择和排序数据帧中的列

如何在Python中使用数组选择和排序数据帧中的列,python,arrays,pandas,dataframe,Python,Arrays,Pandas,Dataframe,我有一个相当大的数据帧,df2~50000行x2000列。列标题是示例名称。另外,我有一个数据帧df1,其中有一个样本列表,我想将其作为df1索引包含在分析中。我想使用df1索引中的样本列表为这些选定的样本仅选择df2中的列,而放弃其余的列。我还想保留来自df1索引的样本顺序 示例数据: # df1 data1 = {'Sample': ['Sample_A','Sample_D', 'Sample_E'], 'Location': ['Bangladesh', 'Myanma

我有一个相当大的数据帧,df2~50000行x2000列。列标题是示例名称。另外,我有一个数据帧df1,其中有一个样本列表,我想将其作为df1索引包含在分析中。我想使用df1索引中的样本列表为这些选定的样本仅选择df2中的列,而放弃其余的列。我还想保留来自df1索引的样本顺序

示例数据:

# df1
data1 = {'Sample': ['Sample_A','Sample_D', 'Sample_E'], 
        'Location': ['Bangladesh', 'Myanmar', 'Thailand'],
        'Year':[2012, 2014, 2015]}
df1 = pd.DataFrame(data1)
df1.set_index('Sample')

# df2
data2 = {'Num': ['Value_1','Value_2','Value_3','Value_4','Value_5'], 
        'Sample_A': [0,1,0,0,1],
        'Sample_B':[0,0,1,0,0],
        'Sample_C':[1,0,0,0,1],
        'Sample_D':[0,0,1,1,0]}
df2 = pd.DataFrame(data2)
df2.set_index('Num')
首先,我从df1的索引生成我想要的样本列表,例如

samples = df1['Sample'].tolist()
"样本"就是,

['Sample_A', 'Sample_D', 'Sample_E']
使用“示例”,我想要的输出数据帧df3应该如下所示:

index  Sample_A  Sample_D
Value_1  0  0
Value_2  1  0
Value_3  0  1
Value_4  0  1
Value_5  1  0
但是如果我使用

df3 = df2[samples]
然后我得到错误消息:

"['Sample_E'] not in index"
那么,如何忽略df2中未找到的示例以避免此错误消息

更新 有效的解决方案-

# 1. Define samples to use from df1
samples = df1['Sample'].tolist()
# Only include samples that are found in df2 as well
final_samples = list(set(list(df2.columns)) & set(samples ))
# Make new df with columns corresponding to final_samples
df3 = df2.loc[:, final_samples]

你可以这样做。它们列数组是按您实际需要的顺序排列的

import pandas as pd

data = {'index': ['Value_1','Value_2','Value_3','Value_4','Value_5'], 
        'Sample_A': [0,1,0,0,1],
        'Sample_B':[0,0,1,0,0],
        'Sample_C':[1,0,0,0,1],
        'Sample_D':[0,0,1,1,0]}
df = pd.DataFrame(data)
df.set_index('index')
df1 = df[['index']+['Sample_A','Sample_D']]
输出:

     index  Sample_A  Sample_D
0  Value_1         0         0
1  Value_2         1         0
2  Value_3         0         1
3  Value_4         0         1
4  Value_5         1         0
但是,要忽略缺少的列,请只选择您正在进行分析的列

samples = ['index', 'Sample_A', 'Sample_D','Extra_Sample']
final_samples = list(set(list(df1.columns)) & set(samples ))
现在您可以传递只有df2列的最终_样本

df3 = df2[final_samples]
像这样试试

df = pd.read_csv("data.csv", usecols=['Sample_A','Sample_D']).fillna('')
print(df)
选择所有行和某些列,可以使用单个冒号选择所有行

>>> df.loc[:, ['Sample_A','Sample_D']]
从您提供的数据集中选择您的答案:

>>> data2 = {'Num': ['Value_1','Value_2','Value_3','Value_4','Value_5'],
...         'Sample_A': [0,1,0,0,1],
...         'Sample_B':[0,0,1,0,0],
...         'Sample_C':[1,0,0,0,1],
...         'Sample_D':[0,0,1,1,0]}
>>> df2 = pd.DataFrame(data2)
>>> df2.set_index('Num').loc[:, ['Sample_A','Sample_D']]
         Sample_A  Sample_D
Num
Value_1         0         0
Value_2         1         0
Value_3         0         1
Value_4         0         1
Value_5         1         0
=====================================

>>> df3 = df2.loc[:, samples]
>>> df3
   Sample_A  Sample_D  Sample_E
0         0         0       NaN
1         1         0       NaN
2         0         1       NaN
3         0         1       NaN
4         1         0       NaN


样本列表需要采用什么格式?我从另一个数据帧的索引列中提取它。我尝试了-samples=df1['samples'].astypestr,然后使用您的代码df3=df2[[samples]],但得到了一条错误消息。您只需按顺序将它们保存在当前数据帧的字符串或索引列表中。samples=df1['samples']。tolist然后将其传递给df3=df2[samples]@WillHamilton,您是否可以发布您正在处理的数据帧代码以获得正确答案。@pygo我已经添加了用于生成示例的代码list@jezrael,thnx:-仅从您处借用。'df3=df2.reindexcolumns=samples'也可以完美且非常简洁地工作。谢谢
>>> df3 = df2.reindex(columns=samples)
>>> df3
   Sample_A  Sample_D  Sample_E
0         0         0       NaN
1         1         0       NaN
2         0         1       NaN
3         0         1       NaN
4         1         0       NaN