使用python从现有列创建新的映射列_Python_Pandas

使用python从现有列创建新的映射列

python pandas

使用python从现有列创建新的映射列,python,pandas,Python,Pandas,我有一个熊猫数据帧，它的列数可变，比如C1、C2、C3、F1、F2。。。F100。我需要联合收割机F1，F2。。F100转换为dict/map数据类型的一列，如下所示。我怎样才能用熊猫来做呢？C1、C2、C3是固定名称列，而F1、F2、F100是变量列输入： C1 C2 C3 F1 F2 F100 "1" "2" "3" "1" "2" "100" 输出： C1 C2 C3 Features "1" "2" "3" {"F1":"1", "F2":"2", "F100":

我有一个熊猫数据帧，它的列数可变，比如C1、C2、C3、F1、F2。。。F100。我需要联合收割机F1，F2。。F100转换为dict/map数据类型的一列，如下所示。我怎样才能用熊猫来做呢？C1、C2、C3是固定名称列，而F1、F2、F100是变量列

输入：

C1  C2  C3  F1  F2  F100

"1" "2" "3" "1" "2" "100"

输出：

C1  C2  C3  Features

"1" "2" "3" {"F1":"1", "F2":"2", "F100": "100"}

如果使用pandas，则可以使用

df.apply（）

函数执行此操作

代码如下：

def merge(row):
    result = {}
    for idx in row.index:
        if idx.startswith('F'):
            result[idx] = row[idx]
    print(result)
    return result

df['FEATURE'] = df.apply(lambda x: merge(x), axis=1)

结果:

    C1  C2  C3  F1  F2  F100    FEATURE
0   1   2   3   1   2   100     {'F1': 1, 'F100': 100, 'F2': 2}
1   11  21  31  11  21  1001    {'F1': 11, 'F100': 1001, 'F2': 21}
2   12  22  32  2   22  2002    {'F1': 2, 'F100': 2002, 'F2': 22}

考虑下面的例子

d = pd.DataFrame([list('12345678'), list('xyzwrest'), list('abcddfgh')], columns = 'C1, C2, C3, C4, F1, F2, F3, F4'.split(', '))

d

>>>    C1   C2  C3  C4  F1  F2  F3  F4
     0  1   2   3   4   5   6   7   8
     1  x   y   z   w   r   e   s   t
     2  a   b   c   d   d   f   g   h

让我们定义

功能

列如下：

d['Features'] = d.apply(lambda row: {feat: val for feat, val in row.items() if feat.startswith('F')}, axis =1)

#so that when we call d the results will be
d
>>> C1  C2  C3  C4  F1  F2  F3  F4  Features
0   1   2   3   4   5   6   7   8   {'F1': '5', 'F2': '6', 'F3': '7', 'F4': '8'}
1   x   y   z   w   r   e   s   t   {'F1': 'r', 'F2': 'e', 'F3': 's', 'F4': 't'}
2   a   b   c   d   d   f   g   h   {'F1': 'd', 'F2': 'f', 'F3': 'g', 'F4': 'h'}

我希望这能有所帮助。

过滤

记录

输出：

df

所以，

pyspark

或

pandas

？是pandas，而不是pysparky你已经在pyspark和spark scala中问过这个问题了。现在是熊猫。你想做什么？就在一周后，我被要求实现我的脚本的三个版本…被否决了，因为你的解决方案涉及写出

columns

次，而且这是不正确的，因为只有f列应该聚合为map列，我需要键的原始列。@ChiCHEN我更新了答案以解决您的担忧。如果您得到了此编辑的帮助，请告诉我。@aws_学徒我更新了我的答案以解决您的担忧。它对我有效！如果我处理大数据，函数是否存在性能问题？取决于数据的“大”程度。运行代码并首先查看性能。您可以使用

timeit

方法进行评估。

df['Features'] = df.filter(like='F').to_dict('records')

  C1 C2 C3 C4 F1 F2 F3 F4                                      Features
0  1  2  3  4  5  6  7  8  {'F1': '5', 'F2': '6', 'F3': '7', 'F4': '8'}
1  x  y  z  w  r  e  s  t  {'F1': 'r', 'F2': 'e', 'F3': 's', 'F4': 't'}
2  a  b  c  d  d  f  g  h  {'F1': 'd', 'F2': 'f', 'F3': 'g', 'F4': 'h'}