Python 从熊猫那里回来最优雅的方式_Python_Pandas

Python 从熊猫那里回来最优雅的方式

python pandas

Python 从熊猫那里回来最优雅的方式,python,pandas,Python,Pandas,从包含数字和标称数据的数据帧： >>> from pandas import pd >>> d = {'m': {0: 'M1', 1: 'M2', 2: 'M7', 3: 'M1', 4: 'M2', 5: 'M1'}, 'qj': {0: 'q23', 1: 'q4', 2: 'q9', 3: 'q23', 4: 'q23', 5: 'q9'}, 'Budget': {0: 39, 1: 15, 2: 13, 3: 53

从包含数字和标称数据的数据帧：

>>> from pandas import pd
>>> d = {'m': {0: 'M1', 1: 'M2', 2: 'M7', 3: 'M1', 4: 'M2', 5: 'M1'},
         'qj': {0: 'q23', 1: 'q4', 2: 'q9', 3: 'q23', 4: 'q23', 5: 'q9'},
         'Budget': {0: 39, 1: 15, 2: 13, 3: 53, 4: 82, 5: 70}}
>>> df = pd.DataFrame.from_dict(d)
>>> df
   Budget   m   qj
0      39  M1  q23
1      15  M2   q4
2      13  M7   q9
3      53  M1  q23
4      82  M2  q23
5      70  M1   q9

get_dummies将分类变量转换为虚拟/指示符变量：

>>> df_dummies = pd.get_dummies(df)
>>> df_dummies
   Budget  m_M1  m_M2  m_M7  qj_q23  qj_q4  qj_q9
0      39     1     0     0       1      0      0
1      15     0     1     0       0      1      0
2      13     0     0     1       0      0      1
3      53     1     0     0       1      0      0
4      82     0     1     0       1      0      0
5      70     1     0     0       0      0      1

从傻瓜那里回来最优雅的方法是什么

>>> (back_from_dummies(df_dummies) == df).all()
Budget    True
m         True
qj        True
dtype: bool

idxmax

将很容易做到这一点

from itertools import groupby

def back_from_dummies(df):
    result_series = {}

    # Find dummy columns and build pairs (category, category_value)
    dummmy_tuples = [(col.split("_")[0],col) for col in df.columns if "_" in col]

    # Find non-dummy columns that do not have a _
    non_dummy_cols = [col for col in df.columns if "_" not in col]

    # For each category column group use idxmax to find the value.
    for dummy, cols in groupby(dummmy_tuples, lambda item: item[0]):

        #Select columns for each category
        dummy_df = df[[col[1] for col in cols]]

        # Find max value among columns
        max_columns = dummy_df.idxmax(axis=1)

        # Remove category_ prefix
        result_series[dummy] = max_columns.apply(lambda item: item.split("_")[1])

    # Copy non-dummy columns over.
    for col in non_dummy_cols:
        result_series[col] = df[col]

    # Return dataframe of the resulting series
    return pd.DataFrame(result_series)

(back_from_dummies(df_dummies) == df).all()

首先，将各列分开：

In [11]: from collections import defaultdict
         pos = defaultdict(list)
         vals = defaultdict(list)

In [12]: for i, c in enumerate(df_dummies.columns):
             if "_" in c:
                 k, v = c.split("_", 1)
                 pos[k].append(i)
                 vals[k].append(v)
             else:
                 pos["_"].append(i)

In [13]: pos
Out[13]: defaultdict(list, {'_': [0], 'm': [1, 2, 3], 'qj': [4, 5, 6]})

In [14]: vals
Out[14]: defaultdict(list, {'m': ['M1', 'M2', 'M7'], 'qj': ['q23', 'q4', 'q9']})

这允许您为每个虚拟柱切片到不同的帧中：

In [15]: df_dummies.iloc[:, pos["m"]]
Out[15]:
   m_M1  m_M2  m_M7
0     1     0     0
1     0     1     0
2     0     0     1
3     1     0     0
4     0     1     0
5     1     0     0

现在我们可以使用numpy的argmax：

In [16]: np.argmax(df_dummies.iloc[:, pos["m"]].values, axis=1)
Out[16]: array([0, 1, 2, 0, 1, 0])

*注意：idxmax返回标签，我们需要位置以便使用分类*

In [17]: pd.Categorical.from_codes(np.argmax(df_dummies.iloc[:, pos["m"]].values, axis=1), vals["m"])
Out[17]:
[M1, M2, M7, M1, M2, M1]
Categories (3, object): [M1, M2, M7]

现在我们可以把这一切放在一起：

In [21]: df = pd.DataFrame({k: pd.Categorical.from_codes(np.argmax(df_dummies.iloc[:, pos[k]].values, axis=1), vals[k]) for k in vals})

In [22]: df
Out[22]:
    m   qj
0  M1  q23
1  M2   q4
2  M7   q9
3  M1  q23
4  M2  q23
5  M1   q9

并将非模拟柱放回原位：

In [23]: df[df_dummies.columns[pos["_"]]] = df_dummies.iloc[:, pos["_"]]

In [24]: df
Out[24]:
    m   qj  Budget
0  M1  q23      39
1  M2   q4      15
2  M7   q9      13
3  M1  q23      53
4  M2  q23      82
5  M1   q9      70

作为一项功能：

def reverse_dummy(df_dummies):
    pos = defaultdict(list)
    vals = defaultdict(list)

    for i, c in enumerate(df_dummies.columns):
        if "_" in c:
            k, v = c.split("_", 1)
            pos[k].append(i)
            vals[k].append(v)
        else:
            pos["_"].append(i)

    df = pd.DataFrame({k: pd.Categorical.from_codes(
                              np.argmax(df_dummies.iloc[:, pos[k]].values, axis=1),
                              vals[k])
                      for k in vals})

    df[df_dummies.columns[pos["_"]]] = df_dummies.iloc[:, pos["_"]]
    return df

In [31]: reverse_dummy(df_dummies)
Out[31]:
    m   qj  Budget
0  M1  q23      39
1  M2   q4      15
2  M7   q9      13
3  M1  q23      53
4  M2  q23      82
5  M1   q9      70

与@David类似，我发现

idxmax

将为您完成大部分工作。但是，我认为没有简单的方法可以保证您在尝试将列转换回时不会出现问题，因为在某些情况下，很难识别哪些列是虚拟列，哪些不是。我发现，使用分隔符可以极大地缓解这种情况，而这种分隔符不太可能偶然出现在数据中

通常用于具有多个单词的列名中，因此我使用

（双下划线）作为分隔符；我从来没有在野外遇到过这样的专栏

另外，请注意，

pd.get_dummies

会将所有虚拟列移到末尾。这意味着您不一定能够恢复列的原始顺序

这是我的方法的一个例子。您可以将虚拟列识别为其中包含

sep

的列。我们使用

df.filter

获得虚拟列组，这将允许我们使用正则表达式匹配列名（仅

sep

之前的名称部分有效；您也可以使用其他方法来完成这一部分）

rename

部分去掉了列名的开头（例如

m__

），以便剩下的部分就是值。然后

idxmax

提取包含

的列名。这为我们提供了撤消原始列之一上的

pd.get_dummies

的数据帧；我们将反向

pd的数据帧连接在一起。在每个列上获取虚拟对象，以及那些没有“虚拟化”的列
回到df？我不太清楚你的确切意思。我只是指定了回来/回复谢谢。只是想确定一下。
In [1]: import pandas as pd

In [2]: df = pd.DataFrame.from_dict({'m': {0: 'M1', 1: 'M2', 2: 'M7', 3: 'M1', 4: 'M2', 5: 'M1'},
   ...:          'qj': {0: 'q23', 1: 'q4', 2: 'q9', 3: 'q23', 4: 'q23', 5: 'q9'},
   ...:          'Budget': {0: 39, 1: 15, 2: 13, 3: 53, 4: 82, 5: 70}})

In [3]: df
Out[3]: 
   Budget   m   qj
0      39  M1  q23
1      15  M2   q4
2      13  M7   q9
3      53  M1  q23
4      82  M2  q23
5      70  M1   q9

In [4]: sep = '__'

In [5]: dummies = pd.get_dummies(df, prefix_sep=sep)

In [6]: dummies
Out[6]: 
   Budget  m__M1  m__M2  m__M7  qj__q23  qj__q4  qj__q9
0      39      1      0      0        1       0       0
1      15      0      1      0        0       1       0
2      13      0      0      1        0       0       1
3      53      1      0      0        1       0       0
4      82      0      1      0        1       0       0
5      70      1      0      0        0       0       1

In [7]: dfs = []
   ...: 
   ...: dummy_cols = list(set(col.split(sep)[0] for col in dummies.columns if sep in col))
   ...: other_cols = [col for col in dummies.columns if sep not in col]
   ...: 
   ...: for col in dummy_cols:
   ...:     dfs.append(dummies.filter(regex=col).rename(columns=lambda name: name.split(sep)[1]).idxmax(axis=1))
   ...: 
   ...: df = pd.concat(dfs + [dummies[other_cols]], axis=1)
   ...: df.columns = dummy_cols + other_cols
   ...: df
   ...: 
Out[7]: 
    qj   m  Budget
0  q23  M1      39
1   q4  M2      15
2   q9  M7      13
3  q23  M1      53
4  q23  M2      82
5   q9  M1      70