在Python中将数据帧字符串转换为多个伪变量_Python_Dataframe_Data Processing

在Python中将数据帧字符串转换为多个伪变量

python dataframe

在Python中将数据帧字符串转换为多个伪变量,python,dataframe,data-processing,Python,Dataframe,Data Processing,我有一个包含多个列的数据框架。其中一列是“category”，它是一个空格分隔的字符串。df类别的示例为： 3 36 211 433 474 533 690 980 3 36 211 3 16 36 211 396 398 409 3 35 184 590 1038 67 179 208 1

我有一个包含多个列的数据框架。其中一列是“category”，它是一个空格分隔的字符串。df类别的示例为：

             3 36 211 433 474 533 690 980
                                 3 36 211
                  3 16 36 211 396 398 409
                        3 35 184 590 1038
                67 179 208 1008 5000 5237

我还有另一个目录dict=[3,5,7,8,165000]。我希望看到的是一个新的数据帧，其中dict作为列，0/1作为条目。如果df中的一行包含dict条目，则为1，否则为0。因此，输出为：

3  5  7  8  16  36 5000
1  0  0  0  0   1   0
1  0  0  0  0   1   0
1  0  0  0  1   1   0 
1  0  0  0  0   0   0 
0  0  0  0  0   0   1

我试过这样的方法：

for cat in level_0_cat:
    df[cat] = df.apply(lambda x: int(cat in map(int, x.category)), axis = 1)

但它不适用于大型数据集（1000万行）。我也尝试过isin，但还没有找到答案。任何想法都值得赞赏

应该这样做

# Read your data
>>> s = pd.read_clipboard(sep='|', header=None)

# Convert `cats` to string to make `to_string` approach work below
>>> cats = list(map(str, [3,4,7,8,16,36,5000]))
>>> cats
['3', '4', '7', '8', '16', '36', '5000']

# Nested list comprehension... Checks whether each `c` in `cats` exists in each row
>>> encoded = [[1 if v in set(s.ix[idx].to_string().split()) else 0 for idx in s.index] for v in cats]
>>> encoded
[[1, 1, 1, 1, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 1, 0, 0], [0, 0, 0, 0, 1]]


>>> import numpy as np

# Convert the whole thing to a dataframe to add columns
>>> encoded = pd.DataFrame(data=np.matrix(encoded).T, columns=cats)
>>> encoded
   3  4  7  8  16  36  5000
0  1  0  0  0   0   1     0
1  1  0  0  0   0   1     0
2  1  0  0  0   1   1     0
3  1  0  0  0   0   0     0
4  0  0  0  0   0   0     1

编辑：回答这样做的方法，而不直接调用任何索引方法，如

ix

或

loc

encoded = [[1 if v in row else 0 for row in s[0].str.split().map(set)] for v in cats]

encoded
Out[18]: 
[[1, 1, 1, 1, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 1, 0, 0],
 [1, 1, 1, 0, 0],
 [0, 0, 0, 0, 1]]

encoded = pd.DataFrame(data=np.matrix(encoded).T, columns=cats)

encoded
Out[20]: 
   3  4  7  8  16  36  5000
0  1  0  0  0   0   1     0
1  1  0  0  0   0   1     0
2  1  0  0  0   1   1     0
3  1  0  0  0   0   0     0
4  0  0  0  0   0   0     1

应该这样做

# Read your data
>>> s = pd.read_clipboard(sep='|', header=None)

# Convert `cats` to string to make `to_string` approach work below
>>> cats = list(map(str, [3,4,7,8,16,36,5000]))
>>> cats
['3', '4', '7', '8', '16', '36', '5000']

# Nested list comprehension... Checks whether each `c` in `cats` exists in each row
>>> encoded = [[1 if v in set(s.ix[idx].to_string().split()) else 0 for idx in s.index] for v in cats]
>>> encoded
[[1, 1, 1, 1, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 1, 0, 0], [0, 0, 0, 0, 1]]


>>> import numpy as np

# Convert the whole thing to a dataframe to add columns
>>> encoded = pd.DataFrame(data=np.matrix(encoded).T, columns=cats)
>>> encoded
   3  4  7  8  16  36  5000
0  1  0  0  0   0   1     0
1  1  0  0  0   0   1     0
2  1  0  0  0   1   1     0
3  1  0  0  0   0   0     0
4  0  0  0  0   0   0     1

编辑：回答这样做的方法，而不直接调用任何索引方法，如

ix

或

loc

encoded = [[1 if v in row else 0 for row in s[0].str.split().map(set)] for v in cats]

encoded
Out[18]: 
[[1, 1, 1, 1, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0],
 [0, 0, 1, 0, 0],
 [1, 1, 1, 0, 0],
 [0, 0, 0, 0, 1]]

encoded = pd.DataFrame(data=np.matrix(encoded).T, columns=cats)

encoded
Out[20]: 
   3  4  7  8  16  36  5000
0  1  0  0  0   0   1     0
1  1  0  0  0   0   1     0
2  1  0  0  0   1   1     0
3  1  0  0  0   0   0     0
4  0  0  0  0   0   0     1

您不需要将每一行都转换为整数，这样做更简单将类别列表中的元素转换为字符串

categories = [l.strip() for l in '''\
         3 36 211 433 474 533 690 980
                             3 36 211
              3 16 36 211 396 398 409
                    3 35 184 590 1038
            67 179 208 1008 5000 5237'''.split('\n')]

result = [3,5,7,8,16,5000]
d = [str(n) for n in result]
for category in categories:
    result.append([1 if s in category else 0 for s in d])

请不要使用

dict

（这是一个内置函数）来命名一个对象。

您不需要将每一行都转换为整数，这样做更简单将类别列表中的元素转换为字符串

categories = [l.strip() for l in '''\
         3 36 211 433 474 533 690 980
                             3 36 211
              3 16 36 211 396 398 409
                    3 35 184 590 1038
            67 179 208 1008 5000 5237'''.split('\n')]

result = [3,5,7,8,16,5000]
d = [str(n) for n in result]
for category in categories:
    result.append([1 if s in category else 0 for s in d])

请不要使用

dict

（这是一个内置函数）命名对象。

在

encoded=。。。s、 ix[idx].to_string（）..

，由于ix已被弃用，我改为“iloc”，但它抛出了

索引器：在大数据上运行时，单个位置索引器超出了范围。有什么想法吗？见更新。没有庞大的数据集，我不能肯定地告诉你。但是您可以尝试另一种方法，不直接调用任何索引方法，如ix
或loc
，等等。我将s[0].str.split
更改为s.str.split
，它可以在大型数据集上工作。谢谢当我将“cats”的大小（从6增加到20）时，运行encoded=[[1如果行中的v在s.str.split（）。有没有办法提高效率？在encoded=。。。s、 ix[idx].to_string（）..
，由于ix已被弃用，我改为“iloc”，但它抛出了索引器：在大数据上运行时，单个位置索引器超出了范围。有什么想法吗？见更新。没有庞大的数据集，我不能肯定地告诉你。但是您可以尝试另一种方法，不直接调用任何索引方法，如ix
或loc
，等等。我将s[0].str.split
更改为s.str.split
，它可以在大型数据集上工作。谢谢当我将“cats”的大小（从6增加到20）时，运行encoded=[[1如果行中的v在s.str.split（）。有没有办法提高效率？我的对象名不是dict
，请在这里用它来说明。谢谢提醒！我的对象名不是dict
，请在此处使用它进行说明。谢谢提醒！