Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/361.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 使用pd.Series()从现有列添加新列将创建NaN值_Python_Pandas - Fatal编程技术网

Python 使用pd.Series()从现有列添加新列将创建NaN值

Python 使用pd.Series()从现有列添加新列将创建NaN值,python,pandas,Python,Pandas,我想基于现有列向DataFrame添加一个新列。新列只是三列中三个值的元组: df0.shape # (5410185, 17) new_col = pd.Series(list(zip(df0['a'], df0['b'], df0['c']))) new_col.shape # (5410185,) new_col.isnull().sum() # 0 df0['abc'] = new_col df0['abc'].isnull().sum() # 14334 我在一个示例df上尝试了相同

我想基于现有列向DataFrame添加一个新列。新列只是三列中三个值的元组:

df0.shape
# (5410185, 17)
new_col = pd.Series(list(zip(df0['a'], df0['b'], df0['c'])))
new_col.shape
# (5410185,)
new_col.isnull().sum()
# 0
df0['abc'] = new_col
df0['abc'].isnull().sum()
# 14334
我在一个示例df上尝试了相同的方法,效果如预期:

test = pd.DataFrame(np.random.randint(0,1000,100000000).reshape(1000000,100))
test['new'] = pd.Series(list(zip(test[1], test[2], test[3])))
test['new'].isnull().sum()
# 0
“分配”也会产生相同的结果:

df0 = df0.assign(new_col2 = pd.Series(list(zip(df0['a'], df0['b'], df0['c']))))
df0['new_col2'].isnull().sum()
# 14334
我发现了两个类似的问题。我怀疑我的问题也与索引有关。似乎有89个不相同的值:

np.sum(df0.index == new_col.index)
# 89
将同一系列指定为df0的索引工作:

df0.index = new_col
df0['abc'] = df0.index
df0['abc'].isnull().sum()
# 0
更新 以下是@jezreal解决方案的一些基准测试:

%time df0['abc'] = pd.Series(list(zip(df0['a'], df0['b'], df0['c'])), index=df0.index)
Wall time: 2.32 s

% time df0['abc'] = df0[['a','b','c']].apply(tuple, axis=1)
Wall time: 1min 42s

%time df0['abc'] = df0.set_index(['a','b','c']).index.values
Wall time: 8.68 s

% time df0['abc'] = pd.Series([tuple(x) for x in df0[['a','b','c']].values.tolist()], index=df0.index)
Wall time: 9.83 s

我认为需要与新的
系列的
df0
相同的索引来对齐数据:

df0['abc'] = pd.Series(list(zip(df0['a'], df0['b'], df0['c'])), index=df0.index)
或使用
应用

df0['abc'] = df0[['a','b','c']].apply(tuple, axis=1)
样本:

df0 = pd.DataFrame({'a':list('abcdef'),
                   'b':[4,5,4,5,5,4],
                   'c':[7,8,9,4,2,3],
                   'D':[1,3,5,7,1,0],
                   'E':[5,3,6,9,2,4],
                   'F':list('aaabbb')}, index=[1,1,2,2,9,10])

print (df0)
    D  E  F  a  b  c
1   1  5  a  a  4  7
1   3  3  a  b  5  8
2   5  6  a  c  4  9
2   7  9  b  d  5  4
9   1  2  b  e  5  2
10  0  4  b  f  4  3

df0['abc'] = pd.Series(list(zip(df0['a'], df0['b'], df0['c'])))



谢谢你,它起作用了。”“应用”也有效,但我倾向于尽量避免它,因为它非常慢。使用另一种方法需要2.32秒,使用apply需要1分钟42秒!(使用%time,而不是%timeit)是的,那么
df0['abc']=df0.set_index(['a','b','c']).index.values
?或者
df0['abc']=pd.Series([df0['a','b','c'].values.tolist()],index=df0.index)呢?
?我不测试它,但是zip应该是最好的,因为纯python。
print (df0)
    D  E  F  a  b  c        abc
1   1  5  a  a  4  7  (b, 5, 8)
1   3  3  a  b  5  8  (b, 5, 8)
2   5  6  a  c  4  9  (c, 4, 9)
2   7  9  b  d  5  4  (c, 4, 9)
9   1  2  b  e  5  2        NaN
10  0  4  b  f  4  3        NaN
df0['abc'] = pd.Series(list(zip(df0['a'], df0['b'], df0['c'])), index=df0.index)
df0['abc'] = df0[['a','b','c']].apply(tuple, axis=1)


print (df0)
    D  E  F  a  b  c        abc
1   1  5  a  a  4  7  (a, 4, 7)
1   3  3  a  b  5  8  (b, 5, 8)
2   5  6  a  c  4  9  (c, 4, 9)
2   7  9  b  d  5  4  (d, 5, 4)
9   1  2  b  e  5  2  (e, 5, 2)
10  0  4  b  f  4  3  (f, 4, 3)