Python 将分组列中的非空单元格向左移动

Python 将分组列中的非空单元格向左移动,python,pandas,multiple-columns,nan,shift,Python,Pandas,Multiple Columns,Nan,Shift,我有一个dataframe,其中有多个具有相似列名的列。我希望用右边有数据的列填充空单元格 Address1 Address2 Address3 Address4 Phone1 Phone2 Phone3 Phone4 ABC nan def nan 9091-XYz nan nan XYZ-ABZ 应该将列移到类似的位置 Addres

我有一个dataframe,其中有多个具有相似列名的列。我希望用右边有数据的列填充空单元格

Address1     Address2     Address3     Address4     Phone1     Phone2     Phone3     Phone4
ABC          nan          def          nan          9091-XYz   nan        nan        XYZ-ABZ
应该将列移到类似的位置

Address1     Address2     Address3     Address4     Phone1     Phone2     Phone3     Phone4
ABC          def          nan          nan          9091-XYz   XYZ-ABZ    nan        nan 
还有一个解决类似问题的方法

pdf = pd.read_csv('Data.txt',sep='\t')

# gets a set of columns removing the numerical part
columns = set(map(lambda x : x.rstrip('0123456789'),pdf.columns))

for col_pattern in columns:
    # get columns with similar names
    current = [col for col in pdf.columns if col_pattern in col]
    coldf= pdf[current]
    # shift columns to the left
文件
Data.txt
具有按列名排序的列,因此具有类似名称的所有列都聚集在一起

在此方面的任何帮助都将不胜感激

我曾尝试从链接将此代码添加到上面的代码中,但内存不足:

    newdf=pd.read_csv(StringIO(u''+re.sub(',+',',',df.to_csv()).decode('utf-8')))
    list_.append(newdf)
pd.concat(list_,axis=0).to_csv('test.txt')

pushna

将所有空值推送到序列的末尾

coltype

使用
regex
从所有列名中提取非数字前缀

def pushna(s):
    notnull = s[s.notnull()]
    isnull = s[s.isnull()]
    values = notnull.append(isnull).values
    return pd.Series(values, s.index)

coltype = df.columns.to_series().str.extract(r'(\D*)', expand=False)

df.groupby(coltype, axis=1).apply(lambda df: df.apply(pushna, axis=1))

具有
多索引的解决方案
和:

旧的解决方案:

我发现了另一个问题-如果
数据框中有更多行
,上述解决方案就无法正常工作。因此,您可以使用double
apply
。但该解决方案的问题是行中的值顺序不正确:

df = pd.DataFrame({'Address1': {0: 'ABC', 1: 'ABC'}, 'Address2': {0: np.nan, 1: np.nan}, 'Address3': {0: 'def', 1: 'def'}, 'Phone4': {0: 'XYZ-ABZ', 1: 'XYZ-ABZ'}, 'Address4': {0: np.nan, 1: np.nan}, 'Phone1': {0: '9091-XYz', 1: '9091-XYz'}, 'Phone3': {0: np.nan, 1: 'aaa'}, 'Phone2': {0: np.nan, 1: np.nan}})

print (df)
  Address1  Address2 Address3  Address4    Phone1  Phone2 Phone3   Phone4
0      ABC       NaN      def       NaN  9091-XYz     NaN    NaN  XYZ-ABZ
1      ABC       NaN      def       NaN  9091-XYz     NaN    aaa  XYZ-ABZ 

cols = df.columns.str.extract('([[A-Za-z]+)(\d+)', expand=True).values.tolist()
mux = pd.MultiIndex.from_tuples(cols)
df.columns = mux

df = df.groupby(axis=1, level=0)
       .apply(lambda x: x.apply(lambda y: y.sort_values().values, axis=1))

df.columns = [''.join(col) for col in df.columns]
print (df)
  Address1 Address2  Address3  Address4    Phone1   Phone2 Phone3  Phone4
0      ABC      def       NaN       NaN  9091-XYz  XYZ-ABZ    NaN     NaN
1      ABC      def       NaN       NaN  9091-XYz  XYZ-ABZ    aaa     NaN
此外,我还尝试修改解决方案-这样您就不需要
多索引

coltype = df.columns.str.extract(r'([[A-Za-z]+)', expand=False)
print (coltype)
Index(['Address', 'Address', 'Address', 'Address', 'Phone', 'Phone', 'Phone',
       'Phone'],
      dtype='object')

df = df.groupby(coltype, axis=1)
       .apply(lambda x: x.apply(lambda y: y.sort_values().values, axis=1))
print (df)
  Address1 Address2  Address3  Address4    Phone1   Phone2 Phone3  Phone4
0      ABC      def       NaN       NaN  9091-XYz  XYZ-ABZ    NaN     NaN
1      ABC      def       NaN       NaN  9091-XYz  XYZ-ABZ    aaa     NaN

我有一个250万行的csv。我一直在运行它。希望它能很快完成。对一个样本使用多索引可以在大约3倍的时间内得到输出。是的,但可能还有另一个问题-所有NaN都在一列中吗?或者有时某些列中的某些值是NaN和另一个值?我认为
df=pd.DataFrame({'Address1':{0:ABC',1:ABC'},'Address2':{0:np.NaN,1:np.NaN},'Address3':{0:def',1:def'},'Phone4':{0:XYZ ABZ',1:XYZ ABZ'},'Address4':{0:np NaN 1:np NaN Phone1':'9091-XYZ 1',Phone3''{0:np.nan,1:'aaa'},'Phone2':{0:np.nan,1:np.nan})
,请参阅第二行的
Phone3
Ok。我想这是个问题。Phone3可能有一些值,可能并不总是Nayes,我尝试找到解决方案,但与
piRSquared
类似-不幸的是,您需要
在groupby中应用
:(
df = pd.DataFrame({'Address1': {0: 'ABC', 1: 'ABC'}, 'Address2': {0: np.nan, 1: np.nan}, 'Address3': {0: 'def', 1: 'def'}, 'Phone4': {0: 'XYZ-ABZ', 1: 'XYZ-ABZ'}, 'Address4': {0: np.nan, 1: np.nan}, 'Phone1': {0: '9091-XYz', 1: '9091-XYz'}, 'Phone3': {0: np.nan, 1: 'aaa'}, 'Phone2': {0: np.nan, 1: np.nan}})

print (df)
  Address1  Address2 Address3  Address4    Phone1  Phone2 Phone3   Phone4
0      ABC       NaN      def       NaN  9091-XYz     NaN    NaN  XYZ-ABZ
1      ABC       NaN      def       NaN  9091-XYz     NaN    aaa  XYZ-ABZ 

cols = df.columns.str.extract('([[A-Za-z]+)(\d+)', expand=True).values.tolist()
mux = pd.MultiIndex.from_tuples(cols)
df.columns = mux

df = df.groupby(axis=1, level=0)
       .apply(lambda x: x.apply(lambda y: y.sort_values().values, axis=1))

df.columns = [''.join(col) for col in df.columns]
print (df)
  Address1 Address2  Address3  Address4    Phone1   Phone2 Phone3  Phone4
0      ABC      def       NaN       NaN  9091-XYz  XYZ-ABZ    NaN     NaN
1      ABC      def       NaN       NaN  9091-XYz  XYZ-ABZ    aaa     NaN
coltype = df.columns.str.extract(r'([[A-Za-z]+)', expand=False)
print (coltype)
Index(['Address', 'Address', 'Address', 'Address', 'Phone', 'Phone', 'Phone',
       'Phone'],
      dtype='object')

df = df.groupby(coltype, axis=1)
       .apply(lambda x: x.apply(lambda y: y.sort_values().values, axis=1))
print (df)
  Address1 Address2  Address3  Address4    Phone1   Phone2 Phone3  Phone4
0      ABC      def       NaN       NaN  9091-XYz  XYZ-ABZ    NaN     NaN
1      ABC      def       NaN       NaN  9091-XYz  XYZ-ABZ    aaa     NaN