Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/305.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python-如何基于增量表将列中的单元格拆分为新行_Python_Pandas_Numpy - Fatal编程技术网

Python-如何基于增量表将列中的单元格拆分为新行

Python-如何基于增量表将列中的单元格拆分为新行,python,pandas,numpy,Python,Pandas,Numpy,相对较新,正在尝试使用python从CSV文件中分割一些数据。我试图解析此数据,并在出现特定分隔符时将其拆分为新行。这些分隔符是“.”;'和#。COL_C中也没有空格。此外,分隔符的顺序无关紧要,如果我们找到其中一个分隔符,就会自动创建新行 以下是示例数据 colu|A|COL|B|COL|C -------------- Hello | World | Hi.Can;您的帮助 我试图得到的结果是: colu|A|COL|B|COL|C ------------------------- 你

相对较新,正在尝试使用python从CSV文件中分割一些数据。我试图解析此数据,并在出现特定分隔符时将其拆分为新行。这些分隔符是“.”;'和#。COL_C中也没有空格。此外,分隔符的顺序无关紧要,如果我们找到其中一个分隔符,就会自动创建新行

以下是示例数据

colu|A|COL|B|COL|C

--------------

Hello | World | Hi.Can;您的帮助


我试图得到的结果是:

colu|A|COL|B|COL|C

-------------------------

你好|世界|你好

你好|世界|可以

你好|世界|你

你好|世界|帮助



例2:

colu|A|COL|B|COL|C

-------------------------

Hello | World | Hi#123;移动

New | line | Can.I | parse;此.数据




我试图得到的结果是:

colu|A|COL|B|COL|C

-------------------------

你好|世界|你好

你好|世界| 123

你好|世界|移动

New | Line | Can

New | Line | I

New | Line | parse

New | Line |此

New | Line |数据



如果这个数据集有另一行没有helloworld,并且在前两列中有worldworld,我想显示它,并将相应的第三列数据解析成新行


谢谢

示例1

In [107]: df
Out[107]:
   COL_A  COL_B            COL_C
0  Hello  World  Hi.Can;You#Help
解决方案:

def split_list_in_cols_to_rows(df, lst_cols, fill_value=''):
    # make sure `lst_cols` is a list
    if lst_cols and not isinstance(lst_cols, list):
        lst_cols = [lst_cols]
    # all columns except `lst_cols`
    idx_cols = df.columns.difference(lst_cols)

    # calculate lengths of lists
    lens = df[lst_cols[0]].str.len()

    return pd.DataFrame({
        col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
        for col in idx_cols
    }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
      .append(df.loc[lens==0, idx_cols]).fillna(fill_value) \
      .loc[:, df.columns]

In [106]: split_list_in_cols_to_rows(df.assign(COL_C=df.COL_C.str.split(r'[.,;#]')),
                                     lst_cols='COL_C')
Out[106]:
   COL_A  COL_B COL_C
0  Hello  World    Hi
1  Hello  World   Can
2  Hello  World   You
3  Hello  World  Help
In [110]: df
Out[110]:
   COL_A  COL_B                  COL_C
0  Hello  World            Hi#123;move
1    New   line  Can.I#parse;this.data

In [111]: split_list_in_cols_to_rows(df.assign(COL_C=df.COL_C.str.split(r'[.,;#]')),
     ...:                                      lst_cols='COL_C')
Out[111]:
   COL_A  COL_B  COL_C
0  Hello  World     Hi
1  Hello  World    123
2  Hello  World   move
3    New   line    Can
4    New   line      I
5    New   line  parse
6    New   line   this
7    New   line   data
示例2:

def split_list_in_cols_to_rows(df, lst_cols, fill_value=''):
    # make sure `lst_cols` is a list
    if lst_cols and not isinstance(lst_cols, list):
        lst_cols = [lst_cols]
    # all columns except `lst_cols`
    idx_cols = df.columns.difference(lst_cols)

    # calculate lengths of lists
    lens = df[lst_cols[0]].str.len()

    return pd.DataFrame({
        col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
        for col in idx_cols
    }).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
      .append(df.loc[lens==0, idx_cols]).fillna(fill_value) \
      .loc[:, df.columns]

In [106]: split_list_in_cols_to_rows(df.assign(COL_C=df.COL_C.str.split(r'[.,;#]')),
                                     lst_cols='COL_C')
Out[106]:
   COL_A  COL_B COL_C
0  Hello  World    Hi
1  Hello  World   Can
2  Hello  World   You
3  Hello  World  Help
In [110]: df
Out[110]:
   COL_A  COL_B                  COL_C
0  Hello  World            Hi#123;move
1    New   line  Can.I#parse;this.data

In [111]: split_list_in_cols_to_rows(df.assign(COL_C=df.COL_C.str.split(r'[.,;#]')),
     ...:                                      lst_cols='COL_C')
Out[111]:
   COL_A  COL_B  COL_C
0  Hello  World     Hi
1  Hello  World    123
2  Hello  World   move
3    New   line    Can
4    New   line      I
5    New   line  parse
6    New   line   this
7    New   line   data
速度与优雅的融合

def pir(df, c):
    colc = df[c].str.split('\.|;|#')
    clst = colc.values.tolist()
    lens = [len(l) for l in clst]

    cdf = pd.DataFrame({c: np.concatenate(clst)}, df.index.repeat(lens))
    return df.drop(c, 1).join(cdf).reset_index(drop=True)
def pir2(df, c):
    colc = df[c].str.split('\.|;|#')
    clst = colc.values.tolist()
    lens = [len(l) for l in clst]
    j = df.columns.get_loc(c)
    v = df.values
    n, m = v.shape
    r = np.arange(n).repeat(lens)
    return pd.DataFrame(
        np.column_stack([v[r, 0:j], np.concatenate(clst), v[r, j+1:]]),
        columns=df.columns
    )
%timeit pir(df, 'COL_C')
1000 loops, best of 3: 1.42 ms per loop

%timeit pir2(df, 'COL_C')
1000 loops, best of 3: 278 µs per loop

%timeit split_list_in_cols_to_rows(df.assign(COL_C=df.COL_C.str.split(r'[.,;#]')), lst_cols='COL_C')
100 loops, best of 3: 4.16 ms per loop

%%timeit 
COL_C2 = df.COL_C.str.split('\.|;|#').apply(pd.Series).stack()
df.drop('COL_C', 1).join(pd.Series(index=COL_C2.index.droplevel(1), data=COL_C2.values, name='COL_C')).reset_index(drop=True)
100 loops, best of 3: 2.81 ms per loop
from io import StringIO
import pandas as pd

txt = """COL_A | COL_B | COL_C
Hello | World | Hi#123;move
New   | line  | Can.I#parse;this.data """

df = pd.read_csv(StringIO(txt), sep='\s*\|\s*', engine='python')
忘记优雅,给我速度

def pir(df, c):
    colc = df[c].str.split('\.|;|#')
    clst = colc.values.tolist()
    lens = [len(l) for l in clst]

    cdf = pd.DataFrame({c: np.concatenate(clst)}, df.index.repeat(lens))
    return df.drop(c, 1).join(cdf).reset_index(drop=True)
def pir2(df, c):
    colc = df[c].str.split('\.|;|#')
    clst = colc.values.tolist()
    lens = [len(l) for l in clst]
    j = df.columns.get_loc(c)
    v = df.values
    n, m = v.shape
    r = np.arange(n).repeat(lens)
    return pd.DataFrame(
        np.column_stack([v[r, 0:j], np.concatenate(clst), v[r, j+1:]]),
        columns=df.columns
    )
%timeit pir(df, 'COL_C')
1000 loops, best of 3: 1.42 ms per loop

%timeit pir2(df, 'COL_C')
1000 loops, best of 3: 278 µs per loop

%timeit split_list_in_cols_to_rows(df.assign(COL_C=df.COL_C.str.split(r'[.,;#]')), lst_cols='COL_C')
100 loops, best of 3: 4.16 ms per loop

%%timeit 
COL_C2 = df.COL_C.str.split('\.|;|#').apply(pd.Series).stack()
df.drop('COL_C', 1).join(pd.Series(index=COL_C2.index.droplevel(1), data=COL_C2.values, name='COL_C')).reset_index(drop=True)
100 loops, best of 3: 2.81 ms per loop
from io import StringIO
import pandas as pd

txt = """COL_A | COL_B | COL_C
Hello | World | Hi#123;move
New   | line  | Can.I#parse;this.data """

df = pd.read_csv(StringIO(txt), sep='\s*\|\s*', engine='python')


定时

def pir(df, c):
    colc = df[c].str.split('\.|;|#')
    clst = colc.values.tolist()
    lens = [len(l) for l in clst]

    cdf = pd.DataFrame({c: np.concatenate(clst)}, df.index.repeat(lens))
    return df.drop(c, 1).join(cdf).reset_index(drop=True)
def pir2(df, c):
    colc = df[c].str.split('\.|;|#')
    clst = colc.values.tolist()
    lens = [len(l) for l in clst]
    j = df.columns.get_loc(c)
    v = df.values
    n, m = v.shape
    r = np.arange(n).repeat(lens)
    return pd.DataFrame(
        np.column_stack([v[r, 0:j], np.concatenate(clst), v[r, j+1:]]),
        columns=df.columns
    )
%timeit pir(df, 'COL_C')
1000 loops, best of 3: 1.42 ms per loop

%timeit pir2(df, 'COL_C')
1000 loops, best of 3: 278 µs per loop

%timeit split_list_in_cols_to_rows(df.assign(COL_C=df.COL_C.str.split(r'[.,;#]')), lst_cols='COL_C')
100 loops, best of 3: 4.16 ms per loop

%%timeit 
COL_C2 = df.COL_C.str.split('\.|;|#').apply(pd.Series).stack()
df.drop('COL_C', 1).join(pd.Series(index=COL_C2.index.droplevel(1), data=COL_C2.values, name='COL_C')).reset_index(drop=True)
100 loops, best of 3: 2.81 ms per loop
from io import StringIO
import pandas as pd

txt = """COL_A | COL_B | COL_C
Hello | World | Hi#123;move
New   | line  | Can.I#parse;this.data """

df = pd.read_csv(StringIO(txt), sep='\s*\|\s*', engine='python')
设置

def pir(df, c):
    colc = df[c].str.split('\.|;|#')
    clst = colc.values.tolist()
    lens = [len(l) for l in clst]

    cdf = pd.DataFrame({c: np.concatenate(clst)}, df.index.repeat(lens))
    return df.drop(c, 1).join(cdf).reset_index(drop=True)
def pir2(df, c):
    colc = df[c].str.split('\.|;|#')
    clst = colc.values.tolist()
    lens = [len(l) for l in clst]
    j = df.columns.get_loc(c)
    v = df.values
    n, m = v.shape
    r = np.arange(n).repeat(lens)
    return pd.DataFrame(
        np.column_stack([v[r, 0:j], np.concatenate(clst), v[r, j+1:]]),
        columns=df.columns
    )
%timeit pir(df, 'COL_C')
1000 loops, best of 3: 1.42 ms per loop

%timeit pir2(df, 'COL_C')
1000 loops, best of 3: 278 µs per loop

%timeit split_list_in_cols_to_rows(df.assign(COL_C=df.COL_C.str.split(r'[.,;#]')), lst_cols='COL_C')
100 loops, best of 3: 4.16 ms per loop

%%timeit 
COL_C2 = df.COL_C.str.split('\.|;|#').apply(pd.Series).stack()
df.drop('COL_C', 1).join(pd.Series(index=COL_C2.index.droplevel(1), data=COL_C2.values, name='COL_C')).reset_index(drop=True)
100 loops, best of 3: 2.81 ms per loop
from io import StringIO
import pandas as pd

txt = """COL_A | COL_B | COL_C
Hello | World | Hi#123;move
New   | line  | Can.I#parse;this.data """

df = pd.read_csv(StringIO(txt), sep='\s*\|\s*', engine='python')
设置

df = pd.DataFrame({'COL_A': {0: 'Hello ', 1: 'New   '},
 'COL_B': {0: ' World ', 1: ' line  '},
 'COL_C': {0: ' Hi#123;move', 1: ' Can.I#parse;this.data '}})
Out[480]: 
    COL_A    COL_B                    COL_C
0  Hello    World               Hi#123;move
1  New      line     Can.I#parse;this.data 
解决方案

#split COL_C by given delimeter and stack them up in a series
COL_C2 = df.COL_C.str.split('\.|;|#',expand=True).stack()
#join the new series (after setting a name and index) back to the dataframe
df.join(pd.Series(index=COL_C2.index.droplevel(1), data=COL_C2.values, name='COL_C2'))

Out[475]: 
    COL_A    COL_B                    COL_C COL_C2
0  Hello    World               Hi#123;move     Hi
0  Hello    World               Hi#123;move    123
0  Hello    World               Hi#123;move   move
1  New      line     Can.I#parse;this.data     Can
1  New      line     Can.I#parse;this.data       I
1  New      line     Can.I#parse;this.data   parse
1  New      line     Can.I#parse;this.data    this
1  New      line     Can.I#parse;this.data   data 

感谢@piRSquared建议使用expand。非常好的解决方案。但是,我的dataframe有
列=[From,To]
,只有
作为
To
列中的分隔符,处理时返回此错误:
无法根据规则“safe”将数组数据从dtype('float64')强制转换为dtype('int64')
。这指向idx_cols}中的列的
)。为lst_cols}中的列分配(**{col:np.concatenate(df[col].values)。追加(df.loc[lens==0,idx_cols])。fillna(fill_value)。loc[:,df.columns]
。这假设
中的所有条目都必须拆分。如果这是正确的,如果不是这样呢?我将尝试分割它,但如果不能,它将只取一行。