Python 将包含长文本的每一行拆分为多行_Python_Pandas

Python 将包含长文本的每一行拆分为多行

python pandas

Python 将包含长文本的每一行拆分为多行,python,pandas,Python,Pandas,我有一个DataFrame，它有一个字符串列，如下所示： id text label 1 this is long string with many words 1 2 this is a middle string 0 3 short string

我有一个DataFrame，它有一个字符串列，如下所示：

id                         text                      label
1            this is long string with many words       1
2                 this is a middle string              0
3                      short string                    1

array = []
for ii,row in df.iterrows():
    if row['text'].split() > 3:
        jj = 0
        while jj < len(row['text'].split()):
            array.append(
                pd.Series(
                {'id':row['id'],'label':row['label'],
                'text':row['text'].split()[jj:jj+3]}
                ) 
            )
            jj += 3
    else:
        array.append(row)

我想根据字符串长度将此数据帧转换为另一个数据帧，即（

df['text'].str.len>3

）：

array = []
for ii,row in df.iterrows():
    if row['text'].split() > 3:
        jj = 0
        while jj < len(row['text'].split()):
            array.append(
                pd.Series(
                {'id':row['id'],'label':row['label'],
                'text':row['text'].split()[jj:jj+3]}
                ) 
            )
            jj += 3
    else:
        array.append(row)

这是我的代码：

pd.concat(df['text'].str.len() > 200)

array = []
for ii,row in df.iterrows():
    if row['text'].split() > 3:
        jj = 0
        while jj < len(row['text'].split()):
            array.append(
                pd.Series(
                {'id':row['id'],'label':row['label'],
                'text':row['text'].split()[jj:jj+3]}
                ) 
            )
            jj += 3
    else:
        array.append(row)

但这是错误的。

IIUC

v=df.text.str.split(' ')

s=pd.DataFrame({'text':v.sum(),'label':df.label.repeat(v.str.len())})

s['New']=s.groupby(s.index).cumcount()

s.groupby([s.New//3,s.index.get_level_values(level=0)]).agg({'text':lambda x : ' '.join(x),'label':'first'}).sort_index(level=1)

Out[1785]: 
                   text  label
New                           
0   0      this is long      1
1   0  string with many      1
2   0             words      1
0   1         this is a      0
1   1     middle string      0
0   2      short string      1

array = []
for ii,row in df.iterrows():
    if row['text'].split() > 3:
        jj = 0
        while jj < len(row['text'].split()):
            array.append(
                pd.Series(
                {'id':row['id'],'label':row['label'],
                'text':row['text'].split()[jj:jj+3]}
                ) 
            )
            jj += 3
    else:
        array.append(row)

你可以

In [1257]: n = 3

In [1279]: df.set_index(['label', 'id'])['text'].str.split().apply(
               lambda x: pd.Series([' '.join(x[i:i+n]) for i in range(0, len(x), n)])
            ).stack().reset_index().drop('level_2', 1)
Out[1279]:
   label  id                 0
0      1   1      this is long
1      1   1  string with many
2      1   1             words
3      0   2         this is a
4      0   2     middle string
5      1   3      short string

array = []
for ii,row in df.iterrows():
    if row['text'].split() > 3:
        jj = 0
        while jj < len(row['text'].split()):
            array.append(
                pd.Series(
                {'id':row['id'],'label':row['label'],
                'text':row['text'].split()[jj:jj+3]}
                ) 
            )
            jj += 3
    else:
        array.append(row)

细节

   label                                 text  id
0      1  this is long string with many words   1
1      0              this is a middle string   2
2      1                         short string   3

array = []
for ii,row in df.iterrows():
    if row['text'].split() > 3:
        jj = 0
        while jj < len(row['text'].split()):
            array.append(
                pd.Series(
                {'id':row['id'],'label':row['label'],
                'text':row['text'].split()[jj:jj+3]}
                ) 
            )
            jj += 3
    else:
        array.append(row)

这是一种解决方案，使用两个for循环将文本拆分为3组：

array = []
for ii,row in df.iterrows():
    if row['text'].split() > 3:
        jj = 0
        while jj < len(row['text'].split()):
            array.append(
                pd.Series(
                {'id':row['id'],'label':row['label'],
                'text':row['text'].split()[jj:jj+3]}
                ) 
            )
            jj += 3
    else:
        array.append(row)

array=[]
对于ii，df.iterrows（）中的行：
如果行['text'].split（）大于3：
jj=0
而jj

谢谢@Zero，但您的代码不包含id列。删除此错误：ValueError:labels['level_2']不包含在Axis中当我使用文本列时，它会显示“KeyError:'text'”