Python 在单元格中拆分文本并为标记创建其他行_Python_Python 3.x_Pandas

Python 在单元格中拆分文本并为标记创建其他行

python python-3.x pandas

Python 在单元格中拆分文本并为标记创建其他行,python,python-3.x,pandas,Python,Python 3.x,Pandas,假设我在pandas中的DataFrame中有以下内容： id text 1 I am the first document and I am very happy. 2 Here is the second document and it likes playing tennis. 3 This is the third document and it looks very good today. 我想将每个id的文本拆分为3个单词的标记，因此我最终希望得到以下内容： id

假设我在

pandas

中的

DataFrame

中有以下内容：

id  text
1   I am the first document and I am very happy.
2   Here is the second document and it likes playing tennis.
3   This is the third document and it looks very good today.

我想将每个id的文本拆分为3个单词的标记，因此我最终希望得到以下内容：

id  text
1   I am the
1   first document and
1   I am very
1   happy
2   Here is the
2   second document and
2   it likes playing
2   tennis
3   This is the
3   third document and
3   it looks very
3   good today

请记住，除了这两列之外，我的数据帧可能还有其他列，这两列应该以与上面的

id

相同的方式简单地复制到新的数据帧

最有效的方法是什么

我认为我的问题的答案与这里给出的答案非常接近

这可能也会有帮助：。

您可以使用以下内容：

def divide_chunks(l, n): 
    # looping till length l 
    for i in range(0, len(l), n):  
        yield l[i:i + n]

然后使用：

编辑：

m=(pd.DataFrame(df.text.apply(lambda x: list(divide_chunks(x.split(),3))).values.tolist())
.unstack().sort_index(level=1).apply(' '.join).reset_index(level=1))
m.columns=df.columns
print(m)

一个独立的解决方案，可能会慢一点：

# Split every n words
n = 3

# incase id is not index yet
df.set_index('id', inplace=True)

new_df = df.text.str.split(' ', expand=True).stack().reset_index()

new_df = (new_df.groupby(['id', new_df.level_1//n])[0]
                .apply(lambda x: ' '.join(x))
                .reset_index(level=1, drop=True)
         )

new_df

是一个系列：

id
1               I am the
1     first document and
1              I am very
1                 happy.
2            Here is the
2    second document and
2       it likes playing
2                tennis.
3            This is the
3     third document and
3          it looks very
3            good today.
Name: 0, dtype: object

这就是数据帧的内容吗？或者只是一个文本文件？@deveshkumarsing在顶部查看我的编辑。我们只讨论pandasI中的数据帧，我也想听听您对此@jezraelHey的看法，谢谢；它看起来很有趣。老实说，我认为会有一个稍微简单一点的解决方案，但最终我可能会错。顺便问一下，

取消测试

是一个可以像上面那样调用的函数吗？我现在找不到。啊，好吧，你可能是指这个链接答案末尾的函数。顺便问一下，你看到我在上面贴的这个链接了吗？我认为答案可以修改以适应我上面的问题（或者不）。另外@jezrael现在似乎不在，所以我不能让他的魔法在我的手中。你的解决方案的计算复杂度是多少？我认为如果使用嵌套的

for

循环，您也可以使用它，但可能在计算上会非常昂贵。我也认为我们可以考虑这样的事情（如果它有效）。好的，让我们看看。老实说，我已经使用了上面的解决方案（），但它起了作用，但我并不完全理解它，因此我无法如此轻松地修改它来尝试解决我当前的问题。酷，无论如何谢谢：）（编辑后再次向上投票）嘿，谢谢（向上投票）。当然，它看起来更加独立。然而，是的，这也是关于计算成本有多高的问题。顺便问一下，如果数据帧也有其他列（应该只作为列

id

复制到新的数据帧），当我在我的帖子中写到时会发生什么？它仍然有效吗？如果您将索引设置为所有其他列，并将

reset\u索引（level=1）

替换为

reset\u索引（level=-1）

，它仍然有效。好吧，让我看看（尽管我刚刚意识到，我现在真的不需要将这些列带到我的新数据框中，所以它是可以的）。嘿，我认为它确实有效（！！）。好东西。但是，它返回一个序列（我认为id是索引），而不是数据帧。您能否正确重置索引并创建数据帧？数据帧应该有一个被重置的索引列，

id

列和

text

列。（我不知道这是否与此相关，但在我的实际数据中，我的

id

列不是一个数字，而是一个字符串）

   id                 text
0   0             I am the
1   0   first document and
2   0            I am very
3   0               happy.
0   1          Here is the
1   1  second document and
2   1     it likes playing
3   1              tennis.
0   2          This is the
1   2   third document and
2   2        it looks very
3   2          good today.

# Split every n words
n = 3

# incase id is not index yet
df.set_index('id', inplace=True)

new_df = df.text.str.split(' ', expand=True).stack().reset_index()

new_df = (new_df.groupby(['id', new_df.level_1//n])[0]
                .apply(lambda x: ' '.join(x))
                .reset_index(level=1, drop=True)
         )

id
1               I am the
1     first document and
1              I am very
1                 happy.
2            Here is the
2    second document and
2       it likes playing
2                tennis.
3            This is the
3     third document and
3          it looks very
3            good today.
Name: 0, dtype: object