Python 将字符串转换为数据帧，以冒号分隔_Python_Pandas

Python 将字符串转换为数据帧，以冒号分隔

python pandas

Python 将字符串转换为数据帧，以冒号分隔,python,pandas,Python,Pandas,我有一个字符串，来自一篇有几百句话的文章。我想将字符串转换为数据帧，每个句子都作为一行。比如说, data = 'This is a book, to which I found exciting. I bought it for my cousin. He likes it.' 我希望它变成： This is a book, to which I found exciting. I bought it for my cousin. He likes it. 作为一名python新手，我尝试

我有一个字符串，来自一篇有几百句话的文章。我想将字符串转换为数据帧，每个句子都作为一行。比如说,

data = 'This is a book, to which I found exciting. I bought it for my cousin. He likes it.'

我希望它变成：

This is a book, to which I found exciting.
I bought it for my cousin.
He likes it.

作为一名python新手，我尝试了以下几点：

import pandas as pd
data_csv = StringIO(data)
data_df = pd.read_csv(data_csv, sep = ".")

使用上面的代码，所有句子都成为列名。实际上，我希望它们是一行一列的

不要使用

读取\u csv

。只需按

”拆分。

并使用标准的

pd.DataFrame

：

data = 'This is a book, to which I found exciting. I bought it for my cousin. He likes it.'
data_df = pd.DataFrame([sentence for sentence in data.split('.') if sentence],
                       columns=['sentences'])
print(data_df)

#                                     sentences
#  0  This is a book, to which I found exciting
#  1                  I bought it for my cousin
#  2                                He likes it

请记住，如果出现以下情况，这将中断

一些句子中的浮点数。在这种情况下，您需要更改字符串的格式（例如使用

'\n'

而不是

'.

来分隔句子）。

这是一个快速解决方案，但它解决了您的问题：

data_df = pd.read_csv(data, sep=".", header=None).T

您可以通过列表理解来实现这一点：

data = 'This is a book, to which I found exciting. I bought it for my cousin. He likes it.'

df = pd.DataFrame({'sentence': [i+'.' for i in data.split('. ')]})

print(df)

#                                      sentence
# 0  This is a book, to which I found exciting.
# 1                  I bought it for my cousin.
# 2                                He likes it.

你要做的就是把句子标记化。最简单的方法是使用NLTK等文本挖掘库：

from nltk.tokenize import sent_tokenize
pd.DataFrame(sent_tokenize(data))

否则，您可以尝试以下方法：

pd.DataFrame(data.split('. '))

但是，如果你遇到这样的句子，这将失败：

problem = 'Tim likes to jump... but not always!'

@深度空间解决方案比这要好得多。别忘了加上句号：）可以工作，但会遇到输入更复杂的问题。@cdwoelk如何更复杂<在句子的中间，代码>浮点数？这就成了nltk的问题，而不是真的。我可能倾向于使用nltk来实现这一点。然而，快速而肮脏的解决方案在大多数情况下可能都很好。