Python 将字符串转换为数据帧,以冒号分隔
我有一个字符串,来自一篇有几百句话的文章。我想将字符串转换为数据帧,每个句子都作为一行。比如说,Python 将字符串转换为数据帧,以冒号分隔,python,pandas,Python,Pandas,我有一个字符串,来自一篇有几百句话的文章。我想将字符串转换为数据帧,每个句子都作为一行。比如说, data = 'This is a book, to which I found exciting. I bought it for my cousin. He likes it.' 我希望它变成: This is a book, to which I found exciting. I bought it for my cousin. He likes it. 作为一名python新手,我尝试
data = 'This is a book, to which I found exciting. I bought it for my cousin. He likes it.'
我希望它变成:
This is a book, to which I found exciting.
I bought it for my cousin.
He likes it.
作为一名python新手,我尝试了以下几点:
import pandas as pd
data_csv = StringIO(data)
data_df = pd.read_csv(data_csv, sep = ".")
使用上面的代码,所有句子都成为列名。实际上,我希望它们是一行一列的 不要使用
读取\u csv
。只需按”拆分。
并使用标准的pd.DataFrame
:
data = 'This is a book, to which I found exciting. I bought it for my cousin. He likes it.'
data_df = pd.DataFrame([sentence for sentence in data.split('.') if sentence],
columns=['sentences'])
print(data_df)
# sentences
# 0 This is a book, to which I found exciting
# 1 I bought it for my cousin
# 2 He likes it
请记住,如果出现以下情况,这将中断
一些句子中的浮点数。在这种情况下,您需要更改字符串的格式(例如使用
'\n'
而不是'.
来分隔句子)。这是一个快速解决方案,但它解决了您的问题:
data_df = pd.read_csv(data, sep=".", header=None).T
您可以通过列表理解来实现这一点:
data = 'This is a book, to which I found exciting. I bought it for my cousin. He likes it.'
df = pd.DataFrame({'sentence': [i+'.' for i in data.split('. ')]})
print(df)
# sentence
# 0 This is a book, to which I found exciting.
# 1 I bought it for my cousin.
# 2 He likes it.
你要做的就是把句子标记化。最简单的方法是使用NLTK等文本挖掘库:
from nltk.tokenize import sent_tokenize
pd.DataFrame(sent_tokenize(data))
否则,您可以尝试以下方法:
pd.DataFrame(data.split('. '))
但是,如果你遇到这样的句子,这将失败:
problem = 'Tim likes to jump... but not always!'
@深度空间解决方案比这要好得多。别忘了加上句号:)可以工作,但会遇到输入更复杂的问题。@cdwoelk如何更复杂<在句子的中间,代码>浮点数?这就成了nltk的问题,而不是真的。我可能倾向于使用nltk来实现这一点。然而,快速而肮脏的解决方案在大多数情况下可能都很好。