Python 如何从字符串的DataFrame列中获取唯一的单词？_Python_Pandas_Numpy_Dataframe_Bayesian Networks

Python 如何从字符串的DataFrame列中获取唯一的单词？

python pandas numpy dataframe

Python 如何从字符串的DataFrame列中获取唯一的单词？,python,pandas,numpy,dataframe,bayesian-networks,Python,Pandas,Numpy,Dataframe,Bayesian Networks,我正在寻找一种方法来获取数据帧中字符串列中的唯一单词列表 import pandas as pd import numpy as np df = pd.read_csv('FinalStemmedSentimentAnalysisDataset.csv', sep=';',dtype= {'tweetId':int,'tweetText':str,'tweetDate':str,'sentimentLabel':int}) tweets = {} tweets[0] = df

我正在寻找一种方法来获取数据帧中字符串列中的唯一单词列表

import pandas as pd
import numpy as np

df = pd.read_csv('FinalStemmedSentimentAnalysisDataset.csv', sep=';',dtype= 
       {'tweetId':int,'tweetText':str,'tweetDate':str,'sentimentLabel':int})

tweets = {}
tweets[0] = df[df['sentimentLabel'] == 0]
tweets[1] = df[df['sentimentLabel'] == 1]

我使用的数据集来自以下链接：

我得到了一列长度可变的字符串，我想得到列中每个唯一单词的列表及其计数，我怎么能得到它？我在python中使用Pandas，原始数据库有超过100万行，因此我还需要一些有效的方法来足够快地处理这些行，并且不要让代码运行太长时间

列的示例可以是：

我为我的apl朋友感到难过
天哪，这太可怕了
这首新歌是什么
清单可能是这样的

[是，所以，悲伤，因为，我的，apl，朋友，omg，这个，糟糕的，什么，新的，歌曲]

如果你在列中有字符串，那么你必须把每个句子分成单词列表，然后把所有的列表放在一个列表中-你可以使用它

sum（）

，它应该给你所有的单词。要获得唯一的单词，您可以将其转换为

set（）

——然后再转换回

list（）

但在开始时，您必须清理句子以删除字符，如

，？
，等等。我使用regex
仅保留一些字符和空间。最终，您必须将所有单词转换为小写或大写
import pandas as pd

df = pd.DataFrame({
    'sentences': [
        'is so sad for my apl friend.',
        'omg this is terrible.',
        'what is this new song?',
    ]
})

unique = set(df['sentences'].str.replace('[^a-zA-Z ]', '').str.lower().str.split(' ').sum())

print(list(sorted(unique)))

结果
['apl', 'for', 'friend', 'is', 'my', 'new', 'omg', 'sad', 'so', 'song', 'terrible', 'this', 'what']

00:27:04 start
-----
00:27:08 load: 4.10780930519104 s
-----
00:27:23 words: 14.803470849990845 s
-----
00:27:27 set: 4.338541269302368 s
-----
['0', '00', '000', '0000', '00000', '000000000000', '0000001', '000001', '000014', '00004873337e0033fea60']


编辑：如注释中提到的@HenryYik-findall（'\w+'）
可以用来代替split（）
，也可以代替replace（）


编辑：我用来自

除了column.sum（）
或sum（column）
-我测量了1000行的时间，计算了150万行，需要35分钟
使用itertools.chain（）
要快得多-大约需要8秒
import itertools

words = df['sentences'].str.lower().str.findall("\w+")
words = list(itertools.chain(words))
unique = set(words)

但它可以直接转换为set（）

words = df['sentences'].str.lower().str.findall("\w+")

unique = set()

for x in words:
    unique.update(x)

大约需要5秒钟

完整代码：
import pandas as pd
import time 

print(time.strftime('%H:%M:%S'), 'start')

print('-----')
#------------------------------------------------------------------------------

start = time.time()

# `read_csv()` can read directly from internet and compressed to zip
#url = 'http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip'
url = 'SentimentAnalysisDataset.csv'

# need to skip two rows which are incorrect
df = pd.read_csv(url, sep=',', dtype={'ItemID':int, 'Sentiment':int, 'SentimentSource':str, 'SentimentText':str}, skiprows=[8835, 535881])

end = time.time()
print(time.strftime('%H:%M:%S'), 'load:', end-start, 's')

print('-----')
#------------------------------------------------------------------------------

start = end

words = df['SentimentText'].str.lower().str.findall("\w+")
#df['words'] = words

end = time.time()
print(time.strftime('%H:%M:%S'), 'words:', end-start, 's')

print('-----')
#------------------------------------------------------------------------------

start = end

unique = set()
for x in words:
    unique.update(x)

end = time.time()
print(time.strftime('%H:%M:%S'), 'set:', end-start, 's')

print('-----')
#------------------------------------------------------------------------------

print(list(sorted(unique))[:10])

结果
['apl', 'for', 'friend', 'is', 'my', 'new', 'omg', 'sad', 'so', 'song', 'terrible', 'this', 'what']

00:27:04 start
-----
00:27:08 load: 4.10780930519104 s
-----
00:27:23 words: 14.803470849990845 s
-----
00:27:27 set: 4.338541269302368 s
-----
['0', '00', '000', '0000', '00000', '000000000000', '0000001', '000001', '000014', '00004873337e0033fea60']

这回答了你的问题吗？谢谢你的提问。请包含代码而不是图像。@Chris我尝试了他们在帖子中所说的内容，但似乎对我不起作用。或者df[“句子”].str.findall（“\w+”）
跳过分割部分。@furas我尝试了你的解决方案，但似乎没有在无限循环中工作，不知道为什么，我用一些关于如何处理数据的额外信息编辑了这个问题，字符串已经在没有标点符号的情况下处理了，并且都是小写的。什么循环？我不使用任何循环，也不显示任何有问题的循环-那么循环在哪里？也许最好描述问题或显示更多代码。我添加了代码，对于包含150万行的文件来说，运行速度要快得多。