Pandas 添加';文件id';列到单词id'的数据帧;字数
我有以下数据集:Pandas 添加';文件id';列到单词id'的数据帧;字数,pandas,nltk,word-count,Pandas,Nltk,Word Count,我有以下数据集: import pandas as pd jsonDF = pd.DataFrame({'DOCUMENT_ID':[263403828328665088,264142543883739136], 'MESSAGE':['@Zuora wants to help @Network4Good with Hurric...','@ztrip please help spread the good word on hel...']}) DOCUMENT_ID
import pandas as pd
jsonDF = pd.DataFrame({'DOCUMENT_ID':[263403828328665088,264142543883739136], 'MESSAGE':['@Zuora wants to help @Network4Good with Hurric...','@ztrip please help spread the good word on hel...']})
DOCUMENT_ID MESSAGE
0 263403828328665088 @Zuora wants to help @Network4Good with Hurric...
1 264142543883739136 @ztrip please help spread the good word on hel...
我正试图以
docID wordID count
0 1 118 1
1 1 285 1
2 1 1229 1
3 1 1688 1
4 1 2068 1
我用了下列方法
r=[]
for i in jsonDF['MESSAGE']:
for j in sortedValues(wordsplit(i)):
r.append(j)
IDCount_Re=pd.DataFrame(r)
IDCount_Re[:5]
给我下面的结果
0 17
1 help 2
2 wants 1
3 hurricane 1
4 relief 1
5 text 1
6 sandy 1
7 donate 1
8 6
9 please 1
我可以查字数
我不知道如何将文档_ID附加到上述数据框中的
以下函数用于拆分单词
from nltk.corpus import stopwords
import re
def wordsplit(wordlist):
j=wordlist
j=re.sub(r'\d+', '', j)
j=re.sub('RT', '',j)
j=re.sub('http', '', j)
j = re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", j)
j=j.lower()
j=j.strip()
if not j in stopwords.words('english'):
yield j
def wordSplitCount(wordlist):
'''merges a list into string, splits it, removes stop words and
then counts the occurrences returning an ordered dictitonary'''
#stopwords=set(stopwords.words('english'))
string1=''.join(list(itertools.chain(filter(None, wordlist))))
cnt=Counter()
j = []
for i in string1.split(" "):
i=re.sub(r'&', ' ', i.lower())
if i not in stopwords.words('english'):
cnt[i]+=1
return OrderedDict(cnt)
def sortedValues(wordlist):
'''creates a dictionary list of occurenced w/ values descending'''
d=wordSplitCount(wordlist)
return sorted(d.items(), key=lambda t: t[1], reverse=True)
更新:此处的解决方案:
“文档ID”是
jsonDF
每行中的两个字段之一。您当前的代码无法访问它,因为它直接作用于jsonDF['MESSAGE']
下面是一些不起作用的伪代码—类似于:
for _, row in jsonDF.iterrows():
doc_id, msg = row
words = [word for word in wordsplit(msg)][0].split() # hack
wordcounts = Counter(words).most_common() # sort by decr frequency
然后执行一个pd.concat(pd.DataFrame({'DOCUMENT\u ID':doc\u ID,…
并从
wordcounts
中获取“wordId”和“count”字段。您可以将i、j循环和无效追加组合到一个生成器表达式中:wordcounts=sortedValue中的单词计数器(wordsplit(msg))用于jsonDF['MESSAGE']中的msg)
但这并没有附加文档ID,是吗?我没说它附加了。我建议清理一下你的代码。还有一些事情,比如合并你的正则表达式re.sub(r'(\d+| RT | http'),,,j)
你的sortedValue()
应该被Counter()替换。最常见的(