Pandas 添加';文件id';列到单词id'的数据帧;字数

Pandas 添加';文件id';列到单词id'的数据帧;字数,pandas,nltk,word-count,Pandas,Nltk,Word Count,我有以下数据集: import pandas as pd jsonDF = pd.DataFrame({'DOCUMENT_ID':[263403828328665088,264142543883739136], 'MESSAGE':['@Zuora wants to help @Network4Good with Hurric...','@ztrip please help spread the good word on hel...']}) DOCUMENT_ID

我有以下数据集:

import pandas as pd
jsonDF = pd.DataFrame({'DOCUMENT_ID':[263403828328665088,264142543883739136], 'MESSAGE':['@Zuora wants to help @Network4Good with Hurric...','@ztrip please help spread the good word on hel...']})

DOCUMENT_ID             MESSAGE
0   263403828328665088  @Zuora wants to help @Network4Good with Hurric...
1   264142543883739136  @ztrip please help spread the good word on hel...
我正试图以

docID   wordID  count
0   1   118     1
1   1   285     1
2   1   1229    1
3   1   1688    1
4   1   2068    1
我用了下列方法

r=[]
for i in jsonDF['MESSAGE']:
    for j in sortedValues(wordsplit(i)):
        r.append(j)
IDCount_Re=pd.DataFrame(r)
IDCount_Re[:5]
给我下面的结果

0               17
1   help         2
2   wants        1
3   hurricane   1
4   relief      1
5   text        1
6   sandy       1
7   donate      1
8              6
9   please    1
我可以查字数

我不知道如何将文档_ID附加到上述数据框中的

以下函数用于拆分单词

from nltk.corpus import stopwords 
import re

def wordsplit(wordlist):
    j=wordlist
    j=re.sub(r'\d+', '', j)
    j=re.sub('RT', '',j)
    j=re.sub('http', '', j)
    j = re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", j)
    j=j.lower()
    j=j.strip()
    if not j in stopwords.words('english'):
        yield j

def wordSplitCount(wordlist):
    '''merges a list into string, splits it, removes stop words and 
    then counts the occurrences returning an ordered dictitonary'''
    #stopwords=set(stopwords.words('english'))
    string1=''.join(list(itertools.chain(filter(None, wordlist))))
    cnt=Counter()
    j = []
    for i in string1.split(" "):
        i=re.sub(r'&', ' ', i.lower())
        if i not in stopwords.words('english'):
            cnt[i]+=1
    return OrderedDict(cnt)

def sortedValues(wordlist):
    '''creates a dictionary list of occurenced w/ values descending'''
    d=wordSplitCount(wordlist)
    return sorted(d.items(), key=lambda t: t[1], reverse=True)
更新:此处的解决方案:


“文档ID”是
jsonDF
每行中的两个字段之一。您当前的代码无法访问它,因为它直接作用于
jsonDF['MESSAGE']

下面是一些不起作用的伪代码—类似于:

for _, row in jsonDF.iterrows():
    doc_id, msg = row
    words = [word for word in wordsplit(msg)][0].split() # hack
    wordcounts = Counter(words).most_common() # sort by decr frequency
然后执行一个
pd.concat(pd.DataFrame({'DOCUMENT\u ID':doc\u ID,…

并从
wordcounts

中获取“wordId”和“count”字段。您可以将i、j循环和无效追加组合到一个生成器表达式中:
wordcounts=sortedValue中的单词计数器(wordsplit(msg))用于jsonDF['MESSAGE']中的msg)
但这并没有附加文档ID,是吗?我没说它附加了。我建议清理一下你的代码。还有一些事情,比如合并你的正则表达式
re.sub(r'(\d+| RT | http'),,,j)
你的
sortedValue()
应该被
Counter()替换。最常见的(