Regex 如何在Python中从单词词典中删除单字表

Regex 如何在Python中从单词词典中删除单字表,regex,python-3.x,pandas,word-cloud,Regex,Python 3.x,Pandas,Word Cloud,我正在尝试在python中创建一个只有bigram的wordcloud。现在我有一本字典如下: Word\u dict {'delivered later requested_delivered later requested': 0.07590105638848002, 'delayed delivery_delayed delivery': 0.043280231684707335, 'guidelines followed_guidelines followed': 0.0405665

我正在尝试在python中创建一个只有bigram的wordcloud。现在我有一本字典如下:

Word\u dict

{'delivered later requested_delivered later requested': 0.07590105638848002,
 'delayed delivery_delayed delivery': 0.043280231684707335,
 'guidelines followed_guidelines followed': 0.04056653336980544,
 'delayed pickup_delayed pickup': 0.02733236942769188,
 'delivered later requested_delayed delivery': 0.023815416411579027,
 'delayed delivery_delivered later requested': 0.02332477975624476,
 'guidelines followed_delivered later requested': 0.02131881396186928,
 'delivered later requested_guidelines followed': 0.020793441968104277,
 'delayed pickup_delayed delivery': 0.020619765275950556,
 'delayed delivery_guidelines followed': 0.01998150343228563,
 'delayed delivery_delayed pickup': 0.019464815273128308,
 'guidelines followed_delayed delivery': 0.018900366023628715,
 'delivered later requested_delayed pickup': 0.01870932166225962,
 'delayed pickup_delivered later requested': 0.0185660383912328,
 'guidelines followed_delayed pickup': 0.015148949473108336,
 'delayed pickup_guidelines followed': 0.01475383499845862,
 'super user activity fom_super user activity fom': 0.010490072206084763}
{' requested_delivered ': 0.07590105638848002,
 'delivery_delayed ': 0.043280231684707335,
 'followed_guidelines': 0.04056653336980544,
 'pickup_delayed ': 0.02733236942769188,
 ' requested_delayed ': 0.023815416411579027}
我需要从字典中删除单字或没有下划线的单词。怎么做

预期产出

{'delivered later requested_delivered later requested': 0.07590105638848002,
 'delayed delivery_delayed delivery': 0.043280231684707335,
 'guidelines followed_guidelines followed': 0.04056653336980544,
 'delayed pickup_delayed pickup': 0.02733236942769188,
 'delivered later requested_delayed delivery': 0.023815416411579027,
 'delayed delivery_delivered later requested': 0.02332477975624476,
 'guidelines followed_delivered later requested': 0.02131881396186928,
 'delivered later requested_guidelines followed': 0.020793441968104277,
 'delayed pickup_delayed delivery': 0.020619765275950556,
 'delayed delivery_guidelines followed': 0.01998150343228563,
 'delayed delivery_delayed pickup': 0.019464815273128308,
 'guidelines followed_delayed delivery': 0.018900366023628715,
 'delivered later requested_delayed pickup': 0.01870932166225962,
 'delayed pickup_delivered later requested': 0.0185660383912328,
 'guidelines followed_delayed pickup': 0.015148949473108336,
 'delayed pickup_guidelines followed': 0.01475383499845862,
 'super user activity fom_super user activity fom': 0.010490072206084763}
{' requested_delivered ': 0.07590105638848002,
 'delivery_delayed ': 0.043280231684707335,
 'followed_guidelines': 0.04056653336980544,
 'pickup_delayed ': 0.02733236942769188,
 ' requested_delayed ': 0.023815416411579027}
怎么做

Mycode

def preprocess(x):
    x = re.sub('[^a-z\s]', '', x.lower())                  # get rid of noise
    x = [w for w in x.split() if w not in set(newstopwords)]  # remove stopwords
    return ' '.join(x)

data['Clean_addr'] = data['Reason Code Level 1'].apply(preprocess)

# setup and score the bigrams using the raw frequency.
finder = BigramCollocationFinder.from_words(text_content)
bigram_measures = BigramAssocMeasures()
scored = finder.score_ngrams(bigram_measures.raw_freq)

# By default finder.score_ngrams is sorted, however don't rely on this default behavior.
# Sort highest to lowest based on the score.
scoredList = sorted(scored, key=itemgetter(1), reverse=True)

# word_dict is the dictionary we'll use for the word cloud.
# Load dictionary with the FOR loop below.
# The dictionary will look like this with the bigram and the score from above.
# word_dict = {'bigram A': 0.000697411,
#             'bigram B': 0.000524882}

word_dict = {}

listLen = len(scoredList)

# Get the bigram and make a contiguous string for the dictionary key. 
# Set the key to the scored value. 
for i in range(listLen):
    word_dict['_'.join(scoredList[i][0])] = scoredList[i][1]



# -----

如果可以从数据集中保证整个集合中只有零个或一个下划线包含短语:

# starting from scoredList in the example above (had to add scored.items()) so that you're iterating over key/value pairs
scoredList = sorted(scored.items(), key=itemgetter(1), reverse=True)

new_data = {}
for key, value in scoredList:
    words = [word for word in key.split(' ') if '_' in word]
    if len(words) == 1:
        new_data[words[0]] = value
    elif len(words) > 1:
        raise ValueError('oh no...')

print(new_data)

编辑成一个完全独立的示例;您看到了什么错误?事实上,更仔细地观察您的数据集,似乎有一些重复项…预期的行为是什么?(例如,我看到
delivery\u delayed
至少出现了两次)看起来
scoredList
不是
dict
;修复了按值排序的行,并相应地更改了循环的
。数据副本是预期行为。我可以想出几种处理方法…只获取第一个值?最后一个值是多少?总结一下?要求是什么?如果您计划删除所有只包含字母的单词,请在预期输出中使用
re.sub(r'\s*(?在
'delivery\u delayed'
之前和之后是否保留空格?