Python 考虑两个连续单词作为一个词频_Python_Pandas_Nltk

Python 考虑两个连续单词作为一个词频

python pandas

Python 考虑两个连续单词作为一个词频,python,pandas,nltk,Python,Pandas,Nltk,我有这样一句话： Sentence Who the president of Kuala Lumpur is? 我正在尝试提取所有单词（标记化）但是，我想将吉隆坡提取为双格，因此我正在考虑一个过滤器，它会显示“如果有两个连续的单词有大写字母，则将它们提取为唯一的单词。因此，如果我有以下清单： Who the president of Kuala Lumpur is? 我会（使用上面的代码）：但我想要这个： Word Freq who

我有这样一句话：

Sentence
    Who the president of Kuala Lumpur is?

我正在尝试提取所有单词（标记化）

但是，我想将吉隆坡提取为双格，因此我正在考虑一个过滤器，它会显示“如果有两个连续的单词有大写字母，则将它们提取为唯一的单词。因此，如果我有以下清单：

    Who the president of Kuala Lumpur is?

我会（使用上面的代码）：

但我想要这个：

Word            Freq
who               1
is                1
president         1
of                1
Kuala Lumpur      1
is                1

我认为要找到两个连续的大写字母，我应该应用以下模式：

pattern = r"[A-Z]{2}-\d{3}-[A-Z]{2}"

o安奇：

re.findall('([A-Z][\w-]*(?:\s+[A-Z][\w-]*)+)', df.Sentence.tolist())

但是我不知道如何在上面的代码中包含这些信息。

您可以进行一些预处理，并使用以下命令将bi-gram与句子的其余部分分开。例如：

import re

# initialize sentence text
sentence_without_bigrams = 'Who the president of Kuala Lumpur or Other Place is?'
bigrams = []

# loop until there are no remaining bi-grams
while True:
    # find bi-grams
    match = re.search('([A-Z][\w-]*(?:\s+[A-Z][\w-]*)+)', sentence_without_bigrams)
    if match == None:
        break
    else:
        # add bi-gram to list of bi-grams
        bigrams.append(sentence_without_bigrams[match.start():match.end()])
        # remove bigram from sentence
        sentence_without_bigrams = (sentence_without_bigrams[:match.start()-1] + sentence_without_bigrams[match.end():])


print(bigrams)
>> ['Kuala Lumpur', 'Other Place']

print(sentence_without_bigrams)
>> Who the president of or is?

然而，这个解决方案没有达到您的最终目标，因为像“你好，奥巴马总统”这样的句子不会被正确地捕捉（如前所述）.

这回答了你的问题吗？嗨，比尔，这是我已经做过的。我需要将这两个单词组合在一起，在识别为预期输出后。现在，我有一个代码，可以将它们标记一次，还有re.findall，它可以选择带有大写字母的连续单词。但是它们不算在一起作为一个汉克Lisa.是的，没错。但是我需要分组，并将考虑的单词数作为一个单词。我很清楚如何通过考虑两个连续的单词进行拆分，但不知道如何获得计数/单词频率

re.findall('([A-Z][\w-]*(?:\s+[A-Z][\w-]*)+)', df.Sentence.tolist())

import re

# initialize sentence text
sentence_without_bigrams = 'Who the president of Kuala Lumpur or Other Place is?'
bigrams = []

# loop until there are no remaining bi-grams
while True:
    # find bi-grams
    match = re.search('([A-Z][\w-]*(?:\s+[A-Z][\w-]*)+)', sentence_without_bigrams)
    if match == None:
        break
    else:
        # add bi-gram to list of bi-grams
        bigrams.append(sentence_without_bigrams[match.start():match.end()])
        # remove bigram from sentence
        sentence_without_bigrams = (sentence_without_bigrams[:match.start()-1] + sentence_without_bigrams[match.end():])


print(bigrams)
>> ['Kuala Lumpur', 'Other Place']

print(sentence_without_bigrams)
>> Who the president of or is?