删除标点符号并创建Python词典_Python_Python 3.x_Function_Dictionary_Punctuation

删除标点符号并创建Python词典

python python-3.x function dictionary

删除标点符号并创建Python词典,python,python-3.x,function,dictionary,punctuation,Python,Python 3.x,Function,Dictionary,Punctuation,我正在尝试创建一个函数，该函数删除字符串中的标点符号和小写字母。然后，它应该以计算字符串中单词频率的字典的形式返回所有这些内容这是我到目前为止写的代码： def word_dic(string): string = string.lower() new_string = string.split(' ') result = {} for key in new_string: if key in result: resul

我正在尝试创建一个函数，该函数删除字符串中的标点符号和小写字母。然后，它应该以计算字符串中单词频率的字典的形式返回所有这些内容

这是我到目前为止写的代码：

def word_dic(string):
    string = string.lower()
    new_string = string.split(' ')
    result = {}

    for key in new_string:
        if key in result:
            result[key] += 1
        else:
            result[key] = 1

    for c in result:
        "".join([ c if not c.isalpha() else "" for c in result])

    return result

但这是我执行后得到的：

{'am': 3,
 'god!': 1,
 'god.': 1,
 'i': 2,
 'i?': 1,
 'thanks': 1,
 'to': 1,
 'who': 2}

我只需要删除单词末尾的标点符号。

“”。join（[c if not c.isalpha（）else''表示结果中的c]）

创建一个没有标点符号的新字符串，但它对它没有任何作用；它会立即被丢弃，因为您从未存储结果

实际上，最好的方法是在

result

中对键进行计数之前对其进行规范化。例如，您可以执行以下操作：

for key in new_string:
    # Keep only the alphabetic parts of each key, and replace key for future use
    key = "".join([c for c in key if c.isalpha()])
    if key in result:
        result[key] += 1
    else:
        result[key] = 1

现在，

result

从来没有带标点符号的键（并且，

“god.”

和

“god！”

的计数仅在

“god”

键下求和），并且不需要在事件发生后再通过另一次操作来去除标点符号

或者，如果您只关心每个单词的前导和尾随标点符号（因此

“it's”

应按原样保留，而不是转换为

“its”

），您可以进一步简化。只需导入字符串，然后更改：

    key = "".join([c for c in key if c.isalpha()])

致：

这符合您在问题中特别要求的内容（删除单词末尾的标点符号，但不要删除单词开头或嵌入单词中的标点符号）。

“”。join（[c if not c.isalpha（）else''表示结果中的c]）

创建一个没有标点符号的新字符串，但它对它没有任何作用；它会立即被丢弃，因为您从未存储结果

实际上，最好的方法是在

result

中对键进行计数之前对其进行规范化。例如，您可以执行以下操作：

for key in new_string:
    # Keep only the alphabetic parts of each key, and replace key for future use
    key = "".join([c for c in key if c.isalpha()])
    if key in result:
        result[key] += 1
    else:
        result[key] = 1

现在，

result

从来没有带标点符号的键（并且，

“god.”

和

“god！”

的计数仅在

“god”

键下求和），并且不需要在事件发生后再通过另一次操作来去除标点符号

或者，如果您只关心每个单词的前导和尾随标点符号（因此

“it's”

应按原样保留，而不是转换为

“its”

），您可以进一步简化。只需导入字符串，然后更改：

    key = "".join([c for c in key if c.isalpha()])

致：

这符合您在问题中特别要求的内容（删除单词末尾的标点符号，但不要删除单词开头的标点符号或嵌入单词中的标点符号）。

另一种选择是使用著名的Python语言

用于计算单词，并替换所有非单词字符。

另一个选项是使用著名的Python

用于计算单词和替换非单词字符的所有内容。

您可以使用

字符串。标点符号

识别标点符号并使用

集合。计数器

在正确分解字符串后计算出现次数

from collections import Counter
from string import punctuation

line = "It's a test and it's a good ol' one."

Counter(word.strip(punctuation) for word in line.casefold().split())
# Counter({"it's": 2, 'a': 2, 'test': 1, 'and': 1, 'good': 1, 'ol': 1, 'one': 1})

使用

str.strip

而不是

str.replace

可以保留像它这样的单词

方法

str.casefold

只是

str.lower

的一种更一般的情况，您可以使用

string.标点符号

识别标点符号并使用

集合。计数器

在正确分解字符串后计数发生的次数

from collections import Counter
from string import punctuation

line = "It's a test and it's a good ol' one."

Counter(word.strip(punctuation) for word in line.casefold().split())
# Counter({"it's": 2, 'a': 2, 'test': 1, 'and': 1, 'good': 1, 'ol': 1, 'one': 1})

使用

str.strip

而不是

str.replace

可以保留像它这样的单词

方法

str.casefold

只是

str.lower

的一个更一般的例子。如果您以后想重用这些单词，您可以将它们与它的出现次数一起存储在子词典中。每个单词在字典里都有自己的位置。我们可以创建自己的函数来删除标点符号，非常简单。看看下面的代码是否满足您的需要：

def remove_punctuation(word):
    for c in word:
        if not c.isalpha():
            word = word.replace(c, '')
    return word


def word_dic(s):
    words = s.lower().split(' ')
    result = {}

    for word in words:
        word = remove_punctuation(word)

        if not result.get(word, None):
            result[word] = {
                'word': word,
                'ocurrences': 1,
            }
            continue
        result[word]['ocurrences'] += 1  

    return result


phrase = 'Who am I and who are you? Are we gods? Gods are we? We are what we are!'
print(word_dic(phrase))

您将得到如下输出：

{
“谁”：{
“单词”：“谁”，
“眼病”：2}，
“am”：{
“单词”：“am”，
“眼病”：1}，
“我”：{
“单词”：“我”，
“眼病”：1}，
“和”：{
“单词”：“和”，
“眼病”：1}，
“是”吗{
“单词”：“是”，
“眼病”：5}，
‘你’：{
“单词”：“你”，
“眼病”：1}，
‘我们’：{
“单词”：“我们”，
“眼病”：4}，
“诸神”：{
“单词”：“神”，
“眼病”：2}，
“什么？”{
“单词”：什么，
“眼病”：1}
}

然后，只需执行以下操作，您就可以轻松访问每个单词及其眼波：

word_dict(phrase)['are']['word']       # output: are
word_dict(phrase)['are']['ocurrences'] # output: 5

如果你想以后再使用这些单词，你可以将它们和它的出现次数一起存储在子词典中。每个单词在字典里都有自己的位置。我们可以创建自己的函数来删除标点符号，非常简单。看看下面的代码是否满足您的需要：

def remove_punctuation(word):
    for c in word:
        if not c.isalpha():
            word = word.replace(c, '')
    return word


def word_dic(s):
    words = s.lower().split(' ')
    result = {}

    for word in words:
        word = remove_punctuation(word)

        if not result.get(word, None):
            result[word] = {
                'word': word,
                'ocurrences': 1,
            }
            continue
        result[word]['ocurrences'] += 1  

    return result


phrase = 'Who am I and who are you? Are we gods? Gods are we? We are what we are!'
print(word_dic(phrase))

您将得到如下输出：

{
“谁”：{
“单词”：“谁”，
“眼病”：2}，
“am”：{
“单词”：“am”，
“眼病”：1}，
“我”：{
“单词”：“我”，
“眼病”：1}，
“和”：{
“单词”：“和”，
“眼病”：1}，
“是”吗{
“单词”：“是”，
“眼病”：5}，
‘你’：{
“单词”：“你”，
“眼病”：1}，
‘我们’：{
“单词”：“我们”，
“眼病”：4}，
“诸神”：{
“单词”：“神”，
“眼病”：2}，
“什么？”{
“单词”：什么，
“眼病”：1}
}

然后，只需执行以下操作，您就可以轻松访问每个单词及其眼波：

word_dict(phrase)['are']['word']       # output: are
word_dict(phrase)['are']['ocurrences'] # output: 5

不错（+1），但为了可读性，我会使用更多的行和变量。将非单词字符转换为空格，然后拆分，将使

“it”

被视为单词

“it”

和

“s”

。通常，您希望在空格上拆分，然后剥离标点符号，或者剥离标点符号并在空格上拆分，而不是将标点符号转换为空格