Python 从列表中删除标点符号

Python 从列表中删除标点符号,python,list,file,Python,List,File,我正在收集《独立宣言》的样本,并计算其中单词长度的频率 文件中的示例文本: "When in the Course of human events it becomes necessary for one people to dissolve the political bands which have connected them with another and to assume among the powers of the earth, the separate and equal s

我正在收集《独立宣言》的样本,并计算其中单词长度的频率

文件中的示例文本:

"When in the Course of human events it becomes necessary for one people to dissolve the political bands which have connected them with another and to assume among the powers of the earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle them, a decent respect to the opinions of mankind requires 
that they should declare the causes which impel them to the separation."
注意:单词长度不能包含任何标点符号,例如字符串中的任何内容。标点符号

预期结果(样本):

我目前正忙于从已转换为列表的文件中删除标点符号

以下是我到目前为止所做的尝试:

import sys
import string

def format_text(fname):
        punc = set(string.punctuation)
        words = fname.read().split()
        return ''.join(word for word in words if word not in punc)

try:
    with open(sys.argv[1], 'r') as file_arg:
        file_arg.read()
except IndexError:
    print('You need to provide a filename as an arguement.')
    sys.exit()

fname = open(sys.argv[1], 'r')
formatted_text = format_text(fname)
print(formatted_text)

您可以从单词中删除标点符号,也可以避免将所有文件读入内存:

punc = string.punctuation
return ' '.join(word.strip(punc) for line in fname for word in line.split())
如果您想从《自然》杂志中删除
,则需要翻译:

from string import punctuation

# use ord of characters you want to replace as keys and what you want to replace them with as values
tbl = {ord(k):"" for k in punctuation}
return ' '.join(line.translate(tbl) for line in fname)
要获取频率,请使用:

或者根据您的方法:

freq = Counter(len(word.strip(punc)) for line in fname for word in line.split())
以上述问题中的行为例:

lines =""""When in the Course of human events it becomes necessary for one people to dissolve the political bands which have connected them with another and to assume among the powers of the earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle them, a decent respect to the opinions of mankind requires
that they should declare the causes which impel them to the separation."""

from collections import Counter
freq = Counter(len(word.strip(punctuation)) for line in lines.splitlines() for word in line.split())
print(freq.most_common()) 
输出键/值对的元组,从最长到最短的单词长度开始,键是长度,第二个元素是频率:

[(3, 15), (2, 12), (4, 9), (5, 9), (6, 9), (7, 7), (8, 5), (9, 3), (1, 1), (10, 1)]
如果要输出频率,从1个字母的单词开始,不按顺序排序:

mx = max(freq.values())
for i in range(1, mx+1):
    v = freq[i]
    if v:
        print("length {} words appeared {} time/s.".format(i, v) )
输出:

length 1 words appeared 1 time/s.
length 2 words appeared 12 time/s.
length 3 words appeared 15 time/s.
length 4 words appeared 9 time/s.
length 5 words appeared 9 time/s.
length 6 words appeared 9 time/s.
length 7 words appeared 7 time/s.
length 8 words appeared 5 time/s.
length 9 words appeared 3 time/s.
length 10 words appeared 1 time/s.
对于缺少的键,计数器dict与普通dict不同,它不会返回keyError,而是返回
0
的值,因此
如果v
仅对文件中出现的字长为真

如果要打印已清理的数据,请将所有逻辑放入功能中:

def clean_text(fname):
    punc = string.punctuation
    return [word.strip(punc) for line in fname for word in line.split()]


def get_freq(cleaned):
    return Counter(len(word) for word in cleaned)


def freq_output(d):
    mx = max(d.values())
    for i in range(1, mx + 1):
        v = d[i]
        if v:
            print("length {} words appeared {} time/s.".format(i, v))

try:
    with open(sys.argv[1], 'r') as file_arg:
        file_arg.read()
except IndexError:
    print('You need to provide a filename as an arguement.')
    sys.exit()

fname = open(sys.argv[1], 'r')
formatted_text = clean_text(fname)

print(" ".join(formatted_text))
print()
freq = get_freq(formatted_text)

freq_output(freq) 
在您的问题片段输出上运行:

~$ python test.py test.txt
When in the Course of human events it becomes necessary for one people  
to dissolve the political bands which have connected them with another
and to assume among the powers of the earth the separate and equal station 
 to which the Laws of Nature and of Nature's God entitle them a decent 
respect to the opinions of mankind requires that they should declare 
the causes which impel them to the separation

length 1 words appeared 1 time/s.
length 2 words appeared 12 time/s.
length 3 words appeared 15 time/s.
length 4 words appeared 9 time/s.
length 5 words appeared 9 time/s.
length 6 words appeared 9 time/s.
length 7 words appeared 7 time/s.
length 8 words appeared 5 time/s.
length 9 words appeared 3 time/s.
length 10 words appeared 1 time/s.
如果您只关心频率输出,请一次性完成:

import sys
import string


def freq_output(fname):
    from string import punctuation

    tbl = {ord(k): "" for k in punctuation}
    d = Counter(len(word.strip(punctuation)) for line in fname for word in line.split())
    d = Counter(len(word.translate(tbl)) for line in fname for word in line.split())
    mx = max(d.values())
    for i in range(1, mx + 1):
        v = d[i]
        if v:
            print("length {} words appeared {} time/s.".format(i, v))


try:
    with open(sys.argv[1], 'r') as file_arg:
        file_arg.read()
except IndexError:
    print('You need to provide a filename as an arguement.')
    sys.exit()

fname = open(sys.argv[1], 'r')

freq_output(fname)

使用对
d

正确的方法,您可以从单词中去除标点符号,也可以避免将所有文件读入内存:

punc = string.punctuation
return ' '.join(word.strip(punc) for line in fname for word in line.split())
如果您想从《自然》杂志中删除
,则需要翻译:

from string import punctuation

# use ord of characters you want to replace as keys and what you want to replace them with as values
tbl = {ord(k):"" for k in punctuation}
return ' '.join(line.translate(tbl) for line in fname)
要获取频率,请使用:

或者根据您的方法:

freq = Counter(len(word.strip(punc)) for line in fname for word in line.split())
以上述问题中的行为例:

lines =""""When in the Course of human events it becomes necessary for one people to dissolve the political bands which have connected them with another and to assume among the powers of the earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle them, a decent respect to the opinions of mankind requires
that they should declare the causes which impel them to the separation."""

from collections import Counter
freq = Counter(len(word.strip(punctuation)) for line in lines.splitlines() for word in line.split())
print(freq.most_common()) 
输出键/值对的元组,从最长到最短的单词长度开始,键是长度,第二个元素是频率:

[(3, 15), (2, 12), (4, 9), (5, 9), (6, 9), (7, 7), (8, 5), (9, 3), (1, 1), (10, 1)]
如果要输出频率,从1个字母的单词开始,不按顺序排序:

mx = max(freq.values())
for i in range(1, mx+1):
    v = freq[i]
    if v:
        print("length {} words appeared {} time/s.".format(i, v) )
输出:

length 1 words appeared 1 time/s.
length 2 words appeared 12 time/s.
length 3 words appeared 15 time/s.
length 4 words appeared 9 time/s.
length 5 words appeared 9 time/s.
length 6 words appeared 9 time/s.
length 7 words appeared 7 time/s.
length 8 words appeared 5 time/s.
length 9 words appeared 3 time/s.
length 10 words appeared 1 time/s.
对于缺少的键,计数器dict与普通dict不同,它不会返回keyError,而是返回
0
的值,因此
如果v
仅对文件中出现的字长为真

如果要打印已清理的数据,请将所有逻辑放入功能中:

def clean_text(fname):
    punc = string.punctuation
    return [word.strip(punc) for line in fname for word in line.split()]


def get_freq(cleaned):
    return Counter(len(word) for word in cleaned)


def freq_output(d):
    mx = max(d.values())
    for i in range(1, mx + 1):
        v = d[i]
        if v:
            print("length {} words appeared {} time/s.".format(i, v))

try:
    with open(sys.argv[1], 'r') as file_arg:
        file_arg.read()
except IndexError:
    print('You need to provide a filename as an arguement.')
    sys.exit()

fname = open(sys.argv[1], 'r')
formatted_text = clean_text(fname)

print(" ".join(formatted_text))
print()
freq = get_freq(formatted_text)

freq_output(freq) 
在您的问题片段输出上运行:

~$ python test.py test.txt
When in the Course of human events it becomes necessary for one people  
to dissolve the political bands which have connected them with another
and to assume among the powers of the earth the separate and equal station 
 to which the Laws of Nature and of Nature's God entitle them a decent 
respect to the opinions of mankind requires that they should declare 
the causes which impel them to the separation

length 1 words appeared 1 time/s.
length 2 words appeared 12 time/s.
length 3 words appeared 15 time/s.
length 4 words appeared 9 time/s.
length 5 words appeared 9 time/s.
length 6 words appeared 9 time/s.
length 7 words appeared 7 time/s.
length 8 words appeared 5 time/s.
length 9 words appeared 3 time/s.
length 10 words appeared 1 time/s.
如果您只关心频率输出,请一次性完成:

import sys
import string


def freq_output(fname):
    from string import punctuation

    tbl = {ord(k): "" for k in punctuation}
    d = Counter(len(word.strip(punctuation)) for line in fname for word in line.split())
    d = Counter(len(word.translate(tbl)) for line in fname for word in line.split())
    mx = max(d.values())
    for i in range(1, mx + 1):
        v = d[i]
        if v:
            print("length {} words appeared {} time/s.".format(i, v))


try:
    with open(sys.argv[1], 'r') as file_arg:
        file_arg.read()
except IndexError:
    print('You need to provide a filename as an arguement.')
    sys.exit()

fname = open(sys.argv[1], 'r')

freq_output(fname)

使用对
d

正确的方法,您可以使用translate去除标点符号:

import string

words = fname.read().translate(None, string.punctuation).split()

py2.7

import string
from collections import defaultdict
from collections import Counter

def s1():
    with open("myfile.txt", "r") as f:
        counts = defaultdict(int)
        for line in f:
            words = line.translate(None, string.punctuation).split()
            for length in map(len, words):
                counts[length] += 1
    return counts

def s2():
    with open("myfile.txt", "r") as f:
        counts = Counter(length for line in f for length in map(len, line.translate(None, string.punctuation).split()))
    return counts

py3

import string
from collections import defaultdict
from collections import Counter

def s1():
    with open("myfile.txt", "r") as f:
        counts = defaultdict(int)
        for line in f:
            words = line.translate(None, string.punctuation).split()
            for length in map(len, words):
                counts[length] += 1
    return counts

def s2():
    with open("myfile.txt", "r") as f:
        counts = Counter(length for line in f for length in map(len, line.translate(None, string.punctuation).split()))
    return counts
我认为在Python3.2中,计数器被更新,与手动构建计数器字典的速度相同或更快

此外,python3的翻译也变得不那么冗长:

import string
from collections import defaultdict
from collections import Counter

strip_punct = str.maketrans('','',string.punctuation)

def s1():
    with open("myfile.txt", "r") as f:
        counts = defaultdict(int)
        for line in f:
            words = line.translate(strip_punct).split()
            for length in map(len, words):
                counts[length] += 1
    return counts

def s2():
    with open("myfile.txt", "r") as f:
        counts = Counter(length for line in f for length in map(len, line.translate(strip_punct).split()))
    return counts


可以使用“翻译”删除标点符号:

import string

words = fname.read().translate(None, string.punctuation).split()

py2.7

import string
from collections import defaultdict
from collections import Counter

def s1():
    with open("myfile.txt", "r") as f:
        counts = defaultdict(int)
        for line in f:
            words = line.translate(None, string.punctuation).split()
            for length in map(len, words):
                counts[length] += 1
    return counts

def s2():
    with open("myfile.txt", "r") as f:
        counts = Counter(length for line in f for length in map(len, line.translate(None, string.punctuation).split()))
    return counts

py3

import string
from collections import defaultdict
from collections import Counter

def s1():
    with open("myfile.txt", "r") as f:
        counts = defaultdict(int)
        for line in f:
            words = line.translate(None, string.punctuation).split()
            for length in map(len, words):
                counts[length] += 1
    return counts

def s2():
    with open("myfile.txt", "r") as f:
        counts = Counter(length for line in f for length in map(len, line.translate(None, string.punctuation).split()))
    return counts
我认为在Python3.2中,计数器被更新,与手动构建计数器字典的速度相同或更快

此外,python3的翻译也变得不那么冗长:

import string
from collections import defaultdict
from collections import Counter

strip_punct = str.maketrans('','',string.punctuation)

def s1():
    with open("myfile.txt", "r") as f:
        counts = defaultdict(int)
        for line in f:
            words = line.translate(strip_punct).split()
            for length in map(len, words):
                counts[length] += 1
    return counts

def s2():
    with open("myfile.txt", "r") as f:
        counts = Counter(length for line in f for length in map(len, line.translate(strip_punct).split()))
    return counts


您可以使用正则表达式:

import re

def format_text(fname, pattern):
    words = fname.read()
    return re.sub(p, '', words)

p = re.compile(r'[!&:;",.]')
fh = open('C:/Projects/ExplorePy/test.txt')
text = format_text(fh, p)

根据需要应用split(),可以优化模式。

可以使用正则表达式:

import re

def format_text(fname, pattern):
    words = fname.read()
    return re.sub(p, '', words)

p = re.compile(r'[!&:;",.]')
fh = open('C:/Projects/ExplorePy/test.txt')
text = format_text(fh, p)


根据您的喜好应用split(),可以优化模式。

到底是什么问题?到底是什么问题?我收到一个名称错误,word现在没有定义…?@Jay\u R,不用担心,您可以迭代文件对象并拆分,除非你想使用它,否则将所有内容读入内存是没有意义的。那么你将如何处理下一部分?计算单词长度的频率。字典?列表?那部分很简单。我要补充一点,我对在哪里实现代码感到有点困惑。下一步我要做的是另一个函数获取format_text函数的输出,并在另一个函数中计算它的字长和频率。我收到一个名称错误,word现在没有定义…?@Jay_R,不用担心,你可以迭代文件对象并拆分,除非你想使用它,否则将所有内容读入内存是没有意义的。那么你将如何处理下一部分?计算单词长度的频率。字典?列表?那部分很简单。我要补充一点,我对在哪里实现代码感到有点困惑。下一步我要做的是使用另一个函数来获取format_text函数的输出,并在另一个函数中计算它的字长和频率。你能添加一个链接,指向文档中说计数器比手动生成dict慢的地方吗?这是一篇关于它的博客文章。这个问题也提到了它。对不起,我花了点时间才找到。很有趣。我想柜台上的钟声和哨声抵消了任何开销。仍然令人惊讶的是,考虑到它的确切目的是计数,它不应该总是更快。你可以添加一个链接,指向那些文档说计数器比手动生成dict慢的地方吗?这里有一篇关于它的博客文章。这个问题也提到了它。对不起,我花了点时间才找到。很有趣。我想柜台上的钟声和哨声抵消了任何开销。但令人惊讶的是,考虑到它的确切目的是计数,它不应该总是更快