通过带有重音字符的文本文件进行解析[Python]_Python_Regex_Parsing

通过带有重音字符的文本文件进行解析[Python]

python regex parsing

通过带有重音字符的文本文件进行解析[Python],python,regex,parsing,Python,Regex,Parsing,我试图遍历，并计算法语文本文件中出现的单词（包含重音字符）。下面的代码选择所有单词，但不考虑加重字符： #!/usr/bin/env python # -*- coding: utf-8 -*- import re wordcount={} f = open("verbatim2.txt", "r") regex = re.compile(r'\b\w{4,}\b') #regex = re.compile(r'[A-Z]\p{L}+\s*') for line in f: w

我试图遍历，并计算法语文本文件中出现的单词（包含重音字符）。下面的代码选择所有单词，但不考虑加重字符：

#!/usr/bin/env python
# -*- coding: utf-8 -*-   import re

wordcount={}

f = open("verbatim2.txt", "r") regex = re.compile(r'\b\w{4,}\b')
#regex = re.compile(r'[A-Z]\p{L}+\s*')

for line in f:
    words = regex.findall(line)
    for word in words:
        print word
        if word not in wordcount:
            wordcount[word] = 1
        else:
            wordcount[word] += 1 for k,v in wordcount.items():
    print k, v

如何在我的“字数”字典中正确地包含重音字符

谢谢

在不使用正则表达式的情况下，对四个或四个以上字符的字进行计数/合计/聚合：

import collections
d = collections.counter()

with open('file') as f
    for line in f:
        line = line.strip()
        line = line.split()
        words = (word for word in line if len(word) >= 4)
        d.update(words)

从

\w

的（v2.7）文档中：

如果未指定区域设置和UNICODE标志，则匹配任何字母数字字符和下划线；这相当于套件[a-zA-Z0-9]。使用区域设置时，它将与集合[0-9_uu]匹配当前文件中定义为字母数字的任何字符场所如果设置了UNICODE，则这将与字符[0-9_u2;]加匹配在Unicode字符中分类为字母数字的内容属性数据库

如果您想继续使用正则表达式，请尽可能使用您的代码添加

flags=re.UNICODE

（修复语法和使用错误），我成功了。如上所述，这里已经回答了这个问题

这似乎是一个重复的问题——看，它看起来确实非常接近，但我看不出如何将它应用到我的问题上。你在尝试数四个或更多字符的单词吗？

#!/usr/bin/env python
# -*- coding: utf-8 -*-   
import re

wordcount={}

f = open("verbatim2.txt", "r")
regex = r'\b\w{4,}\b'
#regex = re.compile(r'[A-Z]\p{L}+\s*')

for line in f:
    words = re.findall(regex, line.decode('utf8'), re.UNICODE)
    for word in words:
        print word
        if word not in wordcount:
            wordcount[word] = 1
        else:
            for k,v in wordcount.items():
                wordcount[word] += 1
print wordcount