在Python NLTK中打开和预处理文件_Python_Regex_Nltk

在Python NLTK中打开和预处理文件

python regex

在Python NLTK中打开和预处理文件,python,regex,nltk,Python,Regex,Nltk,我是PythonNLTK新手，真的需要你的建议。我想打开自己的txt文件并进行一些预处理，比如用正则表达式替换单词。我试着像NLTK 2.0食谱中那样做 import re replacement_patterns = [ (r'won\'t', 'will not'), (r'can\'t', 'cannot'), (r'i\'m', 'i am'), (r'ain\'t', 'is not'), (r'(\w

我是PythonNLTK新手，真的需要你的建议。我想打开自己的txt文件并进行一些预处理，比如用正则表达式替换单词。我试着像NLTK 2.0食谱中那样做

import re
replacement_patterns = [
        (r'won\'t', 'will not'),
        (r'can\'t', 'cannot'),
        (r'i\'m', 'i am'),
        (r'ain\'t', 'is not'),
        (r'(\w+)\'ll', '\g<1> will'),
        (r'(\w+)n\'t', '\g<1> not'),
        (r'(\w+)\'ve', '\g<1> have'),
        (r'(\w+t)\'s', '\g<1> is'),
        (r'(\w+)\'re', '\g<1> are'),
        (r'(\w+)\'d', '\g<1> would'),
]
class RegexpReplacer(object):
    def __init__(self, patterns=replacement_patterns):
                self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns]

    def replace(self, line):
                s = line

                for (pattern, repl) in self.patterns:
                        (s, count) = re.subn(pattern, repl, s)

                return s

提前谢谢

我想您应该先使用将所有文件内容读入字符串

    import nltk
f=open("C:/nltk_data/file.txt", "rU")
raw=f.readlines()
from replacers import RegexpReplacer
replacer=RegexpReplacer()
replacer.replace(raw)