删除python中的单词扩展名_Python_String

删除python中的单词扩展名

python string

删除python中的单词扩展名,python,string,Python,String,我有一篇有几个单词的课文。我想删除单词的所有派生扩展。例如，我想删除扩展名-ed-ing并保留初始动词。如果我使用了Verification或verified来保持Verification f.e.，我在python中找到了一个方法strip，它从字符串的开头或结尾删除了一个特定的字符串，但这并不是我想要的。例如，有没有在python中实现这种功能的库我试着执行建议帖子中的代码，我注意到几个词中有一个奇怪的修饰。例如，我有下面的文本 We goin all the way βπƒβ΅οΈβ΅

我有一篇有几个单词的课文。我想删除单词的所有派生扩展。例如，我想删除扩展名-ed-ing并保留初始动词。如果我使用了Verification或verified来保持Verification f.e.，我在python中找到了一个方法strip，它从字符串的开头或结尾删除了一个特定的字符串，但这并不是我想要的。例如，有没有在python中实现这种功能的库

我试着执行建议帖子中的代码，我注意到几个词中有一个奇怪的修饰。例如，我有下面的文本

 We goin all the way βπƒβ΅οΈβ΅οΈ        
 Think ive caught on to a really good song ! Im writing π       
 Lookin back on the stuff i did when i was lil makes me laughh π‚       
 I sneezed on the beat and the beat got sicka       
 #nashnewvideo http://t.co/10cbUQswHR       
 Homee βοΈβοΈβοΈπ΄      
 So much respect for this man , truly amazing guy βοΈ @edsheeran  
 http://t.co/DGxvXpo1OM"        
 What a day ..      
 RT @edsheeran: Having some food with @ShawnMendes      
 #VoiceSave  christina π        
 Im gunna make the βοΈ sign my signature pose       
 You all are so beautiful .. π soooo beautiful      
 Thought that was a really awesome quote        
 Beautiful things don't ask for attention"""

在使用以下代码之后（我还删除了非拉丁字符和URL）

例如，它对beauti来说是美丽的，对realli来说是真实的。我的代码如下：

 reader = csv.reader(f)
    print doc
    for row in reader:
        text =  re.sub(r"(?:\@|https?\://)\S+", "", row[2])
        filter(lambda x: x in string.printable, text)
        out = text.translate(string.maketrans("",""), string.punctuation)
        out = re.sub("[\W\d]", " ", out.strip())
        word_list = out.split()
        str1 = ""
        for verb in word_list:
                 verb = verb.lower()
                 verb = nltk.stem.porter.PorterStemmer().stem_word(verb)
                 str1 = str1+" "+verb+" " 
        list.append(str1)
        str1 = "\n"

相反，您可以使用

lemmatizer

。下面是python NLTK的一个示例：

from nltk.stem import WordNetLemmatizer

s = """
 You all are so beautiful soooo beautiful
 Thought that was a really awesome quote
 Beautiful things don't ask for attention
 """

wnl = WordNetLemmatizer()
print " ".join([wnl.lemmatize(i) for i in s.split()]) #You all are so beautiful soooo beautiful Thought that wa a really awesome quote Beautiful thing don't ask for attention

在某些情况下，它可能无法实现您的期望：

print wnl.lemmatize('going') #going

然后您可以将这两种方法结合起来：

词干分析

和

柠檬化

您的问题有点笼统，但是如果您已经定义了静态文本，最好的方法是编写自己的

词干分析器

。因为

Porter

和

Lancaster

词干分析器遵循自己的规则来剥离词缀，而

WordNet lemmatizer

仅在生成的单词在其词典中时删除词缀

你可以这样写：

import re


def stem(word):
    for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
        if word.endswith(suffix):
            return word[:-len(suffix)]
    return word


def stemmer(phrase):
    for word in phrase:
        if stem(word):
            print re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', word)

因此，对于“处理过程”，您将有：

 >> stemmer('processing processes')
[('process', 'ing'),('process', 'es')]

是的，stem是我要找的词。我试过这篇文章的例子，但我注意到一个严重的词修剪。我在lemmatizer上得到了以下结果：你们都很漂亮，所以我觉得这是一个非常棒的引语，美丽的东西，不需要注意，首先使用lemmatization，然后使用词干是明智的吗？

 >> stemmer('processing processes')
[('process', 'ing'),('process', 'es')]