在Python中过滤文本数据_Python

在Python中过滤文本数据

python

在Python中过滤文本数据,python,Python,我很难理解我做错了什么。下面的代码非常简单 def compileWordList(textList, wordDict): '''Function to extract words from text lines exc. stops, and add associated line nums''' i = 0; for row in textList: i = i + 1 words = re.split('\W+',

我很难理解我做错了什么。下面的代码非常简单

def compileWordList(textList, wordDict):
    '''Function to extract words from text lines exc. stops,
        and add associated line nums'''
    i = 0;
    for row in textList:
        i = i + 1
        words = re.split('\W+', row)
        for wordPart in words:
            word = repr(wordPart)
            word = word.lower()
            if not any(word in s for s in stopsList):
                if word not in wordDict:
                    x = wordLineNumsContainer()
                    x.addLineNum(i)
                    wordDict[word] = x
                elif word in wordDict:
                    lineNumValues = wordDict[word]
                    lineNumValues.addLineNum(i)
                    wordDict[word] = lineNumValues
            elif any(word in s for s in stopsList):
                print(word)

代码从列表中获取一个字符串句子。然后，它使用re.split方法将字符串拆分为整个单词，并返回包含单词的字符串列表

然后我强制字符串小写。然后我想让它在一个停止词列表中检查这个词是否存在，因为我在英语中有太多的常用词。检查单词是否在stopsList中的部分似乎从来都不起作用，因为stops单词每次都会出现在我的wordDict中。此外，我还添加了底部的printword语句，以检查它是否捕捉到了它们，但从未打印任何内容：

在经过的字符串中使用了数百个停止字

请有人在这里给我指点一下好吗？为什么字符串永远不会因为是停止字而被过滤

非常感谢,，亚历克斯

那怎么办

from collections import defaultdict
import re

stop_words = set(['a', 'is', 'and', 'the', 'i'])
text = [ 'This is the first line in my text'
       , 'and this one is the second line in my text'
       , 'I like texts with three lines, so I added that one'
       ]   
word_line_dict = defaultdict(list)

for line_no, line in enumerate(text, 1): 
    words = set(map(str.lower, re.split('\W+', line)))
    words_ok = words.difference(stop_words)
    for wok in words_ok:
        word_line_dict[wok].append(line_no)

print word_line_dict

非常感谢Gnibbler：编写for循环的更好方法&处理dict中第一次插入的更具python风格的方法

除了dict的格式之外，它还打印什么

{ 'added': [3]
, 'like': [3]
, 'that': [3]
, 'this': [1, 2]
, 'text': [1, 2]
, 'lines': [3]
, 'three': [3]
, 'one': [2, 3]
, 'texts': [3]
, 'second': [2]
, 'so': [3]
, 'in': [1, 2]
, 'line': [1, 2]
, 'my': [1, 2]
, 'with': [3]
, 'first': [1]
}

那怎么办

from collections import defaultdict
import re

stop_words = set(['a', 'is', 'and', 'the', 'i'])
text = [ 'This is the first line in my text'
       , 'and this one is the second line in my text'
       , 'I like texts with three lines, so I added that one'
       ]   
word_line_dict = defaultdict(list)

for line_no, line in enumerate(text, 1): 
    words = set(map(str.lower, re.split('\W+', line)))
    words_ok = words.difference(stop_words)
    for wok in words_ok:
        word_line_dict[wok].append(line_no)

print word_line_dict

非常感谢Gnibbler：编写for循环的更好方法&处理dict中第一次插入的更具python风格的方法

除了dict的格式之外，它还打印什么

{ 'added': [3]
, 'like': [3]
, 'that': [3]
, 'this': [1, 2]
, 'text': [1, 2]
, 'lines': [3]
, 'three': [3]
, 'one': [2, 3]
, 'texts': [3]
, 'second': [2]
, 'so': [3]
, 'in': [1, 2]
, 'line': [1, 2]
, 'my': [1, 2]
, 'with': [3]
, 'first': [1]
}

即时红旗：s中的任意字表示stopsList中的s，s中的任意字表示stopsList中的s。再读一遍。这些表达毫无意义！word=reprwordPart的目的是什么？我会将Hello转换为“Hello”，即包含引号的字符串，因此这些单词永远不会与stopwords匹配，除非后者也输入为“a”、“the”等。除此之外，正如Santa所指出的，stopsList中s代表s的任何单词都存在明显的问题。我想应该是if word不在stopsList中：@Sergey我以前有if word不在列表中，但它也不起作用，所以更改了它。仍然有效，尽管您所谈论的问题有道理，谢谢。@AlexW我想如果单词不在列表中是因为repr而不起作用的话。此外，s中的anyword对于stopsList中的s的逻辑有点不同，可能不是您想要的-它检查word是否是stopsList中任何单词的子字符串-即，如果您的stopsList是[apple，banana]，它将排除单词a，p，l，e，app，nana等。但是，正如我所说，由于repr，您的所有单词都被引用，所以它们不匹配。立即出现的红色标志：s中的anyword代表stopsList中的s，s中的anyword代表stopsList中的s。再读一遍。这些表达毫无意义！word=reprwordPart的目的是什么？我会将Hello转换为“Hello”，即包含引号的字符串，因此这些单词永远不会与stopwords匹配，除非后者也输入为“a”、“the”等。除此之外，正如Santa所指出的，stopsList中s代表s的任何单词都存在明显的问题。我想应该是if word不在stopsList中：@Sergey我以前有if word不在列表中，但它也不起作用，所以更改了它。仍然有效，尽管您所谈论的问题有道理，谢谢。@AlexW我想如果单词不在列表中是因为repr而不起作用的话。此外，s中的anyword对于stopsList中的s的逻辑有点不同，可能不是您想要的-它检查word是否是stopsList中任何单词的子字符串-即，如果您的stopsList是[apple，banana]，它将排除单词a，p，l，e，app，nana等。但是，正如我所说，因为报告的原因，你所有的话都被引用了，所以他们什么都不匹配。哇，真是太酷了，谢谢。我明天去看看。再次感谢！在Python2.6+中，您可以使用第_行，第_行在enumeratetext中，1:还具有_键是不推荐的python3 dicts没有它。如果不是wok，你应该在word中使用。更好的方法是将word_line_dict设置为DefaultDictList如果您使用的是Python 2.7，那么您可以将映射替换为集合理解，这有点像Python。words={word.lower for word in re.split'\W+'，line}哇，太酷了，谢谢。我明天去看看。再次感谢！在Python2.6+中，您可以使用第_行，第_行在enumeratetext中，1:还具有_键是不推荐的python3 dicts没有它。如果不是wok，你应该在word中使用。更好的方法是将word_line_dict设置为DefaultDictList如果您使用的是Python 2.7，那么您可以将映射替换为集合理解，这有点像Python。words={word.lower代表re.split'\W+'，line}中的单词