Nlp 我在做一个电影情感分析,但是,代码,但是我';我在代码中遇到了与文字处理相关的问题

Nlp 我在做一个电影情感分析,但是,代码,但是我';我在代码中遇到了与文字处理相关的问题,nlp,nltk,sentiment-analysis,Nlp,Nltk,Sentiment Analysis,我的数据集有42000行。这是我用来在矢量化文本之前编辑文本的代码。但是问题是它有一个嵌套的for循环,我想这会使它非常慢,而且我不能将它用于1500多行。有人能帮我找到更好的方法吗 filtered = [] for i in range(2): rev = re.sub('[^a-zA-Z]', ' ', df['text'][i]) rev = rev.lower() rev = rev.split() filtered =[] for word i

我的数据集有42000行。这是我用来在矢量化文本之前编辑文本的代码。但是问题是它有一个嵌套的for循环,我想这会使它非常慢,而且我不能将它用于1500多行。有人能帮我找到更好的方法吗

filtered = []
for i in range(2):
    rev = re.sub('[^a-zA-Z]', ' ', df['text'][i])
    rev = rev.lower()
    rev = rev.split()
    filtered =[]
    for word in rev:
        if word not in stopwords.words("english"):
            word = PorterStemmer().stem(word)
            filtered.append(word)
    filtered = " ".join(filtered)
    corpus.append(filtered)

编写代码中最耗时的部分是stopwords部分。 每次循环迭代时,它都会调用库来获取停止字列表。 因此,最好获得一次stopwords集,并在每次迭代中使用相同的集合。
我将代码重写如下(其他差异只是为了可读性):

我曾经测量过你发布的代码的速度

测量结果如下

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     8                                           @profile
     9                                           def profile_nltk():
    10         1     435819.0 435819.0      0.3      df = pd.read_csv('IMDB_Dataset.csv')  # (50000, 2)
    11         1          1.0      1.0      0.0      filtered = []
    12         1        247.0    247.0      0.0      reviews = df['review'][:4000]
    13         1          0.0      0.0      0.0      corpus = []
    14      4001     216341.0     54.1      0.1      for i in range(len(reviews)):
    15      4000     221885.0     55.5      0.2          rev = re.sub('[^a-zA-Z]', ' ', df['review'][i])
    16      4000       3878.0      1.0      0.0          rev = rev.lower()
    17      4000      30209.0      7.6      0.0          rev = rev.split()
    18      4000       1097.0      0.3      0.0          filtered = []
    19    950808     235589.0      0.2      0.2          for word in rev:
    20    946808  115658060.0    122.2     78.2              if word not in stopwords.words("english"):
    21    486614   30898223.0     63.5     20.9                  word = PorterStemmer().stem(word)
    22    486614     149604.0      0.3      0.1                  filtered.append(word)
    23      4000      11290.0      2.8      0.0          filtered = " ".join(filtered)
    24      4000       1429.0      0.4      0.0          corpus.append(filtered)
正如@parsa abbasi所指出的,检查stopword的过程约占总数的80%

修改脚本的测量结果如下所示。同样的过程已减少到处理时间的1/100左右

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     8                                           @profile
     9                                           def profile_nltk():
    10         1     441467.0 441467.0      1.4      df = pd.read_csv('IMDB_Dataset.csv')  # (50000, 2)
    11         1          1.0      1.0      0.0      filtered = []
    12         1        335.0    335.0      0.0      reviews = df['review'][:4000]
    13         1          1.0      1.0      0.0      corpus = []
    14         1       2696.0   2696.0      0.0      stopwords_set = stopwords.words('english')
    15      4001      59013.0     14.7      0.2      for i in range(len(reviews)):
    16      4000     186393.0     46.6      0.6          rev = re.sub('[^a-zA-Z]', ' ', df['review'][i])
    17      4000       3657.0      0.9      0.0          rev = rev.lower()
    18      4000      27357.0      6.8      0.1          rev = rev.split()
    19      4000        999.0      0.2      0.0          filtered = []
    20    950808     220673.0      0.2      0.7          for word in rev:
    21                                                       # if word not in stopwords.words("english"):
    22    946808    1201271.0      1.3      3.8              if word not in stopwords_set:
    23    486614   29479712.0     60.6     92.8                  word = PorterStemmer().stem(word)
    24    486614     141242.0      0.3      0.4                  filtered.append(word)
    25      4000      10412.0      2.6      0.0          filtered = " ".join(filtered)
    26      4000       1329.0      0.3      0.0          corpus.append(filtered)

我希望这会有所帮助。

请阅读标签的说明。
Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     8                                           @profile
     9                                           def profile_nltk():
    10         1     441467.0 441467.0      1.4      df = pd.read_csv('IMDB_Dataset.csv')  # (50000, 2)
    11         1          1.0      1.0      0.0      filtered = []
    12         1        335.0    335.0      0.0      reviews = df['review'][:4000]
    13         1          1.0      1.0      0.0      corpus = []
    14         1       2696.0   2696.0      0.0      stopwords_set = stopwords.words('english')
    15      4001      59013.0     14.7      0.2      for i in range(len(reviews)):
    16      4000     186393.0     46.6      0.6          rev = re.sub('[^a-zA-Z]', ' ', df['review'][i])
    17      4000       3657.0      0.9      0.0          rev = rev.lower()
    18      4000      27357.0      6.8      0.1          rev = rev.split()
    19      4000        999.0      0.2      0.0          filtered = []
    20    950808     220673.0      0.2      0.7          for word in rev:
    21                                                       # if word not in stopwords.words("english"):
    22    946808    1201271.0      1.3      3.8              if word not in stopwords_set:
    23    486614   29479712.0     60.6     92.8                  word = PorterStemmer().stem(word)
    24    486614     141242.0      0.3      0.4                  filtered.append(word)
    25      4000      10412.0      2.6      0.0          filtered = " ".join(filtered)
    26      4000       1329.0      0.3      0.0          corpus.append(filtered)