Python 如何创建NLTK自定义停止字文件

Python 如何创建NLTK自定义停止字文件,python,nltk,stop-words,Python,Nltk,Stop Words,我收集了大约1200个文本文件,每个大约5000字。每个文件都是电话会议的记录,包含公司名称。我希望能够处理文件以删除公司名称,因为它们经常重复,因此与我希望执行的其他处理无关。为了实现这一点,我尝试编写一个脚本,为公司名称创建一组自定义的停止名称。我的想法是创建与NLTK停止字列表格式相同的文件,并且可以以相同的方式使用 import os, os.path, sys, nltk, re, pprint, pickle with open('stops_Analyst_Companies.tx

我收集了大约1200个文本文件,每个大约5000字。每个文件都是电话会议的记录,包含公司名称。我希望能够处理文件以删除公司名称,因为它们经常重复,因此与我希望执行的其他处理无关。为了实现这一点,我尝试编写一个脚本,为公司名称创建一组自定义的停止名称。我的想法是创建与NLTK停止字列表格式相同的文件,并且可以以相同的方式使用

import os, os.path, sys, nltk, re, pprint, pickle
with open('stops_Analyst_Companies.txt','r') as stops_Analyst_Companies:
stops_Analyst_Companies = stops_Analyst_Companies.read()
stops_Analyst_Companies= [w.lower() for w in stops_Analyst_Companies]
stops_Analyst_Companies = str(stops_Analyst_Companies)
outfile = open ('cln_stops_Analyst_Companies.txt', 'w')
outfile.write(stops_Analyst_Companies)
下面是公司名称输入文件('stops\u Analyst\u companys.txt')的一个片段;'大西洋股票有限责任合伙公司、雅芳资本顾问公司、巴克莱资本公司、伯恩斯坦公司、BGC Partners公司、BMO Capital Markets U.S...............................总共有大约80个名称

import os, os.path, sys, nltk, re, pprint, pickle
with open('stops_Analyst_Companies.txt','r') as stops_Analyst_Companies:
stops_Analyst_Companies = stops_Analyst_Companies.read()
stops_Analyst_Companies= [w.lower() for w in stops_Analyst_Companies]
stops_Analyst_Companies = str(stops_Analyst_Companies)
outfile = open ('cln_stops_Analyst_Companies.txt', 'w')
outfile.write(stops_Analyst_Companies)
然后,我使用另一个脚本从成绩单文件中删除公司名称,并从成绩单文件中删除NLTK英语停止词,然后对该文件进行pickle处理以供后续使用。成功删除NLTK停止字时,不会删除自定义停止字文件中的字。
我正在尝试的是超出我有限的python能力的几个步骤,因此建议和指导将不胜感激

import os, os.path, sys, nltk, re, pprint, pickle
with open('stops_Analyst_Companies.txt','r') as stops_Analyst_Companies:
stops_Analyst_Companies = stops_Analyst_Companies.read()
stops_Analyst_Companies= [w.lower() for w in stops_Analyst_Companies]
stops_Analyst_Companies = str(stops_Analyst_Companies)
outfile = open ('cln_stops_Analyst_Companies.txt', 'w')
outfile.write(stops_Analyst_Companies)
以下是为分析公司创建自定义stopword文件的脚本

import os, os.path, sys, nltk, re, pprint, pickle
with open('stops_Analyst_Companies.txt','r') as stops_Analyst_Companies:
stops_Analyst_Companies = stops_Analyst_Companies.read()
stops_Analyst_Companies= [w.lower() for w in stops_Analyst_Companies]
stops_Analyst_Companies = str(stops_Analyst_Companies)
outfile = open ('cln_stops_Analyst_Companies.txt', 'w')
outfile.write(stops_Analyst_Companies)
下面是“cln_stps_Analyst_companys.txt”文件中的一个片段; [“,“a”,“t”,“l”,“a”,“n”,“t”,“i”,“c”,“e”,“q”,“u”,“i”,“t”,“i”,“e”,“s”,“l”,“l”,“p”,“p”,“a”,“v”,“o”,“n”,“c”,“a”,“p”,“i”,“a”,“d”,“v”,“i”,“s”,“o”,“r”,“s”,“s”,“s”,“d”,“v”,“i”,“o”,“o”,“n”,“n”,“c”,“c”,“a”,“p”,“i”,“t”,“a”,“l”,“l”,“a”,“d”,“d”,“v”,“i”,“i”,“i”,“o”,“o”,“r”,“s”,“s”,“s”,“s”,“s”,“s”,“s”,“d”,“s”,“s”,“s”,“s”,“d”,“d”,“d”,“d”,“

import os, os.path, sys, nltk, re, pprint, pickle
with open('stops_Analyst_Companies.txt','r') as stops_Analyst_Companies:
stops_Analyst_Companies = stops_Analyst_Companies.read()
stops_Analyst_Companies= [w.lower() for w in stops_Analyst_Companies]
stops_Analyst_Companies = str(stops_Analyst_Companies)
outfile = open ('cln_stops_Analyst_Companies.txt', 'w')
outfile.write(stops_Analyst_Companies)
下面是从成绩单文件中删除停止名称的脚本

import os, os.path, sys, nltk, re, pprint, pickle
with open('stops_Analyst_Companies.txt','r') as stops_Analyst_Companies:
stops_Analyst_Companies = stops_Analyst_Companies.read()
stops_Analyst_Companies= [w.lower() for w in stops_Analyst_Companies]
stops_Analyst_Companies = str(stops_Analyst_Companies)
outfile = open ('cln_stops_Analyst_Companies.txt', 'w')
outfile.write(stops_Analyst_Companies)
import os, os.path, sys, nltk, re, pprint, pickle
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
stopset = set(stopwords.words('english'))
with open('TestStopWords_1.txt','r') as fin:
wordtokens=word_tokenize(fin.read())
lowcase= [w.lower() for w in wordtokens]
# Remove NTLK Stopwords
nostops = [w for w in lowcase if not w in stopset]
print ('NLTK Stopset Words Removed')
print (' ')
print (nostops)
print (' ')
with open ('cln_stops_Analyst_Companies.txt', 'r') as cln_stops_Analyst_Companies:
    customstops = cln_stops_Analyst_Companies.read()
nostops = [w for w in nostops if not w in customstops]
print ('Analyst Companies Names Removed')
print (' ')
print (nostops )
nostops = str(nostops)
with open ("nostops.pickle", 'wb') as outfile:
pickle.dump (nostops, outfile)
print (' ')
print (' Pickle File Created')
print (' ')
这是TestStopWords文件中的代码段

import os, os.path, sys, nltk, re, pprint, pickle
with open('stops_Analyst_Companies.txt','r') as stops_Analyst_Companies:
stops_Analyst_Companies = stops_Analyst_Companies.read()
stops_Analyst_Companies= [w.lower() for w in stops_Analyst_Companies]
stops_Analyst_Companies = str(stops_Analyst_Companies)
outfile = open ('cln_stops_Analyst_Companies.txt', 'w')
outfile.write(stops_Analyst_Companies)
分析师公司,大西洋股票有限责任合伙公司,雅芳资本顾问公司

import os, os.path, sys, nltk, re, pprint, pickle
with open('stops_Analyst_Companies.txt','r') as stops_Analyst_Companies:
stops_Analyst_Companies = stops_Analyst_Companies.read()
stops_Analyst_Companies= [w.lower() for w in stops_Analyst_Companies]
stops_Analyst_Companies = str(stops_Analyst_Companies)
outfile = open ('cln_stops_Analyst_Companies.txt', 'w')
outfile.write(stops_Analyst_Companies)
这是“nostops.pickle”文件中的一个片段

import os, os.path, sys, nltk, re, pprint, pickle
with open('stops_Analyst_Companies.txt','r') as stops_Analyst_Companies:
stops_Analyst_Companies = stops_Analyst_Companies.read()
stops_Analyst_Companies= [w.lower() for w in stops_Analyst_Companies]
stops_Analyst_Companies = str(stops_Analyst_Companies)
outfile = open ('cln_stops_Analyst_Companies.txt', 'w')
outfile.write(stops_Analyst_Companies)

欧元X®“分析师”、“公司”、“大西洋”、“股票”、“有限责任合伙”、“雅芳”、“资本”、“顾问”能否正确格式化您的问题?因为现在还不清楚您的问题是什么,示例的格式都有误,哪些部分是文件,哪些部分是代码。在正确格式化您的问题后,我们可以帮助您更好地解决问题呃。你好,阿尔瓦斯,谢谢你的指导,我将编辑格式以提高可读性。我输入代码时使用了4个空格的缩进,然后在“with”语句后面缩进了4个空格。但是第二个缩进显然没有出现在帖子中。干杯,博伯你能正确设置你的问题格式吗?因为现在是unc通过示例的所有错误格式了解您的问题,哪部分是文件,哪部分是代码。正确设置问题格式后,我们可以更好地帮助您。嗨,Alvas,感谢您的指导,我将编辑格式以提高可读性。我输入代码时使用了4个空格的缩进,然后缩进了co在“with”语句后面加上4个空格。然而,第二个缩进显然没有出现在帖子中。干杯,BobS
import os, os.path, sys, nltk, re, pprint, pickle
with open('stops_Analyst_Companies.txt','r') as stops_Analyst_Companies:
stops_Analyst_Companies = stops_Analyst_Companies.read()
stops_Analyst_Companies= [w.lower() for w in stops_Analyst_Companies]
stops_Analyst_Companies = str(stops_Analyst_Companies)
outfile = open ('cln_stops_Analyst_Companies.txt', 'w')
outfile.write(stops_Analyst_Companies)