Python 未删除标点符号的预处理脚本_Python_String_Nlp_Nltk

Python 未删除标点符号的预处理脚本

python string nlp

Python 未删除标点符号的预处理脚本,python,string,nlp,nltk,Python,String,Nlp,Nltk,我有一个预处理文本文档列表的代码。也就是说：给定一个文本文档列表，它返回一个列表，其中每个文本文档都经过预处理。但由于某些原因，删除标点符号是行不通的 import nltk from nltk.tokenize import word_tokenize from nltk.corpus import stopwords nltk.download("stopwords") nltk.download('punkt') nltk.download('wordnet') def preproc

我有一个预处理文本文档列表的代码。也就是说：给定一个文本文档列表，它返回一个列表，其中每个文本文档都经过预处理。但由于某些原因，删除标点符号是行不通的

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download("stopwords")
nltk.download('punkt')
nltk.download('wordnet')


def preprocess(docs):
  """ 
  Given a list of documents, return each documents as a string of tokens, 
  stripping out punctuation 
  """
  clean_docs = [clean_text(i) for i in docs]
  tokenized_docs = [tokenize(i) for i in clean_docs]
  return tokenized_docs

def tokenize(text):
  """ 
  Tokenizes text -- returning the tokens as a string 
  """
  stop_words = stopwords.words("english")
  nltk_tokenizer = nltk.WordPunctTokenizer().tokenize
  tokens = nltk_tokenizer(text)  
  result = " ".join([i for i in tokens if not i in stop_words])
  return result


def clean_text(text): 
  """ 
  Cleans text by removing case
  and stripping out punctuation. 
  """
  new_text = make_lowercase(text)
  new_text = remove_punct(new_text)
  return new_text

def make_lowercase(text):
  new_text = text.lower()
  return new_text

def remove_punct(text):
  text = text.split()
  punct = string.punctuation
  new_text = " ".join(word for word in text if word not in string.punctuation)
  return new_text

# Get a list of titles  
s1 = "[UPDATE] I am tired"
s2 = "I am cold."

clean_docs = preprocess([s1, s2])
print(clean_docs)

这将打印出：

['[更新]累了'，'冷了']

换句话说，它不会去掉标点符号，因为“[”、“]”和“.”都出现在最终产品中

您正在尝试用标点符号搜索单词。显然，

[UPDATE]

不是标点符号

尝试在文本中搜索标点符号/替换标点符号：

导入字符串
def删除标点符号（文本：str）->str:
对于字符串中的p。标点符号：
text=text.replace（p''）
返回文本
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu'：
text=“[UPDATE]我累了”
打印（删除标点符号（文本））
#输出：
#我累了

您正在尝试用标点符号搜索单词。显然，

[UPDATE]

不是标点符号

尝试在文本中搜索标点符号/替换标点符号：

导入字符串
def删除标点符号（文本：str）->str:
对于字符串中的p。标点符号：
text=text.replace（p''）
返回文本
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu'：
text=“[UPDATE]我累了”
打印（删除标点符号（文本））
#输出：
#我累了

括号可能不被视为标点符号。如果单词不在string.punction中或不在“[{（）}]”中，您应该尝试

，

或类似的内容，括号可能不被视为标点符号。如果单词不在string.punctation中或不在“[{（）}]”中，您应该尝试

，

或类似的内容