Python 如何在不创建字典的情况下编写一个从字典中获取值的代码?

Python 如何在不创建字典的情况下编写一个从字典中获取值的代码?,python,Python,我需要编写一个代码,从文本中删除特定的单词。经过一些研究,我发现最好用“”替换所有单词,但replace()不是一个好选项,因为它也会从其他单词中删除字符。我找到了这个re.sub()函数,并希望定义一段代码来替换给定文本中的单词(单词在单独的列表中定义)。在大多数教程中,需要创建替换词典。我没有,所以我想定义一些东西来检查stopwords列表,并在文本中找到一个时将其替换为“” 这是我的代码: stopwords = ["a", "about", &q

我需要编写一个代码,从文本中删除特定的单词。经过一些研究,我发现最好用“”替换所有单词,但replace()不是一个好选项,因为它也会从其他单词中删除字符。我找到了这个re.sub()函数,并希望定义一段代码来替换给定文本中的单词(单词在单独的列表中定义)。在大多数教程中,需要创建替换词典。我没有,所以我想定义一些东西来检查stopwords列表,并在文本中找到一个时将其替换为“”

这是我的代码:

stopwords = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves"]

import re
from functools import partial

sentences = []
labels = []
with open(path_bbc, 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    next(reader)
    for row in reader: #this line of code divides my file into seperate lists (it's for an ML model)
        labels.append(row[0])
        sentence = row[1]
        for word in sentence:
            #this is where I don't know how to define what I want the code to do. The line below, of course, isn't working
            replacements = for word in stopword {word: " "}
            def helper(dic, match):
                word = match.group(0)
                return dic.get(word, word)
            
            word_re = re.compile(r'\b[a-zA-Z]+\b')
            sentence_rep = word_re.sub(partial(helper, replacements), sentence)
            sentences.append(sentence_rep) #here I want to append my temporary list where the operation of relacing words was happening to my final list sentences
这是期望的输出

#Expected output
# 2225
# tv future hands viewers home theatre systems plasma high-definition tvs digital video recorders moving living room way people watch tv will radically different five years time. according expert panel gathered annual consumer electronics show las vegas discuss new technologies will impact one favourite pastimes. us leading trend programmes content will delivered viewers via home networks cable satellite telecoms companies broadband service providers front rooms portable devices. one talked-about technologies ces digital personal video recorders (dvr pvr). set-top boxes like us s tivo uk s sky+ system allow people record store play pause forward wind tv programmes want. essentially technology allows much personalised tv. also built-in high-definition tv sets big business japan us slower take off europe lack high-definition programming. not can people forward wind adverts can also forget abiding network channel schedules putting together a-la-carte entertainment. us networks cable satellite companies worried means terms advertising revenues well brand identity viewer loyalty channels. although us leads technology moment also concern raised europe particularly growing uptake services like sky+. happens today will see nine months years time uk adam hume bbc broadcast s futurologist told bbc news website. likes bbc no issues lost advertising revenue yet. pressing issue moment commercial uk broadcasters brand loyalty important everyone. will talking content brands rather network brands said tim hanlon brand communications firm starcom mediavest. reality broadband connections anybody can producer content. added: challenge now hard promote programme much choice. means said stacey jolna senior vice president tv guide tv group way people find content want watch simplified tv viewers. means networks us terms channels take leaf google s book search engine future instead scheduler help people find want watch. kind channel model might work younger ipod generation used taking control gadgets play them. might not suit everyone panel recognised. older generations comfortable familiar schedules channel brands know getting. perhaps not want much choice put hands mr hanlon suggested. end kids just diapers pushing buttons already - everything possible available said mr hanlon. ultimately consumer will tell market want. 50 000 new gadgets technologies showcased ces many enhancing tv-watching experience. high-definition tv sets everywhere many new models lcd (liquid crystal display) tvs launched dvr capability built instead external boxes. one example launched show humax s 26-inch lcd tv 80-hour tivo dvr dvd recorder. one us s biggest satellite tv companies directtv even launched branded dvr show 100-hours recording capability instant replay search function. set can pause rewind tv 90 hours. microsoft chief bill gates announced pre-show keynote speech partnership tivo called tivotogo means people can play recorded programmes windows pcs mobile devices. reflect increasing trend freeing multimedia people can watch want want.
该文件具有以下格式:

['category', 'text']
[tech, tv future hands viewers home theatre systems plasma high-definition tvs digital video recorders moving living room way people watch tv will radically different five years time. according expert panel gathered annual consumer electronics show las vegas discuss new technologies will impact one favourite pastimes. us leading trend programmes content will delivered viewers via home networks cable satellite telecoms companies broadband service providers front rooms portable devices. one talked-about technologies ces digital personal video recorders (dvr pvr). set-top boxes like us s tivo uk s sky+ system allow people record store play pause forward wind tv programmes want. essentially technology allows much personalised tv. also built-in high-definition tv sets big business japan us slower]

你好像在找

#。。。句子=行[1]
句子=re.sub(
r'\s?\b(?%s)\b''''|'。连接(停止字),
'',第[1]行,标志=re.I)

您可以在循环之前
re.compiled(…)
regex,然后在循环内部使用
compiledre.sub(“”,句子)
,但是Python会为您缓存regex,因此您将复制Python在幕后为您做的工作(在某些版本的Python中,自己做这件事实际上比较慢).

正则表达式可以在一个步骤中使用re.sub()执行所有单词删除,该模式包含由管道分隔的所有stopwords

您需要检查模式周围的单词边界,以确保不会替换另一个模式中包含的单词的部分:
\b(…)\b

import re

stopwords   = ["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "could", "did", "do", "does", "doing", "down", "during", "each", "few", "for", "from", "further", "had", "has", "have", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "it", "it's", "its", "itself", "let's", "me", "more", "most", "my", "myself", "nor", "of", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "she", "she'd", "she'll", "she's", "should", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "we", "we'd", "we'll", "we're", "we've", "were", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "would", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves"]
wordPattern = re.compile(r"\b(" + "|".join(stopwords) + r")\b",flags=re.I)

def removeWords(S):
    result  = wordPattern.sub("",S,)          # remove words
    return re.sub(r" +"," ",result.strip())   # remove extra spaces


removeWords("am I bad and being a robot again")
'bad robot'
在代码中,将for循环替换为:

next(reader)
labels,sentenses = zip(*(L,removeWords(S)) for L,S,*_ in reader))

您能提供输入文件或输入的示例以及所需的输出格式的准确输出吗?与其重新设计基本的NLP逻辑,不如使用诸如
scikit
或NLTK之类的现有库。@SmartyMyly我编辑了文章以添加该信息