Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/335.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何在使用nltk时从输出屏幕中删除\n_Python_Nltk - Fatal编程技术网

Python 如何在使用nltk时从输出屏幕中删除\n

Python 如何在使用nltk时从输出屏幕中删除\n,python,nltk,Python,Nltk,我正在使用句子标记器,但如何从输出中删除不需要的/n from nltk.tokenize import word_tokenize, sent_tokenize from nltk.corpus import stopwords import PyPDF2 as p2 pdf_file = open("Muhammad_CV.pdf", 'rb') pdf_read = p2.PdfFileReader(pdf_file) count = pdf_read.numPages for i in

我正在使用句子标记器,但如何从输出中删除不需要的/n

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
import PyPDF2 as p2
pdf_file = open("Muhammad_CV.pdf", 'rb')
pdf_read = p2.PdfFileReader(pdf_file)
count = pdf_read.numPages

for i in range(count):
    page = pdf_read.getPage(i)
    text = page.extractText()                               #Extract text
    tokenized = sent_tokenize(text)                 #Token
    all_words = []
    for w in tokenized:
    all_words.append(w.lower())                  #Lower case
# ///////////////// Stop Words ///////////////////////////
    stop_words = set(stopwords.words('english'))
    filtered = []
    for w in all_words:
        if w not in stop_words:
        filtered.append(w)
    print(filtered)
我得到的输出:

{'the specialization includes:\n \n\n \nintroduction\n \nto\n \ndata\n \nscience\n \n\n \nbig\n \ndata\n \n&\n \ncloud\n \ncomputing\n \n\n \ndata\n \nmining\n \n\n \nmachine\n \nlearn\ning'}
所需输出:

{'the specialization includes: introduction to data science big data cloud\n computing data mining machine learning'}

您只需要调用string
strip()
方法来删除周围的空白

下面是一个例子(也使用理解,因为这是python的方式:)


编辑:将
trim
更正为
strip
:)

@smassey trim用于删除前导和尾随空格,但根据穆罕默德的说法,他希望删除“\n”,而不是删除spaces@Muhammad只需将上面的代码编辑为,all_words=[w.replace('\n','').lower()表示符号化的w]谢谢。替换功能非常简单。
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
import PyPDF2 as p2
pdf_file = open("Muhammad_CV.pdf", 'rb')
pdf_read = p2.PdfFileReader(pdf_file)
count = pdf_read.numPages

for i in range(count):
    page = pdf_read.getPage(i)
    text = page.extractText()
    tokenized = sent_tokenize(text)
    all_words = [w.strip().lower() for w in tokenized]
    stop_words = set(stopwords.words('english'))
    filtered = [w for w in all_words if w not in stop_words]
    print(filtered)
 text = '''\n Apple has quietly  hired Dr. Rajiv B. Kumar, a pediatric endocrinologist \n. He will continue working at the hospital part time \n '''

 tokenized_sent_before_remove_n = nltk.sent_tokenize(text)
 #op 
 ['\n Apple has quietly  hired Dr. Rajiv B. Kumar, a pediatric endocrinologist \n.',
'He will continue working at the hospital part time']


 tokenized_sent_after_remove_n = [x.replace('\n','') for x in tokenized_sent]
 #o/p
 [' Apple has quietly  hired Dr. Rajiv B. Kumar, a pediatric endocrinologist .',
 'He will continue working at the hospital part time']