Python 从web上刮取的文本中标记文本_Python_Nltk

Python 从web上刮取的文本中标记文本

python

Python 从web上刮取的文本中标记文本,python,nltk,Python,Nltk,我正试图让我的代码刮，然后打印出10个最常见的英语单词。然而，我的代码只是找到了最常见的段落/句子，而不是单词。因此，我没有得到前十个最常用的词，而是得到了以下垃圾词： [("Well, Prince, so Genoa and Lucca are now just family estates of the\nBuonapartes. But I warn you, if you don't tell me that this means war,\nif you still try

我正试图让我的代码刮，然后打印出10个最常见的英语单词。然而，我的代码只是找到了最常见的段落/句子，而不是单词。因此，我没有得到前十个最常用的词，而是得到了以下垃圾词：

[("Well, Prince, so Genoa and Lucca are now just family estates of the\nBuonapartes. But I warn you, if you don't tell me that this means war,\nif you still try to defend the infamies and horrors perpetrated by\nthat Antichrist- I really believe he is Antichrist- I will have\nnothing more to do with you and you are no longer my friend, no longer\nmy 'faithful slave,' as you call yourself! But how do you do? I see\nI have frightened you- sit down and tell me all the news.", 1),
('If you have nothing better to do, Count [or Prince], and if the\nprospect of spending an evening with a poor invalid is not too\nterrible, I shall be very charmed to see you tonight between 7 and 10-\nAnnette Scherer.',   1),
('Heavens! what a virulent attack!', 1),
("First of all, dear friend, tell me how you are. Set your friend's\nmind at rest,",   1),
('Can one be well while suffering morally? Can one be calm in times\nlike these if one has any feeling?',   1),
('You are\nstaying the whole evening, I hope?', 1), 
("And the fete at the English ambassador's? Today is Wednesday. I\nmust put in an appearance there,",   1),
('My daughter is\ncoming for me to take me there.', 1),
("I thought today's fete had been canceled. I confess all these\nfestivities and fireworks are becoming wearisome.",   1),

我的代码是：

import nltk

from nltk.corpus import stopwords
from nltk import word_tokenize

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

from urllib.request import urlopen
from bs4 import BeautifulSoup
html=urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
soup=BeautifulSoup(html,'html.parser')

nameList = [tag.text for tag in soup.findAll("span", {"class":"red"})]

filtered_words = [word for word in nameList if word not in stopwords.words('english')]  

fdist1 = nltk.FreqDist(nameList)
fdist1.most_common(10)

我试图通过添加

“token=nltk.word\u tokenize（namelist）”

来标记namelist，但我最终得到了

类型错误：预期的字符串或类似字节的对象。

标记器是否可用于web抓取？我也尝试过使用

nameList.split（）

进行拆分，但最后我得到了AttributeError:“list”对象没有属性“split”

如何将这段文本转换为单个单词？

nameList

是一个包含文本的列表。它本身不包含单词，您无法正确处理它。您有以下错误：

您正在文本中搜索，而不是文本中的单词

FreqDict正在名称列表（带文本）中搜索，而不是在

筛选的单词中搜索


您应该用它替换最后一块代码：
# Remember filtered words between texts
filtered_words = []
# Check all texts
for text in nameList:
    # Replace EOLs with ' ', split by ' ' and filter stopwords
    filtered_words += [word for word in text.replace('\n', ' ').split(' ') if word not in stopwords.words('english')]

# Search in stopwords
fdist1 = nltk.FreqDist(filtered_words)
fdist1.most_common(10)


此外，nltk
有一个子模块tokenize
，可以（而且应该）使用它来代替手动拆分。最好是自然文本：
nltk.tokenize.casual\u tokenize（名称列表[2]）

返回：
['Heavens', '!', 'what', 'a', 'virulent', 'attack', '!']

名称列表
是一个包含文本的列表。它本身不包含单词，您无法正确处理它。您有以下错误：
您正在文本中搜索，而不是文本中的单词
FreqDict正在名称列表（带文本）中搜索，而不是在筛选的单词中搜索

您应该用它替换最后一块代码：
# Remember filtered words between texts
filtered_words = []
# Check all texts
for text in nameList:
    # Replace EOLs with ' ', split by ' ' and filter stopwords
    filtered_words += [word for word in text.replace('\n', ' ').split(' ') if word not in stopwords.words('english')]

# Search in stopwords
fdist1 = nltk.FreqDist(filtered_words)
fdist1.most_common(10)


此外，nltk
有一个子模块tokenize
，可以（而且应该）使用它来代替手动拆分。最好是自然文本：
nltk.tokenize.casual\u tokenize（名称列表[2]）

返回：
['Heavens', '!', 'what', 'a', 'virulent', 'attack', '!']

也许这样的东西可以帮助你：
首先对名称列表中的每个元素（句子）使用re.split（）
import re
nameList_splitted=[re.split(';|,|\n| ',x) for x in nameList]

因此，您将获得单个单词的列表
然后，您可以合并成一个最终列表，如下所示：
list_of_words=[]
for list_ in nameList_spaces:
    list_of_words += list_

结果是：
['Well',
 '',
 'Prince',
 '',
 'so',
 'Genoa',
 'and',
 'Lucca',
 'are',
 'now',
...

也许这样的东西可以帮助你：
首先对名称列表中的每个元素（句子）使用re.split（）
import re
nameList_splitted=[re.split(';|,|\n| ',x) for x in nameList]

因此，您将获得单个单词的列表
然后，您可以合并成一个最终列表，如下所示：
list_of_words=[]
for list_ in nameList_spaces:
    list_of_words += list_

结果是：
['Well',
 '',
 'Prince',
 '',
 'so',
 'Genoa',
 'and',
 'Lucca',
 'are',
 'now',
...

为了标记名称列表
，您必须逐项进行标记。在任何情况下，有效地标记。文本
包含多个单词，因此要标记名称列表
，您必须逐项执行。在任何情况下，有效地tag.text
包含不止一个单词，因此这就是您的问题。谢谢Vurmux。我想知道[2]在nltk.tokenize.casual_tokenize（nameList[2]）nameList
是文本列表（有点像C中的数组）。[2]
表示其中的第三个文本（列表从零开始计数）。谢谢Vurmux。我可以知道[2]在nltk.tokenize.casual_tokenize（nameList[2]）nameList
是文本列表（有点像C中的数组）。[2]
表示其中的第三个文本（列表从零开始计数）。