在Python3中从网站查找最常用的单词_Python_Beautifulsoup_Web Crawler_Nltk

在Python3中从网站查找最常用的单词

python web-crawler

在Python3中从网站查找最常用的单词,python,beautifulsoup,web-crawler,nltk,Python,Beautifulsoup,Web Crawler,Nltk,我需要使用Python3代码查找并复制在给定网站上出现5次以上的单词，但我不知道该怎么做。我在这里查阅了有关堆栈溢出的归档文件，但其他解决方案依赖于python 2代码。以下是迄今为止我掌握的微不足道的代码： from urllib.request import urlopen website = urllib.urlopen("http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart") 有人对该怎么做有什么建议吗？我已经安装了NL

我需要使用Python3代码查找并复制在给定网站上出现5次以上的单词，但我不知道该怎么做。我在这里查阅了有关堆栈溢出的归档文件，但其他解决方案依赖于python 2代码。以下是迄今为止我掌握的微不足道的代码：

   from urllib.request import urlopen
   website = urllib.urlopen("http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart")

有人对该怎么做有什么建议吗？我已经安装了NLTK，并且我已经研究了漂亮的汤，但是对于我来说，我不知道如何正确安装它（我非常喜欢python！）！据我所知，如有任何解释，我将不胜感激。谢谢：）

所以，这是一个新手写的，但是如果你只是需要一个快速的答案，我想这可能有用。请注意，使用此方法，您不能仅将URL与程序一起放入，您必须手动将其粘贴到代码中。（对不起）

对于“text.count（i）>=5部分”的解释，每次它遍历For循环时，都会检查文本中是否使用了五个或更多特定单词。然后，对于“and overusedwords.count（i）=0:”，这只是确保同一个单词不会被添加到过度使用的单词列表中两次。

希望我能帮上忙。我想你可能想要一种方法，可以直接从键入url中获取这些信息，但这可能会帮助其他有类似问题的初学者。

这并不完美，但只是一个如何让你开始使用的想法，以及

我会这样做：

安装BeautifulSoup，如下所述

您需要这些导入：

from bs4 import BeautifulSoup
import re
from collections import Counter

使用BeautifulSoup在站点上抓取可见的文本，这在stackoverflow上有解释

使用以下命令从可见文本中获取单词列表

re.findall(r'\b\w+', visible_text_string)

将每个单词转换成小写

lst = [x.lower() for x in lst]

计算每个单词的出现次数，并列出
（单词，计数）
元组

counter = Counter(lst) occs = [(word,count) for word,count in counter.items() if count > 5]

按发生次数对OCC进行排序：

occs.sort(key=lambda x:x[1])

scrapy

，
urllib
，
urllib2
和
BeautifulSoup
是你在网站上大嚼数据的朋友
它取决于单个站点以及站点作者将文本放在页面上的位置。大多数情况下，您可以在
.. 中找到文本例如，在此网站（）中，您需要的文本是：如果你在新加坡只有一家俱乐部的时间，那么它就必须这么做 be Zouk可能是新加坡唯一享有国际声誉的夜总会， Zouk仍然是年轻人的一个机构和成年礼城邦中的人们它已经在邻国催生了其他几家俱乐部，如马来西亚，甚至还有自己的舞蹈节——圣淘沙的ZoukOut。 Zouk由三个俱乐部和一个带主房间的酒吧组成展示科技和室内音乐。天鹅绒地下更轻松而且是独一无二的，虽然Phuture比其他品牌更具实验性和活力，顾名思义 Zouk的全球声誉意味着它拥有各种各样的领先企业世界DJ，从卡尔·考克斯和保罗·奥肯福德到化学兄弟 Zouk还举办著名的曼波嘉宝复古之夜在周三，在Zouk度过一个夜晚的另一个原因是值得回味页面上还有其他文本，但通常情况下，您只需要主文本，而不需要页面上的导航栏和样板文件您只需通过以下方式即可获得： >>> import urllib2 >>> from bs4 import BeautifulSoup as bsoup >>> url = "http://www.yoursingapore.com/content/traveller/en/browse/see-and-do/nightlife/dance-clubs/zouk.html" >>> page = urllib2.urlopen(url).read() >>> for i in bsoup(page).find_all('p'): ... print i.text.strip() ... If you only have time for one club in Singapore, then it simply has to be Zouk. Probably Singapore’s only nightspot of international repute, Zouk remains both an institution and a rite of passage for young people in the city-state. It has spawned several other clubs in neighbouring countries like Malaysia, and even has its own dance festival – Sentosa’s ZoukOut. Zouk is made up of three clubs and a wine bar, with the main room showcasing techno and house music. Velvet Underground is more relaxed and exclusive, while Phuture is experimental and racier than the rest, just as its name suggests. Zouk’s global reputation means it’s home to all manner of leading world DJs, from Carl Cox and Paul Oakenfold to the Chemical Brothers and Primal Scream. Zouk also holds its famous Mambo Jambo retro nights on Wednesdays, another reason why a night at Zouk is one to savour. Find us on Facebook Twitter Youtube Wikipedia Singapore Reviews Copyright © 2013 Singapore Tourism Board. Website Terms of Use | Privacy Statement | Photo Credits 你意识到你得到的比你真正需要的更多，因此你可以筛选b组（第页）。在访问其中的段落之前，通过获取… 进一步查找它： >>> for i in bsoup(page).find_all(attrs={'class':'paragraph section'}): ... print i.text.strip() ... If you only have time for one club in Singapore, then it simply has to be Zouk. Probably Singapore’s only nightspot of international repute, Zouk remains both an institution and a rite of passage for young people in the city-state. It has spawned several other clubs in neighbouring countries like Malaysia, and even has its own dance festival – Sentosa’s ZoukOut. Zouk is made up of three clubs and a wine bar, with the main room showcasing techno and house music. Velvet Underground is more relaxed and exclusive, while Phuture is experimental and racier than the rest, just as its name suggests. Zouk’s global reputation means it’s home to all manner of leading world DJs, from Carl Cox and Paul Oakenfold to the Chemical Brothers and Primal Scream. Zouk also holds its famous Mambo Jambo retro nights on Wednesdays, another reason why a night at Zouk is one to savour. 瞧，你看到了文本。但正如前面所说，如何从页面中咀嚼主要文本取决于页面的书写方式以下是完整的代码： >>> import urllib2 >>> from collections import Counter >>> from nltk import word_tokenize >>> from bs4 import BeautifulSoup as bsoup >>> page = urllib2.urlopen(url).read() >>> text = " ".join([i.text.strip() for i in bsoup(page).find_all(attrs={'class':'paragraph section'})]) >>> word_freq = Counter(word_tokenize(text)) >>> word_freq['Zouk'] 4 >>> word_freq.most_common() [(u',', 8), (u'and', 8), (u'to', 4), (u'of', 4), (u'Zouk', 4), (u'is', 4), (u'the', 4), (u'its', 3), (u'has', 3), (u'in', 3), (u'a', 3), (u'only', 2), (u'for', 2), (u'one', 2), (u'clubs', 2), (u'exclusive', 1), (u'all', 1), (u'Velvet', 1), (u'just', 1), (u'dance', 1), (u'global', 1), (u'rest', 1), (u'Chemical', 1), (u'Oakenfold', 1), (u'it\u2019s', 1), (u'young', 1), (u'passage', 1), (u'main', 1), (u'neighbouring', 1), (u'then', 1), (u'than', 1), (u'means', 1), (u'famous', 1), (u'made', 1), (u'world', 1), (u'like', 1), (u'DJs', 1), (u'bar', 1), (u'name', 1), (u'countries', 1), (u'night', 1), (u'showcasing', 1), (u'Paul', 1), (u'people', 1), (u'house', 1), (u'ZoukOut.', 1), (u'up', 1), (u'\u2013', 1), (u'Underground', 1), (u'home', 1), (u'even', 1), (u'Singapore', 1), (u'city-state.', 1), (u'retro', 1), (u'international', 1), (u'rite', 1), (u'be', 1), (u'institution', 1), (u'reason', 1), (u'techno', 1), (u'both', 1), (u'nightspot', 1), (u'festival', 1), (u'experimental', 1), (u'Singapore\u2019s', 1), (u'own', 1), (u'savour', 1), (u'suggests.', 1), (u'Zouk\u2019s', 1), (u'simply', 1), (u'another', 1), (u'Probably', 1), (u'Jambo', 1), (u'spawned', 1), (u'from', 1), (u'Brothers', 1), (u'remains', 1), (u'leading', 1), (u'.', 1), (u'Phuture', 1), (u'Carl', 1), (u'more', 1), (u'on', 1), (u'club', 1), (u'relaxed', 1), (u'If', 1), (u'with', 1), (u'Wednesdays', 1), (u'room', 1), (u'Primal', 1), (u'while', 1), (u'three', 1), (u'at', 1), (u'racier', 1), (u'it', 1), (u'an', 1), (u'Zouk.', 1), (u'as', 1), (u'manner', 1), (u'have', 1), (u'nights', 1), (u'Malaysia', 1), (u'holds', 1), (u'also', 1), (u'other', 1), (u'repute', 1), (u'you', 1), (u'several', 1), (u'Sentosa\u2019s', 1), (u'Cox', 1), (u'Mambo', 1), (u'why', 1), (u'It', 1), (u'reputation', 1), (u'time', 1), (u'Scream.', 1), (u'music.', 1), (u'wine', 1)] 上述示例来自：谭丽玲和弗朗西斯·邦德。2011。建筑和注释语言多样的NTU-MC（NTU多语语料库）。在第25届亚太语言会议记录，信息与计算（PACLIC 25），新加坡我很想回答，但我也害怕python 3。我建议你使用它来避免一些头痛。你使用的是什么操作系统？你发布的关于下载beautiful soup的链接说，把bs4目录放到你的代码库中——代码库在哪里？@user3682157，不管你把代码放在哪里。这就是“如果其他所有操作都失败了”“尽管如此，在尝试easy\u install 、pip 等时出现了什么问题？请求是现有python代码的一部分，还是需要下载其他模块？如果是这样，有没有简单的方法？很抱歉，这些新手的问题！谢谢你的回复@用户3682157，为python安装软件包的最简单方法是使用pip ，然后要安装一个软件包，您将使用pip安装请求，您可能需要sudo如果您使用的是unix操作系统，该链接将解释如何安装。 >>> for i in bsoup(page).find_all(attrs={'class':'paragraph section'}): ... print i.text.strip() ... If you only have time for one club in Singapore, then it simply has to be Zouk. Probably Singapore’s only nightspot of international repute, Zouk remains both an institution and a rite of passage for young people in the city-state. It has spawned several other clubs in neighbouring countries like Malaysia, and even has its own dance festival – Sentosa’s ZoukOut. Zouk is made up of three clubs and a wine bar, with the main room showcasing techno and house music. Velvet Underground is more relaxed and exclusive, while Phuture is experimental and racier than the rest, just as its name suggests. Zouk’s global reputation means it’s home to all manner of leading world DJs, from Carl Cox and Paul Oakenfold to the Chemical Brothers and Primal Scream. Zouk also holds its famous Mambo Jambo retro nights on Wednesdays, another reason why a night at Zouk is one to savour. >>> import urllib2 >>> from collections import Counter >>> from nltk import word_tokenize >>> from bs4 import BeautifulSoup as bsoup >>> page = urllib2.urlopen(url).read() >>> text = " ".join([i.text.strip() for i in bsoup(page).find_all(attrs={'class':'paragraph section'})]) >>> word_freq = Counter(word_tokenize(text)) >>> word_freq['Zouk'] 4 >>> word_freq.most_common() [(u',', 8), (u'and', 8), (u'to', 4), (u'of', 4), (u'Zouk', 4), (u'is', 4), (u'the', 4), (u'its', 3), (u'has', 3), (u'in', 3), (u'a', 3), (u'only', 2), (u'for', 2), (u'one', 2), (u'clubs', 2), (u'exclusive', 1), (u'all', 1), (u'Velvet', 1), (u'just', 1), (u'dance', 1), (u'global', 1), (u'rest', 1), (u'Chemical', 1), (u'Oakenfold', 1), (u'it\u2019s', 1), (u'young', 1), (u'passage', 1), (u'main', 1), (u'neighbouring', 1), (u'then', 1), (u'than', 1), (u'means', 1), (u'famous', 1), (u'made', 1), (u'world', 1), (u'like', 1), (u'DJs', 1), (u'bar', 1), (u'name', 1), (u'countries', 1), (u'night', 1), (u'showcasing', 1), (u'Paul', 1), (u'people', 1), (u'house', 1), (u'ZoukOut.', 1), (u'up', 1), (u'\u2013', 1), (u'Underground', 1), (u'home', 1), (u'even', 1), (u'Singapore', 1), (u'city-state.', 1), (u'retro', 1), (u'international', 1), (u'rite', 1), (u'be', 1), (u'institution', 1), (u'reason', 1), (u'techno', 1), (u'both', 1), (u'nightspot', 1), (u'festival', 1), (u'experimental', 1), (u'Singapore\u2019s', 1), (u'own', 1), (u'savour', 1), (u'suggests.', 1), (u'Zouk\u2019s', 1), (u'simply', 1), (u'another', 1), (u'Probably', 1), (u'Jambo', 1), (u'spawned', 1), (u'from', 1), (u'Brothers', 1), (u'remains', 1), (u'leading', 1), (u'.', 1), (u'Phuture', 1), (u'Carl', 1), (u'more', 1), (u'on', 1), (u'club', 1), (u'relaxed', 1), (u'If', 1), (u'with', 1), (u'Wednesdays', 1), (u'room', 1), (u'Primal', 1), (u'while', 1), (u'three', 1), (u'at', 1), (u'racier', 1), (u'it', 1), (u'an', 1), (u'Zouk.', 1), (u'as', 1), (u'manner', 1), (u'have', 1), (u'nights', 1), (u'Malaysia', 1), (u'holds', 1), (u'also', 1), (u'other', 1), (u'repute', 1), (u'you', 1), (u'several', 1), (u'Sentosa\u2019s', 1), (u'Cox', 1), (u'Mambo', 1), (u'why', 1), (u'It', 1), (u'reputation', 1), (u'time', 1), (u'Scream.', 1), (u'music.', 1), (u'wine', 1)]