Hashtags python html_Python_Html_Beautifulsoup_Lxml_Hashtag

Hashtags python html

python html

Hashtags python html,python,html,beautifulsoup,lxml,hashtag,Python,Html,Beautifulsoup,Lxml,Hashtag,我想从给定网站中提取所有标签：例如，“我喜欢堆栈溢出，因为人们非常乐于助人！” 这将把3个hashtag拉入一个表中。在我的目标网站中，有一个带有#标签描述的表格所以我们可以找到“爱”这个标签是关于爱的这是我的工作： #import the library used to query a website import urllib2 #specify the url wiki = "https://www.symplur.com/healthcare-ha

我想从给定网站中提取所有标签：例如，“我喜欢堆栈溢出，因为人们非常乐于助人！” 这将把3个hashtag拉入一个表中。在我的目标网站中，有一个带有#标签描述的表格所以我们可以找到“爱”这个标签是关于爱的

这是我的工作：

    #import the library used to query a website
    import urllib2
    #specify the url
    wiki = "https://www.symplur.com/healthcare-hashtags/tweet-chats/all"
    #Query the website and return the html to the variable 'page'
    page = urllib2.urlopen(wiki)
    #import the Beautiful soup functions to parse the data returned from the 
     website
    from bs4 import BeautifulSoup
    #Parse the html in the 'page' variable, and store it in Beautiful Soup 
    format
     soup = BeautifulSoup(page, "lxml")
    print soup.prettify()
    s = soup.get_text()
    import re
     re.findall("#(\w+)", s)

我在输出中有一个问题：第一个是输出如下所示： [u'eeeee'， u'333333'， u‘2222222’， u‘2222222’， u‘2222222’， u‘2222222’， u‘2222222’， u‘2222222’， u‘2222222’， u'AASTGrandRoundsacute'

输出将Hashtag与描述中的第一个单词连接起来。如果我与我在输出为“lovethis”之前调用的示例进行比较

如何只提取hashtag后面的一个单词

谢谢你

我认为没有必要使用

regex

来解析你从页面上得到的文本，你可以使用

beautifulsou

本身来解析。我在下面的代码中使用Python3.6，只是为了显示整个代码，但重要的一行是

hashtags=soup.findAll（'td'，{id'：'tweetchatlist\u hashtag}）

。注意表中的所有hashtag都有

td

标记和

id

属性

=tweetchatlist\u hashtag

，因此调用

。findAll

是一种方法：

import requests
import re
from bs4 import BeautifulSoup

wiki = "https://www.symplur.com/healthcare-hashtags/tweet-chats/all"
page = requests.get(wiki).text
soup = BeautifulSoup(page, "lxml")

hashtags = soup.findAll('td', {'id':'tweetchatlist_hashtag'})

现在让我们看一下列表的第一项：

>>> hashtags[0]
<td id="tweetchatlist_hashtag" itemprop="location"><a href="https://www.symplur.com/healthcare-hashtags/aastgrandrounds/" title="#AASTGrandRounds">#AASTGrandRounds</a></td>

要继续使用列表理解获取所有hashtag的列表，请执行以下操作：

>>> lst = [hashtag.a['title'] for hashtag in hashtags]

如果未使用列表理解语法，则上面的行与此类似：

>>> lst = []
>>> for hashtag in hashtags:
    lst.append(hashtag.a['title'])

lst

然后是所需的输出，请参见列表的前20项：

>>> lst[:20]
['#AASTGrandRounds', '#abcDrBchat', '#addictionchat', '#advocacychat', '#AetnaMyHealthy', '#AlzChat', '#AnatQ', '#anzOTalk', '#AskAvaility', '#ASPChat', '#ATtalk', '#autchat', '#AXSChat', '#ayacsm', '#bcceu', '#bccww', '#BCSM', '#benurse', '#BeTheDifference', '#bioethx']

“u”实际上不在那里。python显示它是为了告诉您它旁边的字符串在里面。谢谢，这很有用！谢谢，运行您的解决方案我得到了错误：ConnectionError:HTTPSConnectionPool（host='www.symplur.com'，port=443）：使用url:/healthcare hashtags/tweet chats/all重试（由NewConnectionError引起（'：未能建立新的连接：[Errno 10060]Une暂定连接a\xe9chou\xe9 car le parti connect\xe9 n\x92a pas r\xe9pondu conventionment au del\xe0 d\x92une certaine dur\xe9e oune connexion\xe9tablie a\xe9chou\xe9 car l\x92h\xf4te de connexion n\x92a pas r\xe9pondur\xe9pondu'））这是一个代理问题！

>>> lst[:20]
['#AASTGrandRounds', '#abcDrBchat', '#addictionchat', '#advocacychat', '#AetnaMyHealthy', '#AlzChat', '#AnatQ', '#anzOTalk', '#AskAvaility', '#ASPChat', '#ATtalk', '#autchat', '#AXSChat', '#ayacsm', '#bcceu', '#bccww', '#BCSM', '#benurse', '#BeTheDifference', '#bioethx']