Python 有没有办法从tweet中获取非'；你不习惯过滤推特吗？_Python_Twitter_Tweepy

Python 有没有办法从tweet中获取非'；你不习惯过滤推特吗？

python twitter

Python 有没有办法从tweet中获取非'；你不习惯过滤推特吗？,python,twitter,tweepy,Python,Twitter,Tweepy,我使用Tweepy通过这些标签过滤推特[“电晕”、“隔离”、“covid19”] 例如，如果我有这条推文，“我从楼梯上摔下来吃了一个苹果，所以没有医生隔离” 我想得到像“楼梯”、“苹果”和“医生”这样的字符串作为一组关键字有没有办法做到这一点我是python的初学者，我正在使用Youtube上的视频教程来启动这个项目类StdOutListener（StreamListener）： def on_数据（自身、数据）：打印数据返回真值 def on_错误（自身、状态）：打印状态如果uu

我使用Tweepy通过这些标签过滤推特[“电晕”、“隔离”、“covid19”]

例如，如果我有这条推文，“我从楼梯上摔下来吃了一个苹果，所以没有医生隔离” 我想得到像“楼梯”、“苹果”和“医生”这样的字符串作为一组关键字

有没有办法做到这一点

我是python的初学者，我正在使用Youtube上的视频教程来启动这个项目

类StdOutListener（StreamListener）：
def on_数据（自身、数据）：
打印数据
返回真值
def on_错误（自身、状态）：
打印状态
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu'：
lis=StdOutListener（）
auth=OAuthHandler（使用者密钥，使用者密钥）
授权设置\u访问\u令牌（访问\u令牌，访问\u令牌\u密钥）
流=流（身份验证、lis）
stream.filter（track=['covid19'、'corona'、'quantial']

您可以使用列表：

tags =  ["corona", "quarantine", "covid19"]
tweet = "I fell down the stairs and ate an apple so no doctor #quarantine"

# print each word in the tweet that is longer than two characters and
# does not contain any of the tag words
print([word for word in tweet.split() if len(word) > 2 and not any(tag in word for tag in tags)])

这不是一个完美的解决方案，主要是因为它排除了包含标记的单词，即如果其中一个标记是

wash

，那么单词

washington

将被排除。但这只是一个开始。

这个怎么样-

如果你想把tweet分解成文字，那么-

s =  'fell down the stairs and ate an apple so no doctor #quarantine'
allwords = s.split(' ')
allwords

#output
['fell', 'down', 'the', 'stairs', 'and', 'ate', 'an', 'apple', 'so', 'no', 'doctor','#quarantine']

然后你可以这样做，用#tag分隔单词-

hastags = [i for i in allwords if i[:1]=='#']
hastags

#output
['#quarantine']

otherwords = [i for i in allwords if i not in hastags]
otherwords

#output
['fell', 'down', 'the', 'stairs', 'and', 'ate', 'an', 'apple', 'so', 'no', 'doctor']

tags = ["corona", "quarantine", "covid19"]
[i for i in s.split(' ') if i.strip('#') not in tags]

#output
['fell', 'down', 'the', 'stairs', 'and', 'ate', 'an', 'apple', 'so', 'no', 'doctor']

接下来，您可以通过执行以下操作过滤带有#标记的单词-

hastags = [i for i in allwords if i[:1]=='#']
hastags

#output
['#quarantine']

otherwords = [i for i in allwords if i not in hastags]
otherwords

#output
['fell', 'down', 'the', 'stairs', 'and', 'ate', 'an', 'apple', 'so', 'no', 'doctor']

tags = ["corona", "quarantine", "covid19"]
[i for i in s.split(' ') if i.strip('#') not in tags]

#output
['fell', 'down', 'the', 'stairs', 'and', 'ate', 'an', 'apple', 'so', 'no', 'doctor']

对于较大的数据集和一长串特定的hashtag，我建议这样做-

hastags = [i for i in allwords if i[:1]=='#']
hastags

#output
['#quarantine']

otherwords = [i for i in allwords if i not in hastags]
otherwords

#output
['fell', 'down', 'the', 'stairs', 'and', 'ate', 'an', 'apple', 'so', 'no', 'doctor']

tags = ["corona", "quarantine", "covid19"]
[i for i in s.split(' ') if i.strip('#') not in tags]

#output
['fell', 'down', 'the', 'stairs', 'and', 'ate', 'an', 'apple', 'so', 'no', 'doctor']

如果你有一种情况，用来过滤推文的标签前面可能没有#，但你仍然想过滤掉它们-

tags = ["corona", "quarantine", "covid19"]
print([i for i in s.split(' ') if i.strip('#') not in tags and i not in tags])

#output
['fell', 'down', 'the', 'stairs', 'and', 'ate', 'an', 'apple', 'so', 'no', 'doctor']

您可以做的是：使用关键字设置一个json文件（或其他配置文件）。然后，当你得到“楼梯”、“苹果”、“医生”等，用你的新关键字更新文件。同时，让拖缆每隔x次（每小时、6小时、5分钟等）轮询一次文件，并用文件的关键字列表更新其内部关键字列表，然后重新启动拖缆。@asuprem这不是我要找的。。。我打算对实时推文进行流式处理，对于每一条推文，我想在进行分析时提取有价值的词语……我不想添加到标签列表中，以过滤推文