用python从tweet中提取外部链接_Python_Api_Twitter

用python从tweet中提取外部链接

python api twitter

用python从tweet中提取外部链接,python,api,twitter,Python,Api,Twitter,我编写了这个简单的程序来为某个用户从tweet中提取链接。我能够提取tweet中的链接，但似乎我得到的都是以t.co为域名缩短的链接。这些链接将导致其他推文问题是这些链接有时会导致其他tweet。如何从tweet获取链接，并确保这些链接是针对外部站点的，而不是针对twitter本身的我希望我的问题很清楚，因为这是我能描述它的最好方式谢谢 #!/usr/bin/env python # -*- coding: utf-8 -*- import sys import re #http://

我编写了这个简单的程序来为某个用户从tweet中提取链接。我能够提取tweet中的链接，但似乎我得到的都是以t.co为域名缩短的链接。这些链接将导致其他推文

问题是这些链接有时会导致其他tweet。如何从tweet获取链接，并确保这些链接是针对外部站点的，而不是针对twitter本身的

我希望我的问题很清楚，因为这是我能描述它的最好方式

谢谢

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
import re

#http://www.tweepy.org/
import tweepy

#Get your Twitter API credentials and enter them here
consumer_key = ""
consumer_secret = ""
access_key = ""
access_secret = ""

#method to get a user's last  200 tweets
def get_tweets(username):

        #http://tweepy.readthedocs.org/en/v3.1.0/getting_started.html#api
        auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
        auth.set_access_token(access_key, access_secret)
        api = tweepy.API(auth)

        #set count to however many tweets you want; twitter only allows 200 at once
        number_of_tweets = 200

        #get tweets
        tweets = api.user_timeline(screen_name = username,count = number_of_tweets)

        for tweet in tweets:
                urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', tweet.text)
                for url in urls:
                        print url


#if we're running this as a script
if __name__ == '__main__':

    #get tweets for username passed at command line
    if len(sys.argv) == 2:
        get_tweets(sys.argv[1])
    else:
        print "Error: enter one username"

    #alternative method: loop through multiple users
        # users = ['user1','user2']

        # for user in users:
#       get_tweets(user)

这是一个输出示例：（我无法发布它，因为它缩短了链接）。编辑器不允许我这样做。

您需要获取重定向的URL。首先，添加

导入urllib2

，然后尝试以下代码：

for tweet in tweets:
    urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', tweet.text)
    for url in urls:
        try:
            res = urllib2.urlopen(url)
            actual_url = res.geturl()
            print actual_url
        except:
            print url

我有try..except块，因为我测试的一些tweet提取了无效的URL

在Python3中，您可以执行以下操作：

import urllib

for tweet in tweets:
urls = re.findall("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", tweet.text)
for url in urls:
    try:
        opener = urllib.request.build_opener()
        request = urllib.request.Request(url)
        response = opener.open(request)
        actual_url = response.geturl()
        print(actual_url)
    except:
        print(url)