Python 使用tweepy和tesseract提取推特中的img并获取文本

Python 使用tweepy和tesseract提取推特中的img并获取文本,python,twitter,ocr,tweepy,Python,Twitter,Ocr,Tweepy,我正在尝试使用tesseract在我的twitter监视器上实现ocr。我的问题是:如何从用户那里获取图像并立即运行ocr。我正在监视某些twitter帐户的最新推文,如果新推文出现并包含url,我将在浏览器中打开它,现在我想检查推文中是否也有图像,并在控制台中打印内容。我的代码如下所示: import tweepy import re import webbrowser import time import urllib from datetime import datetime # a bu

我正在尝试使用tesseract在我的twitter监视器上实现ocr。我的问题是:如何从用户那里获取图像并立即运行ocr。我正在监视某些twitter帐户的最新推文,如果新推文出现并包含url,我将在浏览器中打开它,现在我想检查推文中是否也有图像,并在控制台中打印内容。我的代码如下所示:

import tweepy
import re
import webbrowser
import time
import urllib
from datetime import datetime
# a bunch of access keys
keys = [(example_keys)]

# which key is in use right now
key_index = 0
test = 0
url_store = ''



# Function to extract url from newest tweet 
def get_tweets(username, tweet_mode='extended'):
        # Authorization to consumer key and consumer secret 
        auth = tweepy.OAuthHandler(keys[key_index][0], keys[key_index][1]) 

        # Access to user's access key and access secret 
        auth.set_access_token(keys[key_index][2], keys[key_index][3]) 

        # Calling api 
        api = tweepy.API(auth) 

        # try to get latest tweet until rate limit is reached
        try:
            # Get newest tweet from profile
            tweets = api.user_timeline(screen_name=username, count=1)
            tweet = [tweet.text for tweet in tweets][0]
            print(tweet)



            global url_store
            # regex through tweet for url
            url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', str(tweet))

            # check if url was found and isn't the same as the url from the last tweet
            if (url!=[] and url[0]!=url_store):
                # store url in variable
                url_store=url[0]
                # open the url in webbrowser
                webbrowser.open(url[0])

                # save the html dom to a text file
                urllib.request.urlretrieve(url[0], "test.txt")

        # when rate limit is reached
        except tweepy.TweepError:
            # select the next key from array
            changeKeys() 

        # right now function always returns false
        return False


def changeKeys():
        global key_index
        # increment key_index by 1 unless end of key array is reached -> start from the beginning
        if key_index >= len(keys) - 1:
            key_index = 0
        else:
            key_index += 1

def getIMG():



# Driver code 
if __name__ == '__main__': 
    # boolean if url was found (right now its always false)
    found=False
    # never ending for loop
    while not found:
        # get tweets from specific twitter handle
        found = get_tweets("Trump",)
        time.sleep(0.02)

这是一个很好的问题。您使用RegEx的方法是错误的图像查找方法

每个Tweet都包含“实体”-请参见

您可以使用这些直接从推文中获取图像

例如:

tweet.entities.urls
将获取推文中的所有URL