Python Can';不要在公共Instagram账户上刮取超过12个帖子
我想用Python从一个公共instagram帐户中刮取所有帖子,用于我在大学里进行的一项研究。然而,我开始感到沮丧,因为我无法从Instagram中提取超过12条帖子 Selenium完成了滚动页面的工作,我已经让beautifulsoup以适当的方式解析了我想要的数据,尽管只针对前12篇文章。到目前为止,我已经尝试了一些不同的方法,但开始觉得卡住了。我在这里查阅了一些教程和线程,例如: 感谢所有的回应 致以最良好的祝愿, 卡尔 我试过的代码。 例1:Python Can';不要在公共Instagram账户上刮取超过12个帖子,python,selenium,web-scraping,beautifulsoup,instagram,Python,Selenium,Web Scraping,Beautifulsoup,Instagram,我想用Python从一个公共instagram帐户中刮取所有帖子,用于我在大学里进行的一项研究。然而,我开始感到沮丧,因为我无法从Instagram中提取超过12条帖子 Selenium完成了滚动页面的工作,我已经让beautifulsoup以适当的方式解析了我想要的数据,尽管只针对前12篇文章。到目前为止,我已经尝试了一些不同的方法,但开始觉得卡住了。我在这里查阅了一些教程和线程,例如: 感谢所有的回应 致以最良好的祝愿, 卡尔 我试过的代码。 例1: from bs4 import
from bs4 import BeautifulSoup
import ssl
import json
import time
from selenium import webdriver
from datetime import datetime
class Insta_Image_Links_Scraper:
def getlinks(self, user, url):
print('[+] Downloading:\n')
c = webdriver.Chrome()
c.get("https://www.instagram.com/frank_the_carden/")
lenOfPage = c.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
match=False
while(match==False):
lastCount = lenOfPage
time.sleep(2)
lenOfPage = c.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
if lastCount==lenOfPage:
match=True
soup = BeautifulSoup(c.page_source, 'lxml')
body = soup.find('body')
script = body.find('script')
page_json = script.text.strip().replace('window._sharedData =', '').replace(';', '')
data = json.loads(page_json)
print('Scraping posts for user ' + user+"...........")
for post in data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges']:
timestamp = post['node']['taken_at_timestamp']
likedby = post['node']['edge_liked_by']['count']
comments = post['node']['edge_media_to_comment']['count']
isVideo = post['node']['is_video']
caption = post['node']['edge_media_to_caption']
print('Post on :',datetime.utcfromtimestamp(timestamp).strftime('%Y-%m-%d %H:%M:%S'))
print('Liked by :',likedby)
print('comments :',comments)
print('caption :',caption)
def main(self):
self.ctx = ssl.create_default_context()
self.ctx.check_hostname = False
self.ctx.verify_mode = ssl.CERT_NONE
with open("accounts.txt") as f:
self.content = f.readlines()
self.content = [x.strip() for x in self.content]
for user in self.content:
self.getlinks(user,
'https://www.instagram.com/'
+ user + '/')
if __name__ == '__main__':
obj = Insta_Image_Links_Scraper()
obj.main()
例2:
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import json
from datetime import datetime
c = webdriver.Chrome()
c.get("https://www.instagram.com/frank_the_carden/")
time.sleep(1)
elem = c.find_element_by_tag_name("body")
no_of_pagedowns = 20
while no_of_pagedowns:
elem.send_keys(Keys.PAGE_DOWN)
time.sleep(0.2)
no_of_pagedowns-=1
soup = BeautifulSoup(c.page_source, 'html.parser')
body = soup.find('body')
script = body.find('script')
page_json = script.text.strip().replace('window._sharedData =', '').replace(';', '')
data = json.loads(page_json)
for post in data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges']:
timestamp = post['node']['taken_at_timestamp']
likedby = post['node']['edge_liked_by']['count']
comments = post['node']['edge_media_to_comment']['count']
isVideo = post['node']['is_video']
caption = post['node']['edge_media_to_caption']
print('Post on :',datetime.utcfromtimestamp(timestamp).strftime('%Y-%m-%d %H:%M:%S'))
print('Liked by :',likedby)
print('comments :',comments)
print('caption :',caption)
例3:
import time
import json
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
from datetime import datetime
import requests
import urllib3
browser = webdriver.Chrome()
media_url = 'https://www.instagram.com/graphql/query/?query_hash=42323d64886122307be10013ad2dcc44&variables={"id":"%s","first":50,"after":"%s"}'
browser = webdriver.Chrome()
# first get https://instagram.com to obtain cookies
browser.get('https://www.instagram.com/frank_the_carden/')
browser_cookies = browser.get_cookies()
# set a session with cookies
session = requests.Session()
for cookie in browser_cookies:
c = {cookie['name']: cookie['value']}
session.cookies.update(c)
# get response as JSON
response = session.get(media_url % ('5719699176', ''), verify=False).json()
time.sleep(1)
elem = browser.find_element_by_tag_name("body")
no_of_pagedowns = 20
while no_of_pagedowns:
elem.send_keys(Keys.PAGE_DOWN)
time.sleep(0.2)
no_of_pagedowns-=1
soup = BeautifulSoup(browser.page_source, 'html.parser')
body = soup.find('body')
script = body.find('script')
page_json = script.text.strip().replace('window._sharedData =', '').replace(';', '')
data = json.loads(page_json)
for post in data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges']:
timestamp = post['node']['taken_at_timestamp']
likedby = post['node']['edge_liked_by']['count']
comments = post['node']['edge_media_to_comment']['count']
isVideo = post['node']['is_video']
caption = post['node']['edge_media_to_caption']
print('Post on :',datetime.utcfromtimestamp(timestamp).strftime('%Y-%m-%d %H:%M:%S'))
print('Liked by :',likedby)
print('comments :',comments)
print('caption :',caption)
例4:
from random import choice
import json
import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
browser = webdriver.Chrome()
browser.get("https://www.instagram.com/frank_the_carden/")
# Selenium script to scroll to the bottom
lenOfPage = browser.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
match=False
while(match==False):
lastCount = lenOfPage
time.sleep(1)
lenOfPage = browser.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
if lastCount==lenOfPage:
match=True
_user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
]
class InstagramScraper:
def __init__(self, user_agents=None, proxy=None):
self.user_agents = user_agents
self.proxy = proxy
def __random_agent(self):
if self.user_agents and isinstance(self.user_agents, list):
return choice(self.user_agents)
return choice(_user_agents)
def __request_url(self, url):
try:
response = requests.get(url, headers={'User-Agent': self.__random_agent()}, proxies={'http': self.proxy,
'https': self.proxy})
response.raise_for_status()
except requests.HTTPError:
raise requests.HTTPError('Received non 200 status code from Instagram')
except requests.RequestException:
raise requests.RequestException
else:
return response.text
@staticmethod
def extract_json_data(html):
soup = BeautifulSoup(html, 'html.parser')
body = soup.find('body')
script_tag = body.find('script')
raw_string = script_tag.text.strip().replace('window._sharedData =', '').replace(';', '')
return json.loads(raw_string)
def profile_page_metrics(self, profile_url):
results = {}
try:
response = self.__request_url(profile_url)
json_data = self.extract_json_data(response)
metrics = json_data['entry_data']['ProfilePage'][0]['graphql']['user']
except Exception as e:
raise e
else:
for key, value in metrics.items():
if key != 'edge_owner_to_timeline_media':
if value and isinstance(value, dict):
value = value['count']
results[key] = value
elif value:
results[key] = value
return results
def profile_page_recent_posts(self, profile_url):
results = []
try:
response = self.__request_url(profile_url)
json_data = self.extract_json_data(response)
metrics = json_data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']["edges"]
except Exception as e:
raise e
else:
for node in metrics:
node = node.get('node')
if node and isinstance(node, dict):
results.append(node)
return results
from pprint import pprint
k = InstagramScraper()
results = k.profile_page_recent_posts('https://www.instagram.com/frank_the_carden/')
pprint(results)
我会直接调用instagram graph ql api,就像您在“示例3”中所做的那样。 我有一个工作代码,但他们改变了查询散列的生成方式,我无法让它工作,但您可能也面临同样的问题
除此之外,我目前正在使用此工具抓取instagram数据。但您需要提供instagram凭据才能使其工作。您可以使用此查询模板获取包含用户帖子的json %22%2C%22first%22%3A%7D 查看此以了解更多信息。我想这可能会有帮助
我一直在寻找一个和你一样的答案,我发现最好的方法是使用以下步骤: 首先使用Instagram查询中的请求库和粘贴 %22、%22第一个%22:,%22之后的%22:%22%22} :您的Instagram个人资料ID。您可以使用个人资料链接末尾的/?\uu a=1将其删除。并查找此数据目录: ['data']['user']['edge\u owner\u to\u timeline\u media']['edges'][0]['node']['owner']['id'] :每个JSON查询要显示多少篇文章。最多50美元。如果你想得到更多,使用第二步 :这种散列表示文章是否有下一页。目录是: ['data']['user']['edge\u owner\u to\u timeline\u media']['page\u info']['end\u cursor'] 然后,当您成功获得所有需要的数据时,您可以使用此代码来保留JSON格式
import json
import request
profilq = request.get('https://www.instagram.com/graphql/query/?query_hash=42323d64886122307be10013ad2dcc44&variables={%22id%22:%22<profile_id>%22,%22first%22:<num_ofpost>,%22after%22:%22<end_cursor>%22}')
data = profilq.json()
我在2020年12月7日之前测试了查询链接。如果你想粘贴我的GitHub链接,你可以查看我的GitHub链接。谢谢回复!是的,这实际上是我最近的一次尝试,我认为这是有道理的。也无法使查询\u哈希工作。我尝试从网络xhr查询字符串参数中获取值,并尝试在代码中输入它们的不同方式,但我得到了以下错误:“我的目录”,第27行,在response=session.get(media_url%('5719699176',''),verify=False).json()TypeError:并非在字符串格式化过程中转换的所有参数ValueError:索引97处不支持格式字符“B”(0x42)。根据graphql文档,应通过将sha256作为字符串应用于查询来创建查询散列。但我可能遗漏了一些东西,我从未获得过与instagram相同的哈希值,因此我不断得到“无效查询哈希值”,这不完全是一个解决方案,但可能会让我有所了解;看看我从乔·拜登的Instagram上抓取的500篇帖子,还有;诚然,这有点黑客,但我基本上使用Selenium来滚动页面,每次滚动时,我都会收集整个HTML,然后在最后比较所有HTML,并解析出URL的单独post快捷码
try:
your code
except IndexError:
caption = '*NO CAPTION PROVIDED*'