简单的Python社交媒体抓取公共信息
我只想从我在两个社交媒体网站上的账户中获取公共信息。(Instagram和Twitter)我的代码返回Twitter的信息,我知道xpath对于Instagram是正确的,但出于某种原因,我没有得到它的数据。我知道XPATH可能更具体,但我可以稍后修复它。我的两个账户都是公开的 1) 我想它可能不喜欢python头,所以我尝试更改它,但仍然一无所获。那一行被注释掉了,但仍然存在 2) 我听说github上有一个API,这段冗长的代码非常吓人,远远超出了我的理解水平。我不知道我在那里读到的超过一半简单的Python社交媒体抓取公共信息,python,xpath,Python,Xpath,我只想从我在两个社交媒体网站上的账户中获取公共信息。(Instagram和Twitter)我的代码返回Twitter的信息,我知道xpath对于Instagram是正确的,但出于某种原因,我没有得到它的数据。我知道XPATH可能更具体,但我可以稍后修复它。我的两个账户都是公开的 1) 我想它可能不喜欢python头,所以我尝试更改它,但仍然一无所获。那一行被注释掉了,但仍然存在 2) 我听说github上有一个API,这段冗长的代码非常吓人,远远超出了我的理解水平。我不知道我在那里读到的超过一半
from lxml import html
import requests
import webbrowser
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
#page = requests.get('https://www.instagram.com/<my account>/', headers=headers)
page = requests.get('https://www.instagram.com/<my account>/')
tree = html.fromstring(page.text)
pageTwo = requests.get('http://www.twitter.com/<my account>')
treeTwo = html.fromstring(pageTwo.text)
instaFollowers = tree.xpath("//span[@data-reactid='.0.1.0.0:0.1.3.1.0']/span[2]/text()")
instaFollowing = tree.xpath("//span[@data-reactid='.0.1.0.0:0.1.3.2.0']/span[2]/text()")
twitFollowers = treeTwo.xpath("//a[@data-nav='followers']/span[@class='ProfileNav-value']/text()")
twitFollowing = treeTwo.xpath("//a[@data-nav='following']/span[@class='ProfileNav-value']/text()")
print ''
print '--------------------'
print 'Social Media Checker'
print '--------------------'
print ''
print 'Instagram: ' + str(instaFollowers) + ' / ' + str(instaFollowing)
print ''
print 'Twitter: ' + str(twitFollowers) + ' / ' + str(twitFollowing)
从lxml导入html
导入请求
导入网络浏览器
headers={'User-Agent':'Mozilla/5.0(windowsnt 6.1)AppleWebKit/537.36(KHTML,比如Gecko)Chrome/41.0.2228.0 Safari/537.36'}
#page=requests.get('https://www.instagram.com//,headers=headers)
page=requests.get('https://www.instagram.com//')
tree=html.fromstring(page.text)
pageTwo=请求。获取('http://www.twitter.com/')
treeTwo=html.fromstring(pageTwo.text)
instaFollowers=tree.xpath(//span[@data reactid='.0.1.0:0.1.3.1.0']/span[2]/text())
instaFollowing=tree.xpath(//span[@data reactid='.0.1.0.0:0.1.3.2.0']/span[2]/text()
twitFollowers=treeTwo.xpath(//a[@data-nav='followers']/span[@class='ProfileNav-value']/text())
twitfollow=treeTwo.xpath(//a[@data nav='following']/span[@class='ProfileNav-value']/text())
打印“
打印'-------------'
打印“社交媒体检查器”
打印'-------------'
打印“
打印“Instagram:”+str(instaFollowers)+“/”+str(instaFollowers)
打印“
打印“Twitter:”+str(twitFollowers)+“/”+str(twitFollowers)
如果还有人对这类事情感兴趣,使用selenium解决了我的问题
有没有更快的方法?如前所述,Instragram的页面源代码并不反映其呈现的源代码,因为调用Javascript函数将内容从JSON数据传递到浏览器。因此,Python在页面源代码中的内容并不能准确显示浏览器呈现给屏幕的内容。欢迎来到动态web编程的新世界!考虑使用或其他Web解析器,可以检索HTML生成的内容(不只是页面源)。 话虽如此,如果您只是需要IG帐户数据,您仍然可以使用Python的lxml对
标记中的JSON内容进行XPath处理(特别是第六次出现,但要根据您需要的页面进行调整)。下面的示例解析JSON数据:
import lxml.etree as et
import urllib.request as rq
rqpage = rq.urlopen('https://instagram.com/google')
txtpage = rqpage.read()
tree = et.HTML(txtpage)
jsondata = tree.xpath("//script[@type='text/javascript' and position()=6]/text()")
for i in jsondata:
print(i)
输出
window._sharedData = {"qs":"{\"shift\":10,\"header
\":\"n3bTdmHGHDgxvZYPN0KDFHqbkxd6zpTl\",\"edges\":100,\"blob
\":\"AQCq42rOTCnKOZcOxFn06L1J6_W8wY6ntAS1bX88VBClAjQD9PyJdefCzOwfSAbUdsBwHKb1QSndurPtjyN-
rHMOrZ_6ubE_Xpu908cyron9Zczkj4QMkAYUHIgnmmftuXG8rrFzq_Oq3BoXpQgovI9hefha-
6SAs1RLJMwMArrbMlFMLAwyd1TZhArcxQkk9bgRGT4MZK4Tk2VNt1YOKDN1pO3NJneFlUxdUJTdDX
zj3eY-stT7DnxF_GM_j6xwk1o\",\"iterations\":7,\"size\":42}","static_root":"
\/\/instagramstatic-a.akamaihd.net\/bluebar\/5829dff","entry_data":
{"ProfilePage":[{"__query_string":"?","__path":"\/google\/","__get_params":
{},"user":{"username":"google","has_blocked_viewer":false,"follows":
{"count":10},"requested_by_viewer":false,"followed_by":
{"count":977186},"country_block":null,"has_requested_viewer":false,"followed_
by_viewer":false,"follows_viewer":false,"profile_pic_url":"https:
\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xfp1\/t51.2885-19\/s150x150
\/11910217_933356470069152_115044571_a.jpg","is_private":false,"full_name":
"Google","media":{"count":180,"page_info":
{"has_previous_page":false,"start_cursor":"1126896719808871555","end_cursor":
"1092117490206686720","has_next_page":true},"nodes":[{"code":"-
jipiawryD","dimensions":{"width":640,"height":640},"owner":
{"id":"1067259270"},"comments":{"count":105},"caption":"Today's the day!
Your searches are served. Happy Thanksgiving \ud83c\udf57\ud83c\udf70
#GoogleTrends","likes":
{"count":11410},"date":1448556579.0,"thumbnail_src":"https:\/
\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\
/11848856_482502108621097_589421586_n.jpg","is_video":true,"id":"112689671980
8871555","display_src":"https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-
xat1\/t51.2885-15
...
(正在提取窗口。\u上面的sharedData变量)
请参见以下用户(关注者、关注者等)数据开始显示的位置:
{
"qs": "{\"shift\":10,\"header\":\"n3bTdmHGHDgxvZYPN0KDFHqbkxd6zpTl\",\"edges\":100,\"blob\":\"AQCq42rOTCnKOZcOxFn06L1J6_W8wY6ntAS1bX88VBClAjQD9PyJdefCzOwfSAbUdsBwHKb1QSndurPtjyN-rHMOrZ_6ubE_Xpu908cyron9Zczkj4QMkAYUHIgnmmftuXG8rrFzq_Oq3BoXpQgovI9hefha-6SAs1RLJMwMArrbMlFMLAwyd1TZhArcxQkk9bgRGT4MZK4Tk2VNt1YOKDN1pO3NJneFlUxdUJTdDXzj3eY-stT7DnxF_GM_j6xwk1o\",\"iterations\":7,\"size\":42}",
"static_root": "\/\/instagramstatic-a.akamaihd.net\/bluebar\/5829dff",
"entry_data": {
"ProfilePage": [
{
"__query_string": "?",
"__path": "\/google\/",
"__get_params": {
},
"user": {
"username": "google",
"has_blocked_viewer": false,
"follows": {
"count": 10
},
"requested_by_viewer": false,
"followed_by": {
"count": 977186
},
"country_block": null,
"has_requested_viewer": false,
"followed_by_viewer": false,
"follows_viewer": false,
"profile_pic_url": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xfp1\/t51.2885-19\/s150x150\/11910217_933356470069152_115044571_a.jpg",
"is_private": false,
"full_name": "Google",
"media": {
"count": 180,
"page_info": {
"has_previous_page": false,
"start_cursor": "1126896719808871555",
"end_cursor": "1092117490206686720",
"has_next_page": true
},
"nodes": [
{
"code": "-jipiawryD",
"dimensions": {
"width": 640,
"height": 640
},
"owner": {
"id": "1067259270"
},
"comments": {
"count": 105
},
"caption": "Today's the day! Your searches are served. Happy Thanksgiving \ud83c\udf57\ud83c\udf70 #GoogleTrends",
"likes": {
"count": 11410
},
"date": 1448556579,
"thumbnail_src": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\/11848856_482502108621097_589421586_n.jpg",
"is_video": true,
"id": "1126896719808871555",
"display_src": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\/11848856_482502108621097_589421586_n.jpg"
},
{
"code": "-hwbf2wr0O",
"dimensions": {
"width": 640,
"height": 640
},
"owner": {
"id": "1067259270"
},
"comments": {
"count": 95
},
"caption": "Thanksgiving dinner is waiting. But first, the airport. \u2708\ufe0f #GoogleApp",
"likes": {
"count": 12621
},
...
若你们想找到关于你们自己和其他人的信息而不需要代码的麻烦,那个么试试这个软件吧。除了自动抓取外,它还分析并可视化从社交网络(Facebook、Twitter、Instagram和谷歌搜索引擎)收到的PDF报告信息
另外,我是这个项目的主要开发者和维护者。刚刚检查了随机Twitter和IG页面的页面来源。虽然我可以找到Twitter属性
@datanav
,但我无法找到IG的@datareactid
。顺便说一句,IG的跟随者和后续输出是Javascript
标记中的JSON字符串。检查您帐户的公共页面。使用具有相同xpath的Google Chrome或Firefox中的控制台导出结果。这就是我知道它的工作原理$x(“//span[@data reactid='.0.1.0.0:0.1.3.1.0']/span[2]/text()”)这就是我在instagram上调用的内容21.2k'看看这个。Chrome和FF的Web developer工具可能会输出完全生成的HTML,而不是Python的requests.get()
可能使用的服务器发送的源HTML。这些跨类可以由JavaScript函数动态生成,然后呈现给浏览器。可能需要发送参数?是的,有。使用我的答案中描述的自动化系统。