简单的Python社交媒体抓取公共信息

简单的Python社交媒体抓取公共信息,python,xpath,Python,Xpath,我只想从我在两个社交媒体网站上的账户中获取公共信息。(Instagram和Twitter)我的代码返回Twitter的信息,我知道xpath对于Instagram是正确的,但出于某种原因,我没有得到它的数据。我知道XPATH可能更具体,但我可以稍后修复它。我的两个账户都是公开的 1) 我想它可能不喜欢python头,所以我尝试更改它,但仍然一无所获。那一行被注释掉了,但仍然存在 2) 我听说github上有一个API,这段冗长的代码非常吓人,远远超出了我的理解水平。我不知道我在那里读到的超过一半

我只想从我在两个社交媒体网站上的账户中获取公共信息。(Instagram和Twitter)我的代码返回Twitter的信息,我知道xpath对于Instagram是正确的,但出于某种原因,我没有得到它的数据。我知道XPATH可能更具体,但我可以稍后修复它。我的两个账户都是公开的

1) 我想它可能不喜欢python头,所以我尝试更改它,但仍然一无所获。那一行被注释掉了,但仍然存在

2) 我听说github上有一个API,这段冗长的代码非常吓人,远远超出了我的理解水平。我不知道我在那里读到的超过一半

from lxml import html
import requests
import webbrowser

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}

#page = requests.get('https://www.instagram.com/<my account>/', headers=headers)
page = requests.get('https://www.instagram.com/<my account>/')
tree = html.fromstring(page.text)
pageTwo = requests.get('http://www.twitter.com/<my account>')
treeTwo = html.fromstring(pageTwo.text)

instaFollowers = tree.xpath("//span[@data-reactid='.0.1.0.0:0.1.3.1.0']/span[2]/text()")

instaFollowing = tree.xpath("//span[@data-reactid='.0.1.0.0:0.1.3.2.0']/span[2]/text()")

twitFollowers = treeTwo.xpath("//a[@data-nav='followers']/span[@class='ProfileNav-value']/text()")

twitFollowing = treeTwo.xpath("//a[@data-nav='following']/span[@class='ProfileNav-value']/text()")

print ''
print '--------------------'
print 'Social Media Checker'
print '--------------------'
print ''
print 'Instagram: ' + str(instaFollowers) + ' / ' + str(instaFollowing)
print ''
print 'Twitter: ' + str(twitFollowers) + ' / ' + str(twitFollowing)
从lxml导入html
导入请求
导入网络浏览器
headers={'User-Agent':'Mozilla/5.0(windowsnt 6.1)AppleWebKit/537.36(KHTML,比如Gecko)Chrome/41.0.2228.0 Safari/537.36'}
#page=requests.get('https://www.instagram.com//,headers=headers)
page=requests.get('https://www.instagram.com//')
tree=html.fromstring(page.text)
pageTwo=请求。获取('http://www.twitter.com/')
treeTwo=html.fromstring(pageTwo.text)
instaFollowers=tree.xpath(//span[@data reactid='.0.1.0:0.1.3.1.0']/span[2]/text())
instaFollowing=tree.xpath(//span[@data reactid='.0.1.0.0:0.1.3.2.0']/span[2]/text()
twitFollowers=treeTwo.xpath(//a[@data-nav='followers']/span[@class='ProfileNav-value']/text())
twitfollow=treeTwo.xpath(//a[@data nav='following']/span[@class='ProfileNav-value']/text())
打印“
打印'-------------'
打印“社交媒体检查器”
打印'-------------'
打印“
打印“Instagram:”+str(instaFollowers)+“/”+str(instaFollowers)
打印“
打印“Twitter:”+str(twitFollowers)+“/”+str(twitFollowers)

如果还有人对这类事情感兴趣,使用selenium解决了我的问题


有没有更快的方法?

如前所述,Instragram的页面源代码并不反映其呈现的源代码,因为调用Javascript函数将内容从JSON数据传递到浏览器。因此,Python在页面源代码中的内容并不能准确显示浏览器呈现给屏幕的内容。欢迎来到动态web编程的新世界!考虑使用或其他Web解析器,可以检索HTML生成的内容(不只是页面源)。 话虽如此,如果您只是需要IG帐户数据,您仍然可以使用Python的lxml对
标记中的JSON内容进行XPath处理(特别是第六次出现,但要根据您需要的页面进行调整)。下面的示例解析JSON数据:

import lxml.etree as et
import urllib.request as rq

rqpage = rq.urlopen('https://instagram.com/google')
txtpage = rqpage.read()

tree = et.HTML(txtpage)
jsondata = tree.xpath("//script[@type='text/javascript' and position()=6]/text()")

for i in jsondata:    
    print(i)
输出

window._sharedData = {"qs":"{\"shift\":10,\"header
\":\"n3bTdmHGHDgxvZYPN0KDFHqbkxd6zpTl\",\"edges\":100,\"blob
\":\"AQCq42rOTCnKOZcOxFn06L1J6_W8wY6ntAS1bX88VBClAjQD9PyJdefCzOwfSAbUdsBwHKb1QSndurPtjyN-
rHMOrZ_6ubE_Xpu908cyron9Zczkj4QMkAYUHIgnmmftuXG8rrFzq_Oq3BoXpQgovI9hefha-
6SAs1RLJMwMArrbMlFMLAwyd1TZhArcxQkk9bgRGT4MZK4Tk2VNt1YOKDN1pO3NJneFlUxdUJTdDX
zj3eY-stT7DnxF_GM_j6xwk1o\",\"iterations\":7,\"size\":42}","static_root":"
\/\/instagramstatic-a.akamaihd.net\/bluebar\/5829dff","entry_data":
{"ProfilePage":[{"__query_string":"?","__path":"\/google\/","__get_params":
{},"user":{"username":"google","has_blocked_viewer":false,"follows":
{"count":10},"requested_by_viewer":false,"followed_by":
{"count":977186},"country_block":null,"has_requested_viewer":false,"followed_
by_viewer":false,"follows_viewer":false,"profile_pic_url":"https:
\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xfp1\/t51.2885-19\/s150x150
\/11910217_933356470069152_115044571_a.jpg","is_private":false,"full_name":
"Google","media":{"count":180,"page_info":
{"has_previous_page":false,"start_cursor":"1126896719808871555","end_cursor":
"1092117490206686720","has_next_page":true},"nodes":[{"code":"-
jipiawryD","dimensions":{"width":640,"height":640},"owner":
{"id":"1067259270"},"comments":{"count":105},"caption":"Today's the day! 
Your searches are served. Happy Thanksgiving \ud83c\udf57\ud83c\udf70 
#GoogleTrends","likes":
{"count":11410},"date":1448556579.0,"thumbnail_src":"https:\/
\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\
/11848856_482502108621097_589421586_n.jpg","is_video":true,"id":"112689671980
8871555","display_src":"https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-
xat1\/t51.2885-15
...
(正在提取窗口。\u上面的sharedData变量)

请参见以下用户(关注者、关注者等)数据开始显示的位置:

{
  "qs": "{\"shift\":10,\"header\":\"n3bTdmHGHDgxvZYPN0KDFHqbkxd6zpTl\",\"edges\":100,\"blob\":\"AQCq42rOTCnKOZcOxFn06L1J6_W8wY6ntAS1bX88VBClAjQD9PyJdefCzOwfSAbUdsBwHKb1QSndurPtjyN-rHMOrZ_6ubE_Xpu908cyron9Zczkj4QMkAYUHIgnmmftuXG8rrFzq_Oq3BoXpQgovI9hefha-6SAs1RLJMwMArrbMlFMLAwyd1TZhArcxQkk9bgRGT4MZK4Tk2VNt1YOKDN1pO3NJneFlUxdUJTdDXzj3eY-stT7DnxF_GM_j6xwk1o\",\"iterations\":7,\"size\":42}",
  "static_root": "\/\/instagramstatic-a.akamaihd.net\/bluebar\/5829dff",
  "entry_data": {
    "ProfilePage": [
      {
        "__query_string": "?",
        "__path": "\/google\/",
        "__get_params": {

        },
        "user": {
          "username": "google",
          "has_blocked_viewer": false,
          "follows": {
            "count": 10
          },
          "requested_by_viewer": false,
          "followed_by": {
            "count": 977186
          },
          "country_block": null,
          "has_requested_viewer": false,
          "followed_by_viewer": false,
          "follows_viewer": false,
          "profile_pic_url": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xfp1\/t51.2885-19\/s150x150\/11910217_933356470069152_115044571_a.jpg",
          "is_private": false,
          "full_name": "Google",
          "media": {
            "count": 180,
            "page_info": {
              "has_previous_page": false,
              "start_cursor": "1126896719808871555",
              "end_cursor": "1092117490206686720",
              "has_next_page": true
            },
            "nodes": [
              {
                "code": "-jipiawryD",
                "dimensions": {
                  "width": 640,
                  "height": 640
                },
                "owner": {
                  "id": "1067259270"
                },
                "comments": {
                  "count": 105
                },
                "caption": "Today's the day! Your searches are served. Happy Thanksgiving \ud83c\udf57\ud83c\udf70 #GoogleTrends",
                "likes": {
                  "count": 11410
                },
                "date": 1448556579,
                "thumbnail_src": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\/11848856_482502108621097_589421586_n.jpg",
                "is_video": true,
                "id": "1126896719808871555",
                "display_src": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\/11848856_482502108621097_589421586_n.jpg"
              },
              {
                "code": "-hwbf2wr0O",
                "dimensions": {
                  "width": 640,
                  "height": 640
                },
                "owner": {
                  "id": "1067259270"
                },
                "comments": {
                  "count": 95
                },
                "caption": "Thanksgiving dinner is waiting. But first, the airport. \u2708\ufe0f #GoogleApp",
                "likes": {
                  "count": 12621
                },
...

若你们想找到关于你们自己和其他人的信息而不需要代码的麻烦,那个么试试这个软件吧。除了自动抓取外,它还分析并可视化从社交网络(Facebook、Twitter、Instagram和谷歌搜索引擎)收到的PDF报告信息


另外,我是这个项目的主要开发者和维护者。

刚刚检查了随机Twitter和IG页面的页面来源。虽然我可以找到Twitter属性
@datanav
,但我无法找到IG的
@datareactid
。顺便说一句,IG的跟随者和后续输出是Javascript
标记中的JSON字符串。检查您帐户的公共页面。使用具有相同xpath的Google Chrome或Firefox中的控制台导出结果。这就是我知道它的工作原理$x(“//span[@data reactid='.0.1.0.0:0.1.3.1.0']/span[2]/text()”)这就是我在instagram上调用的内容21.2k'看看这个。Chrome和FF的Web developer工具可能会输出完全生成的HTML,而不是Python的
requests.get()
可能使用的服务器发送的源HTML。这些跨类可以由JavaScript函数动态生成,然后呈现给浏览器。可能需要发送参数?是的,有。使用我的答案中描述的自动化系统。