Javascript 使用Selinium、Scrapy和Python检索用户个人资料的公共facebook帖子

Javascript 使用Selinium、Scrapy和Python检索用户个人资料的公共facebook帖子,javascript,python,facebook,selenium,scrapy,Javascript,Python,Facebook,Selenium,Scrapy,我正试图找回我的公开资料的墙贴。我需要检查邮件是否到达我的墙壁,并且是否在给定的时间戳内送达。我基本上是在写一个监控检查来验证我们的消息传递系统的消息传递。我收到一个无法建立连接的消息,因为目标计算机主动拒绝了它。不太清楚为什么 #!/usr/bin/env python # Many times when crawling we run into problems where content that is rendered on the page is generated with Jav

我正试图找回我的公开资料的墙贴。我需要检查邮件是否到达我的墙壁,并且是否在给定的时间戳内送达。我基本上是在写一个监控检查来验证我们的消息传递系统的消息传递。我收到一个无法建立连接的消息,因为目标计算机主动拒绝了它。不太清楚为什么

#!/usr/bin/env python

# Many times when crawling we run into problems where content that is rendered on the page is generated with Javascript and therefore scrapy is unable to crawl for it (eg. ajax requests, jQuery craziness). However, if you use Scrapy along with the web testing framework Selenium then we are able to crawl anything displayed in a normal web browser.
#
# Some things to note:
# You must have the Python version of Selenium RC installed for this to work, and you must have set up Selenium properly. Also this is just a template crawler. You could get much crazier and more advanced with things but I just wanted to show the basic idea. As the code stands now you will be doing two requests for any given url. One request is made by Scrapy and the other is made by Selenium. I am sure there are ways around this so that you could possibly just make Selenium do the one and only request but I did not bother to implement that and by doing two requests you get to crawl the page with Scrapy too.
#
# This is quite powerful because now you have the entire rendered DOM available for you to crawl and you can still use all the nice crawling features in Scrapy. This will make for slower crawling of course but depending on how much you need the rendered DOM it might be worth the wait.


    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.selector import HtmlXPathSelector
    from scrapy.http import Request
    import time
    from selenium import selenium

    class SeleniumSpider(CrawlSpider):
        name = "SeleniumSpider"
        start_urls = ["https://www.facebook.com/chronotrackmsgcheck"]

        rules = (
            Rule(SgmlLinkExtractor(allow=('\.html', )), callback='parse_page',follow=True),
        )

        def __init__(self):
            CrawlSpider.__init__(self)
            self.verificationErrors = []
            self.selenium = selenium("localhost", 4444, "*chrome", "https://www.facebook.com/chronotrackmsgcheck")
            self.selenium.start()

        def __del__(self):
            self.selenium.stop()
            print self.verificationErrors
            CrawlSpider.__del__(self)

        def parse_page(self, response):
            item = Item()

            hxs = HtmlXPathSelector(response)
            #Do some XPath selection with Scrapy
            hxs.select('//div').extract()

            sel = self.selenium
            sel.open(response.url)

            #Wait for javscript to load in Selenium
            time.sleep(2.5)

            #Do some crawling of javascript created content with Selenium
            sel.get_text("//div")
            yield item

    SeleniumSpider()

答案是这样的。这将使用selenium解析用户配置文件,然后只解析页面上被认为是文本的内容。如果你想使用数据挖掘,你必须使用你自己的算法,但它对我来说是有效的

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup

driver = webdriver.Firefox()
driver.get("https://www.facebook.com/profileusername")
inputEmail = driver.find_element_by_id("email")
inputEmail.send_keys("facebookemail")
inputPass = driver.find_element_by_id("pass")
inputPass.send_keys("facebookpassword")
inputPass.submit()
page_text = (driver.page_source).encode('utf-8')
soup = BeautifulSoup(page_text)

parse_data = soup.get_text().encode('utf-8').split('Grant Zukel') #if you use your name extactly how it is displayed on facebook it will parse all post it sees, because your name is always in every post.

latest_message = parse_data[3]
driver.close()
print latest_message
以下是我如何获得用户的最新帖子:

#!/usr/bin/python
# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup

driver = webdriver.Firefox()
driver.get("https://www.facebook.com/fbusername")
inputEmail = driver.find_element_by_id("email")
inputEmail.send_keys("fbemail")
inputPass = driver.find_element_by_id("pass")
inputPass.send_keys("fbpass")
inputPass.submit()
page_text = (driver.page_source).encode('utf-8')
soup = BeautifulSoup(page_text)
parse_data = soup.get_text().encode('utf-8').split('Grant Zukel')    
latest_message = parse_data[4]
latest_message = parse_data[4].split('·')
driver.close()
time = latest_message[0]
message = latest_message[1]
print time,message

答案是这样的。这将使用selenium解析用户配置文件,然后只解析页面上被认为是文本的内容。如果你想使用数据挖掘,你必须使用你自己的算法,但它对我来说是有效的

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup

driver = webdriver.Firefox()
driver.get("https://www.facebook.com/profileusername")
inputEmail = driver.find_element_by_id("email")
inputEmail.send_keys("facebookemail")
inputPass = driver.find_element_by_id("pass")
inputPass.send_keys("facebookpassword")
inputPass.submit()
page_text = (driver.page_source).encode('utf-8')
soup = BeautifulSoup(page_text)

parse_data = soup.get_text().encode('utf-8').split('Grant Zukel') #if you use your name extactly how it is displayed on facebook it will parse all post it sees, because your name is always in every post.

latest_message = parse_data[3]
driver.close()
print latest_message
以下是我如何获得用户的最新帖子:

#!/usr/bin/python
# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup

driver = webdriver.Firefox()
driver.get("https://www.facebook.com/fbusername")
inputEmail = driver.find_element_by_id("email")
inputEmail.send_keys("fbemail")
inputPass = driver.find_element_by_id("pass")
inputPass.send_keys("fbpass")
inputPass.submit()
page_text = (driver.page_source).encode('utf-8')
soup = BeautifulSoup(page_text)
parse_data = soup.get_text().encode('utf-8').split('Grant Zukel')    
latest_message = parse_data[4]
latest_message = parse_data[4].split('·')
driver.close()
time = latest_message[0]
message = latest_message[1]
print time,message

如果你没有他们的书面许可,他们是不允许刮脸书的。我建议阅读@WizKid——“爬行”Facebook是合法的。谷歌和其他搜索引擎每天都会抓取公共档案。@Grant Zukel如果您将useragent设置为类似于普通webcrawler的内容,则无需运行或等待任何JS。事实上,在这种情况下,您可以完全省略Selenium,只需获取HTML并对其进行解析。我不知道什么是合适的用户代理,但是从GoogleBot使用的开始。请更正错误,我帮不上忙。@sstur您是否阅读了我发布的链接?我打赌谷歌已经得到Facebook的书面许可。根据谷歌机器人和其他一些机器人被允许刮脸书。但它的结尾是:用户代理:*,不允许:/。这意味着对其他人来说,这是不允许的。如果你错了,我会很快发布工作代码。如果你没有他们的书面许可,他们是不允许刮Facebook的。我建议阅读@WizKid——“爬行”Facebook是合法的。谷歌和其他搜索引擎每天都会抓取公共档案。@Grant Zukel如果您将useragent设置为类似于普通webcrawler的内容,则无需运行或等待任何JS。事实上,在这种情况下,您可以完全省略Selenium,只需获取HTML并对其进行解析。我不知道什么是合适的用户代理,但是从GoogleBot使用的开始。请更正错误,我帮不上忙。@sstur您是否阅读了我发布的链接?我打赌谷歌已经得到Facebook的书面许可。根据谷歌机器人和其他一些机器人被允许刮脸书。但它的结尾是:用户代理:*,不允许:/。这意味着对其他人来说,这是不允许的。如果你错了,我将很快发布工作代码。