Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/javascript/401.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
C# 从Pinterest web地址获取板上的所有图像_C#_Javascript_Html_Ajax_Pinterest - Fatal编程技术网

C# 从Pinterest web地址获取板上的所有图像

C# 从Pinterest web地址获取板上的所有图像,c#,javascript,html,ajax,pinterest,C#,Javascript,Html,Ajax,Pinterest,这个问题听起来很简单,但并不像听起来那么简单 错误简要总结 例如,使用此板 在页面顶部检查板本身的HTML(在div中,使用类GridItems)可以得到: .. ... .. 但在页面底部,在激活无限卷轴几次后,我们得到以下HTML: .. ... .. 正如您所看到的,页面上较高位置的图像的一些容器已经消失,并且并非所有的图像容器都是在首次加载页面时加载的 我想做什么 我希望能够创建一个C#脚本(或目前的任何服务器端语言),可以下载页面的完整HTML(即检索页面上的每个图像),然后

这个问题听起来很简单,但并不像听起来那么简单

错误简要总结

例如,使用此板

在页面顶部检查板本身的HTML(在
div
中,使用类
GridItems
)可以得到:


..
...
..
但在页面底部,在激活无限卷轴几次后,我们得到以下HTML:


..
...
..
正如您所看到的,页面上较高位置的图像的一些容器已经消失,并且并非所有的图像容器都是在首次加载页面时加载的


我想做什么

我希望能够创建一个C#脚本(或目前的任何服务器端语言),可以下载页面的完整HTML(即检索页面上的每个图像),然后从其URL下载图像。下载网页并使用适当的XPath很容易,但真正的挑战是下载每个图像的完整HTML

是否有一种方法可以模拟滚动到页面底部,或者有一种更简单的方法可以检索每个图像?我想象Pinterest使用AJAX来更改HTML,有没有一种方法可以通过编程触发事件来接收所有HTML?提前感谢您提供的建议和解决方案,如果您没有任何建议和解决方案,您甚至可以阅读这个很长的问题

伪代码

using System;
using System.Net;
using HtmlAgilityPack;

private void Main() {
    string pinterestURL = "http://www.pinterest.com/...";
    string XPath = ".../img";

    HtmlDocument doc = new HtmlDocument();

    // Currently only downloads the first 25 images.
    doc.Load(strPinterestUrl);

    foreach(HtmlNode link in doc.DocumentElement.SelectNodes(strXPath))
    {
         image_links[] = link["src"];
         // Use image links
    }
}

一些人建议使用javascript来模拟滚动

我认为您根本不需要模拟滚动,我认为您只需要在滚动发生时找出通过AJAX调用的URI的格式,然后您就可以按顺序获得每个“页面”的结果。需要一点落后的工程技术

使用Chrome inspector的“网络”选项卡,我可以看到,一旦我到达页面下方的某一距离,该URI被称为:

如果我们对其进行解码,就会发现它主要是JSON

http://pinterest.com/resource/BoardFeedResource/get/?source_url=/dodo/web-designui-and-mobile/&data=
{
"options": {
    "board_id": "158400180582875562",
    "access": [],
    "bookmarks": [
        "LT4xNTg0MDAxMTE4NjcxMTM2ODk6MjV8ZWJjODJjOWI4NTQ4NjU4ZDMyNzhmN2U3MGQyZGJhYTJhZjY2ODUzNTI4YTZhY2NlNmY0M2I1ODYwYjExZmQ3Yw=="
    ]
},
"context": {
    "app_version": "fb43cdb"
},
"module": {
    "name": "GridItems",
    "options": {
        "scrollable": true,
        "show_grid_footer": true,
        "centered": true,
        "reflow_all": true,
        "virtualize": true,
        "item_options": {
            "show_rich_title": false,
            "squish_giraffe_pins": false,
            "show_board": false,
            "show_via": false,
            "show_pinner": false,
            "show_pinned_from": true
        },
        "layout": "variable_height"
    }
},
"append": true,
"error_strategy": 1
}
&_=1377091719636
向下滚动直到我们收到第二个请求,然后我们看到

http://pinterest.com/resource/BoardFeedResource/get/?source_url=/dodo/web-designui-and-mobile/&data=
{
    "options": {
        "board_id": "158400180582875562",
        "access": [],
        "bookmarks": [
            "LT4xNTg0MDAxMTE4NjcwNTk1ODQ6NDl8ODFlMDUwYzVlYWQxNzVmYzdkMzI0YTJiOWJkYzUwOWFhZGFkM2M1MzhiNzA0ZDliZDIzYzE3NjkzNTg1ZTEyOQ=="
        ]
    },
    "context": {
        "app_version": "fb43cdb"
    },
    "module": {
        "name": "GridItems",
        "options": {
            "scrollable": true,
            "show_grid_footer": true,
            "centered": true,
            "reflow_all": true,
            "virtualize": true,
            "item_options": {
                "show_rich_title": false,
                "squish_giraffe_pins": false,
                "show_board": false,
                "show_via": false,
                "show_pinner": false,
                "show_pinned_from": true
            },
            "layout": "variable_height"
        }
    },
    "append": true,
    "error_strategy": 2
}
&_=1377092231234
正如你所看到的,变化不大。董事会id是相同的。error_策略现在是2,并且末尾的&_是不同的


&u参数是此处的关键参数。我敢打赌,它会告诉页面从哪里开始下一组照片。我在响应或原始页面HTML中都找不到对它的引用,但它必须在那里的某个地方,或者由客户端的javascript生成。无论哪种方式,页面/浏览器都必须知道下一步需要什么,因此您应该能够获得这些信息

您可以通过使用以下标题发出请求来触发json端点:
X-request-with:XMLHttpRequest

在控制台中的命令中尝试以下操作:

curl -H "X-Requested-With:XMLHttpRequest" "http://pinterest.com/resource/CategoryFeedResource/get/?source_url=%2Fall%2Fgeek%2F&data=%7B%22options%22%3A%7B%22feed%22%3A%22geek%22%2C%22scope%22%3Anull%2C%22bookmarks%22%3A%5B%22Pz8xMzc3NjU4MjEyLjc0Xy0xfDE1ZjczYzc4YzNlNDg3M2YyNDQ4NGU1ZTczMmM0ZTQyYzBjMWFiMWNhYjRhMDRhYjg2MTYwMGVkNWQ0ZDg1MTY%3D%22%5D%2C%22is_category_feed%22%3Atrue%7D%2C%22context%22%3A%7B%22app_version%22%3A%22addc92b%22%7D%2C%22module%22%3A%7B%22name%22%3A%22GridItems%22%2C%22options%22%3A%7B%22scrollable%22%3Atrue%2C%22show_grid_footer%22%3Atrue%2C%22centered%22%3Atrue%2C%22reflow_all%22%3Atrue%2C%22virtualize%22%3Atrue%2C%22item_options%22%3A%7B%22show_pinner%22%3Atrue%2C%22show_pinned_from%22%3Afalse%2C%22show_board%22%3Atrue%2C%22show_via%22%3Afalse%7D%2C%22layout%22%3A%22variable_height%22%7D%7D%2C%22append%22%3Atrue%2C%22error_strategy%22%3A2%7D&module_path=App()%3EHeader()%3EDropdownButton()%3EDropdown()%3ECategoriesMenu(resource%3D%5Bobject+Object%5D%2C+name%3DCategoriesMenu%2C+resource%3DCategoriesResource(browsable%3Dtrue))&_=1377658213300" | python -mjson.tool
您将在输出的json中看到pin数据。你应该能够解析它并抓取下一张你需要的图片


对于该位:
&-=1377658213300
。我推测这是上一个列表中最后一个pin的id。您应该能够在每次通话中使用上一次响应中的最后一个pin替换此密码

好吧,我想这可能是你需要的(稍作改动)

注意事项:

  • 这是PHP,不是C#(但您说过您对任何服务器端语言都感兴趣)
  • 这段代码连接到(非官方的)Pinterest搜索端点。您需要更改$data和$search_res以反映任务的适当端点(例如BoardFeedResource)。注意:至少对于搜索,Pinterest当前使用两个端点,一个用于初始页面加载,另一个用于无限滚动操作。每个都有自己的预期参数结构
  • Pinterest没有官方的公共API,希望在没有任何警告的情况下,每当他们更改任何内容时,这个API都会中断
  • 您可能会发现pinterestapi.co.uk更容易实现,而且您所做的事情也更容易接受
  • 我在类下面有一些演示/调试代码,一旦您获取了所需的数据,这些代码就不应该存在,还有一个您可能想要更改的默认页面获取限制
  • 兴趣点:

  • 下划线
    参数采用JavaScript格式的时间戳,即与Unix时间类似,但添加了毫秒。它实际上不用于分页
  • 分页使用
    bookmarks
    属性,因此您向不需要它的“新”端点发出第一个请求,然后从结果中获取
    书签
    ,并在请求中使用它来获取下一个结果“页面”,从这些结果中获取
    书签
    ,然后获取下一页,等等,直到结果用完或达到预设限制(或者脚本执行时间达到服务器最大值)。我很想知道
    书签
    字段的确切编码。我想,除了pin ID或其他页面标记之外,还有一些有趣的秘密调味汁
  • 我跳过了html,而是处理JSON,因为它(对我来说)比使用DOM操作解决方案或一堆正则表达式更容易
  • get_taged_pins($search_str、$limit、$bookmarks、+$page);
    如果(!($more\u pins==false))$pin\u data\u array=array\u merge($pin\u data\u array,$more\u pins);
    返回$pin_data_数组;
    }
    //递归结束
    返回false;
    }
    }//端类Skrivener_引脚
    }//如果结束,则结束
    /**
    *调试/演示代码
    *删除或注释本节以进行生产
    */
    //输出标题以控制内容的显示方式
    //标题(“内容类型:application/json”);
    标题(“内容类型:文本/普通”);
    //标题(“内容类型:文本/html”);
    //定义搜索词
    //$tag=“维德”;
    $tag=“溶血性”;
    //$tag=“qjkjgjerbjkrekhjk”;
    如果(类_存在('Skrivener_引脚')){
    //实例化该类
    $pin_handler=新的Skrivener_Pins();
    
    #!/usr/bin/env bash 
    ##
    ## File: getpins.bsh 
    ## 
    ## Copyrighted by +A.M.Danischewski  2016+ (c)
    ## This program may be reutilized without limits, provided this 
    ## notice remain intact. 
    
    ## If this breaks one day, then just fire up firefox Developer Tools and check the network traffic to 
    ## capture "copy as curl" of the calls to the search page (filter with BaseSearchResource), then the 
    ## call to feed more data (filter with SearchResource). 
    ## 
    ## Do a search on whatever you want remove the cookie header, and add -o ret2.html -D h2.txt -c c1.txt, 
    ## then search replace the search terms as SEARCHTOKEN1 and SEARCHTOKEN2. 
    ## 
    ## Description this script facilitates alternate browsers, by caching images/pins 
    ## from pinterest. This script is hardwired for two search terms. First create a directory 
    ## to where you want the images to go, then cd there. 
    ##  Usage: 
    ##    $> cd /big/drive/auto_gyros 
    ##    $> getpins.bsh "sleek autogyros"
    ## 
    ## Expect around 900 images to land wherever you select, so make sure you have space! =) 
    ##
    
    declare -r ORIG_IMGS="pin_orig_imgs.txt"
    declare -r TMP_IMGS="pin_imgs.txt"
    declare -r UA_HEADER="User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:19.$(($RANDOM%10))) Gecko/20100101 Firefox/19.0"
    
     ## Say Hello to the main page and get a cookie. 
    declare PINCMD1=$(cat << EOF
    curl -o ret1.html -D h1.txt -c c1.txt -H 'Host: www.pinterest.com' -H '${UA_HEADER}' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Accept-Language: en-US,en;q=0.5' --compressed -H 'Connection: keep-alive' 'https://www.pinterest.com/'
    EOF
    )
     ## Start a search for our dear search terms. 
    declare PINCMD2=$(cat << EOF
    curl -H 'X-APP-VERSION: ea7a93a' -o ret2.html -D h2.txt -c c1.txt -H 'Host: www.pinterest.com' -H '${UA_HEADER}' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Accept-Language: en-US,en;q=0.5' --compressed -H 'X-Pinterest-AppState: active' -H 'X-NEW-APP: 1'  -H 'X-Requested-With: XMLHttpRequest' -H 'Referer: https://www.pinterest.com' -H 'Connection: keep-alive' 'https://www.pinterest.com/resource/BaseSearchResource/get/?source_url=%2Fsearch%2Fpins%2F%3Fq%3DSEARCHTOKEN1%2520SEARCHTOKEN2%26rs%3Dtyped%260%3DSEARCHTOKEN1%257Ctyped%261%3DSEARCHTOKEN2%257Ctyped&data=%7B%22options%22%3A%7B%22restrict%22%3Anull%2C%22scope%22%3A%22pins%22%2C%22constraint_string%22%3Anull%2C%22show_scope_selector%22%3Atrue%2C%22query%22%3A%22SEARCHTOKEN1+SEARCHTOKEN2%22%7D%2C%22context%22%3A%7B%7D%2C%22module%22%3A%7B%22name%22%3A%22SearchPage%22%2C%22options%22%3A%7B%22restrict%22%3Anull%2C%22scope%22%3A%22pins%22%2C%22constraint_string%22%3Anull%2C%22show_scope_selector%22%3Atrue%2C%22query%22%3A%22SEARCHTOKEN1+SEARCHTOKEN2%22%7D%7D%2C%22render_type%22%3A1%2C%22error_strategy%22%3A0%7D&module_path=App%3EHeader%3ESearchForm%3ETypeaheadField(support_guided_search%3Dtrue%2C+resource_name%3DAdvancedTypeaheadResource%2C+tags%3Dautocomplete%2C+class_name%3DbuttonOnRight%2C+prefetch_on_focus%3Dtrue%2C+support_advanced_typeahead%3Dnull%2C+hide_tokens_on_focus%3Dundefined%2C+search_on_focus%3Dtrue%2C+placeholder%3DSearch%2C+show_remove_all%3Dtrue%2C+enable_recent_queries%3Dtrue%2C+name%3Dq%2C+view_type%3Dguided%2C+value%3D%22%22%2C+input_log_element_type%3D227%2C+populate_on_result_highlight%3Dtrue%2C+search_delay%3D0%2C+is_multiobject_search%3Dtrue%2C+type%3Dtokenized%2C+enable_overlay%3Dtrue)&_=1454779874891' 
    EOF
    )
     ## Load further images. 
    declare PINCMD3=$(cat << EOF
    curl -H 'X-APP-VERSION: ea7a93a' -D h3.txt -c c1.txt -H 'Host: www.pinterest.com' -H '${UA_HEADER}' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Accept-Language: en-US,en;q=0.5' --compressed -H 'X-Pinterest-AppState: active' -H 'X-NEW-APP: 1'  -H 'X-Requested-With: XMLHttpRequest' -H 'Referer: https://www.pinterest.com' -H 'Connection: keep-alive' 'https://www.pinterest.com/resource/SearchResource/get/?source_url=%2Fsearch%2Fpins%2F%3Fq%3DSEARCHTOKEN1%2520SEARCHTOKEN2%26rs%3Dtyped%260%3DSEARCHTOKEN1%257Ctyped%261%3DSEARCHTOKEN2%257Ctyped&data=%7B%22options%22%3A%7B%22layout%22%3Anull%2C%22places%22%3Afalse%2C%22constraint_string%22%3Anull%2C%22show_scope_selector%22%3Atrue%2C%22query%22%3A%22SEARCHTOKEN1+SEARCHTOKEN2%22%2C%22scope%22%3A%22pins%22%2C%22bookmarks%22%3A%5B%22_NEW_BOOK_MARK_%22%5D%7D%2C%22context%22%3A%7B%7D%7D&module_path=App%3EHeader%3ESearchForm%3ETypeaheadField(support_guided_search%3Dtrue%2C+resource_name%3DAdvancedTypeaheadResource%2C+tags%3Dautocomplete%2C+class_name%3DbuttonOnRight%2C+prefetch_on_focus%3Dtrue%2C+support_advanced_typeahead%3Dnull%2C+hide_tokens_on_focus%3Dundefined%2C+search_on_focus%3Dtrue%2C+placeholder%3DSearch%2C+show_remove_all%3Dtrue%2C+enable_recent_queries%3Dtrue%2C+name%3Dq%2C+view_type%3Dguided%2C+value%3D%22%22%2C+input_log_element_type%3D227%2C+populate_on_result_highlight%3Dtrue%2C+search_delay%3D0%2C+is_multiobject_search%3Dtrue%2C+type%3Dtokenized%2C+enable_overlay%3Dtrue)&_=1454779874911'
    EOF
    )
     ## Exactly 2 search terms in a single string are expected, you can hack it up if 
     ## you want something else.  
    declare SEARCHTOKEN1=$(echo "${1}" | cut -d " " -f1)
    declare SEARCHTOKEN2=$(echo "${1}" | cut -d " " -f2)
    
    PINCMD3=$(sed "s/SEARCHTOKEN1/${SEARCHTOKEN1}/g" <<< "${PINCMD3}") 
    PINCMD3=$(sed "s/SEARCHTOKEN2/${SEARCHTOKEN2}/g" <<< "${PINCMD3}") 
    PINCMD2=$(sed "s/SEARCHTOKEN1/${SEARCHTOKEN1}/g" <<< "${PINCMD2}") 
    PINCMD2=$(sed "s/SEARCHTOKEN2/${SEARCHTOKEN2}/g" <<< "${PINCMD2}") 
    
    function lspinimgs() { grep -o "\"url\": \"http[s]*://[^\"]*.pinimg.com[^\"]*.jpg\"" "${1}" | cut -d " " -f2 | tr -d "\""; }
    function mkpinorig() { sed "s#\(^http.*\)\(com/\)\([^/]*\)\(/.*jpg\$\)#\1\2originals\4#g" "${1}" > "${2}"; }    
    function getpinbm() { grep -o "bookmarks\": [^ ]* "  "${1}" | sed "s/^book.*\[\"//g;s/\"\].*\$//g" | sort | uniq | grep -v "-end-"; }
    function changepinbm() { PINCMD3=$(sed "s/\(^.*\)\(bookmarks%22%3A%5B%22\)\(.*\)\(%22%5D.*\$\)/\1\2${1}\4/g" <<< "${PINCMD3}"); }
    function cleanup() { rm ret*html c1.txt "${TMP_IMGS}" h{1..3}.txt "${ORIG_IMGS}"; } 
    
    function main() { 
    eval "${PINCMD1}" 
    eval "${PINCMD2}"
    for ((i=3,lasti=2; i<10000; i++,lasti++)); do 
     pinbm=$(getpinbm "ret${lasti}.html")
     [[ -z "${pinbm}" ]] && break 
     changepinbm "${pinbm}"
     eval "${PINCMD3}" > "ret${i}.html"
    done 
    for a in *.html; do lspinimgs "${a}" >> "${TMP_IMGS}"; done
    mkpinorig "${TMP_IMGS}" "${ORIG_IMGS}"
    IFS=$(echo -en "\n\b") && for a in $(sort "${ORIG_IMGS}" | uniq); do 
     wget --tries=3 -E -e robots=off -nc --random-wait --content-disposition --no-check-certificate -p --restrict-file-names=windows,lowercase,ascii --header "${UA_HEADER}" -nd "$a"  
    done
    cleanup 
    } 
    
    main 
    exit 0
    
    # get all pins for the board
    board_pins = []
    pin_batch = pinterest.board_feed(board_id=target_board['id'], board_url=target_board['url'])
    
    while len(pin_batch) > 0:
        board_pins += pin_batch
        pin_batch = pinterest.board_feed(board_id=target_board['id'], board_url=target_board['url'])
    
    for pin in board_pins:
        url = pin['image']
        # process image url..