Node.js NodeJS中的媒体爬行器_Node.js_Web Crawler

Node.js NodeJS中的媒体爬行器

node.js web-crawler

Node.js NodeJS中的媒体爬行器,node.js,web-crawler,Node.js,Web Crawler,我正在做一个名为robot hosting on的项目。我的项目的工作是从xml配置文件提供的url中获取媒体，xml配置文件具有定义的格式，正如您在脚本目录中看到的那样我的问题如下。有两个参数：显示web链接深度的列表，根据列表项中的选择器（css选择器），我可以找到媒体url或子页面url，在那里我可以最终找到媒体包含子页面URL的arr 简化示例如下所示： node_list = {..., next = {..., next= null}}; url_arr = [urls];

我正在做一个名为robot hosting on的项目。我的项目的工作是从xml配置文件提供的url中获取媒体，xml配置文件具有定义的格式，正如您在脚本目录中看到的那样

我的问题如下。有两个参数：

显示web链接深度的列表，根据列表项中的选择器（css选择器），我可以找到媒体url或子页面url，在那里我可以最终找到媒体

包含子页面URL的arr

简化示例如下所示：

node_list = {..., next = {...,  next= null}};
url_arr = [urls];

function fetch(url, node) {
    if(node == null) 
        return ;
    // here do something with http request
    var req = http.get('www.google.com', function(res){
        var data = '';
        res.on('data', function(chunk) {
            data += chunk;
        }.on('end', function() {
             // maybe here generate more new urls
             // get another url_list
             node = node.next;
             fetch(url_new, node);
        }
}

// here need to be run in sync
for (url in url_arr) {
     fetch(url, node)
}

我想迭代url arr中的所有项目，因此我执行以下操作：

node_list = {..., next = {...,  next= null}};
url_arr = [urls];

function fetch(url, node) {
    if(node == null) 
        return ;
    // here do something with http request
    var req = http.get('www.google.com', function(res){
        var data = '';
        res.on('data', function(chunk) {
            data += chunk;
        }.on('end', function() {
             // maybe here generate more new urls
             // get another url_list
             node = node.next;
             fetch(url_new, node);
        }
}

// here need to be run in sync
for (url in url_arr) {
     fetch(url, node)
}

如您所见，如果使用异步http请求，它必须占用所有系统资源。我无法控制这个过程。那么，有人有解决这个问题的好主意吗？

或者，nodejs不是执行此类任务的正确方法吗？

如果问题是您同时收到太多HTTP请求，您可以更改

fetch

函数以对URL堆栈进行操作

基本上你会这样做：

调用
```
fetch
```
时，将URL插入堆栈并检查请求是否正在进行：
如果请求未运行，则从堆栈中选择第一个url并对其进行处理，否则不执行任何操作
当http请求完成时，让它从堆栈中获取一个新url并处理该url

通过这种方式，您可以让for循环像现在一样添加所有URL，但一次只处理一个URL，这样就不会占用太多资源。

我发现一个名为gracefully的项目解决了我的问题。