Javascript 噩梦.js web抓取在服务器上不工作_Javascript_Node.js_Web Scraping_Server_Nightmare

Javascript 噩梦.js web抓取在服务器上不工作

javascript node.js web-scraping server

Javascript 噩梦.js web抓取在服务器上不工作,javascript,node.js,web-scraping,server,nightmare,Javascript,Node.js,Web Scraping,Server,Nightmare,对于我的（开源）Node.js项目，我需要实现一个搜索和下载URL的web刮板。目前，我使用的是： exports.scrap=函数（cb）{ _回调=cb _下载链接=0 让噩梦=新的噩梦（{show:false}）常量url=https://www.bundestag.de/services/opendata' //我们请求噩梦浏览到bundestag.de url并提取整个内部html 噩梦 .goto（url） .wait（‘body’） .evaluate（（）=>document.

对于我的（开源）Node.js项目，我需要实现一个搜索和下载URL的web刮板。目前，我使用的是：

exports.scrap=函数（cb）{
_回调=cb
_下载链接=0
让噩梦=新的噩梦（{show:false}）
常量url=https://www.bundestag.de/services/opendata'
//我们请求噩梦浏览到bundestag.de url并提取整个内部html
噩梦
.goto（url）
.wait（‘body’）
.evaluate（（）=>document.querySelector（'body'）.innerHTML）
(完)
。然后（响应=>{
_下载链接=0
让validLinks=extractLinks（响应）
_foundLinks=validLinks.length
logger.info（“[scraper]找到”+validLinks.length+“有效链接”）
如果（validLinks.length>0）{
validLinks.forEach（href=>{
下载文件fromhref（BT_链接+href）
});
}否则{
logger.info（“[scraper]未下载任何文件。”）
_回调函数（）
}  
}).catch（错误=>{
logger.info（“[scraper]未下载任何文件。”）
_回调函数（）
});
//提取我们需要的链接
让extractLinks=html=>{
数据=[]；
const$=cheerio.load（html）；
$（'.bt link dokument'）。每个（函数（）{
data.push（this.attribs.href）；
});
返回数据.过滤器（checkDocumentLink）
} 
}

这是完美的工作时，在我的本地机器上运行。然而，在我的ubuntu服务器（AWS）上运行它时，似乎有一个问题。我读到这是因为我的服务器上没有可用的图形界面，所以我尝试在它上运行

这是我的档案

运行pm2 ls时，我可以看到Xvfb和我的服务器都在运行：

ubuntu@ip-XXX-XX-XX-XXX:~/bundeszirkus-server/current$ pm2 ls
┌─────────────────────┬────┬─────────┬──────┬───────┬────────┬─────────┬────────┬─────┬────────────┬────────┬──────────┐
│ App name            │ id │ version │ mode │ pid   │ status │ restart │ uptime │ cpu │ mem        │ user   │ watching │
├─────────────────────┼────┼─────────┼──────┼───────┼────────┼─────────┼────────┼─────┼────────────┼────────┼──────────┤
│ Xvfb                │ 1  │ N/A     │ fork │ 26063 │ online │ 6       │ 14m    │ 0%  │ 17.5 MB    │ ubuntu │ disabled │
│ bundeszirkus-server │ 0  │ 1.0.0   │ fork │ 26057 │ online │ 6       │ 14m    │ 0%  │ 246.4 MB   │ ubuntu │ disabled │
└─────────────────────┴────┴─────────┴──────┴───────┴────────┴─────────┴────────┴─────┴────────────┴────────┴──────────┘
 Use `pm2 show <id|name>` to get more details about an app

同时，它在我的本地（Ubuntu）机器上运行时也能工作：

{"message":"Starting server!","level":"info","timestamp":"2020-01-11 12:52:47"}
{"message":"Starting initial scraping.","level":"info","timestamp":"2020-01-11 12:52:47"}
{"message":"[scraper] found 5 valid links.","level":"info","timestamp":"2020-01-11 12:52:49"}
{"message":"[scraper] downloading file: 19138-data.xml from href: http://www.bundestag.de/resource/blob/674998/86249f57e79b8308e820d6581e7e2a95/19138-data.xml","level":"info","timestamp":"2020-01-11 12:52:49"}
{"message":"[scraper] downloading file: 19136-data.xml from href: http://www.bundestag.de/resource/blob/674328/0e9d258d50d08923fe6d6ad1381bdb3f/19136-data.xml","level":"info","timestamp":"2020-01-11 12:52:49"}
{"message":"[scraper] downloading file: 19137-data.xml from href: http://www.bundestag.de/resource/blob/674730/2bc751b619488227c9267e3cbe12c4c3/19137-data.xml","level":"info","timestamp":"2020-01-11 12:52:49"}
{"message":"[scraper] downloading file: 19135-data.xml from href: http://www.bundestag.de/resource/blob/673576/147b80c74d6d681833568cfcf36f9670/19135-data.xml","level":"info","timestamp":"2020-01-11 12:52:49"}
{"message":"[scraper] downloading file: 19134-data.xml from href: http://www.bundestag.de/resource/blob/673116/982f9d0ec845b85bddd289ede4a589fd/19134-data.xml","level":"info","timestamp":"2020-01-11 12:52:49"}
{"message":"[scraper] finished downloading  all 5 files.","level":"info","timestamp":"2020-01-11 12:52:51"}
{"message":"Loading data.","level":"info","timestamp":"2020-01-11 12:52:51"}

我在这里有点迷茫，不知道如何寻找丢失的那块。非常感谢您的帮助

执行以下操作后，它现在可以工作：

向代码中添加

xvfb

，如下所示：

更改此行：

.wait（'body'）

为

.wait（2000）

{"message":"Starting server!","level":"info","timestamp":"2020-01-11 12:52:47"}
{"message":"Starting initial scraping.","level":"info","timestamp":"2020-01-11 12:52:47"}
{"message":"[scraper] found 5 valid links.","level":"info","timestamp":"2020-01-11 12:52:49"}
{"message":"[scraper] downloading file: 19138-data.xml from href: http://www.bundestag.de/resource/blob/674998/86249f57e79b8308e820d6581e7e2a95/19138-data.xml","level":"info","timestamp":"2020-01-11 12:52:49"}
{"message":"[scraper] downloading file: 19136-data.xml from href: http://www.bundestag.de/resource/blob/674328/0e9d258d50d08923fe6d6ad1381bdb3f/19136-data.xml","level":"info","timestamp":"2020-01-11 12:52:49"}
{"message":"[scraper] downloading file: 19137-data.xml from href: http://www.bundestag.de/resource/blob/674730/2bc751b619488227c9267e3cbe12c4c3/19137-data.xml","level":"info","timestamp":"2020-01-11 12:52:49"}
{"message":"[scraper] downloading file: 19135-data.xml from href: http://www.bundestag.de/resource/blob/673576/147b80c74d6d681833568cfcf36f9670/19135-data.xml","level":"info","timestamp":"2020-01-11 12:52:49"}
{"message":"[scraper] downloading file: 19134-data.xml from href: http://www.bundestag.de/resource/blob/673116/982f9d0ec845b85bddd289ede4a589fd/19134-data.xml","level":"info","timestamp":"2020-01-11 12:52:49"}
{"message":"[scraper] finished downloading  all 5 files.","level":"info","timestamp":"2020-01-11 12:52:51"}
{"message":"Loading data.","level":"info","timestamp":"2020-01-11 12:52:51"}

let xvfb = new Xvfb();
try {
  xvfb.startSync();
}
catch (e) {
  console.log(e);
}
// scraping
xvfb.stopSync();