Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/amazon-s3/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Node.js 通过x射线/节点抓取黑客新闻_Node.js_Web Scraping_X Ray - Fatal编程技术网

Node.js 通过x射线/节点抓取黑客新闻

Node.js 通过x射线/节点抓取黑客新闻,node.js,web-scraping,x-ray,Node.js,Web Scraping,X Ray,如何通过x光/nodejs获取黑客新闻() 我想从中得到一些类似的东西: [ {title1, comment1}, {title2, comment2}, ... {"‘Minimal’ cell raises stakes in race to harness synthetic life", 48} ... {title 30, comment 30} ] 有一个新闻台,但我不知道如何刮它。。。 网站上的每个故事都由三个栏目组成。它们没有唯一的父级。所以结构看起

如何通过x光/nodejs获取黑客新闻()

我想从中得到一些类似的东西:

[
  {title1, comment1},
  {title2, comment2},
  ...
  {"‘Minimal’ cell raises stakes in race to harness synthetic life", 48}
  ...
  {title 30, comment 30}
]
有一个新闻台,但我不知道如何刮它。。。 网站上的每个故事都由三个栏目组成。它们没有唯一的父级。所以结构看起来像这样

<tbody>
  <tr class="spacer"> //Markup 1
  <tr class="athing"> //Headline 1 ('.deadmark+ a' contains title)
  <tr class>          //Meta Information 1 (.age+ a contains comments)
  <tr class="spacer"> //Markup 2
  <tr class="athing"> //Headline 2 ('.deadmark+ a' contains title)
  <tr class>          //Meta Information 2 (.age+ a contains comments)
  ...
  <tr class="spacer"> //Markup 30
  <tr class="athing"> //Headline 30 ('.deadmark+ a' contains title)
  <tr class>          //Meta Information 30 (.age+ a contains comments)

第二种方法返回30个名称和29个注释couts。。。我不认为有任何可能将它们映射到一起,因为没有任何信息表明30个标题中的哪一个缺少注释


任何帮助信息

标记都不容易用
X-ray
软件包刮去,因为有。这对于获取
tr.thing
行之后的下一个
tr
同级来获取注释非常有用

我们仍然可以使用(
+
)进入下一行,但是,我们将获取完整的行文本,然后使用正则表达式提取注释值,而不是指向可选的注释链接。如果不存在注释,请将该值设置为
0

完整的工作代码:

var Xray = require('x-ray');
var x = Xray();

x("https://news.ycombinator.com/", {
    title: ["tr.athing .deadmark+ a"],
    comments: ["tr.athing + tr"]
})(function (err, obj) {
    // extracting comments and mapping into an array of objects
    var result = obj.comments.map(function (elm, index) {
        var match = elm.match(/(\d+) comments?/);
        return {
            title: obj.title[index],
            comments: match ? match[1]: "0"
        };
    });
    console.log(result);
});
当前打印:

[ { title: 'Follow the money: what Apple vs. the FBI is really about',
    comments: '85' },
  { title: 'Unable to open links in Safari, Mail or Messages on iOS 9.3',
    comments: '12' },
  { title: 'Gogs – Go Git Service', comments: '13' },
  { title: 'Ubuntu Tablet now available for pre-order',
    comments: '56' },
  ...
  { title: 'American Tech Giants Face Fight in Europe Over Encrypted Data',
    comments: '7' },
  { title: 'Moving Beyond the OOP Obsession', comments: '34' } ]

使用
X-ray
package不容易刮取标记,因为存在标记。这对于获取
tr.thing
行之后的下一个
tr
同级来获取注释非常有用

我们仍然可以使用(
+
)进入下一行,但是,我们将获取完整的行文本,然后使用正则表达式提取注释值,而不是指向可选的注释链接。如果不存在注释,请将该值设置为
0

完整的工作代码:

var Xray = require('x-ray');
var x = Xray();

x("https://news.ycombinator.com/", {
    title: ["tr.athing .deadmark+ a"],
    comments: ["tr.athing + tr"]
})(function (err, obj) {
    // extracting comments and mapping into an array of objects
    var result = obj.comments.map(function (elm, index) {
        var match = elm.match(/(\d+) comments?/);
        return {
            title: obj.title[index],
            comments: match ? match[1]: "0"
        };
    });
    console.log(result);
});
当前打印:

[ { title: 'Follow the money: what Apple vs. the FBI is really about',
    comments: '85' },
  { title: 'Unable to open links in Safari, Mail or Messages on iOS 9.3',
    comments: '12' },
  { title: 'Gogs – Go Git Service', comments: '13' },
  { title: 'Ubuntu Tablet now available for pre-order',
    comments: '56' },
  ...
  { title: 'American Tech Giants Face Fight in Europe Over Encrypted Data',
    comments: '7' },
  { title: 'Moving Beyond the OOP Obsession', comments: '34' } ]

也可以很好地将注释计数器转换为数字并处理错误也可以很好地将注释计数器转换为数字并处理错误