在CasperJS中使用XPath将元素捕获为Javascript数组_Javascript_Xpath_Casperjs_Getelementsbytagname

在CasperJS中使用XPath将元素捕获为Javascript数组

javascript xpath

在CasperJS中使用XPath将元素捕获为Javascript数组,javascript,xpath,casperjs,getelementsbytagname,Javascript,Xpath,Casperjs,Getelementsbytagname,我一直在与CasperJS合作一个我正在做的网页抓取项目，但在让它完美工作方面遇到了一些困难 getElementsAttribute作为从表中捕获href和title信息的一种方式，一直运行得很好，但在某些情况下，表不是超链接的，但无论如何都需要被删除。以下是代码的开头部分： // Load utilities var utils = require('utils'); var client = require('clientutils'); var fs = require('fs');

我一直在与CasperJS合作一个我正在做的网页抓取项目，但在让它完美工作方面遇到了一些困难

getElementsAttribute

作为从表中捕获

href

和

title

信息的一种方式，一直运行得很好，但在某些情况下，表不是超链接的，但无论如何都需要被删除。以下是代码的开头部分：

// Load utilities

var utils = require('utils');
var client = require('clientutils');
var fs = require('fs');
var x = require('casper').selectXPath;
var casper = require('casper').create({

pageSettings: {
    loadImages:  false,        
    loadPlugins: false 
},

clientScripts:  ['C:/casperjs/lib/jquery.min.js','C:/casperjs/lib/jquery.csv-0.71.min.js']

});

// Choose Main URL and Target Links

var mainURL = "http://en.wikipedia.org/wiki/Identification_badges_of_the_United_States_military";
var mainAttribute = '//*[@id="mw-content-text"]/ul/li/div/div/p/a';
var mainElement = '//*[@id="mw-content-text"]/ul/li/div/div/p';

casper.start();

casper.open(mainURL).then(function(){

// Choose Links from Main URL

mainLinks = this.getElementsAttribute(x(mainAttribute),'href');
mainTitle = this.getElementsAttribute(x(mainAttribute),'title');
mainFetch = document.getElementsByTagName(x(mainElement));

utils.dump(mainFetch);

});

casper.run();

getElementsAttribute

为我提供了正确的信息，但是

getElementsByTagName

仅为我提供了一个“未定义”或空的结果，即使在我处理内部内容时也是如此。（这个.getElementsByTagName似乎不起作用）

基本上，我希望在缺少超链接的实例中获取文本，并使用单个XPath选择器将其推入与mainLinks和maintTitle大小/顺序相同的数组中。似乎应该有一个简单的方法来做到这一点，但我一直无法找到它。有人能给我指出正确的方向吗？

你可以尝试使用

mainFetch=this.getElementsInfo（x（mainlelement））
正如您在下面的输出片段中所看到的，它捕获了所有子项，html部分可用于筛选出不以“a”开头的子项
{
    "attributes": {},
    "height": 54,
    "html": "<a href=\"/wiki/CPO_Command_Identification_Badge\" title=\"CPO Command Identification Badge\" class=\"mw-redirect\">Chief Petty Officer Command Identification Badges</a>",
    "nodeName": "p",
    "tag": "<p><a href=\"/wiki/CPO_Command_Identification_Badge\" title=\"CPO Command Identification Badge\" class=\"mw-redirect\">Chief Petty Officer Command Identification Badges</a></p>",
    "text": "Chief Petty Officer Command Identification Badges",
    "visible": true,
    "width": 147,
    "x": 185,
    "y": 9302
},
{
    "attributes": {},
    "height": 18,
    "html": "Law Enforcement Badges",
    "nodeName": "p",
    "tag": "<p>Law Enforcement Badges</p>",
    "text": "Law Enforcement Badges",
    "visible": true,
    "width": 147,
    "x": 185,
    "y": 9526
}

{
“属性”：{}，
“高度”：54，
“html”：“，
“nodeName”：“p”，
“标签”：“”，
“文字”：“总士官司令部识别徽章”，
“可见”：真实，
“宽度”：147，
“x”：185，
“y”：9302
},
{
“属性”：{}，
“高度”：18，
“html”：“执法徽章”，
“nodeName”：“p”，
“标签”：“执法徽章””，
“文字”：“执法徽章”，
“可见”：真实，
“宽度”：147，
“x”：185，
“y”：9526
}
谢谢，我想这可能有用。我试用后会给你回复的！它现在工作得更顺利了——不过，对于过滤html
数据以获得href
属性的最佳方法，您会推荐什么？