Javascript 从html中提取数据_Javascript_Algorithm_Ecmascript 6

Javascript 从html中提取数据

javascript algorithm ecmascript-6

Javascript 从html中提取数据,javascript,algorithm,ecmascript-6,Javascript,Algorithm,Ecmascript 6,我有一个scraper类，它从一个流行的赔率门户收集玩家。没有确切的名字，该网站使用缩写形式。例如Rafael->R。幸运的是，它们可以在slugified形式的链接中找到（纳达尔·拉斐尔）我提出了一个方法processPlayers，试图解决这个问题。它适用于更简单的情况，但如果玩家的名字中有连字符或有两个名字，则会失败我编写了一些测试来演示html结构和问题类刮刀{ /** *将html字符串转换为ChereIO对象 *@param{String}html html字符串 *@re

我有一个scraper类，它从一个流行的赔率门户收集玩家。没有确切的名字，该网站使用缩写形式。例如Rafael->R。幸运的是，它们可以在slugified形式的链接中找到（纳达尔·拉斐尔）

我提出了一个方法

processPlayers

，试图解决这个问题。它适用于更简单的情况，但如果玩家的名字中有连字符或有两个名字，则会失败

我编写了一些测试来演示html结构和问题

类刮刀{
/**
*将html字符串转换为ChereIO对象
*@param{String}html html字符串
*@return{Object}cheerino对象
*/
HTMLDOM（html）{
返回cheerio.load（html）
}
/**
*如果名称将变为slagify，则返回部件数
*它考虑到名称可能包含连字符
*利奥波德·冯·萨谢尔受虐狂->利奥波德·冯·萨谢尔受虐狂
*@param{Array}a_name按空格（“”）分隔的名称
*@return{Integer}名称的长度
*/
getNameLength（一个名称）{
让name=a_name.length>1？a_name.join（“”）：a_name[0]
返回一个_name.length+name.split（'-'）。length-1
}
大写（a_名称）{
让res=[]
a_name.forEach（str=>{
res.push（str.substr（0,1）.toUpperCase（）+str.substr（1））
})
返回res.join（“”）
}
过程玩家（玩家）{
让链接=玩家（'a'）
让href=link.attr（'href'）
让a_players=link.text（）.split（'-'））
让a_href=href.split（'/'））
让a_link=a_href[a_href.length-2]。拆分（'-'））
让a_player1=a_players[0].trim（）.split（“”）
让a_player2=a_players[1].trim（）.split（“”）
让a_player1_lastName=a_player1.slice（0，-1）
让a_player2_lastName=a_player2.slice（0，-1）
设a_player1ShortFirstName=[a_player1[a_player1.length-1]]
设a_player2ShortFirstName=[a_player2[a_player2.length-1]]
让p1\u lnLength=this.getNameLength（一个\u player1\u lastName）
让p1\u fnLength=this.getNameLength（一个\u player1ShortFirstName）
让p2\u lnLength=this.getNameLength（一个\u player2\u lastName）
让p2\u fnLength=this.getNameLength（一个\u player2ShortFirstName）
设p1_长度=p1_长度+p1_长度
设p2_length=p2_lnLength+p2_fnLength
让player1FirstName=this.capitalize（a_link.slice（p1_lnLength，p1_length））
让player2FirstName=this.capitalize（a_link.slice（p1_length+p2_lnLength，p1_length+p2_length））
返回{
p1：{
名字：player1名字，
lastName:a_player1_lastName.join（“”）
},
p2：{
名字：player2FirstName，
lastName:a_player2_lastName.join（“”）
}
}
}
}
//试验===========================================================
测试（'simple case'，function（））{
让playersCell=`
`
常数刮刀=新刮刀（）
const td=scraper.htmldom（playersCell）
const players=scraper.processPlayers（td）
相等（players.p1.firstName“Anastasia”）
相等（players.p1.lastName“Pavlyuchenkova”）
deepEqual（players.p2{
名字：“莎拉”，
姓氏：“Sorribes Tormo”，
})
});
// =====================================================================
test（'last name中的连字符'，function（）{
让playersCell=`
`
常数刮刀=新刮刀（）
const td=scraper.htmldom（playersCell）
const players=scraper.processPlayers（td）
相等（players.p2.firstName“Mariana”）
相等（players.p2.lastName“Duque Marino”）
});
// =====================================================================
测试（'名字中的连字符'，函数（）{
让playersCell=`
`
常数刮刀=新刮刀（）
const td=scraper.htmldom（playersCell）
const players=scraper.processPlayers（td）
相等（players.p1.firstName'Jo Wilfried'）
相等（players.p1.lastName“Tsonga”）
});
测试（'两个名字'，函数（）{
让playersCell=`
`
常数刮刀=新刮刀（）
const td=scraper.htmldom（playersCell）
const players=scraper.processPlayers（td）
相等（players.p2.firstName“Blanco Garbine”）
相等（players.p2.lastName“Muguruza”）
});
看起来问题在于在“困难”的情况下正确地确定姓和名
您正在将文本拆分为两个全名，好吗。

然后你把全名分成几个字，好吗
然后你就犯了一个错误，假设名字总是最后一个词，而剩下的是姓
事实上，姓
是所有不是缩写形式的单词，其余的是名

我用这种方法解决了你的问题
请在下面找到更新的类刮板
（我已删除不再使用的函数）：
它通过了所有的测试。
抱歉，如果它不是很可读。它很难阅读和理解，但是一个非常紧凑（有效）的解决方案：）
class Scraper {
  /**
   * Converts a html string to a cheerio object
   * @param {String} html The html string
   * @return {Object} The cheerio object
   */
  htmlToDom(html) {
    return cheerio.load(html)
  }

  processPlayers(players) {
    let href = players('a').attr('href')
    let N = players('a').text().trim().split(/\s+-\s+/).map(n => {
        let r = new RegExp('.+('+n.replace(/-/g,'\\S+').replace(/\./g,'[^-]+').replace(/\s/g,'.')+').+','i')
        let p = {lastName: n.replace(/\s\S+\./g,'')}
        href.match(r).map(m => {
          r = new RegExp(p.lastName.replace(/\s/g,'.') + '-', 'i')
          m = m.replace(r,'').split(/-/).map(i => {return i.substring(0,1).toUpperCase() + i.substring(1)})
          p.firstName = n.split(p.lastName + ' ')[1].replace(/\./g,'').split('').map(l => {return m[0] && m[0].indexOf(l) === 0 ? m.shift() : l}).join('')
        })
        return p
      })
    return {p1: N[0], p2: N[1]}
  }
}