在R中将Facebook htm文件转换为数据帧_R_Regex_Web Scraping_Rvest

在R中将Facebook htm文件转换为数据帧

r regex web-scraping

在R中将Facebook htm文件转换为数据帧,r,regex,web-scraping,rvest,R,Regex,Web Scraping,Rvest,我正试图将我的Facebook聊天信息从.htm文件提取到一个适当的数据框中Rvest通过将html节点（user、meta、p）提取到向量中，然后再提取df，为我提供了很好的服务。然而，我被困在这一部分： <div class="thread"> John, My Name" <div class="message"> <div class="message_header"> <span clas

我正试图将我的Facebook聊天信息从.htm文件提取到一个适当的数据框中

Rvest

通过将html节点（user、meta、p）提取到向量中，然后再提取df，为我提供了很好的服务。然而，我被困在这一部分：

<div class="thread">
    John, My Name"
    <div class="message">
        <div class="message_header">
            <span class="user">My Name</span>
            <span class="meta">Thursday, April 9, 2015 at 12:55am UTC+07</span>
        </div>
    </div>
    <p>Hello, how are you today</p>


//Other <div class = "message">
//Other <div class = "thread">

这是我在实现@Jota建议后的代码

#Finding the length of each thread for looping using html_children() and length()
list <- html_nodes(url, css = ".thread")
count <- sapply(list, html_children)
threadlength <- sapply(count, length)
#Extracting the names of the thread using xpath
threadlist <- html_nodes(url2, xpath = '*//div[@class = "thread"]/text()[1]') %>% html_text()

#Creating the thread column
#x indicates how many rows a thread topic should be duplicated. 
#y is used to subset the thread column. 
#z is used to close the inner loop, moving to the next thread topic
thread <- c()
n <- 0
y <- 0
for (x in threadlength) {
  z <- 0
  n <- n+1
  repeat{
    y <- y+1
    z <- z+1
    thread[y] <- threadlist[n]
    if (z == x){
      break
    }
  }
}

#使用html_children（）和length（）查找每个循环线程的长度
欢迎来到StackOverflow！请确保编辑您的问题以显示您尝试过的代码。难道你不能用类“thread”将div作为目标，直接通过html_节点（例如，xpath='*//div[@class=“thread”]/text（）[1]”）获取它们的文本吗？
？谢谢@Jota！这正是我所需要的。您对在哪里阅读xpath有什么建议吗？您可以查看并
#Finding the length of each thread for looping using html_children() and length()
list <- html_nodes(url, css = ".thread")
count <- sapply(list, html_children)
threadlength <- sapply(count, length)
#Extracting the names of the thread using xpath
threadlist <- html_nodes(url2, xpath = '*//div[@class = "thread"]/text()[1]') %>% html_text()

#Creating the thread column
#x indicates how many rows a thread topic should be duplicated. 
#y is used to subset the thread column. 
#z is used to close the inner loop, moving to the next thread topic
thread <- c()
n <- 0
y <- 0
for (x in threadlength) {
  z <- 0
  n <- n+1
  repeat{
    y <- y+1
    z <- z+1
    thread[y] <- threadlist[n]
    if (z == x){
      break
    }
  }
}