Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/69.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在R中将Facebook htm文件转换为数据帧_R_Regex_Web Scraping_Rvest - Fatal编程技术网

在R中将Facebook htm文件转换为数据帧

在R中将Facebook htm文件转换为数据帧,r,regex,web-scraping,rvest,R,Regex,Web Scraping,Rvest,我正试图将我的Facebook聊天信息从.htm文件提取到一个适当的数据框中Rvest通过将html节点(user、meta、p)提取到向量中,然后再提取df,为我提供了很好的服务。然而,我被困在这一部分: <div class="thread"> John, My Name" <div class="message"> <div class="message_header"> <span clas

我正试图将我的Facebook聊天信息从.htm文件提取到一个适当的数据框中
Rvest
通过将html节点(user、meta、p)提取到向量中,然后再提取df,为我提供了很好的服务。然而,我被困在这一部分:

<div class="thread">
    John, My Name"
    <div class="message">
        <div class="message_header">
            <span class="user">My Name</span>
            <span class="meta">Thursday, April 9, 2015 at 12:55am UTC+07</span>
        </div>
    </div>
    <p>Hello, how are you today</p>


//Other <div class = "message">
//Other <div class = "thread"> 

这是我在实现@Jota建议后的代码

#Finding the length of each thread for looping using html_children() and length()
list <- html_nodes(url, css = ".thread")
count <- sapply(list, html_children)
threadlength <- sapply(count, length)
#Extracting the names of the thread using xpath
threadlist <- html_nodes(url2, xpath = '*//div[@class = "thread"]/text()[1]') %>% html_text()

#Creating the thread column
#x indicates how many rows a thread topic should be duplicated. 
#y is used to subset the thread column. 
#z is used to close the inner loop, moving to the next thread topic
thread <- c()
n <- 0
y <- 0
for (x in threadlength) {
  z <- 0
  n <- n+1
  repeat{
    y <- y+1
    z <- z+1
    thread[y] <- threadlist[n]
    if (z == x){
      break
    }
  }
}
#使用html_children()和length()查找每个循环线程的长度

欢迎来到StackOverflow!请确保编辑您的问题以显示您尝试过的代码。难道你不能用类“thread”将div作为目标,直接通过
html_节点(例如,xpath='*//div[@class=“thread”]/text()[1]”)获取它们的文本吗?
?谢谢@Jota!这正是我所需要的。您对在哪里阅读
xpath
有什么建议吗?您可以查看并
#Finding the length of each thread for looping using html_children() and length()
list <- html_nodes(url, css = ".thread")
count <- sapply(list, html_children)
threadlength <- sapply(count, length)
#Extracting the names of the thread using xpath
threadlist <- html_nodes(url2, xpath = '*//div[@class = "thread"]/text()[1]') %>% html_text()

#Creating the thread column
#x indicates how many rows a thread topic should be duplicated. 
#y is used to subset the thread column. 
#z is used to close the inner loop, moving to the next thread topic
thread <- c()
n <- 0
y <- 0
for (x in threadlength) {
  z <- 0
  n <- n+1
  repeat{
    y <- y+1
    z <- z+1
    thread[y] <- threadlist[n]
    if (z == x){
      break
    }
  }
}