在R中将Facebook htm文件转换为数据帧
我正试图将我的Facebook聊天信息从.htm文件提取到一个适当的数据框中在R中将Facebook htm文件转换为数据帧,r,regex,web-scraping,rvest,R,Regex,Web Scraping,Rvest,我正试图将我的Facebook聊天信息从.htm文件提取到一个适当的数据框中Rvest通过将html节点(user、meta、p)提取到向量中,然后再提取df,为我提供了很好的服务。然而,我被困在这一部分: <div class="thread"> John, My Name" <div class="message"> <div class="message_header"> <span clas
Rvest
通过将html节点(user、meta、p)提取到向量中,然后再提取df,为我提供了很好的服务。然而,我被困在这一部分:
<div class="thread">
John, My Name"
<div class="message">
<div class="message_header">
<span class="user">My Name</span>
<span class="meta">Thursday, April 9, 2015 at 12:55am UTC+07</span>
</div>
</div>
<p>Hello, how are you today</p>
//Other <div class = "message">
//Other <div class = "thread">
这是我在实现@Jota建议后的代码
#Finding the length of each thread for looping using html_children() and length()
list <- html_nodes(url, css = ".thread")
count <- sapply(list, html_children)
threadlength <- sapply(count, length)
#Extracting the names of the thread using xpath
threadlist <- html_nodes(url2, xpath = '*//div[@class = "thread"]/text()[1]') %>% html_text()
#Creating the thread column
#x indicates how many rows a thread topic should be duplicated.
#y is used to subset the thread column.
#z is used to close the inner loop, moving to the next thread topic
thread <- c()
n <- 0
y <- 0
for (x in threadlength) {
z <- 0
n <- n+1
repeat{
y <- y+1
z <- z+1
thread[y] <- threadlist[n]
if (z == x){
break
}
}
}
#使用html_children()和length()查找每个循环线程的长度
欢迎来到StackOverflow!请确保编辑您的问题以显示您尝试过的代码。难道你不能用类“thread”将div作为目标,直接通过html_节点(例如,xpath='*//div[@class=“thread”]/text()[1]”)获取它们的文本吗?
?谢谢@Jota!这正是我所需要的。您对在哪里阅读xpath
有什么建议吗?您可以查看并
#Finding the length of each thread for looping using html_children() and length()
list <- html_nodes(url, css = ".thread")
count <- sapply(list, html_children)
threadlength <- sapply(count, length)
#Extracting the names of the thread using xpath
threadlist <- html_nodes(url2, xpath = '*//div[@class = "thread"]/text()[1]') %>% html_text()
#Creating the thread column
#x indicates how many rows a thread topic should be duplicated.
#y is used to subset the thread column.
#z is used to close the inner loop, moving to the next thread topic
thread <- c()
n <- 0
y <- 0
for (x in threadlength) {
z <- 0
n <- n+1
repeat{
y <- y+1
z <- z+1
thread[y] <- threadlist[n]
if (z == x){
break
}
}
}