R 合并html文件的文本分析_R_Text

R 合并html文件的文本分析

r text

R 合并html文件的文本分析,r,text,R,Text,我有3个不同的文本文件，它们的名称是txt1 txt2 txt3： txt1 <- read_html("https://tr.wikisource.org/wiki/Recep_Tayyip_Erdo%C4%9Fan%27%C4%B1n_1_Haziran_2010_tarihli_AK_Parti_grup_toplant%C4%B1s%C4%B1_konu%C5%9Fmas%C4%B1") txt2 <- read_html("https://tr.wikisource.o

我有3个不同的文本文件，它们的名称是txt1 txt2 txt3：

txt1 <- read_html("https://tr.wikisource.org/wiki/Recep_Tayyip_Erdo%C4%9Fan%27%C4%B1n_1_Haziran_2010_tarihli_AK_Parti_grup_toplant%C4%B1s%C4%B1_konu%C5%9Fmas%C4%B1")  
txt2 <- read_html("https://tr.wikisource.org/wiki/Recep_Tayyip_Erdo%C4%9Fan%27%C4%B1n_1_Haziran_2011_tarihli_Diyarbak%C4%B1r_mitinginde_yapt%C4%B1%C4%9F%C4%B1_konu%C5%9Fma")  
txt3 <- read_html("https://tr.wikisource.org/wiki/Recep_Tayyip_Erdo%C4%9Fan%27%C4%B1n_1_%C5%9Eubat_2011_tarihli_AK_Parti_grup_toplant%C4%B1s%C4%B1_konu%C5%9Fmas%C4%B1 " )

现在我试图创建一个独特的html文本文件来分析所有这些，因为它们是一个文件。

知道如何用不同的html文本文件创建一个唯一的html文本文件吗

你走的是正确的道路，那么：

library(rvest)
library(tidyverse)

txt1 <- read_html("https://tr.wikisource.org/wiki/Recep_Tayyip_Erdo%C4%9Fan%27%C4%B1n_1_Haziran_2010_tarihli_AK_Parti_grup_toplant%C4%B1s%C4%B1_konu%C5%9Fmas%C4%B1")  
txt2 <- read_html("https://tr.wikisource.org/wiki/Recep_Tayyip_Erdo%C4%9Fan%27%C4%B1n_1_Haziran_2011_tarihli_Diyarbak%C4%B1r_mitinginde_yapt%C4%B1%C4%9F%C4%B1_konu%C5%9Fma")  
txt3 <- read_html("https://tr.wikisource.org/wiki/Recep_Tayyip_Erdo%C4%9Fan%27%C4%B1n_1_%C5%9Eubat_2011_tarihli_AK_Parti_grup_toplant%C4%B1s%C4%B1_konu%C5%9Fmas%C4%B1 " )

# first link
 df1<- txt1 %>%
    html_nodes('#mw-content-text p') %>%  #choose the text
    html_text() %>%                       
    t() %>%                               # transpose
    data.frame() %>%                      # as data.frame
    unite()                               # melt all the cell in one text

第二和第三个链接也是如此：

 df2<- txt2 %>%
   html_nodes('#mw-content-text p') %>%
   html_text() %>% t() %>% data.frame() %>%unite()

 df3<- txt3 %>%
   html_nodes('#mw-content-text p') %>%
   html_text() %>% t() %>% data.frame() %>%unite()

最后将所有内容放在一个单元格中，例如：

 df_total <- cbind(df1,df2,df3) %>% unite()

编辑：

您可以创建一个循环来解析链接向量中的所有页面：

txt1 <- ("https://tr.wikisource.org/wiki/Recep_Tayyip_Erdo%C4%9Fan%27%C4%B1n_1_Haziran_2010_tarihli_AK_Parti_grup_toplant%C4%B1s%C4%B1_konu%C5%9Fmas%C4%B1")  
txt2 <- ("https://tr.wikisource.org/wiki/Recep_Tayyip_Erdo%C4%9Fan%27%C4%B1n_1_Haziran_2011_tarihli_Diyarbak%C4%B1r_mitinginde_yapt%C4%B1%C4%9F%C4%B1_konu%C5%9Fma")  
txt3 <- ("https://tr.wikisource.org/wiki/Recep_Tayyip_Erdo%C4%9Fan%27%C4%B1n_1_%C5%9Eubat_2011_tarihli_AK_Parti_grup_toplant%C4%B1s%C4%B1_konu%C5%9Fmas%C4%B1 " )

url <- c(txt1, txt2, txt3)        # all the urls

# the loop that scrapes and put in a list
dfList <- lapply(url, function(i) 
{
  swimwith <- read_html(i)
  swdf <- swimwith %>%
    html_nodes('#mw-content-text p') %>%
    html_text()%>%                       
    t() %>%                              
    data.frame() %>%                      
    unite()     
})

# from list to df
finaldf1 <- do.call(cbind, dfList) %>% unite()

欢迎来到SO。不幸的是，这和你的问题一样不清楚，很可能会导致发生同样的事情。你想要的是非常不清楚的，尤其是输出的描述或输出的示例。我想要的是，创建一个由这三个文本组成的长文本。比如：textonly=txt1+txt2+txt3。那么我该如何创建它呢？我希望现在一切都清楚了。我非常感谢你。逻辑对我来说很清楚。我可以再问你一个问题吗？如果我有超过3篇文章，比如50篇，有没有其他简单的方法让它们结合在一起？不客气！你可以把它放在一个循环中，就像放在编辑中一样。非常感谢你@苏特