改进我的R代码-需要更好的编码方法的建议吗？_R

改进我的R代码-需要更好的编码方法的建议吗？

改进我的R代码-需要更好的编码方法的建议吗？,r,R,我的代码正在运行，它是一个webscraping脚本，首先从网页的URL获取，然后使用for循环运行所有URL。在循环过程中，它获取一些信息并将其保存到一个数据帧中，我首先在循环之前创建一个空数据帧。该过程使用rbind，效果良好然而，我觉得这段代码不是最优的，可能有一个包，我认为解决方案将是lappy。。。也许不是。但我希望有人能给我一个指针，指向更好的编码方法（如果存在的话）以及如何实现它 library(rvest) URL <- "http://www.transfe

我的代码正在运行，它是一个webscraping脚本，首先从网页的URL获取，然后使用for循环运行所有URL。在循环过程中，它获取一些信息并将其保存到一个数据帧中，我首先在循环之前创建一个空数据帧。该过程使用rbind，效果良好

然而，我觉得这段代码不是最优的，可能有一个包，我认为解决方案将是lappy。。。也许不是。但我希望有人能给我一个指针，指向更好的编码方法（如果存在的话）以及如何实现它

library(rvest)

URL <- "http://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1"

WS <- read_html(URL)

URLs <- WS %>% html_nodes(".hide-for-pad .vereinprofil_tooltip") %>% html_attr("href") %>% as.character()
URLs <- paste0("http://www.transfermarkt.com",URLs)

Catcher1 <- data.frame(Player=character(),P_URL=character())

for (i in URLs) {
  
  WS1 <- read_html(i)
  Player <- WS1 %>% html_nodes("#yw1 .spielprofil_tooltip") %>% html_text() %>% as.character()
  P_URL <- WS1 %>% html_nodes("#yw1 .spielprofil_tooltip") %>% html_attr("href") %>% as.character()
  temp <- data.frame(Player,P_URL)
  Catcher1 <- rbind(Catcher1,temp)
  cat("*")
}

库（rvest）
URL%as.character（）
URL%html\u attr（“href”）%%>%as.character（）
temp您的主要问题是您正在生长一个对象。在本例中，您正在增长数据帧。要解决此问题，请在循环之前创建一个大数据帧，并填充它。这是否是瓶颈，很难说。如果长度（URL）
很小，那么就不会有多大区别
另一个可能的加速是并行运行循环。可能使用parallel:：parspapply
。要将循环转换为并行版本，只需将“循环”部分移动到一个函数，您的代码将类似于：
parallel::parSapply(1:URLs, get_resource)

或者，您可以尝试该包。
您可以尝试使用purrr
而不是循环，如下所示：
require(rvest)
require(purrr)
require(tibble)

URLs %>% 
  map(read_html) %>% 
  map(html_nodes, "#yw1 .spielprofil_tooltip") %>% 
  map_df(~tibble(Player = html_text(.), P_URL = html_attr(., "href")))

require(httr)
doMC::registerDoMC(cores=3) # cores depending on your system
plyr::llply(URLs, GET, .parallel = TRUE) %>% 
  map(read_html) %>% 
  map(html_nodes, "#yw1 .spielprofil_tooltip") %>% 
  map_df(~tibble(Player = html_text(.), P_URL = html_attr(., "href")))

# A tibble: 1,036 × 2
          Player                                P_URL
           <chr>                                <chr>
1   David de Gea   /david-de-gea/profil/spieler/59377
2      D. de Gea   /david-de-gea/profil/spieler/59377
3  Sergio Romero  /sergio-romero/profil/spieler/30690
4      S. Romero  /sergio-romero/profil/spieler/30690
5  Sam Johnstone /sam-johnstone/profil/spieler/110864
6   S. Johnstone /sam-johnstone/profil/spieler/110864
7    Daley Blind    /daley-blind/profil/spieler/12282
8       D. Blind    /daley-blind/profil/spieler/12282
9    Eric Bailly   /eric-bailly/profil/spieler/286384
10     E. Bailly   /eric-bailly/profil/spieler/286384
# ... with 1,026 more rows

时间：
   user  system elapsed 
  2.939   2.746   5.699 

   user  system elapsed 
  2.505   0.337   2.940 

花费时间最多的步骤是通过map（read\u html）
进行爬行

要使其瘫痪，您可以使用plyr
的并行后端，如下所示：
require(rvest)
require(purrr)
require(tibble)

URLs %>% 
  map(read_html) %>% 
  map(html_nodes, "#yw1 .spielprofil_tooltip") %>% 
  map_df(~tibble(Player = html_text(.), P_URL = html_attr(., "href")))

require(httr)
doMC::registerDoMC(cores=3) # cores depending on your system
plyr::llply(URLs, GET, .parallel = TRUE) %>% 
  map(read_html) %>% 
  map(html_nodes, "#yw1 .spielprofil_tooltip") %>% 
  map_df(~tibble(Player = html_text(.), P_URL = html_attr(., "href")))

# A tibble: 1,036 × 2
          Player                                P_URL
           <chr>                                <chr>
1   David de Gea   /david-de-gea/profil/spieler/59377
2      D. de Gea   /david-de-gea/profil/spieler/59377
3  Sergio Romero  /sergio-romero/profil/spieler/30690
4      S. Romero  /sergio-romero/profil/spieler/30690
5  Sam Johnstone /sam-johnstone/profil/spieler/110864
6   S. Johnstone /sam-johnstone/profil/spieler/110864
7    Daley Blind    /daley-blind/profil/spieler/12282
8       D. Blind    /daley-blind/profil/spieler/12282
9    Eric Bailly   /eric-bailly/profil/spieler/286384
10     E. Bailly   /eric-bailly/profil/spieler/286384
# ... with 1,026 more rows

不知何故，我的Rstudio在使用plyr:：llply（URL，read\u html，.parallel=TRUE）
时崩溃了，这就是为什么我使用底层httr:：GET
并在下一步通过map（read\u html）
解析结果。因此，刮取是并行进行的，但响应的解析是按顺序进行的
时间：
   user  system elapsed 
  2.939   2.746   5.699 

   user  system elapsed 
  2.505   0.337   2.940 

在这两种情况下，结果如下所示：
require(rvest)
require(purrr)
require(tibble)

URLs %>% 
  map(read_html) %>% 
  map(html_nodes, "#yw1 .spielprofil_tooltip") %>% 
  map_df(~tibble(Player = html_text(.), P_URL = html_attr(., "href")))

require(httr)
doMC::registerDoMC(cores=3) # cores depending on your system
plyr::llply(URLs, GET, .parallel = TRUE) %>% 
  map(read_html) %>% 
  map(html_nodes, "#yw1 .spielprofil_tooltip") %>% 
  map_df(~tibble(Player = html_text(.), P_URL = html_attr(., "href")))

# A tibble: 1,036 × 2
          Player                                P_URL
           <chr>                                <chr>
1   David de Gea   /david-de-gea/profil/spieler/59377
2      D. de Gea   /david-de-gea/profil/spieler/59377
3  Sergio Romero  /sergio-romero/profil/spieler/30690
4      S. Romero  /sergio-romero/profil/spieler/30690
5  Sam Johnstone /sam-johnstone/profil/spieler/110864
6   S. Johnstone /sam-johnstone/profil/spieler/110864
7    Daley Blind    /daley-blind/profil/spieler/12282
8       D. Blind    /daley-blind/profil/spieler/12282
9    Eric Bailly   /eric-bailly/profil/spieler/286384
10     E. Bailly   /eric-bailly/profil/spieler/286384
# ... with 1,026 more rows

#一个tible:1036×2
播放器P_URL
1 David de Gea/David de Gea/profil/spieler/59377
2 D.德Gea/david de Gea/profil/spieler/59377
3 Sergio Romero/Sergio Romero/profil/spieler/30690
4 S.Romero/sergio Romero/profil/spieler/30690
5 Sam Johnstone/Sam Johnstone/profil/spieler/110864
6 S.Johnstone/sam Johnstone/profil/spieler/110864
7 Daley Blind/Daley Blind/profil/spieler/12282
8 D.盲人/戴利盲人/profil/spieler/12282
9 Eric Bailly/Eric Bailly/profil/spieler/286384
10 E.Bailly/eric Bailly/profil/spieler/286384
# ... 还有1026行
谢谢，有些好主意。如果CSS标识符“#yw1.spielprofil_tooltip”与刮取的数据不同，这会改变您的选择吗？在您的示例中，所有信息都基于单个CSS标识符。这就是为什么这么容易。如果涉及到更多标识符，我很可能会使用一个单独的函数，它接收文档并返回我想要的数据帧。例如，在您的代码中，将WS1
作为输入，并返回temp
。我投票将此问题作为离题题结束，因为它应该移至代码审阅stackexchangeThanks，我一直在研究foreach并取得一些进展