优化工作脚本速度所需for循环的替代方法

优化工作脚本速度所需for循环的替代方法,r,for-loop,apply,lapply,sapply,R,For Loop,Apply,Lapply,Sapply,我已经有这个工作,但希望优化这一点。提取与此相关的文章数据需要很长时间,因为我的方法使用for循环。我需要一排一排地跑,每一排都需要一秒钟多一点的时间。然而,在我的实际数据集中,我有大约10000行,这需要很长时间。除了for循环之外,还有其他方法可以提取全文吗?我对每一行都使用相同的方法,所以我想知道R中是否有类似于将一列乘以一个超快速的数字的函数 创建虚拟数据集: date<- as.Date(c('2020-06-25', '2020-06-25','2020-06-25','202

我已经有这个工作,但希望优化这一点。提取与此相关的文章数据需要很长时间,因为我的方法使用for循环。我需要一排一排地跑,每一排都需要一秒钟多一点的时间。然而,在我的实际数据集中,我有大约10000行,这需要很长时间。除了for循环之外,还有其他方法可以提取全文吗?我对每一行都使用相同的方法,所以我想知道R中是否有类似于将一列乘以一个超快速的数字的函数

创建虚拟数据集:

date<- as.Date(c('2020-06-25', '2020-06-25','2020-06-25','2020-06-25','2020-06-25','2020-06-25','2020-06-25','2020-06-25','2020-06-25','2020-06-25'))

text <- c('Isko cites importance of wearing face mask, gives 10K pieces to 30 barangays', 
      'GMRC now a law; to be integrated in school curriculum',
      'QC to impose stringent measures to screen applicants for PWD ID',
      '‘Baka kalaban ka:’ Cops intimidate dzBB reporter',
      'Is gov’t playing with traditional jeepney drivers? A lawmaker thinks so',
      'PNP records highest single-day COVID-19 tally as cases rise to 579',
      'IBP tells new lawyers: ‘Excel without sacrificing honor’',
      'Senators express concern over DepEd’s preparedness for upcoming school year',
      'Angara calls for probe into reported spread of ‘fake’ PWD IDs',
      'Grab PH eyes new scheme to protect food couriers vs no-show customers')
link<- c('https://newsinfo.inquirer.net/1297621/isko-cites-importance-of-wearing-face-mask-gives-10k-pieces-to-30-barangays',  
     'https://newsinfo.inquirer.net/1297618/gmrc-now-a-law-to-be-integrated-in-school-curriculum',                           
     'https://newsinfo.inquirer.net/1297614/qc-to-impose-stringent-measures-to-screen-applicants-for-pwd-id',                 
     'https://newsinfo.inquirer.net/1297606/baka-kalaban-ka-cops-intimidate-dzbb-reporter',                                  
     'https://newsinfo.inquirer.net/1297582/is-govt-playing-with-traditional-jeepney-drivers-a-party-list-lawmaker-thinks-so',
     'https://newsinfo.inquirer.net/1297577/pnp-records-highest-single-day-covid-19-tally-as-cases-rose-to-579',             
     'https://newsinfo.inquirer.net/1297562/ibp-tells-new-lawyers-excel-without-sacrificing-honor',                         
     'https://newsinfo.inquirer.net/1297559/senators-on-depeds-preparedness-for-upcoming-school-year',                      
     'https://newsinfo.inquirer.net/1297566/angara-calls-for-probe-into-reported-spread-of-fake-pwd-ids',                   
     'https://newsinfo.inquirer.net/1297553/grab-ph-eyes-new-scheme-to-protect-food-couriers-vs-no-show-customers')

df<-data.frame(date, text, link)
df
         date                                                                         text                                                 link
1  2020-06-25 Isko cites importance of wearing face mask, gives 10K pieces to 30 barangays   https://newsinfo.inquirer.net/1297621/isko-cites-importance-of-wearing-face-mask-gives-10k-pieces-to-30-barangays
2  2020-06-25                        GMRC now a law; to be integrated in school curriculum   https://newsinfo.inquirer.net/1297618/gmrc-now-a-law-to-be-integrated-in-school-curriculum
3  2020-06-25              QC to impose stringent measures to screen applicants for PWD ID   https://newsinfo.inquirer.net/1297614/qc-to-impose-stringent-measures-to-screen-applicants-for-pwd-id
4  2020-06-25                             ‘Baka kalaban ka:’ Cops intimidate dzBB reporter   https://newsinfo.inquirer.net/1297606/baka-kalaban-ka-cops-intimidate-dzbb-reporter
5  2020-06-25      Is gov’t playing with traditional jeepney drivers? A lawmaker thinks so   https://newsinfo.inquirer.net/1297582/is-govt-playing-with-traditional-jeepney-drivers-a-party-list-lawmaker-thinks-so
6  2020-06-25           PNP records highest single-day COVID-19 tally as cases rise to 579   https://newsinfo.inquirer.net/1297577/pnp-records-highest-single-day-covid-19-tally-as-cases-rose-to-579
7  2020-06-25                     IBP tells new lawyers: ‘Excel without sacrificing honor’   https://newsinfo.inquirer.net/1297562/ibp-tells-new-lawyers-excel-without-sacrificing-honor
8  2020-06-25  Senators express concern over DepEd’s preparedness for upcoming school year   https://newsinfo.inquirer.net/1297559/senators-on-depeds-preparedness-for-upcoming-school-year
9  2020-06-25                Angara calls for probe into reported spread of ‘fake’ PWD IDs   https://newsinfo.inquirer.net/1297566/angara-calls-for-probe-into-reported-spread-of-fake-pwd-ids
10 2020-06-25        Grab PH eyes new scheme to protect food couriers vs no-show customers   https://newsinfo.inquirer.net/1297553/grab-ph-eyes-new-scheme-to-protect-food-couriers-vs-no-show-customers
获取每个链接的文章数据的代码:

now<-Sys.time()
for(i in 1:nrow(df)) {
  test_article<- read_html(df[i, 3]) %>% 
    html_nodes(".article_align div p") %>% 
    html_text() %>%
    toString() 

  text_df <- tibble(test_article)
  df[i,4]<-test_article
  print(paste(i,"/",nrow(df), sep = ""))
}
finish<-Sys.time()
finish-now
now%
html_text()%>%
toString()

text_df您可以并行化循环:

#setup parallel backend to use many processors
cores=detectCores()
cl <- makeCluster(cores[1]-1) #not to overload your computer
registerDoParallel(cl)
now <- Sys.time()
result <- foreach(i =1:nrow(df),.combine=rbind,.packages=('dplyr','rvest') %dopar% { 
  test_article <- read_html(df[i, 3]) %>% 
    html_nodes(".article_align div p") %>% 
    html_text() %>%
    toString() 
  
  data.frame( test_article = test_article, ID = paste(i,"-",nrow(df), sep = ""))
  }

finish<-Sys.time()
finish-now
#stop cluster
stopCluster(cl)
#设置并行后端以使用多个处理器
核心=检测核心()

cl不确定for循环是否是延迟的原因:您将节省几毫秒,而不是几秒钟。您可以尝试使用并行处理同时发送多个数据查询。即使对于10000行,如果采用不同的方法,也只能节省几毫秒?我怀疑需要时间的是html查询,而不是循环本身。为了节省时间,您首先需要找到代码中占用时间最多的部分的解决方案。是否有一种方法可以同时将html查询应用于所有行,或者它是否需要一次一行的for循环,这基本上就是我要问的。这就是我在第一条评论中试图表达的内容:您需要这样做,这很好,谢谢!然后我会把结果放在一起
new\dfu感谢您的帮助,看起来它可以在一半的时间内运行