如何从R访问维基百科?

如何从R访问维基百科?,r,wikipedia,text-mining,wikipedia-api,mediawiki-api,R,Wikipedia,Text Mining,Wikipedia Api,Mediawiki Api,是否有任何R软件包允许查询Wikipedia(很可能使用Mediawiki API)以获取与该查询相关的可用文章列表,以及导入选定的文章进行文本挖掘?使用RCurl软件包检索信息,以及用于解析响应的XML或RJSONIO包 如果您支持代理,请设置您的选项 opts <- list( proxy = "136.233.91.120", proxyusername = "mydomain\\myusername", proxypassword = 'whatever',

是否有任何R软件包允许查询Wikipedia(很可能使用Mediawiki API)以获取与该查询相关的可用文章列表,以及导入选定的文章进行文本挖掘?

使用
RCurl
软件包检索信息,以及用于解析响应的
XML
RJSONIO

如果您支持代理,请设置您的选项

opts <- list(
  proxy = "136.233.91.120", 
  proxyusername = "mydomain\\myusername", 
  proxypassword = 'whatever', 
  proxyport = 8080
)
解析结果

fromJSON(rawToChar(search_example))
有一个“R中的MediaWiki API包装器”

library(devtools)
install_github("Ironholds/WikipediR")
library(WikipediR)
它包括以下功能:

ls("package:WikipediR")
 [1] "wiki_catpages"      "wiki_con"           "wiki_diff"          "wiki_page"         
 [5] "wiki_pagecats"      "wiki_recentchanges" "wiki_revision"      "wiki_timestamp"    
 [9] "wiki_usercontribs"  "wiki_userinfo"  
在这里,它正在使用中,为一组用户获取贡献详细信息和用户详细信息:

library(RCurl)
library(XML)

# scrape page to get usernames of users with highest numbers of edits
top_editors_page <- "http://en.wikipedia.org/wiki/Wikipedia:List_of_Wikipedians_by_number_of_edits"
top_editors_table <- readHTMLTable(top_editors_page)
very_top_editors <- as.character(top_editors_table[[3]][1:5,]$User)

# setup connection to wikimedia project 
con <- wiki_con("en", project = c("wikipedia"))

# connect to API and get last 50 edits per user
user_data <- lapply(very_top_editors,  function(i) wiki_usercontribs(con, i) )
# and get information about the users (registration date, gender, editcount, etc)
user_info <- lapply(very_top_editors,  function(i) wiki_userinfo(con, i) )
库(RCurl)
库(XML)
#刮取页面以获取编辑次数最多的用户的用户名

顶级编辑页面一个新的巨大可能性是
wikifacts
包(在CRAN上):


您可能会发现以下内容很有用:我在某些搜索词中使用此选项时遇到问题,但我怀疑这是我所在网络的问题。我需要志愿者在
search
参数中检查带有不同字符串的示例代码。
library(RCurl)
library(XML)

# scrape page to get usernames of users with highest numbers of edits
top_editors_page <- "http://en.wikipedia.org/wiki/Wikipedia:List_of_Wikipedians_by_number_of_edits"
top_editors_table <- readHTMLTable(top_editors_page)
very_top_editors <- as.character(top_editors_table[[3]][1:5,]$User)

# setup connection to wikimedia project 
con <- wiki_con("en", project = c("wikipedia"))

# connect to API and get last 50 edits per user
user_data <- lapply(very_top_editors,  function(i) wiki_usercontribs(con, i) )
# and get information about the users (registration date, gender, editcount, etc)
user_info <- lapply(very_top_editors,  function(i) wiki_userinfo(con, i) )
library(wikifacts)
wiki_define('R (programming language)')
## R (programming language) 
## "R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. Polls, data mining surveys, and studies of scholarly literature databases show substantial increases in popularity; as of April 2021, R ranks 16th in the TIOBE index, a measure of popularity of programming languages.The official R software environment is a GNU package.\nIt is written primarily in C, Fortran, and R itself (thus, it is partially self-hosting) and is freely available under the GNU General Public License. Pre-compiled executables are provided for various operating systems."