R 以数字格式而非字符格式从网站读取数据集
使用下面的代码,我从网站上读取数据。R 以数字格式而非字符格式从网站读取数据集,r,rvest,R,Rvest,使用下面的代码,我从网站上读取数据。 问题是它将数据读取为字符而不是数字格式,特别是一些列,如“Enlem(N)和Boylam(E)” 我怎样才能解决这个问题 library(rvest) widths <- c(11,10,10,10,14,5,5,5,48,100) dat <- "http://www.koeri.boun.edu.tr/scripts/lst5.asp" %>% read_html %>% html_nodes("pre") %&
问题是它将数据读取为
字符
而不是数字
格式,特别是一些列,如“Enlem(N)和Boylam(E)”
我怎样才能解决这个问题
library(rvest)
widths <- c(11,10,10,10,14,5,5,5,48,100)
dat <- "http://www.koeri.boun.edu.tr/scripts/lst5.asp" %>%
read_html %>%
html_nodes("pre") %>%
html_text %>%
textConnection %>%
read.fwf(widths = widths, stringsAsFactors = FALSE) %>%
setNames(nm = .[6,]) %>%
tail(-7) %>%
head(-2)
库(rvest)
宽度%
html_节点(“前”)%%>%
html_文本%>%
text连接%>%
read.fwf(宽度=宽度,stringsAsFactors=假)%>%
集合名(nm=[6,])%>%
尾部(-7)%>%
总目(-2)
如果您知道哪些特定列应该是数字,您可以将这些列转换为数字。如果您不知道哪些列应该是数字,您可以创建一个函数来查看数据,如果列中的大小写百分比足够大,则将该列更改为数字。我使用了下面的函数为此目的:
NumericColumns <- function(x, AllowedPercentNumeric =.95, PreserveDate=TRUE, PreserveColumns){
# find the counts of NA values in input data frame's columns
param_originalNA <- apply(x, 2, function(z){sum(is.na(z))})
# blindly coerce data.frame to numeric
df_JustNumbers <- suppressWarnings(as.data.frame(lapply(x, as.numeric)))
# Percent Non-NA values in each column
PercentNumeric <- (apply(df_JustNumbers, 2, function(x)sum(!is.na(x))))/(nrow(x)-param_originalNA)
rm(param_originalNA)
# identify columns which have a greater than or equal percentage of numeric as specified
param_numeric <- names(PercentNumeric)[PercentNumeric >= AllowedPercentNumeric]
# Remove columns from list to convert to numeric that are specified as to preserve
if (!missing(PreserveColumns)){param_numeric <- setdiff(param_numeric, PreserveColumns)}
# Identify columns that are dates initially
IsDateColumns <- lapply(x, function(y)(is(y, "Date")|is(y, "POSIXct")))
param_dates <- names(IsDateColumns)[IsDateColumns==TRUE]
# Remove dates from list if specified to preserve dates
if (PreserveDate){param_numeric <- setdiff(param_numeric, param_dates)}
# returns column position of numeric columns in target data frame
param_numeric <- match(param_numeric, colnames(df_JustNumbers))
# removes NA's from column list
param_numeric <- param_numeric[complete.cases(param_numeric)]
# coerces columns in param_numeric to numeric and inserts numeric columns into target data.frame
if(length(param_numeric)==1){
suppressWarnings(x[, param_numeric] <- as.numeric(x[, param_numeric]))
}
if(length(param_numeric)>1){
suppressWarnings(x[, param_numeric] <- apply(x[, param_numeric],2, function(x)as.numeric(x)))
}
return(x)
}
NumericColumns您可以将as.numeric
列?(即dat$Enlem(N))看起来都是以字符形式读入的,因为第一次读入表格时,前7行中有列描述,稍后将删除。
# Use function to convert to numeric
dat <- NumericColumns(dat)