筛选my R数据帧会导致数据帧排序错误_R

筛选my R数据帧会导致数据帧排序错误

筛选my R数据帧会导致数据帧排序错误,r,R,考虑以下两个代码片段 A: 有人能解释一下为什么当我以.numeric的形式应用时，数字会发生变化吗？第二种情况下，得到不同结果的真正原因是，整个数据集有一些页脚注释，这些注释也是用read.csv读取的，因为页脚中有“character”元素，所以大多数列都是“factor”类。这两种方法都可以避免使用read.csv中的skip参数跳过最后几行在read.csv调用中使用stringsAsFactors=FALSE并跳过行。列是根据因子的级别排序的如果您已经在不跳过行的情况下读取了文

考虑以下两个代码片段

有人能解释一下为什么当我以.numeric的形式应用时，数字会发生变化吗？

第二种情况下，得到不同结果的真正原因是，整个数据集有一些页脚注释，这些注释也是用read.csv读取的，因为页脚中有“character”元素，所以大多数列都是“factor”类。这两种方法都可以避免

使用read.csv中的skip参数跳过最后几行在read.csv调用中使用stringsAsFactors=FALSE并跳过行。列是根据因子的级别排序的

如果您已经在不跳过行的情况下读取了文件，请转换为相应的类。如果它是“数值”列，请通过as.numericas.characterdf$列或as.numericlevelsdf$列[df$列]将其转换为数值。

最好显示一个小样本数据并将其应用于代码，显示问题所在。我怀疑人们是否愿意下载数据。请检查是否有因子列。由于在第一种情况下使用了子集，并且没有删除级别，因此排序可能基于因子变量not上的级别tested@RichardScriven，我已将csv文件添加为粘贴链接。我希望这能解决安全问题，或者你认为人们不下载文件还有其他原因吗？@akrun，在我将该列转换为数字之前，gdp$V1确实是案例B中的一个因子变量。然而，一旦我将列转换为数字，它不应该像其他任何数字一样吗？@EricBaldwin我研究了这两种情况。在第二个例子中，对于gdp，如果你看一下strgdp，除了V3，所有其他都是因素。而在第一种情况下，因子列是V1、V4和V6。您跳过了一些页眉行，同样，页脚中也有一些行需要跳过。只需检查tailgdp[2]`

download.file("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv", destfile = "./data/gdp.csv", method = "curl" )
gdp <- read.csv('./data/gdp.csv', header=F, skip=5, nrows=190) # Specify nrows, get correct answer

download.file("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv", destfile = "./data/education.csv", method = "curl" )
education = read.csv('./data/education.csv')

mergedData <- merge(gdp, education, by.x='V1', by.y='CountryCode')
# No need to remove unranked countries because we specified nrows
# No need to convert V2 from factor to numeric
sortedMergedData = arrange(mergedData, -V2)
sortedMergedData[13,1] # Get KNA, correct answer

download.file("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FGDP.csv", destfile = "./data/gdp.csv", method = "curl" )
gdp <- read.csv('./data/gdp.csv', header=F, skip=5) # Don't specify nrows, get incorrect answer

download.file("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FEDSTATS_Country.csv", destfile = "./data/education.csv", method = "curl" )
education = read.csv('./data/education.csv')

mergedData <- merge(gdp, education, by.x='V1', by.y='CountryCode')
mergedData = mergedData[which(mergedData$V2 != ""),] # Remove unranked countries
mergedData$V2 = as.numeric(mergedData$V2) # make V2 a numeric column
sortedMergedData = arrange(mergedData, -V2)
sortedMergedData[13,1] # Get SRB, incorrect answer

> mergedData$V2
  [1] 161 105 60  125 32  26  133 172 12  27  68  162 25  140 128 59  76  93 
 [19] 138 111 69  169 149 96  7   153 113 167 117 165 11  20  36  2   99  98 
 [37] 121 30  182 166 81  67  102 51  4   183 33  72  48  64  38  159 13  103
 [55] 85  43  155 5   185 109 6   114 86  148 175 176 110 42  178 77  160 37 
 [73] 108 71  139 58  16  10  46  22  47  122 40  9   116 92  3   50  87  145
 [91] 120 189 178 15  146 56  136 83  168 171 70  163 84  74  94  82  62  147
[109] 141 132 164 14  188 135 129 137 151 130 118 154 127 152 34  123 144 39 
[127] 126 18  23  107 55  66  44  89  49  41  187 115 24  61  45  97  54  52 
[145] 8   142 19  73  119 35  174 157 100 88  186 150 63  80  21  158 173 65 
[163] 124 156 31  143 91  170 184 101 79  17  190 95  106 53  78  1   75  180
[181] 29  57  177 181 90  28  112 104 134
194 Levels:  .. Not available.   1 10 100 101 102 103 104 105 106 107 ... Note: Rankings include only those economies with confirmed GDP estimates. Figures in italics are for 2011 or 2010.
> mergedData$V2 = as.numeric(mergedData$V2)
> mergedData$V2
  [1]  72  10 149  32 118 111  41  84  26 112 157  73 110  49  35 147 166 185
 [19]  46  17 158  80  58 188 159  63  19  78  23  76  15 105 122 104 191 190
 [37]  28 116  94  77 172 156   7 139 126  95 119 162 135 153 124  69  37   8
 [55] 176 130  65 137  97  14 148  20 177  57  87  88  16 129  90 167  71 123
 [73]  13 161  47 146  70   4 133 107 134  29 127 181  22 184 115 138 178  54
 [91]  27 101  90  59  55 144  44 174  79  83 160  74 175 164 186 173 151  56
[109]  50  40  75  48 100  43  36  45  61  38  24  64  34  62 120  30  53 125
[127]  33  91 108  12 143 155 131 180 136 128  99  21 109 150 132 189 142 140
[145] 170  51 102 163  25 121  86  67   5 179  98  60 152 171 106  68  85 154
[163]  31  66 117  52 183  82  96   6 169  81 103 187  11 141 168   3 165  92
[181] 114 145  89  93 182 113  18   9  42