拆分一个变量名，并在R中将该变量名分解为单独的列_R_String_Data Cleaning

拆分一个变量名，并在R中将该变量名分解为单独的列

r string

拆分一个变量名，并在R中将该变量名分解为单独的列,r,string,data-cleaning,R,String,Data Cleaning,我想解析一些perfmon（Windows性能日志数据）数据通常，一组列名如下所示： > colnames(p) [1] "Time" [2] "\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Queue Length" [3] "\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Read

我想解析一些perfmon（Windows性能日志数据）数据

通常，一组列名如下所示：

> colnames(p)
[1] "Time"                                                         
[2] "\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Queue Length"      
[3] "\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Read Queue Length" 
[4] "\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Write Queue Length"
[5] "\\\\testdb1\\Processor(_Total)\\% Processor Time"             
[6] "\\\\testdb1\\System\\Processes"                               
[7] "\\\\testdb1\\System\\Processor Queue Length"

我将这些数据输入R的方式是：

p <- read.csv("r-perfmon.csv",stringsAsFactors = FALSE, check.names = FALSE)

我希望能够解析列名，然后融化数据

因此，如果我们以一列数据为例

> example <- p[2]
> head(example)
  \\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Queue Length
1                                             0.040037563
2                                             0.009740260
3                                             0.011009828
4                                             0.006016244
5                                             0.015125328
6                                             0.002814141

编辑：根据要求对我的数据头进行dput

structure(list(`(PDH-CSV 4.0) (GMT Daylight Time)(-60)` = c("04/15/2013 00:00:19.279", 
"04/15/2013 00:00:34.279", "04/15/2013 00:00:49.275", "04/15/2013 00:01:04.284", 
"04/15/2013 00:01:19.279", "04/15/2013 00:01:34.275"), `\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Queue Length` = c(0.040037563, 
0.00974026, 0.011009828, 0.006016244, 0.015125328, 0.002814141
), `\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Read Queue Length` = c(0.001421333, 
0, 0.000206726, 0, 0.001894, 0), `\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Write Queue Length` = c(0.03861623, 
0.00974026, 0.010803102, 0.006016244, 0.013231327, 0.002814141
), `\\\\testdb1\\Processor(_Total)\\% Processor Time` = c(29.56933862, 
10.85699395, 7.733924001, 1.910202013, 6.164864178, 1.351882837
), `\\\\testdb1\\System\\Processes` = c(86L, 86L, 81L, 81L, 81L, 
81L), `\\\\testdb1\\System\\Processor Queue Length` = c(0L, 0L, 0L, 
0L, 0L, 0L)), .Names = c("(PDH-CSV 4.0) (GMT Daylight Time)(-60)", 
"\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Queue Length", "\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Read Queue Length", 
"\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Write Queue Length", 
"\\\\testdb1\\Processor(_Total)\\% Processor Time", "\\\\testdb1\\System\\Processes", 
"\\\\testdb1\\System\\Processor Queue Length"), row.names = c(NA, 
6L), class = "data.frame")

要知道最终数据应该是什么样子有点困难，因为如果每个列名都被反斜杠或括号分割，那么结果中的列数将根据输入列的不同而有所不同

因此，我将每列拆分为一个单独的列表元素。如果

dput

中的data.frame被调用为

# Look at second column - then all you need to do is tweak the names
s <- strsplit(colnames(d)[2], "\\\\|\\)|\\(")[[1]]
data.frame(time = d[[1]], t(s[nzchar(s)]), value=d[[2]])

                     time      X1           X2   X3                     X4       value
1 04/15/2013 00:00:19.279 testdb1 PhysicalDisk 0 C: Avg. Disk Queue Length 0.040037563
2 04/15/2013 00:00:34.279 testdb1 PhysicalDisk 0 C: Avg. Disk Queue Length 0.009740260
3 04/15/2013 00:00:49.275 testdb1 PhysicalDisk 0 C: Avg. Disk Queue Length 0.011009828
4 04/15/2013 00:01:04.284 testdb1 PhysicalDisk 0 C: Avg. Disk Queue Length 0.006016244
5 04/15/2013 00:01:19.279 testdb1 PhysicalDisk 0 C: Avg. Disk Queue Length 0.015125328
6 04/15/2013 00:01:34.275 testdb1 PhysicalDisk 0 C: Avg. Disk Queue Length 0.002814141

同样，您需要重命名这些列。

首先在r中使用

restrape

将数据重塑为长格式，然后在最终的列名上使用

strsplit

。如果你想让其他人复制你的数据，你还需要

dput

你的数据。我有长格式

p宽格式，你可以一次更改一列。。。但我不确定最终的数据集会是什么样子。但对于您的示例..s，根据要求，我已经包含了我的数据头的dput。。。感谢您的编辑，但是您是否能够将问题中的dput复制到新的R会话中？这对我来说是一个错误，因为只有一个反斜杠谢谢你，这是一个很好的起点。这段代码的结果似乎是，我得到了一个数据帧，它有很多列time，x1，x2，time.1，x1.1，x2.1，time.2，x1.2等等。我希望数据只有time，x1，x2列。这有意义吗？嗨，高斯，我不太确定预期的输出应该是什么样子，当我在第二列上做的时候，它与你的问题的预期结果相匹配。由于列名分为不同数量的部分，我不确定如何/是否要将它们组合在一起。你能指出你想如何组合多列的输出吗？也许可以创建三个数据帧。PhysicalDisk、处理器和系统各一个，并将公共列绑定在一起？
structure(list(`(PDH-CSV 4.0) (GMT Daylight Time)(-60)` = c("04/15/2013 00:00:19.279", 
"04/15/2013 00:00:34.279", "04/15/2013 00:00:49.275", "04/15/2013 00:01:04.284", 
"04/15/2013 00:01:19.279", "04/15/2013 00:01:34.275"), `\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Queue Length` = c(0.040037563, 
0.00974026, 0.011009828, 0.006016244, 0.015125328, 0.002814141
), `\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Read Queue Length` = c(0.001421333, 
0, 0.000206726, 0, 0.001894, 0), `\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Write Queue Length` = c(0.03861623, 
0.00974026, 0.010803102, 0.006016244, 0.013231327, 0.002814141
), `\\\\testdb1\\Processor(_Total)\\% Processor Time` = c(29.56933862, 
10.85699395, 7.733924001, 1.910202013, 6.164864178, 1.351882837
), `\\\\testdb1\\System\\Processes` = c(86L, 86L, 81L, 81L, 81L, 
81L), `\\\\testdb1\\System\\Processor Queue Length` = c(0L, 0L, 0L, 
0L, 0L, 0L)), .Names = c("(PDH-CSV 4.0) (GMT Daylight Time)(-60)", 
"\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Queue Length", "\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Read Queue Length", 
"\\\\testdb1\\PhysicalDisk(0 C:)\\Avg. Disk Write Queue Length", 
"\\\\testdb1\\Processor(_Total)\\% Processor Time", "\\\\testdb1\\System\\Processes", 
"\\\\testdb1\\System\\Processor Queue Length"), row.names = c(NA, 
6L), class = "data.frame")

# Look at second column - then all you need to do is tweak the names
s <- strsplit(colnames(d)[2], "\\\\|\\)|\\(")[[1]]
data.frame(time = d[[1]], t(s[nzchar(s)]), value=d[[2]])

                     time      X1           X2   X3                     X4       value
1 04/15/2013 00:00:19.279 testdb1 PhysicalDisk 0 C: Avg. Disk Queue Length 0.040037563
2 04/15/2013 00:00:34.279 testdb1 PhysicalDisk 0 C: Avg. Disk Queue Length 0.009740260
3 04/15/2013 00:00:49.275 testdb1 PhysicalDisk 0 C: Avg. Disk Queue Length 0.011009828
4 04/15/2013 00:01:04.284 testdb1 PhysicalDisk 0 C: Avg. Disk Queue Length 0.006016244
5 04/15/2013 00:01:19.279 testdb1 PhysicalDisk 0 C: Avg. Disk Queue Length 0.015125328
6 04/15/2013 00:01:34.275 testdb1 PhysicalDisk 0 C: Avg. Disk Queue Length 0.002814141

# Apply it over all variables
lapply(seq_along(colnames(d))[-1], function(i) {
                 s <- strsplit(colnames(d)[[i]], "\\\\|\\)|\\(")[[1]]
                 data.frame(time = d[[1]], t(s[nzchar(s)]), value=d[[i]])
})