如何从web日志中提取元素以形成data.frame？_R

如何从web日志中提取元素以形成data.frame？

如何从web日志中提取元素以形成data.frame？,r,R,我有一个大约一百万行的博客，我想提取一些日期、时间和状态来形成一个新的data.frame V1 2013-08-27 16:00:01 117.79.149.2 GET 200 0 0 2013-08-27 16:00:02 117.79.149.2 GET 404 0 0 2013-08-27 16:00:03 117.79.149.2 GET 200 0 0 2013-08-27 16:00:04 117.79.149

我有一个大约一百万行的博客，我想提取一些日期、时间和状态来形成一个新的data.frame

       V1
       2013-08-27 16:00:01 117.79.149.2 GET 200 0 0
       2013-08-27 16:00:02 117.79.149.2 GET 404 0 0
       2013-08-27 16:00:03 117.79.149.2 GET 200 0 0
       2013-08-27 16:00:04 117.79.149.2 GET 404 0 0

成为

       Date_Time              Status
       2013-08-27 16:00:01    200
       2013-08-27 16:00:02    404
       2013-08-27 16:00:03    200
       2013-08-27 16:00:04    404

我知道如何通过以下代码提取所需的元素

       temp<-unlist(strsplit(x," "))
       Date_Time<-paste(temp[1],temp[2])
       Status<-temp[5]

temp您可以使用sapply
：
example <- c("asdf asdwer dsf cswe asd","asfdw ewr cswe sdf wers")  
split.example <- strsplit(example," ")
example.2 <- sapply(split.example,"[[",2)

使用@Sven提供的dat
就可以完整地回答这个问题：
temp <- strsplit(as.character(dat$V1)," ")
new.df <- data.frame(Date_Time = paste(sapply(temp,"[[",1),
                                       sapply(temp,"[[",2)),
                     Status = sapply(temp,"[[",5))

> new.df
            Date_Time Status
1 2013-08-27 16:00:01    200
2 2013-08-27 16:00:02    404
3 2013-08-27 16:00:03    200
4 2013-08-27 16:00:04    404

temp您可以使用sapply
：
example <- c("asdf asdwer dsf cswe asd","asfdw ewr cswe sdf wers")  
split.example <- strsplit(example," ")
example.2 <- sapply(split.example,"[[",2)

使用@Sven提供的dat
就可以完整地回答这个问题：
temp <- strsplit(as.character(dat$V1)," ")
new.df <- data.frame(Date_Time = paste(sapply(temp,"[[",1),
                                       sapply(temp,"[[",2)),
                     Status = sapply(temp,"[[",5))

> new.df
            Date_Time Status
1 2013-08-27 16:00:01    200
2 2013-08-27 16:00:02    404
3 2013-08-27 16:00:03    200
4 2013-08-27 16:00:04    404

temp您可以使用sapply
：
example <- c("asdf asdwer dsf cswe asd","asfdw ewr cswe sdf wers")  
split.example <- strsplit(example," ")
example.2 <- sapply(split.example,"[[",2)

使用@Sven提供的dat
就可以完整地回答这个问题：
temp <- strsplit(as.character(dat$V1)," ")
new.df <- data.frame(Date_Time = paste(sapply(temp,"[[",1),
                                       sapply(temp,"[[",2)),
                     Status = sapply(temp,"[[",5))

> new.df
            Date_Time Status
1 2013-08-27 16:00:01    200
2 2013-08-27 16:00:02    404
3 2013-08-27 16:00:03    200
4 2013-08-27 16:00:04    404

temp您可以使用sapply
：
example <- c("asdf asdwer dsf cswe asd","asfdw ewr cswe sdf wers")  
split.example <- strsplit(example," ")
example.2 <- sapply(split.example,"[[",2)

使用@Sven提供的dat
就可以完整地回答这个问题：
temp <- strsplit(as.character(dat$V1)," ")
new.df <- data.frame(Date_Time = paste(sapply(temp,"[[",1),
                                       sapply(temp,"[[",2)),
                     Status = sapply(temp,"[[",5))

> new.df
            Date_Time Status
1 2013-08-27 16:00:01    200
2 2013-08-27 16:00:02    404
3 2013-08-27 16:00:03    200
4 2013-08-27 16:00:04    404

temp基于正则表达式的解决方案：
with(dat, data.frame(Date_Time = gsub("(.*\\:[0-9]+) .*", "\\1", V1),
                     Status = gsub(".*T ([0-9]+) .*", "\\1", V1)))

#             Date_Time Status
# 1 2013-08-27 16:00:01    200
# 2 2013-08-27 16:00:02    404
# 3 2013-08-27 16:00:03    200
# 4 2013-08-27 16:00:04    404


其中，dat
是您的数据帧：
dat <- data.frame(V1 = readLines(
  textConnection("2013-08-27 16:00:01 117.79.149.2 GET 200 0 0
2013-08-27 16:00:02 117.79.149.2 GET 404 0 0
2013-08-27 16:00:03 117.79.149.2 GET 200 0 0
2013-08-27 16:00:04 117.79.149.2 GET 404 0 0")))

dat基于正则表达式的解决方案：
with(dat, data.frame(Date_Time = gsub("(.*\\:[0-9]+) .*", "\\1", V1),
                     Status = gsub(".*T ([0-9]+) .*", "\\1", V1)))

#             Date_Time Status
# 1 2013-08-27 16:00:01    200
# 2 2013-08-27 16:00:02    404
# 3 2013-08-27 16:00:03    200
# 4 2013-08-27 16:00:04    404


其中，dat
是您的数据帧：
dat <- data.frame(V1 = readLines(
  textConnection("2013-08-27 16:00:01 117.79.149.2 GET 200 0 0
2013-08-27 16:00:02 117.79.149.2 GET 404 0 0
2013-08-27 16:00:03 117.79.149.2 GET 200 0 0
2013-08-27 16:00:04 117.79.149.2 GET 404 0 0")))

dat基于正则表达式的解决方案：
with(dat, data.frame(Date_Time = gsub("(.*\\:[0-9]+) .*", "\\1", V1),
                     Status = gsub(".*T ([0-9]+) .*", "\\1", V1)))

#             Date_Time Status
# 1 2013-08-27 16:00:01    200
# 2 2013-08-27 16:00:02    404
# 3 2013-08-27 16:00:03    200
# 4 2013-08-27 16:00:04    404


其中，dat
是您的数据帧：
dat <- data.frame(V1 = readLines(
  textConnection("2013-08-27 16:00:01 117.79.149.2 GET 200 0 0
2013-08-27 16:00:02 117.79.149.2 GET 404 0 0
2013-08-27 16:00:03 117.79.149.2 GET 200 0 0
2013-08-27 16:00:04 117.79.149.2 GET 404 0 0")))

dat基于正则表达式的解决方案：
with(dat, data.frame(Date_Time = gsub("(.*\\:[0-9]+) .*", "\\1", V1),
                     Status = gsub(".*T ([0-9]+) .*", "\\1", V1)))

#             Date_Time Status
# 1 2013-08-27 16:00:01    200
# 2 2013-08-27 16:00:02    404
# 3 2013-08-27 16:00:03    200
# 4 2013-08-27 16:00:04    404


其中，dat
是您的数据帧：
dat <- data.frame(V1 = readLines(
  textConnection("2013-08-27 16:00:01 117.79.149.2 GET 200 0 0
2013-08-27 16:00:02 117.79.149.2 GET 404 0 0
2013-08-27 16:00:03 117.79.149.2 GET 200 0 0
2013-08-27 16:00:04 117.79.149.2 GET 404 0 0")))

datmydfmydfmydfmydf如果你让你的代码更具可读性，你的答案可能会对其他人更有帮助。@SESman，非常感谢！我认为这种方法适用于简单的结构，就像我举例说明的那样。事实上，每个日志记录都更复杂，文件也更多，“子字符串”不适合这种情况。有没有比“子字符串”更灵活、更有效的方法呢？不客气。我不熟悉你的数据<代码>子字符串
和strsplit
通常适合我的需要。我想说正则表达式是最灵活的工具，但是，除了基本的语法之外，语法让我头脑发热。在您发布的示例中，只有一个列分隔符（“”），所以日志文件的read.table
就足够了。否则，您可以查看scanf
和fscannf
实用程序，它们可以让您解释如何解释每一行，一个字符接一个字符。不幸的是，我不知道R中有这样一个函数（我使用bash或matlab）。如果您让代码更具可读性，您的答案可能会对其他人更有帮助。@SESman，非常感谢！我认为这种方法适用于简单的结构，就像我举例说明的那样。事实上，每个日志记录都更复杂，文件也更多，“子字符串”不适合这种情况。有没有比“子字符串”更灵活、更有效的方法呢？不客气。我不熟悉你的数据<代码>子字符串
和strsplit
通常适合我的需要。我想说正则表达式是最灵活的工具，但是，除了基本的语法之外，语法让我头脑发热。在您发布的示例中，只有一个列分隔符（“”），所以日志文件的read.table
就足够了。否则，您可以查看scanf
和fscannf
实用程序，它们可以让您解释如何解释每一行，一个字符接一个字符。不幸的是，我不知道R中有这样一个函数（我使用bash或matlab）。如果您让代码更具可读性，您的答案可能会对其他人更有帮助。@SESman，非常感谢！我认为这种方法适用于简单的结构，就像我举例说明的那样。事实上，每个日志记录都更复杂，文件也更多，“子字符串”不适合这种情况。有没有比“子字符串”更灵活、更有效的方法呢？不客气。我不熟悉你的数据<代码>子字符串
和strsplit
通常适合我的需要。我想说正则表达式是最灵活的工具，但是，除了基本的语法之外，语法让我头脑发热。在您发布的示例中，只有一个列分隔符（“”），所以日志文件的read.table
就足够了。否则，您可以查看scanf
和fscannf
实用程序，它们可以让您解释如何解释每一行，一个字符接一个字符。不幸的是，我不知道R中有这样一个函数（我使用bash或matlab）。如果您让代码更具可读性，您的答案可能会对其他人更有帮助。@SESman，非常感谢！我认为这种方法适用于简单的结构，就像我举例说明的那样。事实上，每个日志记录都更复杂，文件也更多，“子字符串”不适合这种情况。有没有比“子字符串”更灵活、更有效的方法呢？不客气。我不熟悉你的数据<代码>子字符串
和strsplit
通常适合我的需要。我想说正则表达式是最灵活的工具，但是，除了基本的语法之外，语法让我头脑发热。在您发布的示例中，只有一个列分隔符（“”），所以日志文件的read.table
就足够了。否则，您可以查看scanf
和fscannf
实用程序，它们可以让您解释如何解释每一行，一个字符接一个字符。不幸的是，我不知道R中有这样一个函数（我使用bash或matlab）。你能帮我提前解决这个问题吗！，你能帮我提前解决这个问题吗！，你能帮我提前解决这个问题吗！，你能帮我提前解决这个问题吗！