分析错误:";“拖尾垃圾”;在尝试解析数据帧中的JSON列时
我有一个类似的日志文件。这是一个文本文档,看起来像:分析错误:";“拖尾垃圾”;在尝试解析数据帧中的JSON列时,json,r,jsonlite,purrr,Json,R,Jsonlite,Purrr,我有一个类似的日志文件。这是一个文本文档,看起来像: Id,Date,Level,Message 35054,2016-06-17 19:29:43 +0000,INFO,"{ ""id"": -2, ""ipAddress"": ""100.100.100.100"", ""howYouHearAboutUs"":
Id,Date,Level,Message
35054,2016-06-17 19:29:43 +0000,INFO,"{
""id"": -2,
""ipAddress"": ""100.100.100.100"",
""howYouHearAboutUs"": null,
""isInterestedInOffer"": true,
""incomeRange"": 60000,
""isEmailConfirmed"": false
}"
35055,2016-06-17 19:36:38 +0000,INFO,"{
""id"": -1,
""firstName"": ""John"",
""lastName"": ""Smith"",
""email"": ""john.smith@gmail.com"",
""city"": ""Smalltown"",
""incomeRange"": 1,
""birthDate"": ""1999-12-10T05:00:00Z"",
""password"": ""*********"",
""agreeToTermsOfUse"": true,
""howYouHearAboutUs"": ""Radio"",
""isInterestedInOffer"": false
}"
35059,2016-07-19 19:52:08 +0000,INFO,"{
""id"": -3,
""visitUrl"": ""https://www.website.com/?purpose=X"",
""ipAddress"": ""100.200.300.400"",
""howYouHearAboutUs"": null,
""isInterestedInOffer"": true,
""incomeRange"": 100000,
""isEmailConfirmed"": true,
""isIdentityConfirmed"": false,
""agreeToTermsOfUse"": true,
""validationResults"": null
}"
我试图通过以下方式解析消息
列中的JSON:
library(readr)
library(jsonlite)
df <- read_csv("log_file_from_above.csv")
fromJSON(as.character(df$Message))
如何去除“拖尾垃圾”?fromJSON()
不是对字符向量进行“应用”,而是试图将其全部转换为数据帧。你可以试试
purrr::map(df$Message, jsonlite::fromJSON)
@Abdou提供了什么或
jsonlite::stream_in(textConnection(gsub("\\n", "", df$Message)))
后两者将创建数据帧。第一个将创建一个列表,您可以将其添加为列
您可以将最后一种方法与dplyr::bind_cols
结合使用,以创建包含所有数据的新数据帧:
dplyr::bind_cols(df[,1:3],
jsonlite::stream_in(textConnection(gsub("\\n", "", df$Message))))
@Abdou还提出了一种几乎纯的碱性R解决方案:
cbind(df, do.call(plyr::rbind.fill, lapply(paste0("[",df$Message,"]"), function(x) jsonlite::fromJSON(x))))
完整的工作流程:
library(dplyr)
library(jsonlite)
df <- read.table("http://pastebin.com/raw/MMPMwNZv",
quote='"', sep=",", stringsAsFactors=FALSE, header=TRUE)
bind_cols(df[,1:3], stream_in(textConnection(gsub("\\n", "", df$Message)))) %>%
glimpse()
##
Found 3 records...
Imported 3 records. Simplifying into dataframe...
## Observations: 3
## Variables: 19
## $ Id <int> 35054, 35055, 35059
## $ Date <chr> "2016-06-17 19:29:43 +0000", "2016-06-17 1...
## $ Level <chr> "INFO", "INFO", "INFO"
## $ id <int> -2, -1, -3
## $ ipAddress <chr> "100.100.100.100", NA, "100.200.300.400"
## $ howYouHearAboutUs <chr> NA, "Radio", NA
## $ isInterestedInOffer <lgl> TRUE, FALSE, TRUE
## $ incomeRange <int> 60000, 1, 100000
## $ isEmailConfirmed <lgl> FALSE, NA, TRUE
## $ firstName <chr> NA, "John", NA
## $ lastName <chr> NA, "Smith", NA
## $ email <chr> NA, "john.smith@gmail.com", NA
## $ city <chr> NA, "Smalltown", NA
## $ birthDate <chr> NA, "1999-12-10T05:00:00Z", NA
## $ password <chr> NA, "*********", NA
## $ agreeToTermsOfUse <lgl> NA, TRUE, TRUE
## $ visitUrl <chr> NA, NA, "https://www.website.com/?purpose=X"
## $ isIdentityConfirmed <lgl> NA, NA, FALSE
## $ validationResults <lgl> NA, NA, NA
库(dplyr)
图书馆(jsonlite)
df%
一瞥
##
找到3条记录。。。
导入3条记录。简化为数据帧。。。
##意见:3
##变量:19
##350543505535059美元
##$Date“2016-06-17 19:29:43+0000”,“2016-06-17 1。。。
##$Level“INFO”、“INFO”、“INFO”
##$id-2、-1、-3
##$ipAddress“100.100.100.100”,NA,“100.200.300.400”
##$howYouHearAboutUs NA,“收音机”,NA
##$IsInterestdinOffer真、假、真
##收入范围600001000美元
##$Isemaild假,不,真
##$firstName NA,“约翰”,NA
##$lastName NA,“史密斯”,NA
##$NA,“约翰。smith@gmail.com“,不
##$city NA,“小镇”,NA
##$birthDate NA,“1999-12-10T05:00:00Z”,NA
##$password NA,“*******”,NA
##$AgreentToTermsofuse不适用,对,对
##$visitUrl不,不,”https://www.website.com/?purpose=X"
##$isIdentityConfirmed不适用,不适用,错误
##$validationResults不适用,不适用,不适用
lappy(paste0(“[”,df$Message,“]),函数(x)jsonlite::fromJSON(x))
产生了一些结果?我将一块json数据从html文档复制到一个新的文本文件中,并且也遇到了这个错误。根据上面的注释,我解决了这个问题的方法是手动添加一个开括号([)在我的json数据文本文件的顶部,末尾有一个小括号(])。我如何在数据帧中使用purrr::map(df$Message,jsonlite::fromJSON)
?这样我就不会丢失时间戳?@hrbrmstr,你能添加cbind(df[,1:3],do.call(plyr::rbind.fill,lappy(paste0(“[”,df$Message,”),df,“]),函数(x)jsonlite::fromJSON(x)))
?它使用了plyr
包。仍然会在dplyr::bind_cols(df[,1:3],jsonlite::stream_in(textConnection(gsub(\\n“,”,df$Message)))中遇到解析错误。
?您使用的是完整保留奇怪缩进的pastebin文件吗?
library(dplyr)
library(jsonlite)
df <- read.table("http://pastebin.com/raw/MMPMwNZv",
quote='"', sep=",", stringsAsFactors=FALSE, header=TRUE)
bind_cols(df[,1:3], stream_in(textConnection(gsub("\\n", "", df$Message)))) %>%
glimpse()
##
Found 3 records...
Imported 3 records. Simplifying into dataframe...
## Observations: 3
## Variables: 19
## $ Id <int> 35054, 35055, 35059
## $ Date <chr> "2016-06-17 19:29:43 +0000", "2016-06-17 1...
## $ Level <chr> "INFO", "INFO", "INFO"
## $ id <int> -2, -1, -3
## $ ipAddress <chr> "100.100.100.100", NA, "100.200.300.400"
## $ howYouHearAboutUs <chr> NA, "Radio", NA
## $ isInterestedInOffer <lgl> TRUE, FALSE, TRUE
## $ incomeRange <int> 60000, 1, 100000
## $ isEmailConfirmed <lgl> FALSE, NA, TRUE
## $ firstName <chr> NA, "John", NA
## $ lastName <chr> NA, "Smith", NA
## $ email <chr> NA, "john.smith@gmail.com", NA
## $ city <chr> NA, "Smalltown", NA
## $ birthDate <chr> NA, "1999-12-10T05:00:00Z", NA
## $ password <chr> NA, "*********", NA
## $ agreeToTermsOfUse <lgl> NA, TRUE, TRUE
## $ visitUrl <chr> NA, NA, "https://www.website.com/?purpose=X"
## $ isIdentityConfirmed <lgl> NA, NA, FALSE
## $ validationResults <lgl> NA, NA, NA