R 读取具有特定编码的TSV(初始两个字节,之后为UTF-8)和每个字符后的NUL

R 读取具有特定编码的TSV(初始两个字节,之后为UTF-8)和每个字符后的NUL,r,data.table,readr,nul,R,Data.table,Readr,Nul,我有一个模糊的TSV,我正试图读取,显然它以一个标识符开始,并嵌入了一些NUL值(似乎每个真正的字符后面都有一个NUL)。这些是文件的前100个字节(用十六进制编辑器缩短):(我不得不将其重命名为txt以便上传,但它是一个tsv文件) 不幸的是,我不能用基函数,也不能用readr或data.table读取它 以下是reprex: file <- 'test_file.txt' # read.tsv is not able to read the file since there are

我有一个模糊的TSV,我正试图读取,显然它以一个标识符开始,并嵌入了一些NUL值(似乎每个真正的字符后面都有一个NUL)。这些是文件的前100个字节(用十六进制编辑器缩短):(我不得不将其重命名为txt以便上传,但它是一个tsv文件)

不幸的是,我不能用基函数,也不能用readr或data.table读取它

以下是reprex:

file <- 'test_file.txt'

# read.tsv is not able to read the file since there are embedded NULs
tmp <- read.table(file, header = T, nrows = 2)
#> Warning in read.table(file, header = T, nrows = 2): line 1 appears to
#> contain embedded nulls
#> Warning in read.table(file, header = T, nrows = 2): line 2 appears to
#> contain embedded nulls
#> Warning in read.table(file, header = T, nrows = 2): line 3 appears to
#> contain embedded nulls
#> Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
#> dec, : embedded nul(s) found in input

# unfortunately the skipNul argument also doesn't work
tmp <- read.table(file, header = T, nrows = 2, skipNul = T)
#> Error in read.table(file, header = T, nrows = 2, skipNul = T): more columns than column names

# read_tsv from readr is also not able to read the file (probably since it stops each line after a NUL)
tmp <- readr::read_tsv(file, n_max = 2)
#> Warning: Duplicated column names deduplicated: '' => '_1' [3], '' =>
#> '_2' [4], '' => '_3' [5], '' => '_4' [6], '' => '_5' [7], '' => '_6' [8],
#> '' => '_7' [9], '' => '_8' [10], '' => '_9' [11], '' => '_10' [12], '' =>
#> '_11' [13]
#> Parsed with column specification:
#> cols(
#>   y = col_character(),
#>   col_character(),
#>   `_1` = col_character(),
#>   `_2` = col_character(),
#>   `_3` = col_character(),
#>   `_4` = col_character(),
#>   `_5` = col_character(),
#>   `_6` = col_character(),
#>   `_7` = col_character(),
#>   `_8` = col_character(),
#>   `_9` = col_character(),
#>   `_10` = col_character(),
#>   `_11` = col_character()
#> )
#> Error in read_tokens_(data, tokenizer, col_specs, col_names, locale_, : Column 2 must be named

# fread from data.table is also not able to read the file (although it is the first function that more clearly shows the problem)
tmp <- data.table::fread(file, nrows = 2)
#> Error in data.table::fread(file, nrows = 2): embedded nul in string: 'ÿþy\0e\0a\0r\0'

# read lines reads the first actual character 'y' and the file identifier characters that seem to parse as 'ÿþ' in UTF-8
readLines(file, n = 1)
#> Warning in readLines(file, n = 1): line 1 appears to contain an embedded
#> nul
#> [1] "ÿþy"

# the problem is in the hidden NUL characters as the following command shows
readLines(file, n = 1, skipNul = T)
#> [1] "ÿþyear\tmonth\tday\tDateTime\tAreaTypeCode\tAreaName\tMapCode\tPowerSystemResourceName\tProductionTypeName\tActualGenerationOutput\tActualConsumption\tInstalledGenCapacity\tSubmissionTS"
文件包含嵌入的空值
#>read.table(文件,header=T,nrows=2)中的警告:第2行显示为
#>包含嵌入的空值
#>read.table(文件,header=T,nrows=2)中的警告:第3行显示为
#>包含嵌入的空值
#>扫描中的警告(文件=文件,内容=内容,sep=sep,quote=quote,dec=
#>dec,:在输入中找到嵌入式nul
#不幸的是,巧妙的论证也不起作用
read.table(file,header=T,nrows=2,skipNul=T)中的tmp错误:列数多于列名
#readr中的read_tsv也无法读取文件(可能是因为它在NUL之后停止每一行)
tmp警告:已删除重复的列名重复:“”=>“\u 1”[3],“”=>
#> '_2' [4], '' => '_3' [5], '' => '_4' [6], '' => '_5' [7], '' => '_6' [8],
#> '' => '_7' [9], '' => '_8' [10], '' => '_9' [11], '' => '_10' [12], '' =>
#> '_11' [13]
#>使用列规范解析:
#>科尔斯(
#>y=列字符(),
#>col_character(),
#>``u 1`=列字符(),
#>`u 2`=列字符(),
#>`u 3`=列字符(),
#>``u 4`=列字符(),
#>`u 5`=列字符(),
#>``u 6`=列字符(),
#>``u 7`=列字符(),
#>``u 8`=列字符(),
#>`u 9`=列字符(),
#>`u 10`=列字符(),
#>``u 11`=列字符()
#> )
#>读取标记时出错(数据、标记器、列规格、列名称、区域设置):第2列必须命名
#fread from data.table也无法读取文件(尽管这是第一个更清楚地显示问题的函数)
数据中的tmp错误。表格::fread(文件,nrows=2):字符串中嵌入nul:“Ã?þy\0e\0a\0r\0”
#读取行读取第一个实际字符“y”和文件标识符字符,这些字符在UTF-8中似乎被解析为“Ãþ”
读线(文件,n=1)
#>readLines(文件,n=1)中的警告:第1行似乎包含嵌入的
#>努尔
#>[1]“是吗?”
#问题在于隐藏的NUL字符,如下命令所示
读线(文件,n=1,skipNul=T)
#>[1]“年\tmonth\tday\tDateTime\tAreaTypeCode\tAreaName\tMapCode\tPowerSystemResourceName\tProductionTypeName\tTactualGenerationOutput\tTactualConsumption\tInstalledEncapacity\tSubmissions”

是否有一种变通方法允许我读取此文件?最好不要使用基本函数,因为它们的速度非常慢,而且我必须读取多个超过300 MB的文件(>20个)。

中介绍了当前的变通方法

这个答案在很大程度上依赖于这个答案。我添加了一些注释,修改了这个示例以处理头字节,并添加了fread(data.table)和read_tsv(readr)的用法,以创建到数据帧的最终链接

file <- 'test_file.txt'

# read the file as raw and skip the first two header bytes
data_raw <- readBin(file, raw(), file.info(file)$size)[3:file.info(file)$size]

# replace the NUL values by an uncommon UTF-8 character so that we can
# later filter this one out. Please check out this list for more uncommon
# characters: http://www.fileformat.info/info/charset/UTF-8/list.htm
data_raw[data_raw == as.raw(0)] <- as.raw(1)

# convert to a string and remove the replaced characters (raw(1) in our case)
data_string <- gsub("\001", "", rawToChar(data_raw), fixed = TRUE)

# convert the resulting string to a data frame by a function to your liking
data_tmp1 <- data.table::fread(data_string, header = T) # quickest
data_tmp2 <- readr::read_tsv(data_string) # slower and is not working well with the UTF-8 characters
data_tmp3 <- read.table(data_string) # crashed R for my files (probably due to size)

文件你可以上传你的文件吗?你的意思是除了txt链接和问题中包含的代码之外?是的,我不确定上传到GH不会改变文件的原始内容。我更确信复制粘贴到Gist不会有这种效果(特别是,我得到的错误与你不同)MichaelChirico,我该怎么复制它?HEX转储?因为一旦我从文本编辑器中拷贝了部分文件,它就好像改变了解决问题的编码。