R 读取具有特定编码的TSV（初始两个字节，之后为UTF-8）和每个字符后的NUL_R_Data.table_Readr_Nul

R 读取具有特定编码的TSV（初始两个字节，之后为UTF-8）和每个字符后的NUL

R 读取具有特定编码的TSV（初始两个字节，之后为UTF-8）和每个字符后的NUL,r,data.table,readr,nul,R,Data.table,Readr,Nul,我有一个模糊的TSV，我正试图读取，显然它以一个标识符开始，并嵌入了一些NUL值（似乎每个真正的字符后面都有一个NUL）。这些是文件的前100个字节（用十六进制编辑器缩短）：（我不得不将其重命名为txt以便上传，但它是一个tsv文件）不幸的是，我不能用基函数，也不能用readr或data.table读取它以下是reprex： file <- 'test_file.txt' # read.tsv is not able to read the file since there are

我有一个模糊的TSV，我正试图读取，显然它以一个标识符开始，并嵌入了一些NUL值（似乎每个真正的字符后面都有一个NUL）。这些是文件的前100个字节（用十六进制编辑器缩短）：（我不得不将其重命名为txt以便上传，但它是一个tsv文件）

不幸的是，我不能用基函数，也不能用readr或data.table读取它

以下是reprex：

file <- 'test_file.txt'

# read.tsv is not able to read the file since there are embedded NULs
tmp <- read.table(file, header = T, nrows = 2)
#> Warning in read.table(file, header = T, nrows = 2): line 1 appears to
#> contain embedded nulls
#> Warning in read.table(file, header = T, nrows = 2): line 2 appears to
#> contain embedded nulls
#> Warning in read.table(file, header = T, nrows = 2): line 3 appears to
#> contain embedded nulls
#> Warning in scan(file = file, what = what, sep = sep, quote = quote, dec =
#> dec, : embedded nul(s) found in input

# unfortunately the skipNul argument also doesn't work
tmp <- read.table(file, header = T, nrows = 2, skipNul = T)
#> Error in read.table(file, header = T, nrows = 2, skipNul = T): more columns than column names

# read_tsv from readr is also not able to read the file (probably since it stops each line after a NUL)
tmp <- readr::read_tsv(file, n_max = 2)
#> Warning: Duplicated column names deduplicated: '' => '_1' [3], '' =>
#> '_2' [4], '' => '_3' [5], '' => '_4' [6], '' => '_5' [7], '' => '_6' [8],
#> '' => '_7' [9], '' => '_8' [10], '' => '_9' [11], '' => '_10' [12], '' =>
#> '_11' [13]
#> Parsed with column specification:
#> cols(
#>   y = col_character(),
#>   col_character(),
#>   `_1` = col_character(),
#>   `_2` = col_character(),
#>   `_3` = col_character(),
#>   `_4` = col_character(),
#>   `_5` = col_character(),
#>   `_6` = col_character(),
#>   `_7` = col_character(),
#>   `_8` = col_character(),
#>   `_9` = col_character(),
#>   `_10` = col_character(),
#>   `_11` = col_character()
#> )
#> Error in read_tokens_(data, tokenizer, col_specs, col_names, locale_, : Column 2 must be named

# fread from data.table is also not able to read the file (although it is the first function that more clearly shows the problem)
tmp <- data.table::fread(file, nrows = 2)
#> Error in data.table::fread(file, nrows = 2): embedded nul in string: 'Ã¿Ã¾y\0e\0a\0r\0'

# read lines reads the first actual character 'y' and the file identifier characters that seem to parse as 'Ã¿Ã¾' in UTF-8
readLines(file, n = 1)
#> Warning in readLines(file, n = 1): line 1 appears to contain an embedded
#> nul
#> [1] "Ã¿Ã¾y"

# the problem is in the hidden NUL characters as the following command shows
readLines(file, n = 1, skipNul = T)
#> [1] "Ã¿Ã¾year\tmonth\tday\tDateTime\tAreaTypeCode\tAreaName\tMapCode\tPowerSystemResourceName\tProductionTypeName\tActualGenerationOutput\tActualConsumption\tInstalledGenCapacity\tSubmissionTS"

文件包含嵌入的空值
#>read.table（文件，header=T，nrows=2）中的警告：第2行显示为
#>包含嵌入的空值
#>read.table（文件，header=T，nrows=2）中的警告：第3行显示为
#>包含嵌入的空值
#>扫描中的警告（文件=文件，内容=内容，sep=sep，quote=quote，dec=
#>dec，：在输入中找到嵌入式nul
#不幸的是，巧妙的论证也不起作用
read.table（file，header=T，nrows=2，skipNul=T）中的tmp错误：列数多于列名
#readr中的read_tsv也无法读取文件（可能是因为它在NUL之后停止每一行）
tmp警告：已删除重复的列名重复：“”=>“\u 1”[3]，“”=>
#> '_2' [4], '' => '_3' [5], '' => '_4' [6], '' => '_5' [7], '' => '_6' [8],
#> '' => '_7' [9], '' => '_8' [10], '' => '_9' [11], '' => '_10' [12], '' =>
#> '_11' [13]
#>使用列规范解析：
#>科尔斯(
#>y=列字符（），
#>col_character（），
#>``u 1`=列字符（），
#>`u 2`=列字符（），
#>`u 3`=列字符（），
#>``u 4`=列字符（），
#>`u 5`=列字符（），
#>``u 6`=列字符（），
#>``u 7`=列字符（），
#>``u 8`=列字符（），
#>`u 9`=列字符（），
#>`u 10`=列字符（），
#>``u 11`=列字符（）
#> )
#>读取标记时出错（数据、标记器、列规格、列名称、区域设置）：第2列必须命名
#fread from data.table也无法读取文件（尽管这是第一个更清楚地显示问题的函数）
数据中的tmp错误。表格：：fread（文件，nrows=2）：字符串中嵌入nul:“Ã？Ã¾y\0e\0a\0r\0”
#读取行读取第一个实际字符“y”和文件标识符字符，这些字符在UTF-8中似乎被解析为“ÃÃ¾”
读线（文件，n=1）
#>readLines（文件，n=1）中的警告：第1行似乎包含嵌入的
#>努尔
#>[1]“是吗？”
#问题在于隐藏的NUL字符，如下命令所示
读线（文件，n=1，skipNul=T）
#>[1]“年\tmonth\tday\tDateTime\tAreaTypeCode\tAreaName\tMapCode\tPowerSystemResourceName\tProductionTypeName\tTactualGenerationOutput\tTactualConsumption\tInstalledEncapacity\tSubmissions”

是否有一种变通方法允许我读取此文件？最好不要使用基本函数，因为它们的速度非常慢，而且我必须读取多个超过300 MB的文件（>20个）。

中介绍了当前的变通方法

这个答案在很大程度上依赖于这个答案。我添加了一些注释，修改了这个示例以处理头字节，并添加了fread（data.table）和read_tsv（readr）的用法，以创建到数据帧的最终链接

file <- 'test_file.txt'

# read the file as raw and skip the first two header bytes
data_raw <- readBin(file, raw(), file.info(file)$size)[3:file.info(file)$size]

# replace the NUL values by an uncommon UTF-8 character so that we can
# later filter this one out. Please check out this list for more uncommon
# characters: http://www.fileformat.info/info/charset/UTF-8/list.htm
data_raw[data_raw == as.raw(0)] <- as.raw(1)

# convert to a string and remove the replaced characters (raw(1) in our case)
data_string <- gsub("\001", "", rawToChar(data_raw), fixed = TRUE)

# convert the resulting string to a data frame by a function to your liking
data_tmp1 <- data.table::fread(data_string, header = T) # quickest
data_tmp2 <- readr::read_tsv(data_string) # slower and is not working well with the UTF-8 characters
data_tmp3 <- read.table(data_string) # crashed R for my files (probably due to size)

文件你可以上传你的文件吗？你的意思是除了txt链接和问题中包含的代码之外？是的，我不确定上传到GH不会改变文件的原始内容。我更确信复制粘贴到Gist不会有这种效果（特别是，我得到的错误与你不同）MichaelChirico，我该怎么复制它？HEX转储？因为一旦我从文本编辑器中拷贝了部分文件，它就好像改变了解决问题的编码。