将seer数据加载到seer中_R_Loading_Read.table_Fixed Width

将seer数据加载到seer中

将seer数据加载到seer中,r,loading,read.table,fixed-width,R,Loading,Read.table,Fixed Width,我正在尝试从ASCII文件加载SEER数据。只有一个.sas加载文件，我正试图将其转换为R load命令 .sas加载文件如下所示： filename seer9 './yr1973_2015.seer9/*.TXT'; data in; infile s

我正在尝试从ASCII文件加载SEER数据。只有一个.sas加载文件，我正试图将其转换为R load命令

.sas加载文件如下所示：

filename seer9 './yr1973_2015.seer9/*.TXT';                                           

data in;                                                                              
infile seer9 lrecl=362;                                                             
input                                                                               
@ 1   PUBCSNUM             $char8.  /* Patient ID */                              
@ 9   REG                  $char10. /* SEER registry */                           
@ 19  MAR_STAT             $char1.  /* Marital status at diagnosis */             
@ 20  RACE1V               $char2.  /* Race/ethnicity */                          
@ 23  NHIADE               $char1.  /* NHIA Derived Hisp Origin */                
@ 24  SEX                  $char1.  /* Sex */

我有以下代码来尝试复制类似的加载过程：

data <- read.table("OTHER.TXT", 
col.names = c("pubcsnum", "reg", "mar_stat", "race1v", "nhaide", "sex"),
sep = c(1, 9, 19, 20, 23, 24))

如果不使用

sep

参数，则会出现以下错误：

Error in read.table("OTHER.TXT", col.names = c("pubcsnum", "reg", "mar_stat",
:invalid 'sep' argument

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,
: 
  line 1 did not have 133 elements

有没有人有加载seer数据的经验？有人有没有建议为什么这不起作用

*值得注意的是，当我使用

fill=TRUE

参数时，第二个错误

第1行没有133个元素

不再出现，但当我评估前几个观察值时，数据不正确。我通过评估一个已知变量

sex

进一步确认：

> summary(data$sex)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.000e+00 2.000e+00 3.020e+03 7.852e+18 9.884e+13 2.055e+20

其中值为1/2且摘要无意义

固定宽度文件（如.sas文件所述）通过

外部

包中的

read.fwf

函数读取。我担心，普林斯顿大学主办的格式精美的网页在如何使用

read.table

实现这一目的方面是完全错误的。实际上没有分隔符，只有位置。在本例中，您可以使用（假设您的工作目录中有一个名为“yr1973_2015.seer9”的目录）：

我不确定在那些长行中是否会有效地忽略尾随信息，但用

？read.fwf

页面上的第一个示例的轻微修改进行检查：

> ff <- tempfile()
> cat(file = ff, "12345689", "98765489", sep = "\n")
> read.fwf(ff, widths = c(1,2,3))
  V1 V2  V3
1  1 23 456
2  9 87 654
>unlink(ff)

ff cat（文件=ff，“12345689”，“98765489”，sep=“\n”） >read.fwf（ff，宽度=c（1,2,3）） V1 V2 V3 1 1 23 456 2 9 87 654 >取消链接（ff）我查了一下我的记忆，用安东尼的名字作为搜索词可能会有帮助，发现他的网站已经更新了。退房：

所以其他的评论和答案指出了其中的大部分，但这里有一个更完整的答案来回答您的确切问题。我听说很多人都在努力处理这些ASCII文件（包括许多相关但不是很简单的包），我想为其他搜索者解答这些问题

固定宽度文件这些SEER“ASCII”文件实际上是固定宽度的文本文件（ASCII是一种编码标准，而不是文件格式）。这意味着没有分隔符（例如“，”或“\t”）分隔字段（在.csv或.tsv中）

相反，每个字段由行中的开始和结束位置（有时是开始位置和字段宽度/长度）定义。这是我们在您总结的.sas文件中看到的内容：

input                                                                               
@ 1   PUBCSNUM             $char8.  /* Patient ID */                              
@ 9   REG                  $char10. /* SEER registry */  
...

这是什么意思？

第一个患者ID字段从位置1开始，长度为8（从$char8开始，类似于SQL模式中的精度等），这意味着它在位置8结束
第二个字段SEER registry ID从位置9（前一个字段的1+8）开始，长度为10（同样从$char10开始），这意味着它在位置18结束
等等

其中，

数字持续增加，因此字段不会重叠

读取固定宽度文件我发现

readr:：read_fwf（）

函数既漂亮又简单，主要是因为它有两个助手函数，即

fwf_positions（）

，告诉它如何按开始和结束（或宽度，使用

fwf_widths（）

）定义每个字段

因此，要从文件中读取这两个字段，我们可以执行以下操作：

read_fwf(<file>, fwf_positions(start=c(1, 9), end=c(8, 18), col_names=c("patient_id", "registry_id")))

read_fwf（，fwf_位置（开始=c（1,9），结束=c（8,18），列名=c（“患者id”，“登记id”））

其中col_name仅用于重命名列

助手脚本。我以前一直在努力解决这些问题，所以我实际上编写了一个.sas文件，并提取了开始位置、宽度、列名和描述

全部内容如下，只需替换文件名：

## Script to read the SEER file dictionary and use it to read SEER ASCII data files.

library(tidyverse)
library(stringr)

#### Reading the file dictionary ----
## https://seer.cancer.gov/manuals/read.seer.research.nov2017.sas

sas.raw <- read_lines("https://seer.cancer.gov/manuals/read.seer.research.nov2017.sas")
sas.df <- tibble(raw = sas.raw) %>% 
  ## remove first few rows by insisting an @ that defines the start index of that field
  filter(str_detect(raw, "@")) %>% 
  ## extract out the start, width and column name+description fields
  mutate(start = str_replace(str_extract(raw, "@ [[:digit:]]{1,3}"), "@ ", ""),
         width = str_replace(str_extract(raw, "\\$char[[:digit:]]{1,2}"), "\\$char", ""),
         col_name = str_extract(raw, "[[:upper:]]+[[:upper:][:digit:][:punct:]]+"),
         col_desc = str_trim(str_replace(str_replace(str_extract(raw, "\\/\\*.+\\*\\/"), "\\/\\*", ""), "\\*\\/", "" )) ) %>% 
  ## coerce to integers
  mutate_at(vars(start, width), funs(as.integer)) %>% 
  ## calculate the end position
  mutate(end = start + width - 1)

column_mapping <- sas.df %>% 
  select(col_name, col_desc)

#### read the file with the start+end positions----

## CHANGE THIS LINE
file_path = "data/test_COLRECT.txt"

## read the file with the fixed width positions
data.df <- read_fwf(file_path, 
                    fwf_positions(sas.df$start, sas.df$end, sas.df$col_name))
## result is a tibble

读取SEER文件字典并使用它读取SEER ASCII数据文件的脚本。图书馆（tidyverse）图书馆（stringr） ####阅读文件字典---- ## https://seer.cancer.gov/manuals/read.seer.research.nov2017.sas sas.raw% ##提取起始、宽度和列名+描述字段突变（start=str_replace（str_extract（raw），“@[：digit:]{1,3}”），“@”和“），宽度=str\u替换（str\u提取（原始，\\$char[[：digit:]{1,2}”），“\\$char”，”， col_name=str_extract（原始，“[:upper:][]+[:upper:][:digit:][:punt:][]+”， col\u desc=str\u trim（str\u replace（str\u extract（原始，\\/\*.+\*\\\\/”，“\\/\*”，”），“\\*\/”，“））%>% ##强制为整数在（变量（开始，宽度），funs（作为整数））%>% ##计算结束位置突变（结束=开始+宽度-1）列映射% 选择（列名称、列描述） ####读取具有开始+结束位置的文件---- ##换行 file\u path=“data/test\u COLRECT.txt” ##读取具有固定宽度位置的文件

data.df您能否将

fill=TRUE

添加到read.table命令中，

read.table

不知道如何处理缺少的值。

fill=TRUE

消除错误，但请参阅我上面的编辑以了解更多说明。遗憾的是，数据不可靠。在SAS中，您必须指定列的起始位置（例如位置1、9等），但在

sep

参数中不需要这样做

sep

正在查找分隔符，而不是起始变量的位置。有关更多详细信息，请参见此处：我在此处看到了使用

sep

定义数据位置的方法：。诚然，文档中没有对其进行描述，而且似乎不起作用。然而，仅仅加载数据似乎会为没有意义的变量创建值（如上所列）。有没有更好的方法呢？如果你有SAS，我认为最好是在那里创建文件，然后使用类似“haven:：read_SAS（）”的方法将文件读入你的R会话。对于你的性别变量，它似乎要么是在阅读中的错误，要么是在阅读中的错误

read_fwf(<file>, fwf_positions(start=c(1, 9), end=c(8, 18), col_names=c("patient_id", "registry_id")))

## Script to read the SEER file dictionary and use it to read SEER ASCII data files.

library(tidyverse)
library(stringr)

#### Reading the file dictionary ----
## https://seer.cancer.gov/manuals/read.seer.research.nov2017.sas

sas.raw <- read_lines("https://seer.cancer.gov/manuals/read.seer.research.nov2017.sas")
sas.df <- tibble(raw = sas.raw) %>% 
  ## remove first few rows by insisting an @ that defines the start index of that field
  filter(str_detect(raw, "@")) %>% 
  ## extract out the start, width and column name+description fields
  mutate(start = str_replace(str_extract(raw, "@ [[:digit:]]{1,3}"), "@ ", ""),
         width = str_replace(str_extract(raw, "\\$char[[:digit:]]{1,2}"), "\\$char", ""),
         col_name = str_extract(raw, "[[:upper:]]+[[:upper:][:digit:][:punct:]]+"),
         col_desc = str_trim(str_replace(str_replace(str_extract(raw, "\\/\\*.+\\*\\/"), "\\/\\*", ""), "\\*\\/", "" )) ) %>% 
  ## coerce to integers
  mutate_at(vars(start, width), funs(as.integer)) %>% 
  ## calculate the end position
  mutate(end = start + width - 1)

column_mapping <- sas.df %>% 
  select(col_name, col_desc)

#### read the file with the start+end positions----

## CHANGE THIS LINE
file_path = "data/test_COLRECT.txt"

## read the file with the fixed width positions
data.df <- read_fwf(file_path, 
                    fwf_positions(sas.df$start, sas.df$end, sas.df$col_name))
## result is a tibble