Python 循环遍历R中的.dat文件并仅提取特定数据作为列
我的本地驱动器中有900多个文件夹,每个文件夹都有一个.dat扩展名文件。我想循环访问每个文件夹以访问其中的文件,只获取特定数据并将该数据写入新文件。每个.dat文件都是这样的-Python 循环遍历R中的.dat文件并仅提取特定数据作为列,python,r,loops,extraction,Python,R,Loops,Extraction,我的本地驱动器中有900多个文件夹,每个文件夹都有一个.dat扩展名文件。我想循环访问每个文件夹以访问其中的文件,只获取特定数据并将该数据写入新文件。每个.dat文件都是这样的- Authors: # Pallavi Subhraveti # Quang Ong # Tim Holland # Anamika Kothari # Ingrid Keseler # Ron Caspi # Peter D Karp # Please see the li
Authors:
# Pallavi Subhraveti
# Quang Ong
# Tim Holland
# Anamika Kothari
# Ingrid Keseler
# Ron Caspi
# Peter D Karp
# Please see the license agreement regarding the use of and distribution of
this file.
# The format of this file is defined at http://bioinformatics.ai.sri.com
# Version: 21.5
# File Name: compounds.dat
# Date and time generated: October 24, 2017, 14:52:45
# Attributes:
# UNIQUE-ID
# TYPES
# COMMON-NAME
# ABBREV-NAME
# ACCESSION-1
# ANTICODON
# ATOM-CHARGES
# ATOM-ISOTOPES
# CATALYZES
# CFG-ICON-COLOR
# CHEMICAL-FORMULA
# CITATIONS
# CODONS
# COFACTORS-OF
# MOLECULAR-WEIGHT
# MONOISOTOPIC-MW
[Data Chunk 1]
UNIQUE-ID - CPD0-1108
TYPES - D-Ribofuranose
COMMON-NAME - β-D-ribofuranose
ATOM-CHARGES - (9 -1)
ATOM-CHARGES - (6 1)
CHEMICAL-FORMULA - (C 5)
CHEMICAL-FORMULA - (H 14)
CHEMICAL-FORMULA - (N 1)
CHEMICAL-FORMULA - (O 6)
CHEMICAL-FORMULA - (P 1)
CREDITS - SRI
CREDITS - kaipa
DBLINKS - (CHEBI "10647" NIL |kothari| 3594051403 NIL NIL)
DBLINKS - (BIGG "37147" NIL |kothari| 3584718837 NIL NIL)
DBLINKS - (PUBCHEM "25200464" NIL |taltman| 3466375284 NIL NIL)
DBLINKS - (LIGAND-CPD "C01233" NIL |keseler| 3342798255 NIL NIL)
INCHI - InChI=1S/C5H14NO6P/c6-1-2-11-13(9,10)12-4-5(8)3-7/h5,7-8H,1-4,6H2,(H,9,10)
MOLECULAR-WEIGHT - 215.142
MONOISOTOPIC-MW - 216.0636987293
NON-STANDARD-INCHI - InChI=1S/C5H14NO6P/c6-1-2-11-13(9,10)12-4-5(8)3-7/h5,7-8H,1-4,6H2,(H,9,10)
SMILES - C(OP([O-])(OCC(CO)O)=O)C[N+]
SYNONYMS - sn-Glycero-3-phosphoethanolamine
SYNONYMS - 1-glycerophosphorylethanolamine\
[Data Chunk 2]
//
UNIQUE-ID - URIDINE
TYPES - Pyrimidine
....
....
UNIQUE-ID TYPES COMMON-NAME CHEMICAL-FORMULA BIGG ID CHEMSPIDER ID CAS ID CHEBI ID PUBCHEM ID MOLECULAR-WEIGHT MONOISOTOPIC-MW
CPD0-1108 D-Ribofuranose β-D-ribofuranose C5H14N1O6P1 37147 NA NA 10647 25200464 215.142 216.0636987293
URIDINE Pyrimidine ...
每个文件中大约有18000行(在记事本++中查看数据)。现在我想创建一个新文件,只复制数据中的特定列。我只希望在新创建的文件中复制这些列,并且该文件应如下所示-
Authors:
# Pallavi Subhraveti
# Quang Ong
# Tim Holland
# Anamika Kothari
# Ingrid Keseler
# Ron Caspi
# Peter D Karp
# Please see the license agreement regarding the use of and distribution of
this file.
# The format of this file is defined at http://bioinformatics.ai.sri.com
# Version: 21.5
# File Name: compounds.dat
# Date and time generated: October 24, 2017, 14:52:45
# Attributes:
# UNIQUE-ID
# TYPES
# COMMON-NAME
# ABBREV-NAME
# ACCESSION-1
# ANTICODON
# ATOM-CHARGES
# ATOM-ISOTOPES
# CATALYZES
# CFG-ICON-COLOR
# CHEMICAL-FORMULA
# CITATIONS
# CODONS
# COFACTORS-OF
# MOLECULAR-WEIGHT
# MONOISOTOPIC-MW
[Data Chunk 1]
UNIQUE-ID - CPD0-1108
TYPES - D-Ribofuranose
COMMON-NAME - β-D-ribofuranose
ATOM-CHARGES - (9 -1)
ATOM-CHARGES - (6 1)
CHEMICAL-FORMULA - (C 5)
CHEMICAL-FORMULA - (H 14)
CHEMICAL-FORMULA - (N 1)
CHEMICAL-FORMULA - (O 6)
CHEMICAL-FORMULA - (P 1)
CREDITS - SRI
CREDITS - kaipa
DBLINKS - (CHEBI "10647" NIL |kothari| 3594051403 NIL NIL)
DBLINKS - (BIGG "37147" NIL |kothari| 3584718837 NIL NIL)
DBLINKS - (PUBCHEM "25200464" NIL |taltman| 3466375284 NIL NIL)
DBLINKS - (LIGAND-CPD "C01233" NIL |keseler| 3342798255 NIL NIL)
INCHI - InChI=1S/C5H14NO6P/c6-1-2-11-13(9,10)12-4-5(8)3-7/h5,7-8H,1-4,6H2,(H,9,10)
MOLECULAR-WEIGHT - 215.142
MONOISOTOPIC-MW - 216.0636987293
NON-STANDARD-INCHI - InChI=1S/C5H14NO6P/c6-1-2-11-13(9,10)12-4-5(8)3-7/h5,7-8H,1-4,6H2,(H,9,10)
SMILES - C(OP([O-])(OCC(CO)O)=O)C[N+]
SYNONYMS - sn-Glycero-3-phosphoethanolamine
SYNONYMS - 1-glycerophosphorylethanolamine\
[Data Chunk 2]
//
UNIQUE-ID - URIDINE
TYPES - Pyrimidine
....
....
UNIQUE-ID TYPES COMMON-NAME CHEMICAL-FORMULA BIGG ID CHEMSPIDER ID CAS ID CHEBI ID PUBCHEM ID MOLECULAR-WEIGHT MONOISOTOPIC-MW
CPD0-1108 D-Ribofuranose β-D-ribofuranose C5H14N1O6P1 37147 NA NA 10647 25200464 215.142 216.0636987293
URIDINE Pyrimidine ...
每个文件中的每个数据块不一定都有我需要的所有列的信息,这就是为什么我在输出表中提到了我想要的那些列的NA。虽然如果我在这些列中得到空白值是完全可以的,因为我可以在以后单独处理这些空白
这是包含数据的目录-
File 1] -> C:\Users\robbie\Desktop\Organism_Data\aact1035194-hmpcyc\compounds.dat
File 2] -> C:\Users\robbie\Desktop\Organism_Data\aaph679198-hmpcyc\compounds.dat
File 3] -> C:\Users\robbie\Desktop\Organism_Data\yreg1002368-hmpcyc\compounds.dat
File 4] -> C:\Users\robbie\Desktop\Organism_Data\tden699187-hmpcyc\compounds.dat
...
...
我真的倾向于在R引用文章中使用dir
函数,但在编写代码时,我弄不清楚该在函数的模式参数中添加什么,因为有机体名称(文件夹名称)非常奇怪且不一致
非常感谢为获得所需输出提供的任何帮助。我一直在考虑在R中实现这一点的方法,但如果我能在python中得到很好的建议和方法,我也愿意尝试在python中实现这一点。提前非常感谢您的帮助
编辑:
链接到数据-一个文件
将其分解为几个逻辑操作:
text2chunks <- function(txt) {
chunks <- split(txt, cumsum(grepl("^\\[Data Chunk.*\\]$", txt)))
Filter(function(a) grepl("^\\[Data Chunk.*\\]$", a[1]), chunks)
}
chunk2dataframe <- function(vec, hdrs = NULL, sep = " - ") {
s <- stringi::stri_split(vec, fixed=sep, n=2L)
s <- Filter(function(a) length(a) == 2L, s)
df <- as.data.frame(setNames(lapply(s, `[[`, 2), sapply(s, `[[`, 1)),
stringsAsFactors=FALSE)
if (! is.null(hdrs)) df <- df[ names(df) %in% make.names(hdrs) ]
df
}
使用以下数据,我得到了行
,这是单个文件中的字符
向量:
head(lines)
# [1] "Authors:"
# [2] "# Pallavi Subhraveti"
# [3] "# Quang Ong"
# [4] "# Please see the license agreement regarding the use of and distribution of this file."
# [5] "# The format of this file is defined at http://bioinformatics.ai.sri.com"
# [6] "# Version: 21.5"
str(text2chunks(lines))
# List of 2
# $ 1: chr [1:5] "[Data Chunk 1]" "UNIQUE-ID - CPD0-1108" "TYPES - D-Ribofuranose" "COMMON-NAME - β-D-ribofuranose" ...
# $ 2: chr [1:6] "[Data Chunk 2]" "// something out of place here?" "UNIQUE-ID - URIDINE" "TYPES - Pyrimidine" ...
str(lapply(text2chunks(lines), chunk2dataframe, hdrs=hdrs))
# List of 2
# $ 1:'data.frame': 1 obs. of 3 variables:
# ..$ UNIQUE.ID : chr "CPD0-1108"
# ..$ TYPES : chr "D-Ribofuranose"
# ..$ COMMON.NAME: chr "β-D-ribofuranose"
# $ 2:'data.frame': 1 obs. of 3 variables:
# ..$ UNIQUE.ID : chr "URIDINE"
# ..$ TYPES : chr "Pyrimidine"
# ..$ COMMON.NAME: chr "β-D-ribofuranose or something"
最终产品:
dplyr::bind_rows(lapply(text2chunks(lines), chunk2dataframe, hdrs=hdrs))
# UNIQUE.ID TYPES COMMON.NAME
# 1 CPD0-1108 D-Ribofuranose β-D-ribofuranose
# 2 URIDINE Pyrimidine β-D-ribofuranose or something
由于您希望在许多函数上迭代此函数,因此为其创建一个方便的函数是有意义的:
text2dataframe <- function(txt) {
dplyr::bind_rows(lapply(text2chunks(txt), chunk2dataframe, hdrs=hdrs))
}
一个文件
将其分解为几个逻辑操作:
text2chunks <- function(txt) {
chunks <- split(txt, cumsum(grepl("^\\[Data Chunk.*\\]$", txt)))
Filter(function(a) grepl("^\\[Data Chunk.*\\]$", a[1]), chunks)
}
chunk2dataframe <- function(vec, hdrs = NULL, sep = " - ") {
s <- stringi::stri_split(vec, fixed=sep, n=2L)
s <- Filter(function(a) length(a) == 2L, s)
df <- as.data.frame(setNames(lapply(s, `[[`, 2), sapply(s, `[[`, 1)),
stringsAsFactors=FALSE)
if (! is.null(hdrs)) df <- df[ names(df) %in% make.names(hdrs) ]
df
}
使用以下数据,我得到了行
,这是单个文件中的字符
向量:
head(lines)
# [1] "Authors:"
# [2] "# Pallavi Subhraveti"
# [3] "# Quang Ong"
# [4] "# Please see the license agreement regarding the use of and distribution of this file."
# [5] "# The format of this file is defined at http://bioinformatics.ai.sri.com"
# [6] "# Version: 21.5"
str(text2chunks(lines))
# List of 2
# $ 1: chr [1:5] "[Data Chunk 1]" "UNIQUE-ID - CPD0-1108" "TYPES - D-Ribofuranose" "COMMON-NAME - β-D-ribofuranose" ...
# $ 2: chr [1:6] "[Data Chunk 2]" "// something out of place here?" "UNIQUE-ID - URIDINE" "TYPES - Pyrimidine" ...
str(lapply(text2chunks(lines), chunk2dataframe, hdrs=hdrs))
# List of 2
# $ 1:'data.frame': 1 obs. of 3 variables:
# ..$ UNIQUE.ID : chr "CPD0-1108"
# ..$ TYPES : chr "D-Ribofuranose"
# ..$ COMMON.NAME: chr "β-D-ribofuranose"
# $ 2:'data.frame': 1 obs. of 3 variables:
# ..$ UNIQUE.ID : chr "URIDINE"
# ..$ TYPES : chr "Pyrimidine"
# ..$ COMMON.NAME: chr "β-D-ribofuranose or something"
最终产品:
dplyr::bind_rows(lapply(text2chunks(lines), chunk2dataframe, hdrs=hdrs))
# UNIQUE.ID TYPES COMMON.NAME
# 1 CPD0-1108 D-Ribofuranose β-D-ribofuranose
# 2 URIDINE Pyrimidine β-D-ribofuranose or something
由于您希望在许多函数上迭代此函数,因此为其创建一个方便的函数是有意义的:
text2dataframe <- function(txt) {
dplyr::bind_rows(lapply(text2chunks(txt), chunk2dataframe, hdrs=hdrs))
}
另一种方法是,在这种情况下,它只读取您提供的文件,但它可以读取多个文件 我添加了一些中间结果来显示代码实际在做什么
library(tidyverse)
library(data.table)
library(zoo)
# create a data.frame with the desired files
filenames <- list.files( path = getwd(), pattern = "*.dat$", recursive = TRUE, full.names = TRUE )
# > filenames
#[1] "C:/Users/********/Documents/Git/udls2/test.dat"
#read in the files, using data.table's fread.. here I grep lines starting with UNIQUE-ID or TYPES. create your desired regex-pattern
pattern <- "^UNIQUE-ID|^TYPES"
content.list <- lapply( filenames, function(x) fread( x, sep = "\n", header = FALSE )[grepl( pattern, V1 )] )
# > content.list
# [[1]]
# V1
# 1: UNIQUE-ID - CPD0-1108
# 2: TYPES - D-Ribofuranose
# 3: UNIQUE-ID - URIDINE
# 4: TYPES - Pyrimidine
#add all content to a single data.table
dt <- rbindlist( content.list )
# > dt
# V1
# 1: UNIQUE-ID - CPD0-1108
# 2: TYPES - D-Ribofuranose
# 3: UNIQUE-ID - URIDINE
# 4: TYPES - Pyrimidine
#split the text in a variable-name and it's content
dt <- dt %>% separate( V1, into = c("var", "content"), sep = " - ")
# > dt
# var content
# 1: UNIQUE-ID CPD0-1108
# 2: TYPES D-Ribofuranose
# 3: UNIQUE-ID URIDINE
# 4: TYPES Pyrimidine
#add an increasing id for every UNIQUE-ID
dt[var == "UNIQUE-ID", id := seq.int( 1: nrow( dt[var=="UNIQUE-ID", ]))]
# > dt
# var content id
# 1: UNIQUE-ID CPD0-1108 1
# 2: TYPES D-Ribofuranose NA
# 3: UNIQUE-ID URIDINE 2
# 4: TYPES Pyrimidine NA
#fill down id vor all variables found
dt[, id := na.locf( dt$id )]
# > dt
# var content id
# 1: UNIQUE-ID CPD0-1108 1
# 2: TYPES D-Ribofuranose 1
# 3: UNIQUE-ID URIDINE 2
# 4: TYPES Pyrimidine 2
#cast
dcast(dt, id ~ var, value.var = "content")
# id TYPES UNIQUE-ID
# 1: 1 D-Ribofuranose CPD0-1108
# 2: 2 Pyrimidine URIDINE
库(tidyverse)
库(数据表)
图书馆(动物园)
#使用所需文件创建data.frame
文件名文件名
#[1] “C:/Users/*********/Documents/Git/udls2/test.dat”
#使用data.table的fread读入文件。。这里我用UNIQUE-ID或type开始grep行。创建所需的正则表达式模式
模式另一种方法,在这种情况下,它只读取您提供的文件,但可以读取多个文件
我添加了一些中间结果来显示代码实际在做什么
library(tidyverse)
library(data.table)
library(zoo)
# create a data.frame with the desired files
filenames <- list.files( path = getwd(), pattern = "*.dat$", recursive = TRUE, full.names = TRUE )
# > filenames
#[1] "C:/Users/********/Documents/Git/udls2/test.dat"
#read in the files, using data.table's fread.. here I grep lines starting with UNIQUE-ID or TYPES. create your desired regex-pattern
pattern <- "^UNIQUE-ID|^TYPES"
content.list <- lapply( filenames, function(x) fread( x, sep = "\n", header = FALSE )[grepl( pattern, V1 )] )
# > content.list
# [[1]]
# V1
# 1: UNIQUE-ID - CPD0-1108
# 2: TYPES - D-Ribofuranose
# 3: UNIQUE-ID - URIDINE
# 4: TYPES - Pyrimidine
#add all content to a single data.table
dt <- rbindlist( content.list )
# > dt
# V1
# 1: UNIQUE-ID - CPD0-1108
# 2: TYPES - D-Ribofuranose
# 3: UNIQUE-ID - URIDINE
# 4: TYPES - Pyrimidine
#split the text in a variable-name and it's content
dt <- dt %>% separate( V1, into = c("var", "content"), sep = " - ")
# > dt
# var content
# 1: UNIQUE-ID CPD0-1108
# 2: TYPES D-Ribofuranose
# 3: UNIQUE-ID URIDINE
# 4: TYPES Pyrimidine
#add an increasing id for every UNIQUE-ID
dt[var == "UNIQUE-ID", id := seq.int( 1: nrow( dt[var=="UNIQUE-ID", ]))]
# > dt
# var content id
# 1: UNIQUE-ID CPD0-1108 1
# 2: TYPES D-Ribofuranose NA
# 3: UNIQUE-ID URIDINE 2
# 4: TYPES Pyrimidine NA
#fill down id vor all variables found
dt[, id := na.locf( dt$id )]
# > dt
# var content id
# 1: UNIQUE-ID CPD0-1108 1
# 2: TYPES D-Ribofuranose 1
# 3: UNIQUE-ID URIDINE 2
# 4: TYPES Pyrimidine 2
#cast
dcast(dt, id ~ var, value.var = "content")
# id TYPES UNIQUE-ID
# 1: 1 D-Ribofuranose CPD0-1108
# 2: 2 Pyrimidine URIDINE
库(tidyverse)
库(数据表)
图书馆(动物园)
#使用所需文件创建data.frame
文件名文件名
#[1] “C:/Users/*********/Documents/Git/udls2/test.dat”
#使用data.table的fread读入文件。。这里我用UNIQUE-ID或type开始grep行。创建所需的正则表达式模式
模式您不需要定义模式,只需设置recursive
参数TRUE
,例如dir(“C:/Users/robbie/Desktop/organic\u Data/”,recursive=TRUE,full.names=TRUE)
您不需要定义模式,只需设置recursive
参数TRUE
,例如dir(“C:/Users/robbie/Desktop/Organism_Data/”,recursive=TRUE,full.names=TRUE)使用require
:和@r2evans每天学习多一点..我会调整我的回答使用require
:和@r2evans每天学习多一点..我会调整我的回答谢谢你的回答。我仍在寻找所需的输出。很抱歉,我的数据没有[数据块1],[Data Chunk 2]
等等。我编辑了该部分,以便让社区成员清楚地了解。数据中的模式位于每个UNIQUE-ID
之前(UNIQUE-ID前一行)有/
这两个字符。您能帮助我如何修改上述模式的代码吗?非常感谢!我想适应您的数据实际上是什么样子都不重要。如果您提供准确且具有代表性的示例数据,或许我能帮上忙。(否则,GIGO。)我已经用数据链接更新了我的问题。请帮助!谢谢你的回答。我仍在寻找所需的输出。很抱歉造成混淆,但我的数据没有[数据块1],[数据块2]
等等。我编辑了该部分,以便让社区成员清楚地了解。数据中的模式在每个UNIQUE-ID
之前(UNIQUE-ID前一行)有/
这两个字符。您能帮助我如何修改上述模式的代码吗?非常感谢!我想适应您的数据实际上是什么样子都不重要。如果您提供准确且具有代表性的示例数据,或许我能帮上忙。(否则,GIGO。)我已经用数据链接更新了我的问题。请帮助!