Python 3.x 如何获取微阵列数据?
谢谢你的帮助。我想使用以下python代码从affymetrix微阵列数据集中读取和处理数据。我想阐明单核细胞在克罗恩病和溃疡性结肠炎疾病条件下的差异基因表达。代码运行得很好,但是当我试图查看X的内容时,我在输出端得到了一个空数组(如:array([],dtype=float64)),这当然没有用。以下是指向原始数据集的链接: 我花了很长时间想弄明白为什么我的输出是空的和不可处理的,但是没有用。代码如下:Python 3.x 如何获取微阵列数据?,python-3.x,Python 3.x,谢谢你的帮助。我想使用以下python代码从affymetrix微阵列数据集中读取和处理数据。我想阐明单核细胞在克罗恩病和溃疡性结肠炎疾病条件下的差异基因表达。代码运行得很好,但是当我试图查看X的内容时,我在输出端得到了一个空数组(如:array([],dtype=float64)),这当然没有用。以下是指向原始数据集的链接: 我花了很长时间想弄明白为什么我的输出是空的和不可处理的,但是没有用。代码如下: import gzip import numpy as np """ Read in a
import gzip
import numpy as np
"""
Read in a SOFT format data file. The following values can be exported:
GID : A list of gene identifiers of length d
SID : A list of sample identifiers of length n
STP : A list of sample descriptions of length d
X : A dxn array of gene expression values
"""
#path to the data file
fname = "../data/GDS1615_full.soft.gz"
## Open the data file directly as a gzip file
with gzip.open(fname) as fid:
SIF = {}
for line in fid:
if line.startswith(line, len("!dataset_table_begin")):
break
elif line.startswith(line, len("!subject_description")):
subset_description = line.split("=")[1].strip()
elif line.startswith(line, len("!subset_sample_id")):
subset_ids = [x.strip() for x in subset_ids]
for k in subset_ids:
SIF[k] = subset_description
## Next line is the column headers (sample id's)
SID = next(fid).split("\t")
## The column indices that contain gene expression data
I = [i for i,x in enumerate(SID) if x.startswith("GSM")]
## Restrict the column headers to those that we keep
SID = [SID[i] for i in I]
## Get a list of sample labels
STP = [SIF[k] for k in SID]
## Read the gene expression data as a list of lists, also get the gene
## identifiers
GID,X = [],[]
for line in fid:
## This is what signals the end of the gene expression data
## section in the file
if line.startswith("!dataset_table_end"):
break
V = line.split("\t")
## Extract the values that correspond to gene expression measures
## and convert the strings to numbers
x = [float(V[i]) for i in I]
X.append(x)
GID.append(V[0] + ";" + V[1])
X = np.array(X)
## The indices of samples for the ulcerative colitis group
UC = [i for i,x in enumerate(STP) if x == "ulcerative colitis"]
## The indices of samples for the Crohn's disease group
CD = [i for i,x in enumerate(STP) if x == "Crohn's disease"]
在控制台上,我得到这样的输出:
X
Out[94]:数组([],dtype=float64)
X.形状
Out[95]:(0,)
再次感谢您的建议。这非常有效:
import gzip
import numpy as np
"""
Read in a SOFT format data file. The following values can be exported:
GID : A list of gene identifiers of length d
SID : A list of sample identifiers of length n
STP : A list of sample desriptions of length d
X : A dxn array of gene expression values
"""
#path to the data file
fname = "../data/GDS1615_full.soft.gz"
## Open the data file directly as a gzip file
with gzip.open(fname) as fid:
SIF = {}
for line in fid:
if line.startswith(b"!dataset_table_begin"):
break
elif line.startswith(b"!subset_description"):
subset_description = line.decode('utf8').split("=")[1].strip()
elif line.startswith(b"!subset_sample_id"):
subset_ids = line.decode('utf8').split("=")[1].split(",")
subset_ids = [x.strip() for x in subset_ids]
for k in subset_ids:
SIF[k] = subset_description
## Next line is the column headers (sample id's)
SID = next(fid).split(b"\t")
## The column indices that contain gene expression data
I = [i for i,x in enumerate(SID) if x.startswith(b"GSM")]
## Restrict the column headers to those that we keep
SID = [SID[i] for i in I]
## Get a list of sample labels
STP = [SIF[k.decode('utf8')] for k in SID]
## Read the gene expression data as a list of lists, also get the gene
## identifiers
GID,X = [],[]
for line in fid:
## This is what signals the end of the gene expression data
## section in the file
if line.startswith(b"!dataset_table_end"):
break
V = line.split(b"\t")
## Extract the values that correspond to gene expression measures
## and convert the strings to numbers
x = [float(V[i]) for i in I]
X.append(x)
GID.append(V[0].decode() + ";" + V[1].decode())
X = np.array(X)
## The indices of samples for the ulcerative colitis group
UC = [i for i,x in enumerate(STP) if x == "ulcerative colitis"]
## The indices of samples for the Crohn's disease group
CD = [i for i,x in enumerate(STP) if x == "Crohn's disease"]
结果:
X.形状
出[4]:(22283127)