python中无定界符高效读取数字矩阵_Python_Performance_Matrix

python中无定界符高效读取数字矩阵

python performance matrix

python中无定界符高效读取数字矩阵,python,performance,matrix,Python,Performance,Matrix,我有一个文件，其中包含数字[0-9]矩阵，没有形状为（N，M）的分隔符。N约为50k，M约为50k。例如，矩阵文件的小版本是，mat.txt 0012230012000 0012230002300 0012230004200 现在我正在使用下面的代码，但是我对速度不是很满意 def read_int_mat（路径）： """ 读取带[0-9]且不带分隔符的整数矩阵。 """ 打开（路径）作为f： mat=np.array( [np.array（[int（c）表示第行中的c.strip（）]）

我有一个文件，其中包含数字[0-9]矩阵，没有形状为（N，M）的分隔符。N约为50k，M约为50k。例如，矩阵文件的小版本是，

mat.txt

0012230012000
0012230002300
0012230004200

现在我正在使用下面的代码，但是我对速度不是很满意

def read_int_mat（路径）：
"""
读取带[0-9]且不带分隔符的整数矩阵。
"""
打开（路径）作为f：
mat=np.array(
[np.array（[int（c）表示第行中的c.strip（）]）表示第行中的f.readlines（）]，
dtype=np.int8，
)
回程垫

编辑：这里有一个迷你基准

将numpy导入为np
def read_int_mat（路径）：
"""
读取带[0-9]且不带分隔符的整数矩阵。
"""
打开（路径）作为f：
mat=np.array(
[np.array（[int（c）表示第行中的c.strip（）]）表示第行中的f.readlines（）]，
dtype=np.int8，
)
回程垫
%timeit read_int_mat（“mat.txt”）
%timeit np.genfromtxt（“mat.txt”，分隔符=1，dtype=“int8”）
打印（读\u int\u mat（“mat.txt”））
打印（np.genfromtxt（“mat.txt”，分隔符=1，dtype=“int8”））

产出是：

61.6 µs ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
327 µs ± 4.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
[[0 0 1 2 2 3 0 0 1 2 0 0 0]
 [0 0 1 2 2 3 0 0 0 2 3 0 0]
 [0 0 1 2 2 3 0 0 0 4 2 0 0]]
[[0 0 1 2 2 3 0 0 1 2 0 0 0]
 [0 0 1 2 2 3 0 0 0 2 3 0 0]
 [0 0 1 2 2 3 0 0 0 4 2 0 0]]

有什么我可以试着进一步加快的吗。Cython能帮忙吗？非常感谢。您可以使用，例如：

文件（13列）：

0012230012000
0012230002300
0012230004200

然后：

印刷品：

[[0 0 1 2 2 3 0 0 1 2 0 0 0]
 [0 0 1 2 2 3 0 0 0 2 3 0 0]
 [0 0 1 2 2 3 0 0 0 4 2 0 0]]

编辑：使用

np.fromiter

的版本，并以二进制模式打开文件：

def read_npfromiter(path):
    with open(path, "rb") as f:
        return np.array(
            [np.fromiter((chr(c) for c in l.strip()), dtype="int8") for l in f],
        )

使用shape

（168，9360）

对文件进行基准测试：

结果:

1.0680423599551432
0.28135157003998756
0.19099885696778074

在本例中，

打印（mat）

的输出应该是什么？@mkrieger1谢谢你的问题，我更新了一个编辑，它将是一个（N，M）numpy矩阵，你能显示它吗？哪一个（N，M）numpy矩阵？谢谢你的回复！我记得我试过genfromtxt，其实性能不太令人满意。我用一个迷你基准更新了描述。@david我添加了带有

np.fromiter

的版本，并以二进制模式打开文件，这要感谢您的新功能确实更快。你认为有什么办法可以进一步改进吗？Cython是一件值得尝试的事情吗？@david是的，用低级语言（如C（Cython））实现该功能会更快——跳过创建Python的临时列表、unicode转换等。。。另外，预先在内存中预分配数组也会有所帮助。我想问一下，预分配数组的最佳做法是什么？我不确定是否可以很快推断出行数。似乎是可能的，因为该文件是非常结构化的。你知道一个函数可以同时处理普通文本和压缩文件（后缀为.gz）吗

def read_npfromiter(path):
    with open(path, "rb") as f:
        return np.array(
            [np.fromiter((chr(c) for c in l.strip()), dtype="int8") for l in f],
        )

from timeit import timeit


def read_int_mat(path):
    """
    Read a matrix of integer with [0-9], and with no delimiter.
    """
    with open(path, "r") as f:
        mat = np.array(
            [
                np.array([int(c) for c in line.strip()])
                for line in f.readlines()
            ],
            dtype=np.int8,
        )
    return mat


def read_npfromiter(path):
    with open(path, "rb") as f:
        return np.array(
            [np.fromiter((chr(c) for c in l.strip()), dtype="int8") for l in f],
        )


def f1(f):
    return np.genfromtxt(
        f, delimiter=1, dtype="int8", autostrip=False, encoding="ascii"
    )


def f2(f):
    return read_int_mat(f)


def f3(f):
    return read_npfromiter(f)


t1 = timeit(lambda: f1("file.txt"), number=1)
t2 = timeit(lambda: f2("file.txt"), number=1)
t3 = timeit(lambda: f3("file.txt"), number=1)

print(t1)
print(t2)
print(t3)