如何在lua中读取大文件（>；1GB）？_Lua_Torch_Luajit

如何在lua中读取大文件（>；1GB）？

lua

如何在lua中读取大文件（>；1GB）？,lua,torch,luajit,Lua,Torch,Luajit,我是Lua的新手（将其用于Torch7框架）。我有一个大约1.4GB大小的输入功能文件（文本文件）。simple io.open函数在尝试打开此文件时抛出错误“内存不足”。在浏览用户组和文档时，我发现这可能是Lua的限制。有解决办法吗？还是我在读文件时做错了什么 local function parse_file(path) -- read file local file = assert(io.open(path,"r")) local content = file:r

我是Lua的新手（将其用于Torch7框架）。我有一个大约1.4GB大小的输入功能文件（文本文件）。simple io.open函数在尝试打开此文件时抛出错误“内存不足”。在浏览用户组和文档时，我发现这可能是Lua的限制。有解决办法吗？还是我在读文件时做错了什么

local function parse_file(path)
    -- read file
    local file = assert(io.open(path,"r"))
    local content = file:read("*all")
    file:close()

    -- split on start/end tags.
    local sections = string.split(content, start_tag)
    for j=1,#sections do
        sections[j] = string.split(sections[j],'\n')
        -- remove the end_tag
        table.remove(sections[j], #sections[j])
    end 
    return sections
end

local train_data = parse_file(file_loc .. '/' .. train_file)

编辑：我试图读取的输入文件包含我想训练模型的图像特征。这个文件是按顺序排列的（{start tag}…contents…{end tag}{start tag}…等等…），所以如果我可以一次加载一个部分（从开始标记到结束标记），就可以了。但是，我希望所有这些部分都加载到内存中。

我从未需要读取如此大的文件，但如果内存不足，可能需要逐行读取。经过快速研究，我在lua网站上发现了以下内容：

buff是一个50020字节的新字符串，旧字符串在now>garbage中。经过两个循环周期后，buff是一个包含50040字节的字符串，而有两个旧字符串的垃圾总量超过100kbytes。因此，Lua非常正确地决定，现在是运行其垃圾收集器的好时机，因此它释放了100 KB。问题是，这将每两个周期发生一次，因此Lua将在完成循环之前运行其垃圾收集器2000次。即使进行了所有这些工作，其内存使用量也将是文件大小的三倍左右。更糟糕的是，每次连接都必须将整个字符串内容（50KB并不断增长）复制到新字符串中

因此，加载大文件似乎会占用大量内存，即使您逐行读取，并且每次都像这样使用串联：

然后，他们提出了一个更节省内存的过程：

它占用的内存比以前少了很多。所有信息来自：

希望有帮助。

事实证明，解决加载大文件问题的最简单方法是将Torch升级到Lua5.2或更高版本！正如torch7谷歌集团的Torch开发者所建议的那样

cd ~/torch
./clean.sh
TORCH_LUA_VERSION=LUA52 ./install.sh

从5.2版开始，内存限制就不存在了！我已经测试过了，效果很好

参考：

另一种可能的解决方案（更优雅，类似于@Adam在回答中提出的方法）是使用逐行读取文件并使用张量或存储数据，因为这使用Luajit之外的内存。感谢Vislab，代码示例如下所示

local ffi = require 'ffi'
-- this function loads a file line by line to avoid having memory issues
local function load_file_to_tensor(path)
  -- intialize tensor for the file
  local file_tensor = torch.CharTensor()
  
  -- Now we must determine the maximum size of the tensor in order to allocate it into memory.
  -- This is necessary to allocate the tensor in one sweep, where columns correspond to letters and rows correspond to lines in the text file.
  
  --[[ get  number of rows/columns ]]
  local file = io.open(path, 'r') -- open file
  local max_line_size = 0
  local number_of_lines = 0
  for line in file:lines() do
    -- get maximum line size
    max_line_size = math.max(max_line_size, #line +1) -- the +1 is important to correctly fetch data
    
    -- increment the number of lines counter
    number_of_lines = number_of_lines +1
  end
  file:close() --close file
  
  -- Now that we have the maximum size of the vector, we just have to allocat memory for it (as long there is enough memory in ram)
  file_tensor = file_tensor:resize(number_of_lines, max_line_size):fill(0)
  local f_data = file_tensor:data()
  
  -- The only thing left to do is to fetch data into the tensor. 
  -- Lets open the file again and fill the tensor using ffi
  local file = io.open(path, 'r') -- open file
  for line in file:lines() do
    -- copy data into the tensor line by line
    ffi.copy(f_data, line)
    f_data = f_data + max_line_size
  end
  file:close() --close file

  return file_tensor
end

从这个张量中读取数据既简单又快速。例如，如果您想读取文件中的第10行（位于张量的第10位），您可以简单地执行以下操作：

local line_string = ffi.string(file_tensor[10]:data()) -- this will convert into a string var

警告一句：这将占用更多的内存空间，在某些情况下可能不适合几行比另一行长得多的情况。但是如果你没有内存问题，这甚至可以忽略，因为当把张量从文件加载到内存中时，它的速度非常快，可能会在这个过程中为你节省一些白发

参考资料：

您确定输入输出后会弹出“内存不足”对话框吗？打开？这似乎不对。但是，您可以分块读取文件吗？你真的需要内存中的hole文件吗注意

文件：read（“*all”）

中的

在Lua5.3中已经过时（我不知道torch使用的是哪个版本）torch使用的LuaJIT有内存限制。请参阅例如…@pschulz:local content=file:read（“*all”）正在执行时，会弹出内存不足错误。好的，这似乎是合理的。请把这件事讲清楚。我很清楚你打电话给io时的意思。阅读，但你永远不知道。但再说一遍，你真的必须立刻阅读整个文件吗？@pschulz：我为这个不清楚的问题道歉。我在完成这项任务的同时对Lua有了深刻的了解。我试图读取的输入文件包含我想训练模型的图像特征。这个文件是按顺序排列的（…内容…等等…），所以我可以一次加载一个部分。但是，我希望所有这些部分都加载到内存中。这能让事情更清楚吗？我正在相应地编辑这个问题。谢谢：）小心：旧代码。没有一个

tinsert

或

tremove

等，但这个想法仍然有效。将完整文件读入内存的最有效的方法仍然是

file:read（“*a”）

。只有当您可以让GC收集以前的数据块时，读取较小的数据块才有意义。@siffiejoe:那么使用

file:read（“*a”）

读取文件是最好的选择吗？i、 e.一个人根本无法读取Lua中的大文件：/@亚当：谢谢你详细的回答。但我认为这对我的案子没什么帮助。从你收集这些信息的页面上，它说`要读取整个文件，你可以使用“*all”选项，一次读取它。但有时你没有这样简单的解决办法。那么，唯一的解决办法就是为你的问题设计一个更有效的算法？我对这一点还不是很清楚，所以如果我错了，请纠正我。@NightFury13:

file:read（“*a”）

是最好的选择，如果您需要将整个文件内容作为单个字符串存储在内存中。如果您的

file:read（“*a”）

失败，则意味着您无法一次存储整个文件。我建议为您的文件内容提供一个迭代器接口，这样您就可以一个接一个地引入元素。

cd ~/torch
./clean.sh
TORCH_LUA_VERSION=LUA52 ./install.sh

local ffi = require 'ffi'
-- this function loads a file line by line to avoid having memory issues
local function load_file_to_tensor(path)
  -- intialize tensor for the file
  local file_tensor = torch.CharTensor()
  
  -- Now we must determine the maximum size of the tensor in order to allocate it into memory.
  -- This is necessary to allocate the tensor in one sweep, where columns correspond to letters and rows correspond to lines in the text file.
  
  --[[ get  number of rows/columns ]]
  local file = io.open(path, 'r') -- open file
  local max_line_size = 0
  local number_of_lines = 0
  for line in file:lines() do
    -- get maximum line size
    max_line_size = math.max(max_line_size, #line +1) -- the +1 is important to correctly fetch data
    
    -- increment the number of lines counter
    number_of_lines = number_of_lines +1
  end
  file:close() --close file
  
  -- Now that we have the maximum size of the vector, we just have to allocat memory for it (as long there is enough memory in ram)
  file_tensor = file_tensor:resize(number_of_lines, max_line_size):fill(0)
  local f_data = file_tensor:data()
  
  -- The only thing left to do is to fetch data into the tensor. 
  -- Lets open the file again and fill the tensor using ffi
  local file = io.open(path, 'r') -- open file
  for line in file:lines() do
    -- copy data into the tensor line by line
    ffi.copy(f_data, line)
    f_data = f_data + max_line_size
  end
  file:close() --close file

  return file_tensor
end

local line_string = ffi.string(file_tensor[10]:data()) -- this will convert into a string var