Matlab：如何用逗号作为十进制分隔符来读取数字？_Matlab_File Io_Decimal Point

Matlab：如何用逗号作为十进制分隔符来读取数字？

matlab file-io

Matlab：如何用逗号作为十进制分隔符来读取数字？,matlab,file-io,decimal-point,Matlab,File Io,Decimal Point,我有很多（几十万）相当大（>0.5MB）的文件，其中数据是数字的，但用逗号作为十进制分隔符。对我来说，使用像sed“s/，/./g”这样的外部工具是不切实际的。当分隔符是点时，我只使用textscan（fid，'%f%f%f'），但我看不到更改小数分隔符的选项。我怎样才能有效地阅读这样的文件文件中的示例行： 5,040000 18,040000 -0,030000 注：有一个，但我使用Matlab。您可以使用它将自动处理数据。但你可以明确地说： A = txt2mat('d

我有很多（几十万）相当大（>0.5MB）的文件，其中数据是数字的，但用逗号作为十进制分隔符。对我来说，使用像

sed“s/，/./g”

这样的外部工具是不切实际的。当分隔符是点时，我只使用

textscan（fid，'%f%f%f'）

，但我看不到更改小数分隔符的选项。我怎样才能有效地阅读这样的文件

文件中的示例行：

5,040000    18,040000   -0,030000

注：有一个，但我使用Matlab。

您可以使用

它将自动处理数据。但你可以明确地说：

A = txt2mat('data.txt','ReplaceChar',',.');

另外，它可能效率不高，但如果您只需要用于特定数据格式，则可以从源文件复制部件。

您可以尝试通过添加标题行数和列数作为输入来加速txt2mat，以绕过其文件分析。与点分隔小数的textscan导入相比，不应该有25的系数。（您也可以使用mathworks网站上的作者页面与我联系。）

如果您能在matlab中找到更有效的处理逗号分隔小数的方法，请告诉我们。

使用测试脚本，我发现因子小于1.5。我的代码如下所示：

tmco = {'NumHeaderLines', 1      , ...
        'NumColumns'    , 5      , ...
        'ConvString'    , '%f'   , ...
        'InfoLevel'     , 0      , ...
        'ReadMode'      , 'block', ...
        'ReplaceChar'   , {',.'} } ;

A = txt2mat(filename, tmco{:});

请注意不同的“ReplaceChar”值和“ReadMode”块

我在我的（不是太新的）机器上获得了~5MB文件的以下结果：

txt2mat测试逗号平均时间：0.63231
txt2mat测试点平均时间：0.45715
文本扫描测试点平均时间：0.4787

我的测试脚本的完整代码：

%% generate sample files

fdot = 'C:\temp\cDot.txt';
fcom = 'C:\temp\cCom.txt';

c = 5;       % # columns
r = 100000;  % # rows
test = round(1e8*rand(r,c))/1e6;
tdot = sprintf([repmat('%f ', 1,c), '\r\n'], test.'); % '
tdot = ['a header line', char([13,10]), tdot];

tcom = strrep(tdot,'.',',');

% write dot file
fid = fopen(fdot,'w');
fprintf(fid, '%s', tdot);
fclose(fid);
% write comma file
fid = fopen(fcom,'w');
fprintf(fid, '%s', tcom);
fclose(fid);

disp('-----')

%% read back sample files with txt2mat and textscan

% txt2mat-options with comma decimal sep.
tmco = {'NumHeaderLines', 1      , ...
        'NumColumns'    , 5      , ...
        'ConvString'    , '%f'   , ...
        'InfoLevel'     , 0      , ...
        'ReadMode'      , 'block', ...
        'ReplaceChar'   , {',.'} } ;

% txt2mat-options with dot decimal sep.
tmdo = {'NumHeaderLines', 1      , ...
        'NumColumns'    , 5      , ...
        'ConvString'    , '%f'   , ...
        'InfoLevel'     , 0      , ...
        'ReadMode'      , 'block'} ;

% textscan-options
tsco = {'HeaderLines'   , 1      , ...
        'CollectOutput' , true   } ;


A = txt2mat(fcom, tmco{:});
B = txt2mat(fdot, tmdo{:});

fid = fopen(fdot);
C = textscan(fid, repmat('%f',1,c) , tsco{:} );
fclose(fid);
C = C{1};

disp(['txt2mat  test comma (1=Ok): ' num2str(isequal(A,test)) ])
disp(['txt2mat  test dot   (1=Ok): ' num2str(isequal(B,test)) ])
disp(['textscan test dot   (1=Ok): ' num2str(isequal(C,test)) ])
disp('-----')

%% speed test

numTest = 20;

% A) txt2mat with comma
tic
for k = 1:numTest
    A = txt2mat(fcom, tmco{:});
    clear A
end
ttmc = toc;
disp(['txt2mat  test comma avg. time: ' num2str(ttmc/numTest) ])

% B) txt2mat with dot
tic
for k = 1:numTest
    B = txt2mat(fdot, tmdo{:});
    clear B
end
ttmd = toc;
disp(['txt2mat  test dot   avg. time: ' num2str(ttmd/numTest) ])

% C) textscan with dot
tic
for k = 1:numTest
    fid = fopen(fdot);
    C = textscan(fid, repmat('%f',1,c) , tsco{:} );
    fclose(fid);
    C = C{1};
    clear C
end
ttsc = toc;
disp(['textscan test dot   avg. time: ' num2str(ttsc/numTest) ])
disp('-----')

我的解决方案（假设逗号仅用作小数点，并且空格表示列）：

如果您碰巧需要删除一个标题行（如我所做的），那么这应该可以：

fid = fopen("FILENAME");                  %Open file
indat = fread(fid, '*char');              %Read in the entire file as characters
fclose(fid);                              %Close file
indat = strrep(indat, ',', '.');          %Replace commas with periods
endheader=strfind(indat,13);              %Find first newline
indat=indat(endheader+1:size(indat,2));   %Extract all characters after first new line
[colA, colB] = strread(indat, '%f %f');   %Convert string to numerical data

嗯，效率在这里其实很重要。而且

txt2mat

比

textscan

慢25倍左右。我明白了：（使用这里讨论的regexp转换是否有帮助：嗯，它仍然慢得多。与使用点分隔数字的

textscan

导入相比，我实现了20倍。我使用了公式：

txt2mat（文件名，'InfoLevel'，0，'ReplaceChar'，{''，'，'.'}，'NumHeaderLines'，1'，ConvString'，repmat（'%f'，1,5），'NumColumns'，5）；

对于较小的文件，由于txt2mat的开销较大，该系数会增加。但是，即使对于0.5MB的文件，我也会得到小于2的值。仅供参考。

fid = fopen("FILENAME");
indat = fread(fid, '*char');
fclose(fid);
indat = strrep(indat, ',', '.');
[colA, colB] = strread(indat, '%f %f');

fid = fopen("FILENAME");                  %Open file
indat = fread(fid, '*char');              %Read in the entire file as characters
fclose(fid);                              %Close file
indat = strrep(indat, ',', '.');          %Replace commas with periods
endheader=strfind(indat,13);              %Find first newline
indat=indat(endheader+1:size(indat,2));   %Extract all characters after first new line
[colA, colB] = strread(indat, '%f %f');   %Convert string to numerical data