Matlab 导入文本周围带引号的混合CSV

Matlab 导入文本周围带引号的混合CSV,matlab,csv,import,Matlab,Csv,Import,我正在将逗号分隔的CSV文件导入MATLAB。每一列都有我想考虑的文本,然后是逗号。p> 我使用此问题答案中的read_mixed_csv函数将数据作为单元格读取: 但是,由于我的一些专栏如下所示: "FAIRHOPE, Alabama" "FAIRHOPE HIGH SCHOOL, FAIRHOPE, ALABAMA" "Daphne-Fairhope-Foley, AL" MATLAB将逗号后的所有内容放入新列中。所以 "Daphne-Fairhope-Foley, AL" 变成两列

我正在将逗号分隔的CSV文件导入MATLAB。每一列都有我想考虑的文本,然后是逗号。p> 我使用此问题答案中的read_mixed_csv函数将数据作为单元格读取:

但是,由于我的一些专栏如下所示:

"FAIRHOPE, Alabama"
"FAIRHOPE HIGH SCHOOL, FAIRHOPE,  ALABAMA"
"Daphne-Fairhope-Foley, AL"
MATLAB将逗号后的所有内容放入新列中。所以

"Daphne-Fairhope-Foley, AL"
变成两列

"Daphne-Fairhope-Foley
AL"

<强>如何让MATLAB在混合CSV文件中读取,不仅考虑逗号作为定界符,还要考虑引号?<强>,是否有比<代码>文本扫描更自动化的方法?如果

textscan
是一个选项,那会是什么样子

下面是我试图读入的数据示例,其中包含标题:

"State Code","County Code","Site Num","Parameter Code","POC","Latitude","Longitude","Datum","Parameter Name","Sample Duration","Pollutant Standard","Date Local","Units of Measure","Event Type","Observation Count","Observation Percent","Arithmetic Mean","1st Max Value","1st Max Hour","AQI","Method Name","Local Site Name","Address","State Name","County Name","City Name","CBSA Name","Date of Last Change"
"01","003","0010","88101",1,30.498001,-87.881412,"NAD83","PM2.5 - Local Conditions","24 HOUR","PM25 24-hour 2006","2013-01-01","Micrograms/cubic meter (LC)","None",1,100.0,7.3,7.3,0,30,"R & P Model 2025 PM2.5 Sequential w/WINS - GRAVIMETRIC","FAIRHOPE, Alabama","FAIRHOPE HIGH SCHOOL, FAIRHOPE,  ALABAMA","Alabama","Baldwin","Fairhope","Daphne-Fairhope-Foley, AL","2014-02-11"
"01","003","0010","88101",1,30.498001,-87.881412,"NAD83","PM2.5 - Local Conditions","24 HOUR","PM25 24-hour 2006","2013-01-04","Micrograms/cubic meter (LC)","None",1,100.0,7.6,7.6,0,32,"R & P Model 2025 PM2.5 Sequential w/WINS - GRAVIMETRIC","FAIRHOPE, Alabama","FAIRHOPE HIGH SCHOOL, FAIRHOPE,  ALABAMA","Alabama","Baldwin","Fairhope","Daphne-Fairhope-Foley, AL","2014-02-11"
"01","003","0010","88101",1,30.498001,-87.881412,"NAD83","PM2.5 - Local Conditions","24 HOUR","PM25 24-hour 2006","2013-01-07","Micrograms/cubic meter (LC)","None",1,100.0,8.6,8.6,0,36,"R & P Model 2025 PM2.5 Sequential w/WINS - GRAVIMETRIC","FAIRHOPE, Alabama","FAIRHOPE HIGH SCHOOL, FAIRHOPE,  ALABAMA","Alabama","Baldwin","Fairhope","Daphne-Fairhope-Foley, AL","2014-02-11"
"01","003","0010","88101",1,30.498001,-87.881412,"NAD83","PM2.5 - Local Conditions","24 HOUR","PM25 24-hour 2006","2013-01-10","Micrograms/cubic meter (LC)","None",1,100.0,7,7,0,29,"R & P Model 2025 PM2.5 Sequential w/WINS - GRAVIMETRIC","FAIRHOPE, Alabama","FAIRHOPE HIGH SCHOOL, FAIRHOPE,  ALABAMA","Alabama","Baldwin","Fairhope","Daphne-Fairhope-Foley, AL","2014-02-11"

*注意:将CSV文件转换为制表符分隔的文件使MATLAB更容易处理和避免此问题。

使用文件交换代码
replaceinfle
将其中包含逗号的字符串替换为句点。 使用
read\u mixed\u csv
from读取文件。 从仍然保留的字符串中删除额外的引号

replaceinfile(', ', '. ', fname); % Replace commas that was inside quotes and not meant to be separated as periods so they don't show up as a new column
thisdata = read_mixed_csv(fname, ','); % Reads in the CSV file (\t for tab)
thisdata = regexprep(thisdata, '^"|"$',''); % Remove quotes from file and only keep the first 28 columns (last two columns are empty)
对于
replaceinfle.m
函数: 要在Linux上运行代码,请将Perl部分的第一行更改为

perlCmd = sprintf('"%s"', '/usr/bin/perl');
使用文本限定符(如
)有点棘手,但如果您确保表中的每一行都有相同数量的列(可能没有空列),则以下操作可能会起作用

文本限定符之外的任何内容都必须转换为数字。

function C = csvmixed(eachLine,delim,textQualifier)
% Outputs cell containing mixed string and numeric data given a delimiter (',') 
% and a text qualifier ('"').  Each line of the delimited file must be loaded into 
% the cell array eachLine, and each line must have the same number of columns.
% 
% Example:
%   fid = fopen('testcsv.txt','r');
%   eachLine = textscan(fid,'%s','Delimiter','\n'); fclose(fid);
%   C = csvmixed(eachLine{1},',','"')

assert(ischar(delim) && numel(delim)==1);
assert(ischar(textQualifier) && numel(textQualifier)==1);

% find strings, as specified by the input qualifier
patternStr = sprintf('"([^"]*)"%c?',delim);
patternStr = strrep(patternStr,'"',textQualifier);
Cstr = regexp(eachLine,patternStr,'tokens');

% find numeric data
patternNum = sprintf('(?<=(,|^))[^%c,a-zA-Z]*(?=(,|$))',textQualifier);
patternNum = strrep(patternNum,',',delim);
Cnum = regexp(eachLine,patternNum,'match','emptymatch');

numCols = cellfun(@numel,Cstr) + cellfun(@numel,Cnum);
assert(nnz(diff(numCols))==0,'Number of columns not consistent.')

% get string extents (begin, start) indexes for each string
strExtents = regexp(eachLine,patternStr,'tokenExtents');

% deal out parsed data for each line
C = cell(numel(eachLine),numCols(1));
for ii = 1:numel(eachLine),
    strBounds = vertcat(strExtents{ii}{:});
    delimLocs = getDelimLocs(eachLine{ii},strBounds,delim);
    strCellMap = getCellMap(strBounds,delimLocs);

    C(ii,strCellMap) = [Cstr{ii}{:}]; % TODO: preallocate
    C(ii,~strCellMap) = num2cell(str2double(Cnum{ii})); % all else must be numeric
end

end

function delimLocs = getDelimLocs(lineText,solidBounds,delim)
    delimCharLocs = strfind(lineText,delim);
    delimLocs = delimCharLocs(~any(bsxfun(@ge,delimCharLocs,solidBounds(:,1)) & ...
        bsxfun(@le,delimCharLocs,solidBounds(:,2)),1));
end

function cellMap = getCellMap(typeBounds,delimLocs)
    cellMap = any(bsxfun(@gt,typeBounds(:,1),[0 delimLocs]) & ...
        bsxfun(@lt,typeBounds(:,1),[delimLocs Inf]), 1);
end
函数C=csvmixed(eachLine、delim、textQualifier)
%输出包含混合字符串和数字数据的单元格(给定分隔符(','))
%和文本限定符(“”)。分隔文件的每一行都必须加载到
%单元格数组必须是一行一行,并且每行的列数必须相同。
% 
%例如:
%fid=fopen('testcsv.txt','r');
%eachLine=textscan(fid,'%s','Delimiter','\n');fclose(fid);
%C=csvmixed(每一条线{1},',','')
断言(ischar(delim)和&numel(delim)==1);
断言(ischar(textQualifier)&&numel(textQualifier)==1);
%查找由输入限定符指定的字符串
patternStr=sprintf(“([^”]*)%c?”,delim);
patternStr=strrep(patternStr,“”,textQualifier);
Cstr=regexp(eachLine,patternStr,'tokens');
%查找数字数据

patternNum=sprintf('(?这对我的几个小文件有效。但随后它开始给我一个“csvmixed(第34行)C(l,~strCellMap)=num2cell(str2double(Cnum{l}))中的错误;%其他所有文件都必须是数字。“这些文件的格式基本相同。你知道错误是指什么吗?这里有一个文件:@shizishan你的文件没有根据链接完成上传,但我可以告诉你这与此有关:“任何不在文本限定符内的内容都必须转换为数字。”寻找不在引号中的非数字数据。可以添加单独的逻辑来处理非数字数据,但它会变得更复杂和更慢。我一直在尝试看看这是否是问题所在,但文件的任何部分看起来都没有什么不同。我意识到,对于2000年,它在所有
01
097
2之后停止005
ones(前三列告诉我数据来自哪个站点)。我会继续尝试,但如果你有机会,我会把2000年的文件放在这里。这次应该可以了:@shizishan有一个小错误。代码没有处理以逗号结尾的字符串(例如
“501 W.VALLEY BLVD.,BIG BEAR CITY,”
)。这是固定的。仅供参考,该文件的输出单元格约为580MB。@狮子山不确定,我只安装了2014版本。
function C = csvmixed(eachLine,delim,textQualifier)
% Outputs cell containing mixed string and numeric data given a delimiter (',') 
% and a text qualifier ('"').  Each line of the delimited file must be loaded into 
% the cell array eachLine, and each line must have the same number of columns.
% 
% Example:
%   fid = fopen('testcsv.txt','r');
%   eachLine = textscan(fid,'%s','Delimiter','\n'); fclose(fid);
%   C = csvmixed(eachLine{1},',','"')

assert(ischar(delim) && numel(delim)==1);
assert(ischar(textQualifier) && numel(textQualifier)==1);

% find strings, as specified by the input qualifier
patternStr = sprintf('"([^"]*)"%c?',delim);
patternStr = strrep(patternStr,'"',textQualifier);
Cstr = regexp(eachLine,patternStr,'tokens');

% find numeric data
patternNum = sprintf('(?<=(,|^))[^%c,a-zA-Z]*(?=(,|$))',textQualifier);
patternNum = strrep(patternNum,',',delim);
Cnum = regexp(eachLine,patternNum,'match','emptymatch');

numCols = cellfun(@numel,Cstr) + cellfun(@numel,Cnum);
assert(nnz(diff(numCols))==0,'Number of columns not consistent.')

% get string extents (begin, start) indexes for each string
strExtents = regexp(eachLine,patternStr,'tokenExtents');

% deal out parsed data for each line
C = cell(numel(eachLine),numCols(1));
for ii = 1:numel(eachLine),
    strBounds = vertcat(strExtents{ii}{:});
    delimLocs = getDelimLocs(eachLine{ii},strBounds,delim);
    strCellMap = getCellMap(strBounds,delimLocs);

    C(ii,strCellMap) = [Cstr{ii}{:}]; % TODO: preallocate
    C(ii,~strCellMap) = num2cell(str2double(Cnum{ii})); % all else must be numeric
end

end

function delimLocs = getDelimLocs(lineText,solidBounds,delim)
    delimCharLocs = strfind(lineText,delim);
    delimLocs = delimCharLocs(~any(bsxfun(@ge,delimCharLocs,solidBounds(:,1)) & ...
        bsxfun(@le,delimCharLocs,solidBounds(:,2)),1));
end

function cellMap = getCellMap(typeBounds,delimLocs)
    cellMap = any(bsxfun(@gt,typeBounds(:,1),[0 delimLocs]) & ...
        bsxfun(@lt,typeBounds(:,1),[delimLocs Inf]), 1);
end