Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/matlab/13.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何在matlab中利用条件提取文本中的单词?_Matlab_Nlp_Preprocessor - Fatal编程技术网

如何在matlab中利用条件提取文本中的单词?

如何在matlab中利用条件提取文本中的单词?,matlab,nlp,preprocessor,Matlab,Nlp,Preprocessor,我有多个文本文件,只想提取一个带条件的值 文件看起来像这样 157.76941498460488, u'id': 1056080, u'image_id': 354282, u'bbox': [188.68243243243242, 229.17468354430378, 16.21621621621621, 9.729113924050637], u'legibility': u'illegible', u'class': u'machine printed'}, {u'langua

我有多个文本文件,只想提取一个带条件的值

文件看起来像这样

157.76941498460488, u'id': 1056080, u'image_id': 354282, u'bbox':      [188.68243243243242, 229.17468354430378, 16.21621621621621, 9.729113924050637], u'legibility': u'illegible', u'class': u'machine printed'}, {u'language': u'na', u'area': 157.76941498460522, u'id': 1056081, u'image_id': 354282, u'bbox': [176.79054054054052, 241.06582278481014, 16.216216216216246, 9.729113924050637], u'legibility': u'illegible', u'class': u'machine printed'}, {u'language': u'na', u'area': 130.89018132056108, u'id': 1056082, u'image_id': 354282, u'bbox': [60.03378378378378, 224.8506329113924, 15.13513513513514, 8.648101265822783], u'legibility': u'illegible', u'class': u'machine printed'}, {u'language': u'english', u'area': 229.08553456429397, u'class': u'machine printed', u'utf8_string': u'7206', u'image_id': 354282, u'bbox': [447.84940154212785, 338.8799273943157, 15.489338584815993, 14.78988488177692], u'legibility': u'legible', u'id': 1232932}, {u'language': u'english', u'area': 125.41629858832702, u'class': u'machine printed', u'utf8_string': u'HSS', u'image_id': 354282, u'bbox': [465.63345695432395, 333.1362827800334, 10.039386119788142, 12.492427036063997], u'legibility': u'legible', u'id': 1232933}]  
bbox = [188.68243243243242, 229.17468354430378, 16.21621621621621, 9.729113924050637]
bbox1 = [60.03378378378378, 224.8506329113924, 15.13513513513514, 8.648101265822783]
..etc bbox3 and box4 all the bboxs if its 'utf8_string'and legible
我想提取所有的bbox,如果它是utf8\u字符串,输出存储如下

157.76941498460488, u'id': 1056080, u'image_id': 354282, u'bbox':      [188.68243243243242, 229.17468354430378, 16.21621621621621, 9.729113924050637], u'legibility': u'illegible', u'class': u'machine printed'}, {u'language': u'na', u'area': 157.76941498460522, u'id': 1056081, u'image_id': 354282, u'bbox': [176.79054054054052, 241.06582278481014, 16.216216216216246, 9.729113924050637], u'legibility': u'illegible', u'class': u'machine printed'}, {u'language': u'na', u'area': 130.89018132056108, u'id': 1056082, u'image_id': 354282, u'bbox': [60.03378378378378, 224.8506329113924, 15.13513513513514, 8.648101265822783], u'legibility': u'illegible', u'class': u'machine printed'}, {u'language': u'english', u'area': 229.08553456429397, u'class': u'machine printed', u'utf8_string': u'7206', u'image_id': 354282, u'bbox': [447.84940154212785, 338.8799273943157, 15.489338584815993, 14.78988488177692], u'legibility': u'legible', u'id': 1232932}, {u'language': u'english', u'area': 125.41629858832702, u'class': u'machine printed', u'utf8_string': u'HSS', u'image_id': 354282, u'bbox': [465.63345695432395, 333.1362827800334, 10.039386119788142, 12.492427036063997], u'legibility': u'legible', u'id': 1232933}]  
bbox = [188.68243243243242, 229.17468354430378, 16.21621621621621, 9.729113924050637]
bbox1 = [60.03378378378378, 224.8506329113924, 15.13513513513514, 8.648101265822783]
..etc bbox3 and box4 all the bboxs if its 'utf8_string'and legible
我的代码

i=imread('image.JPEG');

fid = fopen('text1.txt','r');
C = textscan(fid, '%s','Delimiter','');
fclose(fid);
C = C{:};

box = ~cellfun(@isempty, strfind(C,'bbox'));

output = [C{find(box)}]

我得到了整条线,不仅仅是bbox

下面是一个使用regex进行此提取的代码示例。它可能不是地球上最快的,也不是最健壮的,但是如果你的文本文件很小,它就可以完成任务。它为您提供了一个在这种情况下如何继续的想法,并修改代码,以便在必要时使用您的数据充分工作

我想您在
[filepathList]

Cdata = cell(numel(filepathList),1);
for i=1:numel(filepathList)
    fid = fopen(filepathList{i});
    tline = fgetl(fid);

    % find the bbox blocks
    C = regexp(tline,'u''bbox'': +\[([0-9\. ,]+)\]','tokens');
    C = cellfun(@(x) x{1},C,'UniformOutput',false)';

    % in each block, find the numbers
    C2 = cellfun(@(x) textscan(x,'%f','Whitespace',', '),C,'UniformOutput',false);
    Cdata{i} = cell2mat(cellfun(@(x) x', cat(1,C2{:}),'UniformOutput',false));
    fclose(fid);
end

希望这有帮助

为什么要指定空分隔符?感谢您的快速回复和易于理解的代码,它与上面的示例完美结合。但是,当我开始用不同的文本运行它时,它会给我一个错误
错误,使用被连接的矩阵的cat维度是不一致的
我修改了代码,使其适用于你的第二个示例,它也非常有效,我的最后一个问题是,如何推广它,我有更多像这样的10k文件,并尝试在循环中运行它。谢谢你的进步。你也许应该重写这个问题,使它更适合你的问题。你建议你只有一个大文件。如果它回答了特定的问题,你可以接受答案