Matlab 如何为文本文件中的行动态添加前缀？_Matlab_Text_Format

Matlab 如何为文本文件中的行动态添加前缀？

matlab text

Matlab 如何为文本文件中的行动态添加前缀？,matlab,text,format,Matlab,Text,Format,我有一个非常大的文本文件，其格式如下：基因1基因2 基因3 基因4 基因5 基因6 基因7基因8 基因9 我希望此文件的格式如下：基因1基因2 基因1基因3 基因1基因4 基因1基因5 基因1基因6 基因7基因8 基因7基因9 基因1、基因2等。。是一些没有空格的字母组合，可以有不同的长度。。下面是一个示例文件有人能给我指一下正确的方向吗 % getting the text and the first word text_in_file = fileread('oldfle.txt

我有一个非常大的文本文件，其格式如下：

基因1基因2

基因3

基因4

基因5

基因6

基因7基因8

基因9

我希望此文件的格式如下：

基因1基因2

基因1基因3

基因1基因4

基因1基因5

基因1基因6

基因7基因8

基因7基因9

基因1、基因2等。。是一些没有空格的字母组合，可以有不同的长度。。下面是一个示例文件

有人能给我指一下正确的方向吗

% getting the text and the first word
text_in_file = fileread('oldfle.txt');
first_word = regexp(text_in_file, '\S*', 'match','once');

% generating the new string
str = regexprep(text_in_file,'[\n\r]+',['\n\n' first_word ' ']);
% writing to the file
fid = fopen('newfile.txt', 'wt');fprintf(fid, str);fclose(fid);

下面是一个修改过的代码，它将处理多行有2个基因的情况。它重置计数并开始在单个基因行前面插入新的基因名称。这就是你想要的吗

% getting the text
text_in_file = fileread('oldfile.txt');
% splitting into rows
rows = regexp(text_in_file,'\n','split');
% number of genes in the rows
A = cellfun(@(x) numel(regexp(x, '\t')), rows);
% row indices with two genes
two_word_rows = find(A==2); 
% first genes
first_words = cellfun(@(x) regexp(x, '\S+', 'match', 'once'), rows(two_word_rows), 'UniformOutput' , false);

% modifying the rows
for i=setdiff(1:numel(rows), two_word_rows) % exclude the two gene rows
    last_idx = find(two_word_rows<i,1,'last'); % which word to add?
    rows{i} = sprintf('%s\t%s', char(first_words(last_idx)), rows{i});
end

% writing to the file
fid = fopen('newfile.txt', 'wt');
fprintf(fid, '%s', rows{:});
fclose(fid);

%获取文本
text_in_file=fileread（'oldfile.txt'）；
%分成几行
rows=regexp（文件中的文本“\n”，“split”）；
%行中的基因数
A=cellfun（@（x）numel（regexp（x，'\t'）），行）；
%具有两个基因的行索引
两个字行=查找（A==2）；
%第一基因
第一个单词=cellfun（@（x）regexp（x，“\S+”，“匹配”，“一次”），行（两行），“UniformOutput”，false）；
%修改行
对于i=setdiff（1:numel（行），两个单词行）%排除两个基因行
last_idx=查找（两行）
这是一个经过修改的代码，它将处理多行有2个基因的情况。它重置计数并开始在单个基因行前面插入新的基因名称。这是您想要的吗
% getting the text
text_in_file = fileread('oldfile.txt');
% splitting into rows
rows = regexp(text_in_file,'\n','split');
% number of genes in the rows
A = cellfun(@(x) numel(regexp(x, '\t')), rows);
% row indices with two genes
two_word_rows = find(A==2); 
% first genes
first_words = cellfun(@(x) regexp(x, '\S+', 'match', 'once'), rows(two_word_rows), 'UniformOutput' , false);

% modifying the rows
for i=setdiff(1:numel(rows), two_word_rows) % exclude the two gene rows
    last_idx = find(two_word_rows<i,1,'last'); % which word to add?
    rows{i} = sprintf('%s\t%s', char(first_words(last_idx)), rows{i});
end

% writing to the file
fid = fopen('newfile.txt', 'wt');
fprintf(fid, '%s', rows{:});
fclose(fid);

%获取文本
text_in_file=fileread（'oldfile.txt'）；
%分成几行
rows=regexp（文件中的文本“\n”，“split”）；
%行中的基因数
A=cellfun（@（x）numel（regexp（x，'\t'）），行）；
%具有两个基因的行索引
两个字行=查找（A==2）；
%第一基因
第一个单词=cellfun（@（x）regexp（x，“\S+”，“匹配”，“一次”），行（两行），“UniformOutput”，false）；
%修改行
对于i=setdiff（1:numel（行），两个单词行）%排除两个基因行
last_idx=find（两行）此代码导入所有32491个基因名，然后将它们写入新文件
oldfile='file.txt';
newfile='file2.txt';
fclose all;
fid=fopen(oldfile,'r');
genes={};
l=fgetl(fid);
while ~isnumeric(l)
    l = regexp(l, '\W', 'split');
    l = l(~cellfun(@isempty,l));
    if ~isempty(l)
        genes(end+1:end+numel(l))=l;
    end
    l=fgetl(fid);
end
fclose(fid);

fid=fopen(newfile,'wt');
for ct = 2:numel(genes)
    fprintf(fid,'%s %s\n',genes{1},genes{ct});
end
fclose(fid);

输出：
TGM1 HIST1H4C
TGM1 HIST1H4B
TGM1 HIST1H4A
TGM1 TGM3
TGM1 HIST1H4G
TGM1 HIST1H4F
TGM1 HIST1H4E
TGM1 HIST1H4D
TGM1 HIST1H4K
TGM1 HIST1H4J
(etc.)

这段代码导入所有32491个基因名，然后将它们写入一个新文件
oldfile='file.txt';
newfile='file2.txt';
fclose all;
fid=fopen(oldfile,'r');
genes={};
l=fgetl(fid);
while ~isnumeric(l)
    l = regexp(l, '\W', 'split');
    l = l(~cellfun(@isempty,l));
    if ~isempty(l)
        genes(end+1:end+numel(l))=l;
    end
    l=fgetl(fid);
end
fclose(fid);

fid=fopen(newfile,'wt');
for ct = 2:numel(genes)
    fprintf(fid,'%s %s\n',genes{1},genes{ct});
end
fclose(fid);

输出：
TGM1 HIST1H4C
TGM1 HIST1H4B
TGM1 HIST1H4A
TGM1 TGM3
TGM1 HIST1H4G
TGM1 HIST1H4F
TGM1 HIST1H4E
TGM1 HIST1H4D
TGM1 HIST1H4K
TGM1 HIST1H4J
(etc.)

我试过你的代码！但是你知道，正如我说的，文本非常大，很多情况下，行有两列，下一行只有一列，我共享这个文件，请检查itbro，我的文件有一个基因和它的邻居，像一个基因有很多邻居，因为行中有太多的基因，有两个基因意味着这个基因ts第一个邻居是gene 2，然后在下一行中列出其他邻居。第二个代码有一个错误：输入参数类型为“char”的未定义函数“strsplit”。邻居网络（第13行）行中的错误=strsplit（文件中的文本“\n”）；你的matlab版本是什么？如果我从google drive下载你的文件并运行代码，它会工作得很好。我已经在编辑的版本中避免了strsplitstrsplit
，你现在可以尝试了。我试过你的代码！但是你知道，正如我所说，文本非常大，并且有很多情况下，行有两列，下一行只有一列共享文件plz检查itbro，我的文件有一个基因和它的邻居，ILT像一个基因有很多邻居一样，因为行中有太多的基因，所以有两个基因意味着这个基因1它的第一个邻居是基因2，然后在下一行列出其他邻居第二个代码有一个错误：输入a的未定义函数“strsplit”“char”类型的rguments。邻居网络（第13行）行中的错误=strsplit（文本在文件“\n”中）；你的matlab版本是什么？如果我从google drive下载你的文件并运行代码，它会工作得很好。我有2015aI在编辑的版本中避免了strsplit
，你现在可以试试，Tsjitoyan的方法更优雅。了解他在regexprep中做了什么，并对其进行改进，使其在你的文件上工作。\s+
匹配空白。因此，如果将[\n\r]+
更改为[\n\r\s]+
它可能已经起作用了。我认为OP需要澄清当有多行有2个基因时他想要实现什么…@Vahe Tshitoyan多行有2个基因意味着基因有一个sirst邻居，然后是其他基因……。基因6和基因7基因8意味着基因8是基因7的邻居和其他nighboursevery基因的列表显示同一行中的第一个邻居和下一行中的下一个邻居列表，直到另一行开始，还显示新基因和同一行中的第一个邻居以及其他邻居，依此类推on@F.caren我认为这是因为文件末尾有一个空行。您可以在写入文件之前删除最后一行e使用行（A==0）=[]；
Vahe Tsjitoyan的方法更为优雅。了解他在regexprep中所做的工作，并对其进行改进，使其适用于您的文件。\s+
匹配空格。因此，如果将[\n\r]+
更改为[\n\r\s]+
它可能已经起作用了。我认为OP需要澄清当有多行有2个基因时他想要实现什么…@Vahe Tshitoyan多行有2个基因意味着基因有一个sirst邻居，然后是其他基因……。基因6和基因7基因8意味着基因8是基因7的邻居和其他nighboursevery基因的列表显示同一行中的第一个邻居和下一行中的下一个邻居列表，直到另一行开始，还显示新基因和同一行中的第一个邻居以及其他邻居，依此类推on@F.caren我认为这是因为文件末尾有一个空行。您可以在写入文件之前删除最后一行e使用行（A==0）=[]；
请更清楚地说明当您有多行包含2个基因时您想要实现的目标。是否保留第一行的基因