如何在r中保持列数不变的情况下将fasta字符串分隔成多行?

如何在r中保持列数不变的情况下将fasta字符串分隔成多行?,r,dataframe,split,bioinformatics,fasta,R,Dataframe,Split,Bioinformatics,Fasta,我试图读取一个fasta文件,并将序列作为单独的氨基酸显示为数据帧。 1 seq=1列 到目前为止,我得到的是: FASTA_test.txt包含: >sp|Q9UER7|DAXX_HUMAN Death domain-associated protein 6 OS=Homo sapiens OX=9606 GN=DAXX PE=1 SV=2 MATANSIIVLDDDDEDEAAAQPGPSHPLPNAASPGAEAPSSSEPHGARGSSSSGGKKCYK LENEKLFEEFLEL

我试图读取一个fasta文件,并将序列作为单独的氨基酸显示为数据帧。 1 seq=1列

到目前为止,我得到的是:

FASTA_test.txt包含:

>sp|Q9UER7|DAXX_HUMAN Death domain-associated protein 6 OS=Homo sapiens OX=9606 GN=DAXX PE=1 SV=2
MATANSIIVLDDDDEDEAAAQPGPSHPLPNAASPGAEAPSSSEPHGARGSSSSGGKKCYK
LENEKLFEEFLELCKMQTADHPEVVPFLYNRQQRAHSLFLASAEFCNILSRVLSRARSRP
AKLYVYINELCTVLKAHSAKKKLNLAPAATTSNEPSGNNPPTHLSLDPTNAENTASQSPR
TRGSRRQIQRLEQLLALYVAEIRRLQEKELDLSELDDPDSAYLQEARLKRKLIRLFGRLC
ELKDCSSLTGRVIEQRIPYRGTRYPEVNRRIERLINKPGPDTFPDYGDVLRAVEKAAARH
SLGLPRQQLQLMAQDAFRDVGIRLQERRHLDLIYNFGCHLTDDYRPGVDPALSDPVLARR
LRENRSLAMSRLDEVISKYAMLQDKSEEGERKKRRARLQGTSSHSADTPEASLDSGEGPS
GMASQGCPSASRAETDDEDDEESDEEEEEEEEEEEEEATDSEEEEDLEQMQEGQEDDEEE
DEEEEAAAGKDGDKSPMSSLQISNEKNLEPGKQISRSSGEQQNKGRIVSPSLLSEEPLAP
SSIDAESNGEQPEELTLEEESPVSQLFELEIEALPLDTPSSVETDISSSRKQSEEPFTTV
LENGAGMVSSTSFNGGVSPHNWGDSGPPCKKSRKEKKQTGSGPLGNSYVERQRSVHEKNG
KKICTLPSPPSPLASLAPVADSSTRVDSPSHGLVTSSLCIPSPARLSQTPHSQPPRPGTC
KTSVATQCDPEEIIVLSDSD
>sp|P29590|PML_HUMAN Protein PML OS=Homo sapiens OX=9606 GN=PML PE=1 SV=3
MEPAPARSPRPQQDPARPQEPTMPPPETPSEGRQPSPSPSPTERAPASEEEFQFLRCQQC
QAEAKCPKLLPCLHTLCSGCLEASGMQCPICQAPWPLGADTPALDNVFFESLQRRLSVYR
QIVDAQAVCTRCKESADFWCFECEQLLCAKCFEAHQWFLKHEARPLAELRNQSVREFLDG
TRKTNNIFCSNPNHRTPTLTSIYCRGCSKPLCCSCALLDSSHSELKCDISAEIQQRQEEL
DAMTQALQEQDSAFGAVHAQMHAAVGQLGRARAETEELIRERVRQVVAHVRAQERELLEA
VDARYQRDYEEMASRLGRLDAVLQRIRTGSALVQRMKCYASDQEVLDMHGFLRQALCRLR
QEEPQSLQAAVRTDGFDEFKVRLQDLSSCITQGKDAAVSKKASPEAASTPRDPIDVDLPE
EAERVKAQVQALGLAEAQPMAVVQSVPGAHPVPVYAFSIKGPSYGEDVSNTTTAQKRKCS
QTQCPRKVIKMESEEGKEARLARSSPEQPRPSTSKAVSPPHLDGPPSPRSPVIGSEVFLP
NSNHVASGAGEAEERVVVISSSEDSDAENSSSRELDDSSSESSDLQLEGPSTLRVLDENL
ADPQAEDRPLVFFDLKIDNETQKISQLAAVNRESKFRVVIQPEAFFSIYSKAVSLEVGLQ
HFLSFLSSMRRPILACYKLWGPGLPNFFRALEDINRLWEFQEAISGFLAALPLIRERVPG
ASSFKLKNLAQTYLARNMSERSAMAAVLAMRDLCRLLEVSPGPQLAQHVYPFSSLQCFAS
LQPLVQAAVLPRAEARLLALHNVSFMELLSAHRRDRQGGLKKYSRYLSLQTTTLPPAQPA
FNLQALGTYFEGLLEGPALARAEGVSTPLAGRGLAERASQQS
我的密码是:

library("Biostrings")
fastaFile <- readAAStringSet("~/Desktop/FASTA_test.txt")
seq_name = names(fastaFile)
sequence = paste(fastaFile)
df <- data.frame(seq_name, sequence)
view(df)

#separate the aa into separate columns
df_splited_1 <- as.data.frame(do.call(cbind, apply(df, 1, function(x) {
  do.call(expand.grid, strsplit(df$sequence, ""))
})))

view(df_splited_1)
库(“生物串”)

fastaFile是否尝试将序列拆分为每列一个字符?对于sequence1,您将有740列
nchar(df$sequence[1])#[1]740
如果是,第二个序列的长度更长
nchar(df$sequence[2])#[1]882
,您希望如何对它们进行行绑定,较短的一个将获得NAs?我想将序列拆分为每行一个字符。但是我想在第1列中保留序号1,在第2列中保留序号2,而不是全部在一列中好的,同样的问题,长度不同,您想如何对它们进行列绑定?然后它就简单到
strsplit(df$sequence,“”)
?@RonakShah dput在这种情况下不起作用,它只是指向OPs机器上的数据的指针。最好从文本文件中读取
fastaFile,您是否试图将序列拆分为每列一个字符?对于sequence1,您将有740列
nchar(df$sequence[1])#[1]740
如果是,第二个序列的长度更长
nchar(df$sequence[2])#[1]882
,您希望如何对它们进行行绑定,较短的一个将获得NAs?我想将序列拆分为每行一个字符。但是我想在第1列中保留序号1,在第2列中保留序号2,而不是全部在一列中好的,同样的问题,长度不同,您想如何对它们进行列绑定?然后它就简单到
strsplit(df$sequence,“”)
?@RonakShah dput在这种情况下不起作用,它只是指向OPs机器上的数据的指针。最好直接从文本文件中读取
fastaFile
dput(fastaFile)
new("AAStringSet", pool = new("SharedRaw_Pool", xp_list = list(
    <pointer: 0x0>), .link_to_cached_object_list = list(<environment>)), 
    ranges = new("GroupedIRanges", group = c(1L, 1L), start = c(1L, 
    741L), width = c(740L, 882L), NAMES = c("sp|Q9UER7|DAXX_HUMAN Death domain-associated protein 6 OS=Homo sapiens OX=9606 GN=DAXX PE=1 SV=2", 
    "sp|P29590|PML_HUMAN Protein PML OS=Homo sapiens OX=9606 GN=PML PE=1 SV=3"
    ), elementType = "ANY", elementMetadata = NULL, metadata = list()), 
    elementType = "AAString", elementMetadata = NULL, metadata = list())