如何在r中保持列数不变的情况下将fasta字符串分隔成多行?
我试图读取一个fasta文件,并将序列作为单独的氨基酸显示为数据帧。 1 seq=1列 到目前为止,我得到的是: FASTA_test.txt包含:如何在r中保持列数不变的情况下将fasta字符串分隔成多行?,r,dataframe,split,bioinformatics,fasta,R,Dataframe,Split,Bioinformatics,Fasta,我试图读取一个fasta文件,并将序列作为单独的氨基酸显示为数据帧。 1 seq=1列 到目前为止,我得到的是: FASTA_test.txt包含: >sp|Q9UER7|DAXX_HUMAN Death domain-associated protein 6 OS=Homo sapiens OX=9606 GN=DAXX PE=1 SV=2 MATANSIIVLDDDDEDEAAAQPGPSHPLPNAASPGAEAPSSSEPHGARGSSSSGGKKCYK LENEKLFEEFLEL
>sp|Q9UER7|DAXX_HUMAN Death domain-associated protein 6 OS=Homo sapiens OX=9606 GN=DAXX PE=1 SV=2
MATANSIIVLDDDDEDEAAAQPGPSHPLPNAASPGAEAPSSSEPHGARGSSSSGGKKCYK
LENEKLFEEFLELCKMQTADHPEVVPFLYNRQQRAHSLFLASAEFCNILSRVLSRARSRP
AKLYVYINELCTVLKAHSAKKKLNLAPAATTSNEPSGNNPPTHLSLDPTNAENTASQSPR
TRGSRRQIQRLEQLLALYVAEIRRLQEKELDLSELDDPDSAYLQEARLKRKLIRLFGRLC
ELKDCSSLTGRVIEQRIPYRGTRYPEVNRRIERLINKPGPDTFPDYGDVLRAVEKAAARH
SLGLPRQQLQLMAQDAFRDVGIRLQERRHLDLIYNFGCHLTDDYRPGVDPALSDPVLARR
LRENRSLAMSRLDEVISKYAMLQDKSEEGERKKRRARLQGTSSHSADTPEASLDSGEGPS
GMASQGCPSASRAETDDEDDEESDEEEEEEEEEEEEEATDSEEEEDLEQMQEGQEDDEEE
DEEEEAAAGKDGDKSPMSSLQISNEKNLEPGKQISRSSGEQQNKGRIVSPSLLSEEPLAP
SSIDAESNGEQPEELTLEEESPVSQLFELEIEALPLDTPSSVETDISSSRKQSEEPFTTV
LENGAGMVSSTSFNGGVSPHNWGDSGPPCKKSRKEKKQTGSGPLGNSYVERQRSVHEKNG
KKICTLPSPPSPLASLAPVADSSTRVDSPSHGLVTSSLCIPSPARLSQTPHSQPPRPGTC
KTSVATQCDPEEIIVLSDSD
>sp|P29590|PML_HUMAN Protein PML OS=Homo sapiens OX=9606 GN=PML PE=1 SV=3
MEPAPARSPRPQQDPARPQEPTMPPPETPSEGRQPSPSPSPTERAPASEEEFQFLRCQQC
QAEAKCPKLLPCLHTLCSGCLEASGMQCPICQAPWPLGADTPALDNVFFESLQRRLSVYR
QIVDAQAVCTRCKESADFWCFECEQLLCAKCFEAHQWFLKHEARPLAELRNQSVREFLDG
TRKTNNIFCSNPNHRTPTLTSIYCRGCSKPLCCSCALLDSSHSELKCDISAEIQQRQEEL
DAMTQALQEQDSAFGAVHAQMHAAVGQLGRARAETEELIRERVRQVVAHVRAQERELLEA
VDARYQRDYEEMASRLGRLDAVLQRIRTGSALVQRMKCYASDQEVLDMHGFLRQALCRLR
QEEPQSLQAAVRTDGFDEFKVRLQDLSSCITQGKDAAVSKKASPEAASTPRDPIDVDLPE
EAERVKAQVQALGLAEAQPMAVVQSVPGAHPVPVYAFSIKGPSYGEDVSNTTTAQKRKCS
QTQCPRKVIKMESEEGKEARLARSSPEQPRPSTSKAVSPPHLDGPPSPRSPVIGSEVFLP
NSNHVASGAGEAEERVVVISSSEDSDAENSSSRELDDSSSESSDLQLEGPSTLRVLDENL
ADPQAEDRPLVFFDLKIDNETQKISQLAAVNRESKFRVVIQPEAFFSIYSKAVSLEVGLQ
HFLSFLSSMRRPILACYKLWGPGLPNFFRALEDINRLWEFQEAISGFLAALPLIRERVPG
ASSFKLKNLAQTYLARNMSERSAMAAVLAMRDLCRLLEVSPGPQLAQHVYPFSSLQCFAS
LQPLVQAAVLPRAEARLLALHNVSFMELLSAHRRDRQGGLKKYSRYLSLQTTTLPPAQPA
FNLQALGTYFEGLLEGPALARAEGVSTPLAGRGLAERASQQS
我的密码是:
library("Biostrings")
fastaFile <- readAAStringSet("~/Desktop/FASTA_test.txt")
seq_name = names(fastaFile)
sequence = paste(fastaFile)
df <- data.frame(seq_name, sequence)
view(df)
#separate the aa into separate columns
df_splited_1 <- as.data.frame(do.call(cbind, apply(df, 1, function(x) {
do.call(expand.grid, strsplit(df$sequence, ""))
})))
view(df_splited_1)
库(“生物串”)
fastaFile是否尝试将序列拆分为每列一个字符?对于sequence1,您将有740列nchar(df$sequence[1])#[1]740
如果是,第二个序列的长度更长nchar(df$sequence[2])#[1]882
,您希望如何对它们进行行绑定,较短的一个将获得NAs?我想将序列拆分为每行一个字符。但是我想在第1列中保留序号1,在第2列中保留序号2,而不是全部在一列中好的,同样的问题,长度不同,您想如何对它们进行列绑定?然后它就简单到strsplit(df$sequence,“”)
?@RonakShah dput在这种情况下不起作用,它只是指向OPs机器上的数据的指针。最好从文本文件中读取fastaFile,您是否试图将序列拆分为每列一个字符?对于sequence1,您将有740列nchar(df$sequence[1])#[1]740
如果是,第二个序列的长度更长nchar(df$sequence[2])#[1]882
,您希望如何对它们进行行绑定,较短的一个将获得NAs?我想将序列拆分为每行一个字符。但是我想在第1列中保留序号1,在第2列中保留序号2,而不是全部在一列中好的,同样的问题,长度不同,您想如何对它们进行列绑定?然后它就简单到strsplit(df$sequence,“”)
?@RonakShah dput在这种情况下不起作用,它只是指向OPs机器上的数据的指针。最好直接从文本文件中读取fastaFile
dput(fastaFile)
new("AAStringSet", pool = new("SharedRaw_Pool", xp_list = list(
<pointer: 0x0>), .link_to_cached_object_list = list(<environment>)),
ranges = new("GroupedIRanges", group = c(1L, 1L), start = c(1L,
741L), width = c(740L, 882L), NAMES = c("sp|Q9UER7|DAXX_HUMAN Death domain-associated protein 6 OS=Homo sapiens OX=9606 GN=DAXX PE=1 SV=2",
"sp|P29590|PML_HUMAN Protein PML OS=Homo sapiens OX=9606 GN=PML PE=1 SV=3"
), elementType = "ANY", elementMetadata = NULL, metadata = list()),
elementType = "AAString", elementMetadata = NULL, metadata = list())