R 使用基因组范围对象作为参考从.fasta文件中提取多个序列时出错
我有一个对应于我的参考基因组的fasta文件和一个对应于我数据的SNP调用的vcf文件。我想从我的fasta中获得每个SNP的序列。 为此,我使用R加载vcf文件,并使用以下命令从中提取基因组范围对象:R 使用基因组范围对象作为参考从.fasta文件中提取多个序列时出错,r,fasta,vcf-variant-call-format,genomicranges,getseq,R,Fasta,Vcf Variant Call Format,Genomicranges,Getseq,我有一个对应于我的参考基因组的fasta文件和一个对应于我数据的SNP调用的vcf文件。我想从我的fasta中获得每个SNP的序列。 为此,我使用R加载vcf文件,并使用以下命令从中提取基因组范围对象: vcf.fn<-"SNPsAcrossAlltheIndividuals.vcf" vcf <- readVcf(vcf.fn, verbose=FALSE) SNPrange <- vcf@rowRanges library(Rsamtools) fil
vcf.fn<-"SNPsAcrossAlltheIndividuals.vcf"
vcf <- readVcf(vcf.fn, verbose=FALSE)
SNPrange <- vcf@rowRanges
library(Rsamtools)
file_path <- "F1.fasta"
indexFa(file_path)
fa = FaFile("F1.fasta")
最后,我运行命令dogetsequencefromsnprange使用我的fasta文件。但是,我得到了以下错误:
seq_ <-getSeq(fa, SNPrange)
Error in value[[3L]](cond) :
record 12177 (chr7:88167221-88167221) failed
file: F1.fasta
这是我的信息
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=Swedish_Sweden.1252 LC_CTYPE=Swedish_Sweden.1252 LC_MONETARY=Swedish_Sweden.1252
[4] LC_NUMERIC=C LC_TIME=Swedish_Sweden.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] readr_1.3.1 limma_3.44.3
[3] ggplot2_3.3.2 stringr_1.4.0
[5] vcfR_1.12.0 adegenet_2.1.3
[7] ape_5.4-1 ade4_1.7-15
[9] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0 GenomicFeatures_1.40.1
[11] AnnotationDbi_1.50.3 VariantAnnotation_1.34.0
[13] Rsamtools_2.4.0 Biostrings_2.56.0
[15] XVector_0.28.0 SummarizedExperiment_1.18.2
[17] DelayedArray_0.14.1 matrixStats_0.56.0
[19] Biobase_2.48.0 GenomicRanges_1.40.0
[21] GenomeInfoDb_1.24.2 IRanges_2.22.2
[23] S4Vectors_0.26.0 BiocGenerics_0.34.0
loaded via a namespace (and not attached):
[1] colorspace_1.4-1 seqinr_3.6-1 deldir_0.1-29 ellipsis_0.3.1
[5] class_7.3-17 rstudioapi_0.11 farver_2.0.3 bit64_4.0.5
[9] fansi_0.4.1 xml2_1.3.2 codetools_0.2-16 splines_4.0.2
[13] memuse_4.1-0 cluster_2.1.0 dbplyr_2.0.0 shiny_1.5.0
[17] compiler_4.0.2 httr_1.4.2 assertthat_0.2.1 Matrix_1.2-18
[21] fastmap_1.0.1 cli_2.1.0 later_1.1.0.1 htmltools_0.5.0
[25] prettyunits_1.1.1 tools_4.0.2 igraph_1.2.5 coda_0.19-4
[29] gtable_0.3.0 glue_1.4.2 GenomeInfoDbData_1.2.3 reshape2_1.4.4
[33] dplyr_1.0.2 rappdirs_0.3.1 gmodels_2.18.1 Rcpp_1.0.5
[37] raster_3.3-13 vctrs_0.3.4 spdep_1.1-5 gdata_2.18.0
[41] nlme_3.1-148 rtracklayer_1.48.0 pinfsc50_1.2.0 mime_0.9
[45] lifecycle_0.2.0 gtools_3.8.2 XML_3.99-0.3 LearnBayes_2.15.1
[49] zlibbioc_1.34.0 MASS_7.3-51.6 scales_1.1.1 BSgenome_1.56.0
[53] hms_0.5.3 promises_1.1.1 expm_0.999-5 curl_4.3
[57] memoise_1.1.0 biomaRt_2.44.4 stringi_1.5.3 RSQLite_2.2.1
[61] e1071_1.7-3 permute_0.9-5 boot_1.3-25 BiocParallel_1.22.0
[65] spData_0.3.8 rlang_0.4.7 pkgconfig_2.0.3 bitops_1.0-6
[69] lattice_0.20-41 purrr_0.3.4 sf_0.9-6 labeling_0.4.2
[73] GenomicAlignments_1.24.0 bit_4.0.4 tidyselect_1.1.0 plyr_1.8.6
[77] magrittr_1.5 R6_2.5.0 generics_0.1.0 DBI_1.1.0
[81] withr_2.3.0 mgcv_1.8-31 pillar_1.4.6 units_0.6-7
[85] RCurl_1.98-1.2 sp_1.4-2 tibble_3.0.3 crayon_1.3.4
[89] KernSmooth_2.23-17 BiocFileCache_1.12.1 progress_1.2.2 grid_4.0.2
[93] blob_1.2.1 vegan_2.5-6 digest_0.6.25 classInt_0.4-3
[97] xtable_1.8-4 httpuv_1.5.4 openssl_1.4.3 munsell_0.5.0
[101] viridisLite_0.3.0 askpass_1.1
FindVerlaps不会告诉你SNP超出范围。您需要检查染色体长度是否小于相应的snp位置。您可以使用type=“within”尝试FindVerlaps。。但只要写些东西来检查染色体就行了length@StupidWolf,我不知道我是否理解你的陈述如何有助于解决这个问题。我使用的命令给出了与第一个对象重叠的第二个文件的所有范围的输出,换句话说,如果对象2的所有SNP(1个位置)与对象1的SNP重叠,因此,它们包含在范围内,并且输出的范围数将与对象2的范围数相同。但正如我在后面解释的,这不是问题所在。@StupidWolf问题在于,有些标记无法被getSeq函数处理,我不知道为什么会发生这种情况,我也不知道还有什么可以检查它们的。因为当我在bash中使用另一个算法执行完全相同的任务时(请参见下面的问题),它为该标记提供了所需的序列。问题不在于不明确的碱基对。。这很容易证明。只要写一个fasta文件1 contig,10bp,有一个模棱两可的基数,你就可以看到它被正确地读入了。。这意味着问题在于基因组范围。您可以尝试的另一件事是使用biostring和subset读取fasta文件,您将看到它也抛出了一个错误
chr1<- gr[seqnames(gr) == "chr1" ]
chr2<- gr[seqnames(gr) == "chr2" ]
chr3<- gr[seqnames(gr) == "chr3" ]
...
seq1 <-getSeq(fa, chr1)
seq2 <-getSeq(fa, chr2)
seq3 <-getSeq(fa, chr3)
...
seq7 <-getSeq(fa, chr7)
Error in value[[3L]](cond) : record 993 (chr7:88167220-88167222) failed
file: F1.fasta
samtools faidx F1.fasta chr7:88167220-88167222"
>chr7:88167220-88167222
> CRA
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=Swedish_Sweden.1252 LC_CTYPE=Swedish_Sweden.1252 LC_MONETARY=Swedish_Sweden.1252
[4] LC_NUMERIC=C LC_TIME=Swedish_Sweden.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] readr_1.3.1 limma_3.44.3
[3] ggplot2_3.3.2 stringr_1.4.0
[5] vcfR_1.12.0 adegenet_2.1.3
[7] ape_5.4-1 ade4_1.7-15
[9] TxDb.Mmusculus.UCSC.mm10.knownGene_3.10.0 GenomicFeatures_1.40.1
[11] AnnotationDbi_1.50.3 VariantAnnotation_1.34.0
[13] Rsamtools_2.4.0 Biostrings_2.56.0
[15] XVector_0.28.0 SummarizedExperiment_1.18.2
[17] DelayedArray_0.14.1 matrixStats_0.56.0
[19] Biobase_2.48.0 GenomicRanges_1.40.0
[21] GenomeInfoDb_1.24.2 IRanges_2.22.2
[23] S4Vectors_0.26.0 BiocGenerics_0.34.0
loaded via a namespace (and not attached):
[1] colorspace_1.4-1 seqinr_3.6-1 deldir_0.1-29 ellipsis_0.3.1
[5] class_7.3-17 rstudioapi_0.11 farver_2.0.3 bit64_4.0.5
[9] fansi_0.4.1 xml2_1.3.2 codetools_0.2-16 splines_4.0.2
[13] memuse_4.1-0 cluster_2.1.0 dbplyr_2.0.0 shiny_1.5.0
[17] compiler_4.0.2 httr_1.4.2 assertthat_0.2.1 Matrix_1.2-18
[21] fastmap_1.0.1 cli_2.1.0 later_1.1.0.1 htmltools_0.5.0
[25] prettyunits_1.1.1 tools_4.0.2 igraph_1.2.5 coda_0.19-4
[29] gtable_0.3.0 glue_1.4.2 GenomeInfoDbData_1.2.3 reshape2_1.4.4
[33] dplyr_1.0.2 rappdirs_0.3.1 gmodels_2.18.1 Rcpp_1.0.5
[37] raster_3.3-13 vctrs_0.3.4 spdep_1.1-5 gdata_2.18.0
[41] nlme_3.1-148 rtracklayer_1.48.0 pinfsc50_1.2.0 mime_0.9
[45] lifecycle_0.2.0 gtools_3.8.2 XML_3.99-0.3 LearnBayes_2.15.1
[49] zlibbioc_1.34.0 MASS_7.3-51.6 scales_1.1.1 BSgenome_1.56.0
[53] hms_0.5.3 promises_1.1.1 expm_0.999-5 curl_4.3
[57] memoise_1.1.0 biomaRt_2.44.4 stringi_1.5.3 RSQLite_2.2.1
[61] e1071_1.7-3 permute_0.9-5 boot_1.3-25 BiocParallel_1.22.0
[65] spData_0.3.8 rlang_0.4.7 pkgconfig_2.0.3 bitops_1.0-6
[69] lattice_0.20-41 purrr_0.3.4 sf_0.9-6 labeling_0.4.2
[73] GenomicAlignments_1.24.0 bit_4.0.4 tidyselect_1.1.0 plyr_1.8.6
[77] magrittr_1.5 R6_2.5.0 generics_0.1.0 DBI_1.1.0
[81] withr_2.3.0 mgcv_1.8-31 pillar_1.4.6 units_0.6-7
[85] RCurl_1.98-1.2 sp_1.4-2 tibble_3.0.3 crayon_1.3.4
[89] KernSmooth_2.23-17 BiocFileCache_1.12.1 progress_1.2.2 grid_4.0.2
[93] blob_1.2.1 vegan_2.5-6 digest_0.6.25 classInt_0.4-3
[97] xtable_1.8-4 httpuv_1.5.4 openssl_1.4.3 munsell_0.5.0
[101] viridisLite_0.3.0 askpass_1.1