R 从间隔列表中模拟随机位置_R_Simulation_Bioinformatics_Bioconductor_Genomicranges

R 从间隔列表中模拟随机位置

R 从间隔列表中模拟随机位置,r,simulation,bioinformatics,bioconductor,genomicranges,R,Simulation,Bioinformatics,Bioconductor,Genomicranges,我试图在R中开发一个函数，以输出给定间隔列表中的随机位置 sim_dat <- bpSim(N=10) head(sim_dat) 我的间隔文件（14600行）是一个以制表符分隔的bed文件（chromose start-end-name），如下所示： 1 4953 16204 1 1 16284 16612 1 1 16805 17086 1 1 18561 18757 1 1 18758 1904

我试图在R中开发一个函数，以输出给定间隔列表中的随机位置

sim_dat <- bpSim(N=10)
head(sim_dat)

我的间隔文件（14600行）是一个以制表符分隔的

bed

文件（

chromose start-end-name

），如下所示：

1      4953    16204   1
1      16284   16612   1
1      16805   17086   1
1      18561   18757   1
1      18758   19040   1
1      19120   19445   1

当前我的函数将在这些间隔内生成

随机位置

sim_dat <- bpSim(N=10)
head(sim_dat)

最终，我试图模拟基因组中的随机位置，因此需要为每个

模拟数百次数据

如果您能就我如何能够：

减少运行时间
不再需要
```
基因组范围
```

此外，如果有人知道任何软件包已经做到了这一点，我宁愿使用现有的软件包，而不是重新发明轮子

对于不同长度的范围，我假设您希望这些随机选择的位置与段的长度成比例。换句话说，基于范围内的实际碱基对，选择是一致的。否则，您将过度表示小范围（较高的标记密度），而过低表示大范围（较低的标记密度）

这是一个data.table解决方案，它可以在我的机器上几乎立即运行1000个站点，在大约10秒内运行100万个随机站点。它随机抽样您想要的站点数量，首先抽样行（按每行的范围大小加权），然后在该范围内均匀抽样

library(data.table)

nSites <- 1e4

bed <- data.table(chromosome=1, start=c(100,1050,3600,4000,9050), end=c(1000,3000,3700,8000,20000))

# calculate size of range
bed[, size := 1 + end-start]

# Randomly sample bed file rows, proportional to the length of each range
simulated.sites <- bed[sample(.N, size=nSites, replace=TRUE, prob=bed$size)]

# Randomly sample uniformly within each chosen range
simulated.sites[, position := sample(start:end, size=1), by=1:dim(simulated.sites)[1]]

# Remove extra columns and format as needed
simulated.sites[, start  := position]
simulated.sites[, end := position]
simulated.sites[, c("size", "position") := NULL]

具有如下输出：

       chromosome start   end
    1:          1 10309 10309
    2:          1  4578  4578
    3:          1  1984  1984
    4:          1 14703 14703
    5:          1 10090 10090
   ---
 9996:          1  1601  1601
 9997:          1  5317  5317
 9998:          1 18918 18918
 9999:          1  1154  1154
10000:          1  7343  7343

bedtools

random

或

shuffle

是救命稻草。只需为bedtools编写一个简单的R包装，就可以了。@PoGibas-这是我考虑过的。然而，我之所以从区间文件生成位置，是因为我排除了在基因组的

不可复制的

区域生成位置的可能性。据我所见，bedtools shuffle仅允许您沿每条染色体的长度生成随机位置（而不是其中的位置）。bedtools shuffle具有选项

-excel

，您可以在其中指定装配间隙，也可以使用

-incl

指定间隔

library(data.table)

nSites <- 1e4

bed <- data.table(chromosome=1, start=c(100,1050,3600,4000,9050), end=c(1000,3000,3700,8000,20000))

# calculate size of range
bed[, size := 1 + end-start]

# Randomly sample bed file rows, proportional to the length of each range
simulated.sites <- bed[sample(.N, size=nSites, replace=TRUE, prob=bed$size)]

# Randomly sample uniformly within each chosen range
simulated.sites[, position := sample(start:end, size=1), by=1:dim(simulated.sites)[1]]

# Remove extra columns and format as needed
simulated.sites[, start  := position]
simulated.sites[, end := position]
simulated.sites[, c("size", "position") := NULL]

 chromosome start   end  size
          1   100  1000   901
          1  1050  3000  1951
          1  3600  3700   101
          1  4000  8000  4001
          1  9050 20000 10951

       chromosome start   end
    1:          1 10309 10309
    2:          1  4578  4578
    3:          1  1984  1984
    4:          1 14703 14703
    5:          1 10090 10090
   ---
 9996:          1  1601  1601
 9997:          1  5317  5317
 9998:          1 18918 18918
 9999:          1  1154  1154
10000:          1  7343  7343