Python 3.x 在Python中模拟10000个硬币翻转非常慢
我正在写一个模拟,创建了25组10000个周期,每组由48次抛硬币组成。此代码中的某些内容使其运行非常缓慢。它已经运行了至少20分钟,并且仍在工作。R中的类似模拟在10秒内运行 以下是我正在使用的python代码:Python 3.x 在Python中模拟10000个硬币翻转非常慢,python-3.x,pandas,dataframe,simulation,coin-flipping,Python 3.x,Pandas,Dataframe,Simulation,Coin Flipping,我正在写一个模拟,创建了25组10000个周期,每组由48次抛硬币组成。此代码中的某些内容使其运行非常缓慢。它已经运行了至少20分钟,并且仍在工作。R中的类似模拟在10秒内运行 以下是我正在使用的python代码: import pandas as pd from random import choices threshold=17 all_periods = pd.DataFrame() for i in range(10000): simulated_period = pd.Da
import pandas as pd
from random import choices
threshold=17
all_periods = pd.DataFrame()
for i in range(10000):
simulated_period = pd.DataFrame()
for j in range(25):
#Data frame with 48 weeks as rows. Each run through loop adds one more year as column until there are 25
simulated_period = pd.concat([simulated_period, pd.DataFrame(choices([1, -1], k=48))],\
ignore_index=True, axis=1)
positives = simulated_period[simulated_period==1].count(axis=1)
negatives = simulated_period[simulated_period==-1].count(axis=1)
#Combine positives and negatives that are more than the threshold into single dataframe
sig = pd.DataFrame([[sum(positives>=threshold), sum(negatives>=threshold)]], columns=['positive', 'negative'])
sig['total'] = sig['positive'] + sig['negative']
#Add summary of individual simulation to the others
all_periods = pd.concat([all_periods, sig])
如果有帮助,下面是正在快速运行的R脚本:
flip <- function(threshold=17){
#threshold is min number of persistent results we want to see. For example, 17/25 positive or 17/25 negative
outcomes <- c(1, -1)
trial <- do.call(cbind, lapply(1:25, function (i) sample(outcomes, 48, replace=T)))
trial <- as.data.frame(t(trial)) #48 weeks in columns, 25 years in rows.
summary <- sapply(trial, function(x) c(pos=length(x[x==1]), neg=length(x[x==-1])))
summary <- as.data.frame(t(summary)) #use data frame so $pos/$neg can be used instead of [1,]/[2,]
sig.pos <- length(summary$pos[summary$pos>=threshold])
sig.neg <- length(summary$neg[summary$neg>=threshold])
significant <- c(pos=sig.pos, neg=sig.neg, total=sig.pos+sig.neg)
return(significant)
}
results <- do.call(rbind, lapply(1:10000, function(i) flip(threshold)))
results <- as.data.frame(results)
flip为什么不生成整个大集合
idx = pd.MultiIndex.from_product((range(10000), range(25)),
names=('period', 'set'))
df = pd.DataFrame(data=np.random.choice([1,-1], (10000*25, 48)), index=idx)
在我的电脑上用了大约120毫秒。然后是其他操作:
positives = df.eq(1).sum(level=0).gt(17).sum(axis=1).to_frame(name='positives')
negatives = df.eq(-1).sum(level=0).gt(17).sum(axis=1).to_frame(name='negatives')
all_periods = pd.concat( (positives, negatives), axis=1 )
all_periods['total'] = all_periods.sum(1)
大约需要600毫秒的额外时间。介于正数=
和sig['total']=
之间的线真的应该在范围内的j(25)
循环中吗?主要的减速几乎可以肯定是一个循环中的concats:模拟周期=pd.concat([模拟周期..
这会产生不必要的拷贝,是O(N^2)。通常情况下,你会在循环中添加一个列表,并在最后一次concat。谢谢你,Quang,这比我的解决方案好得多。我相信你的代码中有一个错误,最后一行应该是:all_periods['total']=all_periods.sum(1)
。用all_periods
替换new_df
。再次感谢你。