Python 将Pandas代码更改为CUDF以供GPU使用_Python_Python 3.x_Pandas_Numpy_Cudf

Python 将Pandas代码更改为CUDF以供GPU使用

python python-3.x pandas numpy

Python 将Pandas代码更改为CUDF以供GPU使用,python,python-3.x,pandas,numpy,cudf,Python,Python 3.x,Pandas,Numpy,Cudf,我通过混合正片和负片来制作成对的图像。这个过程在计算上非常复杂，需要大量的RAM和处理器。为了加快速度，我想使用GPU并将pandas代码更改为CUDF。现在，CUDF的文档非常有限，我想将下面的代码更改为CUDF positives = pd.DataFrame() for value in tqdm(identities.values(), desc="Positives"): positives = positives.append(pd.DataFrame(i

我通过混合正片和负片来制作成对的图像。这个过程在计算上非常复杂，需要大量的RAM和处理器。为了加快速度，我想使用GPU并将pandas代码更改为CUDF。现在，CUDF的文档非常有限，我想将下面的代码更改为CUDF

positives = pd.DataFrame()
for value in tqdm(identities.values(), desc="Positives"):
    positives = positives.append(pd.DataFrame(itertools.combinations(value, 2), columns=["file_x", "file_y"]),
                                 ignore_index=True)
positives["decision"] = "Yes"
print(positives)
samples_list = list(identities.values())
negatives = pd.DataFrame()
######################====================Functions=============##############

def compute_cross_samples(x):
    return pd.DataFrame(itertools.product(*x), columns=["file_x", "file_y"])

####################################
if __name__ == "__main__":
    if Path("positives_negatives.csv").exists():
        df = pd.read_csv("positives_negatives.csv")
    else:
        with ProcessPoolExecutor() as pool:
            # take cpu_count combinations from identities.values
            for combos in tqdm(more_itertools.ichunked(itertools.combinations(identities.values(), 2), cpu_count())):
                # for each combination iterator that comes out, calculate the cross
                for cross_samples in pool.map(compute_cross_samples, combos):
                    # for each product iterator "cross_samples", iterate over its values and append them to negatives
                    negatives = negatives.append(cross_samples)

        negatives["decision"] = "No"

negatives = negatives.sample(positives.shape[0])
df = pd.concat([positives, negatives]).reset_index(drop=True)
df.to_csv("positives_negatives.csv", index=False)`

对于代码，您需要考虑两件事：

由于API的相似性，首先要开始导入cudf。然后，在您使用pd导入变量名you的地方，将其替换为cudf。虽然这只是一个开始，但这将帮助您了解过渡的基础。在编码方面，从开始，尤其是

正如前面所说，在删除CPU处理代码的基础上，您希望重构函数，使其不需要循环。cuDF和其他RAPIDS库在后台做了大量工作，以并行化GPU代码。添加for循环会使进程串行化并降低速度

最后，请在这里阅读我们的官方文档，这将有助于您的CPU->GPU重构：

多处理池不适用于CUDA。cudf阵列有一个从熊猫转换的方法。没问题，你可以删除多处理代码。我只想在GPU上运行代码。多处理需要9天，然后会出现错误。在过去的两个月里，我一直面临着这个问题。需要帮助您的确切问题是什么？问题是我必须构建一个非常大的列表，并且创建列表的时间非常长。我必须通过GPU的使用来减少它。如果你创建一个最小的、完整的、可复制的示例，社区可能会更好地帮助你。我删除了多重处理并更改了所有内容，但仍然得到一个错误