Python 从pandas数据帧加载Keras中的成批图像_Python_Pandas_Keras

Python 从pandas数据帧加载Keras中的成批图像

python pandas keras

Python 从pandas数据帧加载Keras中的成批图像,python,pandas,keras,Python,Pandas,Keras,我有一个带有两列的pandas数据框，一列有到图像的路径，另一列有字符串类标签我还编写了以下函数，这些函数从数据帧加载图像，重新规范化它们，并将类标签转换为一个热向量 def prepare_data(df): data_X, data_y = df.values[:,0], df.values[:,1] # Load images data_X = np.array([np.array(imread(fname)) for fname in data_X])

我有一个带有两列的pandas数据框，一列有到图像的路径，另一列有字符串类标签

我还编写了以下函数，这些函数从数据帧加载图像，重新规范化它们，并将类标签转换为一个热向量

def prepare_data(df):
    data_X, data_y = df.values[:,0], df.values[:,1]

    # Load images
    data_X = np.array([np.array(imread(fname)) for fname in data_X])

    # Normalize input
    data_X = data_X / 255 - 0.5

    # Prepare labels
    data_y = np.array([label2int[label] for label in data_y])
    data_y = to_categorical(data_y)

    return data_X, data_y

我想将这个数据帧提供给Keras CNN，但整个数据集太大，无法立即加载到内存中

这个站点中的其他答案告诉我，出于这个目的，我应该使用Keras ImageDataGenerator，但老实说，我不知道如何从文档中做到这一点

将延迟加载批次中的数据提供给模型的最简单方法是什么

如果是ImageDataGenerator，如何创建ImageDataGenerator来初始化数据帧并通过函数传递批以创建适当的numpy数组？如何使用ImageDataGenerator来适应模型？

ImageDataGenerator

是一个高级类，允许从多个源（从

np数组，从目录…）生成数据，并包括用于执行图像增强等的实用函数
更新
从1.0.4开始，ImageDataGenerator
附带了一个解决您的案例的。它需要定义如下的dataframe
和directory
参数：
dataframe: Pandas dataframe containing the filenames of the
           images in a column and classes in another or column/s
           that can be fed as raw target data.
directory: string, path to the target directory that contains all
           the images mapped in the dataframe.

因此，不再需要自己实现它

原始答案如下
在您的例子中，使用您描述的dataframe，您还可以编写自己的自定义生成器，将prepare\u data
函数中的逻辑作为一个更简单的解决方案。使用Keras的

Sequenceobject这样做是一种很好的做法，因为它允许使用多处理（如果您使用的是gpu，这将有助于避免瓶颈）

您可以在

序列

对象上查看，它包含一个实现示例。最终，您的代码将是这样的（这是样板代码，您必须添加一些细节，如

label2int

函数或图像预处理逻辑）：

可以像自定义生成器一样传递此对象以训练模型：

sequence = DataSequence(dataframe, batch_size)
model.fit_generator(sequence, epochs=1, use_multiprocessing=True)

如下所述，不需要实现洗牌逻辑。在

fit_generator（）

调用中将

shuffle

参数设置为

True

就足够了。从：

洗牌：布尔。是否在下一步重新排列批次的顺序每个时代的开始。仅用于序列的实例（keras.utils.序列）。未启用“步数/历元”时无效没有

我对凯拉斯还不熟悉，所以还是接受我的建议吧。我认为您应该使用Keras ImageDataGenerator，尤其是

flow\u from\u dataframe

选项，因为您说过您有一个熊猫数据帧

Flow\u from\u dataframe

读取数据帧的列以获取文件名和标签

下面是一个示例片段。在线查找教程

train_datagen = ImageDataGenerator(horizontal_flip=True,
                                   vertical_flip=False,
                                   rescale=1/255.0)

train_generator = train_datagen.flow_from_dataframe(     
    dataframe=trainDataframe,  
    directory=imageDir,
    x_col="file", # name of col in data frame that contains file names
    y_col=y_col_list, # name of col with labels
    has_ext=True, 
    batch_size=batch_size,
    shuffle=True,
    save_to_dir=saveDir,
    target_size=(img_width,img_height),
    color_mode='grayscale',
    class_mode='categorical', # for classification task
    interpolation='bilinear')

您是否仍需要在

安装生成器中指定每个历元的步数
，或者它是否将其拾取\uuuuuuuu
？在您的示例中使用epochs=1
还有什么相关的原因吗？事实上，生成器根据批量大小和len（在引擎盖下，这是在序列基类中实现的）推断每个epoch所需的步骤。我使用单历元只是为了说明这一点。同样值得注意的是，您不需要在历元端的中实现自己的洗牌。您只需在fit_generator
中设置shuffle=True
，它将洗牌序列的实例
@sdcbr-如果像素已经在数据帧中该怎么办？你喜欢什么？
train_datagen = ImageDataGenerator(horizontal_flip=True,
                                   vertical_flip=False,
                                   rescale=1/255.0)

train_generator = train_datagen.flow_from_dataframe(     
    dataframe=trainDataframe,  
    directory=imageDir,
    x_col="file", # name of col in data frame that contains file names
    y_col=y_col_list, # name of col with labels
    has_ext=True, 
    batch_size=batch_size,
    shuffle=True,
    save_to_dir=saveDir,
    target_size=(img_width,img_height),
    color_mode='grayscale',
    class_mode='categorical', # for classification task
    interpolation='bilinear')