Python Scikit学习：从文件夹加载图像，为KNN分类创建带标签的数据集_Python_File_Scikit Learn_Directory

Python Scikit学习：从文件夹加载图像，为KNN分类创建带标签的数据集

python file scikit-learn directory

Python Scikit学习：从文件夹加载图像，为KNN分类创建带标签的数据集,python,file,scikit-learn,directory,Python,File,Scikit Learn,Directory,我想做手写数字识别使用K-近邻分类与scikit学习。我有一个文件夹，其中有5001个手写数字的图像（0-9之间的每个数字有500个图像）我试图找到一种基于这些图像创建数据集的方法，这样我就可以创建一个训练和测试集。我已经阅读了很多关于如何使用scikit学习进行K近邻分类的在线教程，但大多数教程都加载了现有的数据集，例如手写数字的MNIST数据集有没有办法通过从文件夹中读取图像，然后为每个图像指定一个标签来创建自己的数据集？我不知道我可以用什么方法来做这件事。非常感谢您的任何见解。您可以

我想做手写数字识别使用K-近邻分类与scikit学习。我有一个文件夹，其中有5001个手写数字的图像（0-9之间的每个数字有500个图像）

我试图找到一种基于这些图像创建数据集的方法，这样我就可以创建一个训练和测试集。我已经阅读了很多关于如何使用scikit学习进行K近邻分类的在线教程，但大多数教程都加载了现有的数据集，例如手写数字的MNIST数据集

有没有办法通过从文件夹中读取图像，然后为每个图像指定一个标签来创建自己的数据集？我不知道我可以用什么方法来做这件事。非常感谢您的任何见解。

您可以使用Pillow或opencv库来阅读您的图像

枕头：对于Opencv：要转换您可以使用的所有图像，例如操作系统库，请执行以下操作：

import os

创建图像名称列表

loc = os.listdir('your_images_folder')

要使用一个颜色通道存储灰度图像，可以使用空数组

data = np.ones((# of images, image_size wxh))


  for i, l in enumerate(loc):

     # Full image path
     path = os.path.join("your_images_folder", l)

     img = np.asarray(PIL.Image.open(path))

     # Make a vector from an image
     img = img.reshape(-1, img.size)

     # store this vector
     data[i,:]  = img

因此，wou将为您的分类项目获取numpy数组“数据”。 “y”向量也可以从每个图像的名称添加到同一循环中

要在循环中使用进度条跟踪流程，有时TQM库可能是一个合适的解决方案。

要存储rgb图像，可以实现相同的解决方案。对于rgb图像，重塑（-1，）将返回较长的矢量。

要读取数据，应执行以下操作：

from os import listdir
from os.path import isfile, join
import re
import matplotlib.pyplot as plt

mypath = '.' # edit with the path to your data
files = [f for f in listdir(mypath) if isfile(join(mypath, f))]

x = []
y = []

for file in files:
    label = file.split('_')[0] # assuming your img is named like this "eight_1.png" you want to get the label "eight"
    y.append(label)
    img = plt.imread(file)
    x.append(img)

然后，在将其交给scikit learn之前，您需要操纵一点x和y，但您应该没事。

这有帮助吗

import os
import imageio


def convert_word_to_label(word):

    if word == 'zero':
        return 0
    elif word == 'one':
        return 1
    elif word == 'two':
        return 2
    elif word == 'three':
        return 3
    elif word == 'four':
        return 4
    elif word == 'five':
        return 5
    elif word == 'six':
        return 6
    elif word == 'seven':
        return 7
    elif word == 'eight':
        return 8
    elif word == 'nine':
        return 9



def create_dataset(path):
    X = []
    y = []

    for r, d, f in os.walk(path):
        for image in f:
            if '.jpg' in image:
                image_path = os.path.join(r, image)
                img = imageio.imread(image_path)
                X.append(img)
                word = image.split('_')[0]
                y.append(convert_word_to_label(word))
    return X, y

if __name__ == '__main__':
    X, y = create_dataset('path/to/image_folder/')

-1在img.reforme（-1，img.size）中起什么作用？我犯了一个错误，

img.reforme（-1，）

从图像数组中生成1d向量，

img.reforme（-1，img.size）

和

img.reforme（1，img.size）

的行为相同-1表示numpy将自动找到正确的形状<代码>重塑（-1，）-获得1d向量的正确形状。制作完X数据集后，您可以将其保存为熊猫数据帧，或通过np.save（'data.npy'）将其保存。

from os import listdir
from os.path import isfile, join
import re
import matplotlib.pyplot as plt

mypath = '.' # edit with the path to your data
files = [f for f in listdir(mypath) if isfile(join(mypath, f))]

x = []
y = []

for file in files:
    label = file.split('_')[0] # assuming your img is named like this "eight_1.png" you want to get the label "eight"
    y.append(label)
    img = plt.imread(file)
    x.append(img)

import os
import imageio


def convert_word_to_label(word):

    if word == 'zero':
        return 0
    elif word == 'one':
        return 1
    elif word == 'two':
        return 2
    elif word == 'three':
        return 3
    elif word == 'four':
        return 4
    elif word == 'five':
        return 5
    elif word == 'six':
        return 6
    elif word == 'seven':
        return 7
    elif word == 'eight':
        return 8
    elif word == 'nine':
        return 9



def create_dataset(path):
    X = []
    y = []

    for r, d, f in os.walk(path):
        for image in f:
            if '.jpg' in image:
                image_path = os.path.join(r, image)
                img = imageio.imread(image_path)
                X.append(img)
                word = image.split('_')[0]
                y.append(convert_word_to_label(word))
    return X, y

if __name__ == '__main__':
    X, y = create_dataset('path/to/image_folder/')