Python 传递的项目数错误-向dataframe列添加numpy数组内容_Python_Arrays_Pandas_Numpy_Dataframe

Python 传递的项目数错误-向dataframe列添加numpy数组内容

python arrays pandas numpy dataframe

Python 传递的项目数错误-向dataframe列添加numpy数组内容,python,arrays,pandas,numpy,dataframe,Python,Arrays,Pandas,Numpy,Dataframe,我有以下sioma_df数据框：这些是sioma_dfshape和column索引。它有13807行和37列： sioma_df.columns (13807, 37) Index(['Luz (lux)', 'Precipitación (ml)', 'Temperatura (°C)', 'Velocidad del Viento (km/h)', 'E', 'N', 'NE', 'NO', 'O', 'S', 'SE', 'SO', 'PORVL2N1',

我有以下

sioma_df

数据框：

这些是

sioma_df

shape和column索引。它有13807行和37列：

sioma_df.columns
(13807, 37)
Index(['Luz (lux)', 'Precipitación (ml)', 'Temperatura (°C)',
       'Velocidad del Viento (km/h)', 'E', 'N', 'NE', 'NO', 'O', 'S', 'SE',
       'SO', 'PORVL2N1', 'PORVL2N2', 'PORVL4N1', 'PORVL5N1', 'PORVL6N1',
       'PORVL7N1', 'PORVL8N1', 'PORVL9N1', 'PORVL10N1', 'PORVL13N1',
       'PORVL14N1', 'PORVL15N1', 'PORVL16N1', 'PORVL16N2', 'PORVL18N1',
       'PORVL18N2', 'PORVL18N3', 'PORVL18N4', 'PORVL21N1', 'PORVL21N2',
       'PORVL21N3', 'PORVL21N4', 'PORVL21N5', 'PORVL24N1', 'PORVL24N2'],
      dtype='object')

我想应用k-means算法，我已经决定在随机初始化阶段，我将使用k=9
质心

# Turn the dataframe to numpy array
sioma_numpy = sioma_df.get_values()

k=9

# Create a dictionary with the centroids coordinates 
centroids = {
    i + 1: [np.random.randint(0, np.max(sioma_numpy)), np.random.randint(0, np.max(sioma_numpy))]
    for i in range(k)
}

我在应用聚类之前绘制数据

# I get each column individually into an array 

c1 = sioma_df['Luz (lux)'].values
c2 = sioma_df['Precipitación (ml)'].values
c3 = sioma_df['Temperatura (°C)'].values
c4 = sioma_df['Velocidad del Viento (km/h)'].values
c5 = sioma_df['PORVL2N1'].values
c6 = sioma_df['PORVL2N2'].values
c7 = sioma_df['PORVL4N1'].values
c8 = sioma_df['PORVL5N1'].values
c9 = sioma_df['PORVL6N1'].values
c10 = sioma_df['PORVL7N1'].values
c11 = sioma_df['PORVL8N1'].values
c12 = sioma_df['PORVL9N1'].values
c13 = sioma_df['PORVL10N1'].values
c14 = sioma_df['PORVL13N1'].values
c15 = sioma_df['PORVL14N1'].values
c16 = sioma_df['PORVL15N1'].values
c17 = sioma_df['PORVL16N1'].values
c18 = sioma_df['PORVL16N2'].values
c19 = sioma_df['PORVL18N1'].values
c20 = sioma_df['PORVL18N2'].values
c21 = sioma_df['PORVL18N3'].values
c22 = sioma_df['PORVL18N4'].values
c23 = sioma_df['PORVL18N4'].values
c24 = sioma_df['PORVL21N1'].values
c25 = sioma_df['PORVL21N2'].values
c26 = sioma_df['PORVL21N3'].values
c27 = sioma_df['PORVL21N4'].values
c28 = sioma_df['PORVL21N5'].values
c29 = sioma_df['PORVL24N1'].values
c30 = sioma_df['E'].values
c31 = sioma_df['N'].values
c32 = sioma_df['NE'].values
c33 = sioma_df['NO'].values
c34 = sioma_df['O'].values
c35 = sioma_df['S'].values
c36 = sioma_df['SE'].values
c37 = sioma_df['S'].values

""" I generate the X and Y coordinates points of previous c1 to c36 
variables above. With zip I've associate between each Ci and store in 
a list to will represent array X and array Y
"""
X = np.array(list(zip(c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15,c16,c17,c18)))
print( " ARRAY X" +'\n', X, '\n' )
Y = np.array(list(zip(c19,c20,c21,c22,c23,c24,c25,c26,c27,c28,c29,c30,c31,c32,c33,c34,c35,c36,)))
print( " ARRAY Y" +'\n', Y, '\n' )

然后，我生成了一对x，y质心坐标

我想从指定阶段开始，在该阶段中，我将数据点指定给最近的质心。我有以下资料：

def assignment(df, centroids):
    # We take the k=9 centroids keys to iterations based
    for i in centroids.keys():
        # sqrt((x1 - x2)^2 - (y1 - y2)^2)
        # I want create a new column in a sioma_df dataframe named 
        #distance_from_i
        sioma_df['distance_from_{}'.format(i)] = (
            # We calculate the distances between each data point and 
            # each one of the 9 centroids

            # The distance_from_i column will have the distance value 
            # of each data point with reference to each centroid  (Are 9 in total) 
            np.sqrt(
                (X - centroids[i][0]) ** 2
                + (Y - centroids[i][1]) ** 2
            )
        )
    # We iterate by each distance value of each data point i with 
    # reference to each centroid j to compare and meet to what 
    # distance is more closest 
    centroid_distance_cols = ['distance_from_{}'.format(i) for i in centroids.keys()]
    # We create the closest column in the sioma_df dataframe,  
    # selecting the more minimum values in the column axis=1:
    sioma_df['closest'] = sioma_df.loc[:, centroid_distance_cols].idxmin(axis=1)
    sioma_df['closest'] = sioma_df['closest'].map(lambda x: int(x.lstrip('distance_from_')))
    sioma_df['color'] = sioma_df['closest'].map(lambda x: colmap[x])
    return df

# We wxecute the assignment function which perform the compute of what data point is more closest to each centroid
df = assignment(sioma_df, centroids)
print(df.head)

但是，当我执行代码时，会出现以下错误：

KeyError: 'distance_from_1'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-160-b96e0351c13d> in <module>()
     24 
     25 # 
---> 26 df = assignment(sioma_df, centroids)
     27 print(df.head)

<ipython-input-160-b96e0351c13d> in assignment(df, centroids)
     11             np.sqrt(
     12                 (X - centroids[i][0]) ** 2
---> 13                 + (Y - centroids[i][1]) ** 2
     14             )
     15         )


ValueError: Wrong number of items passed 18, placement implies 1

我真的不明白如何解决这个不方便的意义上有一个正确的分配；这使得我很难排除故障

任何能为我指明正确方向的支持都将受到高度赞赏

我的问题是

np.sqrt（…）

语句不返回一维数组。每行的列位置期望值为1，但由于

和

numpy数组的长度，它接收的数组长度为18个元素

numpy数组上的操作是按元素进行的，因此可能不会更改正在操作的数组的形状。然后，当我想要创建新的

distance\u from\u I

列时，请执行以下操作：

sioma_df['distance_from_{}'.format(i)] = (
            np.sqrt(
                (X - centroids[i][0]) ** 2
                + (Y - centroids[i][1]) ** 2
            )
        )

我指定给这个

distance\u from\u I

列，而不是一个一维数组，它是必须接收或接受的容量，否则，我的

distance\u from\u I

列（每行，列）接收18个元素长的数组，这就是错误的原因

ValueError:传递的项目数错误18，放置意味着1

然后，我将新的

distance\u从_I

列初始化为

NaN

值，然后将

np.sqrt（…）

语句的结果值分配给它，它就工作了。我的任务职能运作正常，一直以来都是这样：

def assignment(df, centroids):
    # We take the k=9 centroids keys to iterations based
    for i in centroids.keys():
        # sqrt((x1 - x2)^2 - (y1 - y2)^2) 
        # We calculate the distances between each data point and 
        # each one of the 9 centroids

        # The distance_from_i column will have the distance value 
        # of each data point with reference to each centroid  (Are 9 in total) 
        n = np.sqrt(
                (X - centroids[i][0]) ** 2
                + (Y - centroids[i][1]) ** 2
        )
        # I want create a new column in a sioma_df dataframe named 
        # distance_from_i
        sioma_df['distance_from_{}'.format(i)] =  np.nan 
        sioma_df['distance_from_{}'.format(i)] =  n

    # We iterate by each distance value of each data point i with 
    # reference to each centroid j to compare and meet to what 
    # distance is more closest 
    centroid_distance_cols = ['distance_from_{}'.format(i) for i in centroids.keys()]

    # We create the closest column in the sioma_df dataframe,  
    # selecting the more minimum values in the column axis=1
    sioma_df['closest'] = sioma_df.loc[:, centroid_distance_cols].idxmin(axis=1)
    sioma_df['closest'] = sioma_df['closest'].map(lambda x: int(x.lstrip('distance_from_')))
    sioma_df['color'] = sioma_df['closest'].map(lambda x: colmap[x])
    return df

# We execute the assignment function which perform the compute of what data point is more closest to each centroid
df = assignment(sioma_df, centroids)
print(df.head)

您真的应该使用字典之类的东西来存储所有这些数组。很cleaner@user3483203这是真的，但是，我使用这些ci数组将它们应用于zip函数，并生成

和

坐标，以便稍后在散射函数中使用。字典中有

和

元素，如何在散布函数中使用？或者你告诉我用字典计算距离？确切地说，我是从哪里得到错误的？

def assignment(df, centroids):
    # We take the k=9 centroids keys to iterations based
    for i in centroids.keys():
        # sqrt((x1 - x2)^2 - (y1 - y2)^2) 
        # We calculate the distances between each data point and 
        # each one of the 9 centroids

        # The distance_from_i column will have the distance value 
        # of each data point with reference to each centroid  (Are 9 in total) 
        n = np.sqrt(
                (X - centroids[i][0]) ** 2
                + (Y - centroids[i][1]) ** 2
        )
        # I want create a new column in a sioma_df dataframe named 
        # distance_from_i
        sioma_df['distance_from_{}'.format(i)] =  np.nan 
        sioma_df['distance_from_{}'.format(i)] =  n

    # We iterate by each distance value of each data point i with 
    # reference to each centroid j to compare and meet to what 
    # distance is more closest 
    centroid_distance_cols = ['distance_from_{}'.format(i) for i in centroids.keys()]

    # We create the closest column in the sioma_df dataframe,  
    # selecting the more minimum values in the column axis=1
    sioma_df['closest'] = sioma_df.loc[:, centroid_distance_cols].idxmin(axis=1)
    sioma_df['closest'] = sioma_df['closest'].map(lambda x: int(x.lstrip('distance_from_')))
    sioma_df['color'] = sioma_df['closest'].map(lambda x: colmap[x])
    return df

# We execute the assignment function which perform the compute of what data point is more closest to each centroid
df = assignment(sioma_df, centroids)
print(df.head)