基于所有组合的索引和分组创建内部python循环

基于所有组合的索引和分组创建内部python循环,python,dataframe,enumerate,Python,Dataframe,Enumerate,我有一个脚本,它查看属于一个组(REG_ID)的行和列标题,并对值求和。代码在矩阵(小子集)上运行,如下所示: global index index = 1 x = index while index < len(idgroups): ward_list = idgroups[index] #select list of ward ids for each region from list of lists df6 = mergedcsv.loc[ward_list] #s

我有一个脚本,它查看属于一个组(REG_ID)的行和列标题,并对值求和。代码在矩阵(小子集)上运行,如下所示:

global index
index = 1
x = index
while index < len(idgroups):
    ward_list = idgroups[index] #select list of ward ids for each region from list of lists
    df6 = mergedcsv.loc[ward_list] #select rows with values in the list
    dfcols = mergedcsv.loc[ward_list, :] #select columns with values in list
    ward_liststr = map(str, ward_list) #convert ward_list to strings so that they can be used to select columns, won't work as integers.
    ward_listint = map(int, ward_list)
    #dfrowscols = mergedcsv.loc[ward_list, ward_listint]
    df7 = df6.loc[:, ward_liststr]
    print df7
    regflowsum = df7.values.sum() #sum all values in dataframe
    intflow = [regflowsum]
    print intflow
    dfintflow = pd.DataFrame(intflow)
    dfintflow.reset_index(level=0, inplace=True)
    dfintflow.columns = ["RegID", "regflowsum"]
    dfflows.set_value(index, 'RegID', index)
    dfflows.set_value(index, 'RegID2', index)
    dfflows.set_value(index, 'regflow', regflowsum)
    mergedcsv.set_value(ward_list, 'TotRegFlows', regflowsum)
    index += 1 #increment index number
print dfflows
new_df = pd.merge(pairlist, dfflows,  how='left', left_on=['origID','destID'], right_on = ['RegID', 'RegID2'])
print new_df #useful for checking dataframe merges
regionflows = r"C:\Temp\AllNI\regionflows.csv"
header = ["WardID","LABEL","REG_ID","Total","TotRegFlows"]
mergedcsv.to_csv(regionflows, columns = header, index=False)
regregflows = r"C:\Temp\AllNI\reg_regflows.csv"
headerreg = ["REG_ID_ORIG", "REG_ID_DEST", "FLOW"]

pairlistCSV = r"C:\Temp\AllNI\pairlist_regions.csv"
new_df.to_csv(pairlistCSV)

我的代码可以根据属于每个内部组(REG_ID)的行和列计算所有ID的总和。例如,对属于REG_ID 1的所有行和列ID求和,从而计算区域1和区域1之间的总流量(内部流量),依此类推。 我希望通过计算(求和)区域之间的流来扩展此代码,例如区域1到区域2、3、4、5。。。。 我想我需要在现有的while循环中包含另一个循环,但是我非常希望能得到一些帮助,以确定它应该在哪里以及如何构建它。 我目前在内部流量总和(1-1、2-2、3-3等)上运行的代码如下:

global index
index = 1
x = index
while index < len(idgroups):
    ward_list = idgroups[index] #select list of ward ids for each region from list of lists
    df6 = mergedcsv.loc[ward_list] #select rows with values in the list
    dfcols = mergedcsv.loc[ward_list, :] #select columns with values in list
    ward_liststr = map(str, ward_list) #convert ward_list to strings so that they can be used to select columns, won't work as integers.
    ward_listint = map(int, ward_list)
    #dfrowscols = mergedcsv.loc[ward_list, ward_listint]
    df7 = df6.loc[:, ward_liststr]
    print df7
    regflowsum = df7.values.sum() #sum all values in dataframe
    intflow = [regflowsum]
    print intflow
    dfintflow = pd.DataFrame(intflow)
    dfintflow.reset_index(level=0, inplace=True)
    dfintflow.columns = ["RegID", "regflowsum"]
    dfflows.set_value(index, 'RegID', index)
    dfflows.set_value(index, 'RegID2', index)
    dfflows.set_value(index, 'regflow', regflowsum)
    mergedcsv.set_value(ward_list, 'TotRegFlows', regflowsum)
    index += 1 #increment index number
print dfflows
new_df = pd.merge(pairlist, dfflows,  how='left', left_on=['origID','destID'], right_on = ['RegID', 'RegID2'])
print new_df #useful for checking dataframe merges
regionflows = r"C:\Temp\AllNI\regionflows.csv"
header = ["WardID","LABEL","REG_ID","Total","TotRegFlows"]
mergedcsv.to_csv(regionflows, columns = header, index=False)
regregflows = r"C:\Temp\AllNI\reg_regflows.csv"
headerreg = ["REG_ID_ORIG", "REG_ID_DEST", "FLOW"]

pairlistCSV = r"C:\Temp\AllNI\pairlist_regions.csv"
new_df.to_csv(pairlistCSV)
这将导致KeyError:“[189、197、198、201]]中没有一个在[columns]中”

我也尝试过使用ward_lista=map(str,group_a)和map(int,group_a),但列出了dataframe.loc中找不到的对象。 这些列是混合数据类型,但包含应切片标签的所有列都是int64类型。
我已经尝试了很多关于数据类型的解决方案,但都没有效果。有什么建议吗?

我不能谈论你正在做的计算,但你似乎只是想安排组的组合。问题是它们是定向的还是无向的——也就是说,您需要计算流(A,B)和流(B,A),还是只计算一个

如果只有一个,您可以这样做:

for i,ward_list in enumerate(idgroups):
    for j,ward_list2 in enumerate(idgroups[i:],start=i):
这将迭代i,j对,如下所示:

0,0 0,1 0,2 ... 0,n
1,1 1,2 ... 1,n
2,2 ... 2,n
这将适用于无方向的情况

如果您需要同时计算流(A,B)和流(B,A),那么只需将代码推入名为
流的函数中,并使用反向参数调用它,如图所示。;-)

更新

让我们定义一个名为
flows
的函数:

def flows():
    pass
现在,参数是什么

好吧,看看你的代码,它从一个数据帧中获取数据。你想要两个不同的病房,让我们从这些开始。结果似乎是结果网格的总和

def flows(df, ward_a, ward_b):
    """Return the sum of all the cells in the row/column intersections
    of ward_a and ward_b."""

    return 0
现在我将复制您的代码行:

    ward_list = idgroups[index]
    print ward_list
    df6 = mergedcsv.loc[ward_list] #select rows with values in the list
    dfcols = mergedcsv.loc[ward_list, :] #select columns with values in list
    ward_liststr = map(str, ward_list) #convert ward_list to strings so that they can be used to select columns, won't work as integers.
    ward_listint = map(int, ward_list)
    #dfrowscols = mergedcsv.loc[ward_list, ward_listint]
    df7 = df6.loc[:, ward_liststr]
    print df7
    regflowsum = df7.values.sum() #sum all values in dataframe
    intflow = [regflowsum]
    print intflow
我认为这是这里的大部分
功能。让我们看看

  • ward_列表
    显然是
    ward_a
    ward_b
    参数

  • 我不确定什么是
    df6
    ,因为您可以在
    df7
    中重新计算它。因此,这需要澄清

  • regflowsum
    是我们想要的输出,我想

  • 将其重写到函数中:

        w=pysal.rook_from_shapefile("C:/Temp/AllNI/NIW01_sort.shp",idVariable='LABEL')
    Simil = pysal.open("C:/Temp/AllNI/simNI.csv")
    Similarity = np.array(Simil)
    db = pysal.open('C:\Temp\SQLite\MatrixCSV2.csv', 'r')
    dbf = pysal.open(r'C:\Temp\AllNI\NIW01_sortC.dbf', 'r')
    ids = np.array((dbf.by_col['LABEL']))
    commuters = np.array((dbf.by_col['Total'],dbf.by_col['IDNO']))
    commutersint = commuters.astype(int)
    comm = commutersint[0]
    floor = int(MIN_COM_CT + 100)
    solution = pysal.region.Maxp(w=w,z=Similarity,floor=floor,floor_variable=comm)
    regions = solution.regions
    #print regions
    writecsv = r"C:\Temp\AllNI\reg_output.csv"
    csv = open(writecsv,'w')
    csv.write('"LABEL","REG_ID"\n')
    for i in range(len(regions)):
            for lines in regions[i]:
                csv.write('"' + lines + '","' + str(i+1) + '"\n')
    csv.close()
    flows = r"C:\Temp\SQLite\MatrixCSV2.csv"
    regs = r"C:\Temp\AllNI\reg_output.csv"
    wardflows = pd.read_csv(flows)
    regoutput = pd.read_csv(regs)
    merged = pd.merge(wardflows, regoutput)
    #duplicate REG_ID column as the index to be used later
    merged['REG_ID2'] = merged['REG_ID']
    merged.to_csv("C:\Temp\AllNI\merged.csv", index=False)
    mergedcsv = pd.read_csv("C:\Temp\AllNI\merged.csv",index_col='WardID_1') #index this dataframe using the WardID_1 column
    flabelList = pd.read_csv("C:\Temp\AllNI\merged.csv", usecols = ["WardID", "REG_ID"]) #create list of all FLabel values
    
    reg_id = "REG_ID"
    ward_flows = "RegIntFlows"
    flds = [reg_id, ward_flows] #create list of fields to be use in search
    
    dict_ref = {} # create a dictionary with for each REG_ID a list of corresponding FLABEL fields
    
    
    #group the dataframe by the REG_ID column
    idgroups = flabelList.groupby('REG_ID')['WardID'].apply(lambda x: x.tolist())
    print idgroups
    
    idgrp_df = pd.DataFrame(idgroups)
    
    csvcols = mergedcsv.columns
    
    #create a list of column names to pass as an index to select columns
    columnlist = list(mergedcsv.columns.values)
    
    mergedcsvgroup = mergedcsv.groupby('REG_ID').sum()
    mergedcsvgroup.describe()
    idList = idgroups[2]
    df4 = pd.DataFrame()
    df5 = pd.DataFrame()
    col_ids = idList #ward id no
    
    regiddf = idgroups.index.get_values()
    print regiddf
    #total number of region ids
    #print regiddf
    #create pairlist combinations from region ids
    #combinations with replacement allows for repeated items
    #pairs = list(itertools.combinations_with_replacement(regiddf, 2))
    pairs = list(itertools.product(regiddf, repeat=2))
    #print len(pairs)
    
    #create a new dataframe with pairlists and summed data
    pairlist = pd.DataFrame(pairs,columns=['origID','destID'])
    print pairlist.tail()
    header_pairlist = ["origID","destID","flow"]
    header_intflow = ["RegID", "RegID2", "regflow"]
    dfflows = pd.DataFrame(columns=header_intflow)
    
    print mergedcsv.index
    print mergedcsv.dtypes
    #mergedcsv = mergedcsv.select_dtypes(include=['int64'])
    #print mergedcsv.columns
    #mergedcsv.rename(columns = lambda x: int(x), inplace=True)
    
    def flows():
        pass
    
    #def flows(mergedcsv, region_a, region_b):
    def flows(mergedcsv, ward_lista, ward_listb):
        """Return the sum of all the cells in the row/column intersections
        of ward_lista and ward_listb."""
    
        mergedcsv = mergedcsv.loc[:, mergedcsv.dtypes == 'int64']
        regionflows = mergedcsv.loc[ward_lista, ward_listb]
        regionflowsum = regionflows.values.sum()
    
    
        #grid = [ax, bx, regflowsuma, regflowsumb]
        gridoutput = [ax, bx, regionflowsum]
        print gridoutput
    
        return regflowsuma
        return regflowsumb
    
    #print mergedcsv.index
    
    #mergedcsv.columns = mergedcsv.columns.str.strip()
    
    for ax, group_a in enumerate(idgroups):
        ward_lista = map(int, group_a)
        print ward_lista
    
    
        for bx, group_b in enumerate(idgroups[ax:], start=ax):
            ward_listb = map(int, group_b)
            #print ward_listb
    
            flow_ab = flows(mergedcsv, ward_lista, ward_listb)
                #flow_ab = flows(mergedcsv, group_a, group_b)
    
    def flows(df, ward_a, ward_b):
        """Return the sum of all the cells in the row/column intersections
        of ward_a and ward_b."""
    
        print "Computing flows from:"
        print "    ", ward_a
        print ""
        print "flows into:"
        print "    ", ward_b
    
        # Filter rows by ward_a, cols by ward_b:
        grid = df.loc[ward_a, ward_b]
    
        print "Grid:"
        print grid
    
        flowsum = grid.values.sum()
    
        print "Flows:", flowsum
    
        return flowsum
    
    现在,我假设
    ward_a
    ward_b
    值的格式已经正确。因此,我们必须
    str
    -对它们或函数之外的任何内容进行验证。让我们这样做:

    for ax, group_a in enumerate(idgroups):
        ward_a = map(str, group_a)
    
        for bx, group_b in enumerate(idgroups[ax:], start=ax):
            ward_b = map(str, group_b)
    
            flow_ab = flows(mergedcsv, ward_a, ward_b)
    
            if ax != bx:
                flow_ba = flows(mergedcsv, ward_b, ward_a)
            else:
                flow_ba = flow_ab
    
            # Now what?
    

    此时,您有两个数字。当病房相同(内部流量?)时,它们将相等。此时,您的原始代码不再有用,因为它只处理内部流,而不是A->B流,所以我不知道该怎么做。但是值在变量中,所以…

    Austin,谢谢。这是一个很大的帮助。我希望计算流向A到B和B到A的流量。。我没有遇到过flows函数,所以我们将对此进行研究。没有flows函数。奥斯汀,你的建议真的很有帮助。添加流函数是非常合乎逻辑的。我已经根据你的建议更新了代码。我似乎有数据类型错误或索引错误。我尝试了很多方法来解决这个问题,包括映射到int和选择int64类型的所有列,但仍然有错误。你能发现任何明显的错误吗?非常感谢您的建议您确认这些值实际上在列中了吗?有没有可能他们只在一排?是否尝试打印属性返回的列表?