Python 迭代dataframe列以比较结果_Python_Pandas_Numpy_Dataframe

Python 迭代dataframe列以比较结果

python pandas numpy dataframe

Python 迭代dataframe列以比较结果,python,pandas,numpy,dataframe,Python,Pandas,Numpy,Dataframe,我有一个熊猫数据框，里面有一列当地社区。我想做的是浏览本专栏，比较每个社区，以期序列化数据。当我在python shell中使用一小部分数据时，它可以正常工作： n = pd.DataFrame({'neighborhood':['Dupont Circle', 'Adams Morgan', 'alexandria', 'west end/dupont circle', 'logan circle', 'alexandria, va', 'washington', 'adam morgan/k

我有一个熊猫数据框，里面有一列当地社区。我想做的是浏览本专栏，比较每个社区，以期序列化数据。当我在python shell中使用一小部分数据时，它可以正常工作：

n = pd.DataFrame({'neighborhood':['Dupont Circle', 'Adams Morgan', 'alexandria', 'west end/dupont circle', 'logan circle', 'alexandria, va', 'washington', 'adam morgan/kalorama', 'Washington DC', 'Kalorama']})
print(n)
#results
#            neighborhood
#0           Dupont Circle
#1            Adams Morgan
#2              alexandria
#3  west end/dupont circle
#4            logan circle
#5          alexandria, va
#6              washington
#7    adam morgan/kalorama
#8           Washington DC
#9                Kalorama
for i in range(len(n['neighborhood'])):
    for j in range(i + 1, len(n['neighborhood'])):
        ratio = fw.partial_ratio(n['neighborhood'][i].lower(),n['neighborhood'][j].lower())
        print(n['neighborhood'][i]+' : '+n['neighborhood'][j]+' - '+str(ratio))
        if ratio>90:
            n['neighborhood'][j] = n['neighborhood'][i]
        print(n['neighborhood'][i]+' : '+n['neighborhood'][j])
print(n)
#results
#   neighborhood
#0  Dupont Circle
#1   Adams Morgan
#2     alexandria
#3  Dupont Circle
#4   logan circle
#5     alexandria
#6     washington
#7   Adams Morgan
#8     washington
#9       Kalorama

这就是我所期望的。然而，当我扩大范围，针对我从craigslist中获取的数据运行它时，我得到了一个关键错误

#this is from my main data source
neighborhood_results = post_results[['neighborhood']].copy()
neighborhood_results.to_csv('neighborhood_clean.csv',index=False)

for i in range(len(neighborhood_results['neighborhood'])):
    for j in range(i + 1, len(neighborhood_results['neighborhood'])):
            print(i)
            print(j)
            ratio = fw.partial_ratio(neighborhood_results['neighborhood'][i],neighborhood_results['neighborhood'][j])
            if ratio>90:
                neighborhood_results['neighborhood'][j] = neighborhood_results['neighborhood'][i]

当我运行此代码时，

print（I）print（j）

按预期返回0和1，但随后我得到了密钥错误

#this is from my main data source
neighborhood_results = post_results[['neighborhood']].copy()
neighborhood_results.to_csv('neighborhood_clean.csv',index=False)

for i in range(len(neighborhood_results['neighborhood'])):
    for j in range(i + 1, len(neighborhood_results['neighborhood'])):
            print(i)
            print(j)
            ratio = fw.partial_ratio(neighborhood_results['neighborhood'][i],neighborhood_results['neighborhood'][j])
            if ratio>90:
                neighborhood_results['neighborhood'][j] = neighborhood_results['neighborhood'][i]

第871行，在getitem

result = self.index.get_value(self, key)

文件“C:\Users\cards\AppData\Local\Programs\Python38-32\lib\site packages\pandas\core\index\base.py”，第4405行，在get_值中返回self._engine.get_值（s，k，tz=getattr（series.dtype，“tz”，None））文件“pandas_libs\index.pyx”，第80行，in pandas._libs.index.IndexEngine.get_值文件 “pandas_libs\index.pyx”，第90行，中文 pandas._libs.index.IndexEngine.get_值文件 “pandas_libs\index.pyx”，第138行 pandas._libs.index.IndexEngine.get_loc文件 “pandas_libs\hashtable_class_helper.pxi”，第998行，中 pandas._libs.hashtable.Int64HashTable.get_项目文件 “pandas_libs\hashtable_class_helper.pxi”，第1005行，中 pandas._libs.hashtable.Int64HashTable.get_项

关键错误：0

我的理解是，这与列和键的查找有关。然而，为什么它适用于较小的数据集，而不适用于较大的数据集

完全刮取代码：

from bs4 import BeautifulSoup import json from requests import get import numpy as np import pandas as pd import csv from fuzzywuzzy import fuzz as fw print('hello world') #get the initial page for the listings, to get the total count response = get('https://washingtondc.craigslist.org/search/hhh?query=rent&availabilityMode=0&sale_date=all+dates') html_result = BeautifulSoup(response.text, 'html.parser') results = html_result.find('div', class_='search-legend') total = int(results.find('span',class_='totalcount').text) pages = np.arange(0,total+1,120) neighborhood = [] bedroom_count =[] sqft = [] price = [] link = [] count = 0 for page in pages: response = get('https://washingtondc.craigslist.org/search/hhh?s='+str(page)+'query=rent&availabilityMode=0&sale_date=all+dates') html_result = BeautifulSoup(response.text, 'html.parser') posts = html_result.find_all('li', class_='result-row') for post in posts: if post.find('span',class_='result-hood') is not None: post_url = post.find('a',class_='result-title hdrlnk') post_link = post_url['href'] link.append(post_link) post_neighborhood = post.find('span',class_='result-hood').text post_price = int(post.find('span',class_='result-price').text.strip().replace('$','')) neighborhood.append(post_neighborhood) price.append(post_price) if post.find('span',class_='housing') is not None: if 'ft2' in post.find('span',class_='housing').text.split()[0]: post_bedroom = np.nan post_footage = post.find('span',class_='housing').text.split()[0][:-3] bedroom_count.append(post_bedroom) sqft.append(post_footage) elif len(post.find('span',class_='housing').text.split())>2: post_bedroom = post.find('span',class_='housing').text.replace("br","").split()[0] post_footage = post.find('span',class_='housing').text.split()[2][:-3] bedroom_count.append(post_bedroom) sqft.append(post_footage) elif len(post.find('span',class_='housing').text.split())==2: post_bedroom = post.find('span',class_='housing').text.replace("br","").split()[0] post_footage = np.nan bedroom_count.append(post_bedroom) sqft.append(post_footage) else: post_bedroom = np.nan post_footage = np.nan bedroom_count.append(post_bedroom) sqft.append(post_footage) count+=1 print(count) #create results data frame post_results = pd.DataFrame({'neighborhood':neighborhood,'footage':sqft,'bedroom':bedroom_count,'price':price,'link':link}) #clean up results post_results.drop_duplicates(subset='link') post_results['footage'] = post_results['footage'].replace(0,np.nan) post_results['bedroom'] = post_results['bedroom'].replace(0,np.nan) post_results['neighborhood'] = post_results['neighborhood'].str.strip().str.strip('(|)') post_results['neighborhood'] = post_results['neighborhood'].str.lower() post_results = post_results.dropna(subset=['footage','bedroom'],how='all') post_results.to_csv("rent_clean.csv",index=False) neighborhood_results = post_results[['neighborhood']].copy() neighborhood_results.to_csv('neighborhood_clean.csv',index=False) for i in range(len(neighborhood_results['neighborhood'])): for j in range(i + 1, len(neighborhood_results['neighborhood'])): print(i) print(j) ratio = fw.partial_ratio(neighborhood_results['neighborhood'][i],neighborhood_results['neighborhood'][j]) if ratio>90: neighborhood_results['neighborhood'][j] = neighborhood_results['neighborhood'][i] neighborhood_results.to_csv('neighborhood_clean_a.csv',index=False)

让熊猫帮你做这项工作。它提供了非常简单的函数来迭代行和列：

（以（索引、系列）对的形式迭代数据帧行）

（以（列名、序列）对的形式在数据帧上迭代）

（以namedtuples的形式在数据帧行上迭代）

很容易忘记索引在代码中的作用，通过使用迭代器，您知道您正在访问所有可能的项。
让
pandas
为您完成这项工作。它提供了非常简单的函数来迭代行和列：

（以（索引、系列）对的形式迭代数据帧行）

（以（列名、序列）对的形式在数据帧上迭代）

（以namedtuples的形式在数据帧行上迭代）

很容易忘记索引在代码中的作用，通过使用迭代器，您知道您正在访问所有可能的项