Python 迭代dataframe列以比较结果
我有一个熊猫数据框,里面有一列当地社区。我想做的是浏览本专栏,比较每个社区,以期序列化数据。当我在python shell中使用一小部分数据时,它可以正常工作:Python 迭代dataframe列以比较结果,python,pandas,numpy,dataframe,Python,Pandas,Numpy,Dataframe,我有一个熊猫数据框,里面有一列当地社区。我想做的是浏览本专栏,比较每个社区,以期序列化数据。当我在python shell中使用一小部分数据时,它可以正常工作: n = pd.DataFrame({'neighborhood':['Dupont Circle', 'Adams Morgan', 'alexandria', 'west end/dupont circle', 'logan circle', 'alexandria, va', 'washington', 'adam morgan/k
n = pd.DataFrame({'neighborhood':['Dupont Circle', 'Adams Morgan', 'alexandria', 'west end/dupont circle', 'logan circle', 'alexandria, va', 'washington', 'adam morgan/kalorama', 'Washington DC', 'Kalorama']})
print(n)
#results
# neighborhood
#0 Dupont Circle
#1 Adams Morgan
#2 alexandria
#3 west end/dupont circle
#4 logan circle
#5 alexandria, va
#6 washington
#7 adam morgan/kalorama
#8 Washington DC
#9 Kalorama
for i in range(len(n['neighborhood'])):
for j in range(i + 1, len(n['neighborhood'])):
ratio = fw.partial_ratio(n['neighborhood'][i].lower(),n['neighborhood'][j].lower())
print(n['neighborhood'][i]+' : '+n['neighborhood'][j]+' - '+str(ratio))
if ratio>90:
n['neighborhood'][j] = n['neighborhood'][i]
print(n['neighborhood'][i]+' : '+n['neighborhood'][j])
print(n)
#results
# neighborhood
#0 Dupont Circle
#1 Adams Morgan
#2 alexandria
#3 Dupont Circle
#4 logan circle
#5 alexandria
#6 washington
#7 Adams Morgan
#8 washington
#9 Kalorama
这就是我所期望的。然而,当我扩大范围,针对我从craigslist中获取的数据运行它时,我得到了一个关键错误
#this is from my main data source
neighborhood_results = post_results[['neighborhood']].copy()
neighborhood_results.to_csv('neighborhood_clean.csv',index=False)
for i in range(len(neighborhood_results['neighborhood'])):
for j in range(i + 1, len(neighborhood_results['neighborhood'])):
print(i)
print(j)
ratio = fw.partial_ratio(neighborhood_results['neighborhood'][i],neighborhood_results['neighborhood'][j])
if ratio>90:
neighborhood_results['neighborhood'][j] = neighborhood_results['neighborhood'][i]
当我运行此代码时,print(I)print(j)
按预期返回0和1,但随后我得到了密钥错误
#this is from my main data source
neighborhood_results = post_results[['neighborhood']].copy()
neighborhood_results.to_csv('neighborhood_clean.csv',index=False)
for i in range(len(neighborhood_results['neighborhood'])):
for j in range(i + 1, len(neighborhood_results['neighborhood'])):
print(i)
print(j)
ratio = fw.partial_ratio(neighborhood_results['neighborhood'][i],neighborhood_results['neighborhood'][j])
if ratio>90:
neighborhood_results['neighborhood'][j] = neighborhood_results['neighborhood'][i]
第871行,在getitem
result = self.index.get_value(self, key)
文件“C:\Users\cards\AppData\Local\Programs\Python38-32\lib\site packages\pandas\core\index\base.py”,
第4405行,在get_值中
返回self._engine.get_值(s,k,tz=getattr(series.dtype,“tz”,None))文件“pandas_libs\index.pyx”,第80行,in
pandas._libs.index.IndexEngine.get_值文件
“pandas_libs\index.pyx”,第90行,中文
pandas._libs.index.IndexEngine.get_值文件
“pandas_libs\index.pyx”,第138行
pandas._libs.index.IndexEngine.get_loc文件
“pandas_libs\hashtable_class_helper.pxi”,第998行,中
pandas._libs.hashtable.Int64HashTable.get_项目文件
“pandas_libs\hashtable_class_helper.pxi”,第1005行,中
pandas._libs.hashtable.Int64HashTable.get_项
关键错误:0
我的理解是,这与列和键的查找有关。然而,为什么它适用于较小的数据集,而不适用于较大的数据集
完全刮取代码:
from bs4 import BeautifulSoup
import json
from requests import get
import numpy as np
import pandas as pd
import csv
from fuzzywuzzy import fuzz as fw
print('hello world')
#get the initial page for the listings, to get the total count
response = get('https://washingtondc.craigslist.org/search/hhh?query=rent&availabilityMode=0&sale_date=all+dates')
html_result = BeautifulSoup(response.text, 'html.parser')
results = html_result.find('div', class_='search-legend')
total = int(results.find('span',class_='totalcount').text)
pages = np.arange(0,total+1,120)
neighborhood = []
bedroom_count =[]
sqft = []
price = []
link = []
count = 0
for page in pages:
response = get('https://washingtondc.craigslist.org/search/hhh?s='+str(page)+'query=rent&availabilityMode=0&sale_date=all+dates')
html_result = BeautifulSoup(response.text, 'html.parser')
posts = html_result.find_all('li', class_='result-row')
for post in posts:
if post.find('span',class_='result-hood') is not None:
post_url = post.find('a',class_='result-title hdrlnk')
post_link = post_url['href']
link.append(post_link)
post_neighborhood = post.find('span',class_='result-hood').text
post_price = int(post.find('span',class_='result-price').text.strip().replace('$',''))
neighborhood.append(post_neighborhood)
price.append(post_price)
if post.find('span',class_='housing') is not None:
if 'ft2' in post.find('span',class_='housing').text.split()[0]:
post_bedroom = np.nan
post_footage = post.find('span',class_='housing').text.split()[0][:-3]
bedroom_count.append(post_bedroom)
sqft.append(post_footage)
elif len(post.find('span',class_='housing').text.split())>2:
post_bedroom = post.find('span',class_='housing').text.replace("br","").split()[0]
post_footage = post.find('span',class_='housing').text.split()[2][:-3]
bedroom_count.append(post_bedroom)
sqft.append(post_footage)
elif len(post.find('span',class_='housing').text.split())==2:
post_bedroom = post.find('span',class_='housing').text.replace("br","").split()[0]
post_footage = np.nan
bedroom_count.append(post_bedroom)
sqft.append(post_footage)
else:
post_bedroom = np.nan
post_footage = np.nan
bedroom_count.append(post_bedroom)
sqft.append(post_footage)
count+=1
print(count)
#create results data frame
post_results = pd.DataFrame({'neighborhood':neighborhood,'footage':sqft,'bedroom':bedroom_count,'price':price,'link':link})
#clean up results
post_results.drop_duplicates(subset='link')
post_results['footage'] = post_results['footage'].replace(0,np.nan)
post_results['bedroom'] = post_results['bedroom'].replace(0,np.nan)
post_results['neighborhood'] = post_results['neighborhood'].str.strip().str.strip('(|)')
post_results['neighborhood'] = post_results['neighborhood'].str.lower()
post_results = post_results.dropna(subset=['footage','bedroom'],how='all')
post_results.to_csv("rent_clean.csv",index=False)
neighborhood_results = post_results[['neighborhood']].copy()
neighborhood_results.to_csv('neighborhood_clean.csv',index=False)
for i in range(len(neighborhood_results['neighborhood'])):
for j in range(i + 1, len(neighborhood_results['neighborhood'])):
print(i)
print(j)
ratio = fw.partial_ratio(neighborhood_results['neighborhood'][i],neighborhood_results['neighborhood'][j])
if ratio>90:
neighborhood_results['neighborhood'][j] = neighborhood_results['neighborhood'][i]
neighborhood_results.to_csv('neighborhood_clean_a.csv',index=False)
让熊猫帮你做这项工作。它提供了非常简单的函数来迭代行和列:
- (以(索引、系列)对的形式迭代数据帧行)
- (以(列名、序列)对的形式在数据帧上迭代)
- (以namedtuples的形式在数据帧行上迭代)
很容易忘记索引在代码中的作用,通过使用迭代器,您知道您正在访问所有可能的项。让
pandas
为您完成这项工作。它提供了非常简单的函数来迭代行和列:
- (以(索引、系列)对的形式迭代数据帧行)
- (以(列名、序列)对的形式在数据帧上迭代)
- (以namedtuples的形式在数据帧行上迭代)