Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/339.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 迭代dataframe列以比较结果_Python_Pandas_Numpy_Dataframe - Fatal编程技术网

Python 迭代dataframe列以比较结果

Python 迭代dataframe列以比较结果,python,pandas,numpy,dataframe,Python,Pandas,Numpy,Dataframe,我有一个熊猫数据框,里面有一列当地社区。我想做的是浏览本专栏,比较每个社区,以期序列化数据。当我在python shell中使用一小部分数据时,它可以正常工作: n = pd.DataFrame({'neighborhood':['Dupont Circle', 'Adams Morgan', 'alexandria', 'west end/dupont circle', 'logan circle', 'alexandria, va', 'washington', 'adam morgan/k

我有一个熊猫数据框,里面有一列当地社区。我想做的是浏览本专栏,比较每个社区,以期序列化数据。当我在python shell中使用一小部分数据时,它可以正常工作:

n = pd.DataFrame({'neighborhood':['Dupont Circle', 'Adams Morgan', 'alexandria', 'west end/dupont circle', 'logan circle', 'alexandria, va', 'washington', 'adam morgan/kalorama', 'Washington DC', 'Kalorama']})
print(n)
#results
#            neighborhood
#0           Dupont Circle
#1            Adams Morgan
#2              alexandria
#3  west end/dupont circle
#4            logan circle
#5          alexandria, va
#6              washington
#7    adam morgan/kalorama
#8           Washington DC
#9                Kalorama
for i in range(len(n['neighborhood'])):
    for j in range(i + 1, len(n['neighborhood'])):
        ratio = fw.partial_ratio(n['neighborhood'][i].lower(),n['neighborhood'][j].lower())
        print(n['neighborhood'][i]+' : '+n['neighborhood'][j]+' - '+str(ratio))
        if ratio>90:
            n['neighborhood'][j] = n['neighborhood'][i]
        print(n['neighborhood'][i]+' : '+n['neighborhood'][j])
print(n)
#results
#   neighborhood
#0  Dupont Circle
#1   Adams Morgan
#2     alexandria
#3  Dupont Circle
#4   logan circle
#5     alexandria
#6     washington
#7   Adams Morgan
#8     washington
#9       Kalorama
这就是我所期望的。然而,当我扩大范围,针对我从craigslist中获取的数据运行它时,我得到了一个关键错误

#this is from my main data source
neighborhood_results = post_results[['neighborhood']].copy()
neighborhood_results.to_csv('neighborhood_clean.csv',index=False)

for i in range(len(neighborhood_results['neighborhood'])):
    for j in range(i + 1, len(neighborhood_results['neighborhood'])):
            print(i)
            print(j)
            ratio = fw.partial_ratio(neighborhood_results['neighborhood'][i],neighborhood_results['neighborhood'][j])
            if ratio>90:
                neighborhood_results['neighborhood'][j] = neighborhood_results['neighborhood'][i]
当我运行此代码时,
print(I)print(j)
按预期返回0和1,但随后我得到了密钥错误

#this is from my main data source
neighborhood_results = post_results[['neighborhood']].copy()
neighborhood_results.to_csv('neighborhood_clean.csv',index=False)

for i in range(len(neighborhood_results['neighborhood'])):
    for j in range(i + 1, len(neighborhood_results['neighborhood'])):
            print(i)
            print(j)
            ratio = fw.partial_ratio(neighborhood_results['neighborhood'][i],neighborhood_results['neighborhood'][j])
            if ratio>90:
                neighborhood_results['neighborhood'][j] = neighborhood_results['neighborhood'][i]
第871行,在getitem

result = self.index.get_value(self, key)
文件“C:\Users\cards\AppData\Local\Programs\Python38-32\lib\site packages\pandas\core\index\base.py”, 第4405行,在get_值中 返回self._engine.get_值(s,k,tz=getattr(series.dtype,“tz”,None))文件“pandas_libs\index.pyx”,第80行,in pandas._libs.index.IndexEngine.get_值文件 “pandas_libs\index.pyx”,第90行,中文 pandas._libs.index.IndexEngine.get_值文件 “pandas_libs\index.pyx”,第138行 pandas._libs.index.IndexEngine.get_loc文件 “pandas_libs\hashtable_class_helper.pxi”,第998行,中 pandas._libs.hashtable.Int64HashTable.get_项目文件 “pandas_libs\hashtable_class_helper.pxi”,第1005行,中 pandas._libs.hashtable.Int64HashTable.get_项

关键错误:0

我的理解是,这与列和键的查找有关。然而,为什么它适用于较小的数据集,而不适用于较大的数据集

完全刮取代码:

from bs4 import BeautifulSoup
import json
from requests import get
import numpy as np
import pandas as pd
import csv
from fuzzywuzzy import fuzz as fw

print('hello world')
#get the initial page for the listings, to get the total count
response = get('https://washingtondc.craigslist.org/search/hhh?query=rent&availabilityMode=0&sale_date=all+dates')
html_result = BeautifulSoup(response.text, 'html.parser')
results = html_result.find('div', class_='search-legend')
total = int(results.find('span',class_='totalcount').text)
pages = np.arange(0,total+1,120)

neighborhood = []
bedroom_count =[]
sqft = []
price = []
link = []
count = 0
for page in pages:
    
    response = get('https://washingtondc.craigslist.org/search/hhh?s='+str(page)+'query=rent&availabilityMode=0&sale_date=all+dates')
    html_result = BeautifulSoup(response.text, 'html.parser')

    posts = html_result.find_all('li', class_='result-row')
    for post in posts:
        if post.find('span',class_='result-hood') is not None:
            post_url = post.find('a',class_='result-title hdrlnk')
            post_link = post_url['href']
            link.append(post_link)
            post_neighborhood = post.find('span',class_='result-hood').text
            post_price = int(post.find('span',class_='result-price').text.strip().replace('$',''))
            neighborhood.append(post_neighborhood)
            price.append(post_price)
            if post.find('span',class_='housing') is not None:
                if 'ft2' in post.find('span',class_='housing').text.split()[0]:
                    post_bedroom = np.nan
                    post_footage = post.find('span',class_='housing').text.split()[0][:-3]
                    bedroom_count.append(post_bedroom)
                    sqft.append(post_footage)
                elif len(post.find('span',class_='housing').text.split())>2:
                    post_bedroom = post.find('span',class_='housing').text.replace("br","").split()[0]
                    post_footage = post.find('span',class_='housing').text.split()[2][:-3]
                    bedroom_count.append(post_bedroom)
                    sqft.append(post_footage)
                elif len(post.find('span',class_='housing').text.split())==2:
                    post_bedroom = post.find('span',class_='housing').text.replace("br","").split()[0]
                    post_footage = np.nan
                    bedroom_count.append(post_bedroom)
                    sqft.append(post_footage)
            else:
                post_bedroom = np.nan
                post_footage = np.nan
                bedroom_count.append(post_bedroom)
                sqft.append(post_footage)
        count+=1
       
print(count)
#create results data frame
post_results = pd.DataFrame({'neighborhood':neighborhood,'footage':sqft,'bedroom':bedroom_count,'price':price,'link':link})
#clean up results
post_results.drop_duplicates(subset='link')
post_results['footage'] = post_results['footage'].replace(0,np.nan)
post_results['bedroom'] = post_results['bedroom'].replace(0,np.nan)
post_results['neighborhood'] = post_results['neighborhood'].str.strip().str.strip('(|)')
post_results['neighborhood'] = post_results['neighborhood'].str.lower()
post_results = post_results.dropna(subset=['footage','bedroom'],how='all')
post_results.to_csv("rent_clean.csv",index=False)

neighborhood_results = post_results[['neighborhood']].copy()
neighborhood_results.to_csv('neighborhood_clean.csv',index=False)

for i in range(len(neighborhood_results['neighborhood'])):
    for j in range(i + 1, len(neighborhood_results['neighborhood'])):
            print(i)
            print(j)
            ratio = fw.partial_ratio(neighborhood_results['neighborhood'][i],neighborhood_results['neighborhood'][j])
            if ratio>90:
                neighborhood_results['neighborhood'][j] = neighborhood_results['neighborhood'][i]

neighborhood_results.to_csv('neighborhood_clean_a.csv',index=False)

让熊猫帮你做这项工作。它提供了非常简单的函数来迭代行和列:

  • (以(索引、系列)对的形式迭代数据帧行)
  • (以(列名、序列)对的形式在数据帧上迭代)
  • (以namedtuples的形式在数据帧行上迭代)

很容易忘记索引在代码中的作用,通过使用迭代器,您知道您正在访问所有可能的项。

pandas
为您完成这项工作。它提供了非常简单的函数来迭代行和列:

  • (以(索引、系列)对的形式迭代数据帧行)
  • (以(列名、序列)对的形式在数据帧上迭代)
  • (以namedtuples的形式在数据帧行上迭代)
很容易忘记索引在代码中的作用,通过使用迭代器,您知道您正在访问所有可能的项