Python Pandas-KeyError-在嵌套循环中按索引删除行
我有一个名为pd的熊猫数据帧。我试图使用嵌套的for循环迭代数据帧的每个元组,并在每次迭代时将该元组与帧中的所有其他元组进行比较。在比较步骤中,我使用Python的difflib.SequenceMatcher().ratio()并删除具有高度相似性(ratio>0.8)的元组 问题: 不幸的是,在第一次外循环迭代之后,我得到了一个KeyError 我怀疑,通过删除元组,我使外部循环的索引器无效。或者,我试图访问一个不存在(已删除)的元素,从而使内部循环的索引器无效 代码如下:Python Pandas-KeyError-在嵌套循环中按索引删除行,python,pandas,Python,Pandas,我有一个名为pd的熊猫数据帧。我试图使用嵌套的for循环迭代数据帧的每个元组,并在每次迭代时将该元组与帧中的所有其他元组进行比较。在比较步骤中,我使用Python的difflib.SequenceMatcher().ratio()并删除具有高度相似性(ratio>0.8)的元组 问题: 不幸的是,在第一次外循环迭代之后,我得到了一个KeyError 我怀疑,通过删除元组,我使外部循环的索引器无效。或者,我试图访问一个不存在(已删除)的元素,从而使内部循环的索引器无效 代码如下: import j
import json
import pandas as pd
import pyreadline
import pprint
from difflib import SequenceMatcher
# Note, this file, 'tweetsR.json', was originally csv, but has been translated to json.
with open("twitter data/tweetsR.json", "r") as read_file:
data = json.load(read_file) # Load the source data set, esport tweets.
df = pd.DataFrame(data) # Load data into a pandas(pd) data frame for pandas utilities.
df = df.drop_duplicates(['text'], keep='first') # Drop tweets with identical text content. Note,
these tweets are likely reposts/retweets, etc.
df = df.reset_index(drop=True) # Adjust the index to reflect dropping of duplicates.
def duplicates(df):
for ind in df.index:
a = df['text'][ind]
for indd in df.index:
if indd != 26747: # Trying to prevent an overstep keyError here
b = df['text'][indd+1]
if similar(a,b) >= 0.80:
df.drop((indd+1), inplace=True)
print(str(ind) + "Completed") # Debugging statement, tells us which iterations have completed
duplicates(df)
# Create a set of the dropped tuples and run this code on bizon overnight.
def duplicates(df):
# Find out how to improve the speed of this
excludes = set()
combos = itertools.combinations(df.index, 2)
for combo in combos:
if str(combo) not in excludes:
if similar(df['text'][combo[0]], df['text'][combo[1]]) > 0.8:
excludes.add(f'{combo[0]}, {combo[1]}')
excludes.add(f'{combo[1]}, {combo[0]}')
print("Dropped: " + str(combo))
print(len(excludes))
duplicates(df)
错误输出:
有人能帮我理解和/或修复它吗?有一个解决方案,@KazuyaHatta提到过,就是itertools.composition()。虽然,我使用它的方式(可能还有另一种方式)是O(n^2)。因此,在本例中,对于27000个元组,需要迭代的组合接近357714378个(太长) 代码如下:
import json
import pandas as pd
import pyreadline
import pprint
from difflib import SequenceMatcher
# Note, this file, 'tweetsR.json', was originally csv, but has been translated to json.
with open("twitter data/tweetsR.json", "r") as read_file:
data = json.load(read_file) # Load the source data set, esport tweets.
df = pd.DataFrame(data) # Load data into a pandas(pd) data frame for pandas utilities.
df = df.drop_duplicates(['text'], keep='first') # Drop tweets with identical text content. Note,
these tweets are likely reposts/retweets, etc.
df = df.reset_index(drop=True) # Adjust the index to reflect dropping of duplicates.
def duplicates(df):
for ind in df.index:
a = df['text'][ind]
for indd in df.index:
if indd != 26747: # Trying to prevent an overstep keyError here
b = df['text'][indd+1]
if similar(a,b) >= 0.80:
df.drop((indd+1), inplace=True)
print(str(ind) + "Completed") # Debugging statement, tells us which iterations have completed
duplicates(df)
# Create a set of the dropped tuples and run this code on bizon overnight.
def duplicates(df):
# Find out how to improve the speed of this
excludes = set()
combos = itertools.combinations(df.index, 2)
for combo in combos:
if str(combo) not in excludes:
if similar(df['text'][combo[0]], df['text'][combo[1]]) > 0.8:
excludes.add(f'{combo[0]}, {combo[1]}')
excludes.add(f'{combo[1]}, {combo[0]}')
print("Dropped: " + str(combo))
print(len(excludes))
duplicates(df)
@KazuyaHatta描述了我的下一步,即尝试通过掩码方法进行放置
注意:很遗憾,我无法发布数据集的样本。有一个解决方案,@KazuyaHatta提到过,就是itertools.composition()。虽然,我使用它的方式(可能还有另一种方式)是O(n^2)。因此,在本例中,对于27000个元组,需要迭代的组合接近357714378个(太长) 代码如下:
import json
import pandas as pd
import pyreadline
import pprint
from difflib import SequenceMatcher
# Note, this file, 'tweetsR.json', was originally csv, but has been translated to json.
with open("twitter data/tweetsR.json", "r") as read_file:
data = json.load(read_file) # Load the source data set, esport tweets.
df = pd.DataFrame(data) # Load data into a pandas(pd) data frame for pandas utilities.
df = df.drop_duplicates(['text'], keep='first') # Drop tweets with identical text content. Note,
these tweets are likely reposts/retweets, etc.
df = df.reset_index(drop=True) # Adjust the index to reflect dropping of duplicates.
def duplicates(df):
for ind in df.index:
a = df['text'][ind]
for indd in df.index:
if indd != 26747: # Trying to prevent an overstep keyError here
b = df['text'][indd+1]
if similar(a,b) >= 0.80:
df.drop((indd+1), inplace=True)
print(str(ind) + "Completed") # Debugging statement, tells us which iterations have completed
duplicates(df)
# Create a set of the dropped tuples and run this code on bizon overnight.
def duplicates(df):
# Find out how to improve the speed of this
excludes = set()
combos = itertools.combinations(df.index, 2)
for combo in combos:
if str(combo) not in excludes:
if similar(df['text'][combo[0]], df['text'][combo[1]]) > 0.8:
excludes.add(f'{combo[0]}, {combo[1]}')
excludes.add(f'{combo[1]}, {combo[0]}')
print("Dropped: " + str(combo))
print(len(excludes))
duplicates(df)
@KazuyaHatta描述了我的下一步,即尝试通过掩码方法进行放置
注意:很遗憾,我无法发布数据集的样本。只是一个提示,如果您包含一些数据以使其可复制,您就更有可能在这方面获得帮助example@MaxPower你说得对。给我一点时间。我将提交数据集的一个示例。顺便说一句,此算法失败,因为当外部循环增加其计数器时,将与内部循环发生索引冲突。您将需要使用itertools.composition,我想创建一个像这样的数据帧,并使用条件删除行mask@KazuyaHatta好的,我试试看。看起来itertools.comboinations()可能会起到作用。请注意,如果您包含一些数据,使其可复制,您就更有可能在这方面获得帮助example@MaxPower你说得对。给我一点时间。我将提交数据集的一个示例。顺便说一句,此算法失败,因为当外部循环增加其计数器时,将与内部循环发生索引冲突。您将需要使用itertools.composition,我想创建一个像这样的数据帧,并使用条件删除行mask@KazuyaHatta好的,我试试看。看起来像是itertools.Comboiniations()可能会起作用。