Python 有效跟踪元素算法_Python_Pandas

Python 有效跟踪元素算法

python pandas

Python 有效跟踪元素算法,python,pandas,Python,Pandas,我已经将PDF文件拆分为加载在pandas中的元素。每个元素都有左右坐标以及宽度和高度有些PDF文件很奇怪，一个单词被分成多个元素。类似于具有相同顶部和左侧坐标的图片线，其等于上一个左+宽问题是，坐标有时不一样——我需要一些偏移量。我创造了这个方法，但它是非常无效的，我不知道如何使它更好地与熊猫 @classmethod def _merge_following_elements(cls, original_df: DataFrame) -> DataFrame:

我已经将PDF文件拆分为加载在pandas中的元素。每个元素都有左右坐标以及宽度和高度

有些PDF文件很奇怪，一个单词被分成多个元素。类似于具有相同顶部和左侧坐标的图片线，其等于上一个

左+宽

问题是，坐标有时不一样——我需要一些偏移量。我创造了这个方法，但它是非常无效的，我不知道如何使它更好地与熊猫

@classmethod
    def _merge_following_elements(cls, original_df: DataFrame) -> DataFrame:
        df = original_df.copy()
        i = 0

        while i < df.index[-1]:
            try:
                row = df.loc[i]
            except KeyError:
                # row can be deleted so continue to next
                i += 1
                continue

            left = row['left'] + row['width'] - 1
            right = row['left'] + row['width'] + 1
            top = row['top'] - (row['height'] * 0.3)
            bottom = row['top'] + (row['height'] * 0.3)

            match = df[
                df['page'].eq(row['page'])
                & df['left'].between(left, right)
                & df['top'].between(top, bottom)
                ]

            # continue to next element when following not found
            # ignore when multiple elements found - we have nothing to do
            if match.empty or len(match) > 1:
                i += 1
                continue

            following = match.iloc[0]
            following_index = match.index[0]

            df.loc[i, 'width'] += following['width']
            df.loc[i, 'text'] += following['text']
            df.drop(index=following_index, inplace=True)

        for i, row in df.iterrows():
            df.loc[i, 'text'] = cls._replace_typos(row['text'])

        df.reset_index(inplace=True)

        return df

@classmethod
def\u merge\u以下元素（cls，原始\u df:DataFrame）->DataFrame：
df=原始文件_df.copy（）
i=0
而i1：
i+=1
持续
following=match.iloc[0]
以下索引=匹配。索引[0]
df.loc[i，'width']+=后面的['width']
df.loc[i，'text']+=后面的['text']
df.drop（索引=以下索引，就地=真）
对于i，df.iterrows（）中的行：
df.loc[i，'text']=cls.\u替换\u打字错误（第['text']行）
df.reset_索引（原地=真）
返回df

现在我不知道如何让它变得更好。我试着按顶部排序，然后移动当前行和前一行的值，但由于同一单词的部分可以有不同的顶部坐标，我不能简单地按顶部排序

谢谢你的帮助。雅库布