Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/349.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 有效跟踪元素算法_Python_Pandas - Fatal编程技术网

Python 有效跟踪元素算法

Python 有效跟踪元素算法,python,pandas,Python,Pandas,我已经将PDF文件拆分为加载在pandas中的元素。每个元素都有左右坐标以及宽度和高度 有些PDF文件很奇怪,一个单词被分成多个元素。类似于具有相同顶部和左侧坐标的图片线,其等于上一个左+宽 问题是,坐标有时不一样——我需要一些偏移量。我创造了这个方法,但它是非常无效的,我不知道如何使它更好地与熊猫 @classmethod def _merge_following_elements(cls, original_df: DataFrame) -> DataFrame:

我已经将PDF文件拆分为加载在pandas中的元素。每个元素都有左右坐标以及宽度和高度

有些PDF文件很奇怪,一个单词被分成多个元素。类似于具有相同顶部和左侧坐标的图片线,其等于上一个
左+宽

问题是,坐标有时不一样——我需要一些偏移量。我创造了这个方法,但它是非常无效的,我不知道如何使它更好地与熊猫

@classmethod
    def _merge_following_elements(cls, original_df: DataFrame) -> DataFrame:
        df = original_df.copy()
        i = 0

        while i < df.index[-1]:
            try:
                row = df.loc[i]
            except KeyError:
                # row can be deleted so continue to next
                i += 1
                continue

            left = row['left'] + row['width'] - 1
            right = row['left'] + row['width'] + 1
            top = row['top'] - (row['height'] * 0.3)
            bottom = row['top'] + (row['height'] * 0.3)

            match = df[
                df['page'].eq(row['page'])
                & df['left'].between(left, right)
                & df['top'].between(top, bottom)
                ]

            # continue to next element when following not found
            # ignore when multiple elements found - we have nothing to do
            if match.empty or len(match) > 1:
                i += 1
                continue

            following = match.iloc[0]
            following_index = match.index[0]

            df.loc[i, 'width'] += following['width']
            df.loc[i, 'text'] += following['text']
            df.drop(index=following_index, inplace=True)

        for i, row in df.iterrows():
            df.loc[i, 'text'] = cls._replace_typos(row['text'])

        df.reset_index(inplace=True)

        return df
@classmethod
def\u merge\u以下元素(cls,原始\u df:DataFrame)->DataFrame:
df=原始文件_df.copy()
i=0
而i1:
i+=1
持续
following=match.iloc[0]
以下索引=匹配。索引[0]
df.loc[i,'width']+=后面的['width']
df.loc[i,'text']+=后面的['text']
df.drop(索引=以下索引,就地=真)
对于i,df.iterrows()中的行:
df.loc[i,'text']=cls.\u替换\u打字错误(第['text']行)
df.reset_索引(原地=真)
返回df
现在我不知道如何让它变得更好。我试着按顶部排序,然后移动当前行和前一行的值,但由于同一单词的部分可以有不同的顶部坐标,我不能简单地按顶部排序

谢谢你的帮助。雅库布