Python 如何将ocr数据转换为数据帧
我有一个单元格边界框Python 如何将ocr数据转换为数据帧,python,python-3.x,pandas,dataframe,Python,Python 3.x,Pandas,Dataframe,我有一个单元格边界框 [[23, 19, 1346, 63], [23, 67, 137, 110], [141, 67, 344, 110], [348, 67, 635, 110], [639, 67, 1346, 110], [23, 114, 137, 287], [141, 114, 344, 287], [348, 114, 635, 287], [639, 114, 1346, 287], [23, 291, 137, 507], [141, 291, 344,
[[23, 19, 1346, 63],
[23, 67, 137, 110],
[141, 67, 344, 110],
[348, 67, 635, 110],
[639, 67, 1346, 110],
[23, 114, 137, 287],
[141, 114, 344, 287],
[348, 114, 635, 287],
[639, 114, 1346, 287],
[23, 291, 137, 507],
[141, 291, 344, 507],
[348, 291, 635, 507],
[639, 291, 1346, 507]]
我已经完成了ocr的输出
[([604, 28, 764, 58], '4th Quarter'),
([42, 78, 118, 103], 'Sr No'),
([217, 78, 266, 103], 'PID'),
([439, 78, 543, 104], 'PName'),
([849, 76, 1133, 107], 'Product Description'),
([69, 126, 90, 151], '1'),
([152, 124, 331, 151], 'IDXY100234'),
([386, 123, 595, 151], 'SQRT-XUIP-34'),
([655, 122, 1332, 155], 'si Jandarmeriei in scopul prevenirii delincventei'),
([655, 165, 1289, 197], 'realizarii unei orientari vocationale adecvate'),
([653, 209, 1189, 241], 'contactele cu diverse institutii pentru'),
([68, 302, 90, 328], '2'),
([155, 300, 335, 329], 'IDXY100346'),
([364, 301, 615, 328], 'MAPK-QKGAP-09'),
([651, 299, 1279, 330], 'introducerea elevilor in mediul comunitar si'),
([650, 343, 1267, 375], 'semestrial-comisia de prevenire a violentei'),
([654, 387, 1276, 418], 'Reprezentativ al Parintilor, suplimentate de'),
([653, 429, 1127, 462], 'consultatii individuale cu parintii;')]
我想把它转换成一个合适的数据帧,如下图所示
表中的输出数据帧应相同。在将每个单元格转换为列之后,我不知道该如何做。
我的代码
df=pd.DataFrame(提取的ocr数据)
#寻找质心
df[2]=df[0].应用(λx:((x[0]+x[-4])/2,((x[1]+x[-3])/2)))
col_df=pd.DataFrame([])
#比较质心坐标与单元坐标后将单元转换为列
对于提取的\u单元格\u bb中的bbox:
表_df[“Cols Bool”]=df[2]。应用(如果x[0]>=bbox[0]和x[0]=bbox[1]和x[1],则lambda x:True)
df = pd.DataFrame(extracted_ocr_data)
# Finding centroid
df[2] = df[0].apply(lambda x: ((x[0] + x[-4]) / 2, ((x[1] + x[-3]) / 2)))
col_df = pd.DataFrame([])
# Converting cell into columns after comparing centroid to cell co-ordinates
for bbox in extracted_cells_bb:
table_df["Cols Bool"] = df[2].apply(lambda x: True if x[0] >= bbox[0] and x[0] <= (bbox[0] + bbox[-3]) and x[1] >= bbox[1] and x[1] <= (bbox[1] + bbox[-2]) else False)
col_df = pd.concat([col_df, pd.DataFrame(df[df["Cols Bool"]][1]).reset_index()], axis = 1, ignore_index=True)