Python 将扫描的PDF提取文本导入CSV_Python_Regex_Csv_Dataframe_Python Tesseract

Python 将扫描的PDF提取文本导入CSV

python regex csv dataframe

Python 将扫描的PDF提取文本导入CSV,python,regex,csv,dataframe,python-tesseract,Python,Regex,Csv,Dataframe,Python Tesseract,我使用Tesseract从扫描的PDF中提取了文本。我有这样的输出字符串 Haemoglobin 13.5 14-16 g/dl Random Blood Sugar 186 60 - 160 mg/dl Random Urine Sugar Nil ¢ Blood Urea 43 14-40 mg/dl 4 — Serum Creatinine 2.13 0.4-1.5 mg/dl Serum Uric Acid 4.9 3.4-7.0 mg/dl Serum Sodium 142 135 -

我使用Tesseract从扫描的PDF中提取了文本。我有这样的输出字符串

Haemoglobin 13.5 14-16 g/dl
Random Blood Sugar 186 60 - 160 mg/dl
Random Urine Sugar Nil
¢ Blood Urea 43 14-40 mg/dl
4 — Serum Creatinine 2.13 0.4-1.5 mg/dl
Serum Uric Acid 4.9 3.4-7.0 mg/dl
Serum Sodium 142 135 - 150 meq/L
/ Serum Potassium 2.6 3.5-5.0 meq/L
Total Cholesterol] 146 110 - 160 mg/dl
Triglycerides 162 60 - 180 mg/d]

现在，我必须将其输入一个数据帧或csv，其中一列包含所有文本，另一列包含值，即

**Haemoglobin**            13.5   14-16     g/dl
**Random Blood Sugar**     186    60 - 160  mg/dl

到目前为止，我能做的最好的事情就是这样

  text = text.split('\n')
  text = [x.split(' ') for x in text]
df = pd.DataFrame(text, columns['Header','Detail','a','e,','b','c','d','f'])
df

    Header    Detail   a      e     b      c      d  f
0 Haemoglobin 13.5    14-16   g/dl  None   None  None  None
1 Random      Blood   Sugar   186   60      -     160  mg/dl
2 Random      Urine   Sugar   Nil   None   None  None  None

请帮忙

从结尾向后工作，因为记录的其余部分似乎是固定格式，即向后工作

表示单位的字符串（没有空格）：编号：短跑：编号：编号：你想要的文本

Haemoglobin 13.5 14-16 g/dl
Field 5 (all characters backwards from end until space reached) = g/gl
Field 4 (jump over space, all characters backwards until space or dash reached) = 16
Field 3 (jump over space if present, pick up dash) = -
Field 2 (jump over space if present, all characters backwards until space reached) = 14
Field 1 (jump over space, all characters backwards until space reached) = 13.5
Field 0 (jump over space and take the rest) = Haemoglobin

Total Cholesterol] 146 110 - 160 mg/dl
Field 5 (all characters backwards from end until space reached) = mg/dl
Field 4 (jump over space, all characters backwards until space or dash reached) = 160
Field 3 (jump over space if present, pick up dash) = -
Field 2 (jump over space if present, all characters backwards until space reached) = 110
Field 1 (jump over space, all characters backwards until space reached) = 146
Field 0 (jump over space and take the rest) = Total Cholesterol]

我应该指出，这需要大量的工作，老实说，你还没有尝试过任何东西。但为了给您一个良好的开端，这里有一段代码可以解决输入中的一些明显问题：

import re
def isnum(x):
    try:
        float(x)
        return True
    except:
        return False

def clean_line(lnin):
    # clean the leading garbage
    ln=re.sub('^[^A-Za-z]+','',lnin).split()
    for i in range(len(ln)):
        if isnum(ln[i]):
            ind=i
            break
    Header=' '.join(ln[:ind])
    ln=[Header]+ln[ind:]
    if '-' in ln:
        ind=ln.index('-')
        ln[ind-1]=ln[ind-1]+'-'+ln[ind+1]
        del ln[ind:ind+2]
    return ln

使用

clean_line

功能清洁每一行。然后，您可以将其馈送到数据帧。

使用正则表达式，下面的示例代码将文本解析为带有标记的CSV字符串：描述、结果、正常值、单位

注意：列表测试结果通常使用以下方法从文件中读取：

打开（'name_test_file'）作为测试文件： test_results=test_file.read（）.splitlines（）

重新导入
测试='血红蛋白13.5 14-16 g/dl\n随机血糖18660-160 mg/dl\n'\
“随机尿糖无\n、尿素43 14-40毫克/分升\n”\
“4-血清肌酐2.13 0.4-1.5 mg/dl\n”\
“血清尿酸4.9 3.4-7.0 mg/dl\n血清钠142 135-150 meq/L”\
“/血清钾2.6 3.5-5.0 meq/L\n”\
'总胆固醇]146 110-160毫克/分升\n'\
'甘油三酯162 60-180 mg/d]\n'
test_results=tests.splitlines（）
对于测试结果中的测试结果：
打印（'输入：'，测试结果）
m=重新搜索（r'.*？（？=[a-zA-Z][a-zA-Z]）（？P.*？（=[[0-9]）”
r'[]（？P[0-9.]*？）（？=[][0-9]）'
r'[]（？P[0-9.\-]*？）（？=[]a-zA-Z]）'
r'[]（？P.[a-zA-Z/]*），
测试结果）
如果m不是无：
正常值=m.group（‘正常值’）
单位=m.group（‘单位’）
其他：
m=重新搜索（r'.*？（？=[a-zA-Z][a-zA-Z]）（？P.*？（=[]无）'
r'[]（？PNil）。*'，
测试结果）
正常值=“”
单位=“”
如果m不是无：
description=m.group（'description'）
结果=m.group（'result'）
其他：
描述=测试结果
结果=“”
写入字符串=描述+'、'+结果+'、'+正常值+'、'+单位
打印（写入字符串）

您能告诉我们到目前为止您都做了哪些尝试吗？它是否总是在标题后面的数值？或者也可以是“14-16”这样的范围？我将每行文本按空格分割，并创建了一个数据框。text=[re.split（“”）for x in text]df=pd.DataFrame（text，columns=['Header'，'Detail'，'a'，'e'，'b'，'c'，'d'，'f']）dft这不是她的问题的解决方案，更像是一个注释而不是一个答案。重新措辞一点，但我不明白为什么这不是一个答案-如果需要加载此数据，我会按照我的建议做。或者你是说我需要提供完整的代码来实现这个建议？（如果是这样的话，因为我不是Python开发人员，我将不得不删除我的答案。）要处理的示例代码段将很有帮助：相应地进行了DEdited。ln=Header+ln[ind:]在这一行获取错误“必须是str，而不是list”。抱歉，请将其更改为

ln=[Header]+ln[ind:]

@AishwaryaVenkat当您有许多不同的模式时，使用正则表达式并不容易。例如，在

随机尿糖Nil

行中，什么是标题的规则，什么是剩余的规则？我假设始终存在一个浮点，如果从数据中删除该行，代码就可以正常工作。但对于一般的解决方案，您需要一个异常处理程序。这非常有效！非常感谢：现在我可以安睡了，没问题了。别忘了接受它作为答案。我花了一些时间来制作上面的代码来帮助你，所以如果你能删除你的反对票，我将不胜感激！谢谢……我认为投票人还没有投赞成票/反对票，一定是其他人。

import re
tests = 'Haemoglobin 13.5 14-16 g/dl\nRandom Blood Sugar 186 60 - 160 mg/dl\n'\
    'Random Urine Sugar Nil\n¢ Blood Urea 43 14-40 mg/dl\n'\
    '4 — Serum Creatinine 2.13 0.4-1.5 mg/dl\n'\
    'Serum Uric Acid 4.9 3.4-7.0 mg/dl\nSerum Sodium 142 135 - 150 meq/L\n'\
    '/ Serum Potassium 2.6 3.5-5.0 meq/L\n'\
    'Total Cholesterol] 146 110 - 160 mg/dl\n'\
    'Triglycerides 162 60 - 180 mg/d]\n'

test_results = tests.splitlines()

for test_result in test_results:
    print('input :', test_result)

    m = re.search(r'.*?(?=[a-zA-Z][a-zA-Z])(?P<description>.*?)(?=[ ][0-9])'
              r'[ ](?P<result>[0-9.]*?)(?=[ ][0-9])'
              r'[ ](?P<normal_value>[ 0-9.\-]*?)(?=[ ][a-zA-Z])'
              r'[ ](?P<unit>.[ a-zA-Z/]*)',
              test_result)

    if m is not None:
        normal_value = m.group('normal_value')
        unit = m.group('unit')

    else:
        m = re.search(r'.*?(?=[a-zA-Z][a-zA-Z])(?P<description>.*?)(?=[ ]Nil)'
                  r'[ ](?P<result>Nil).*',
                  test_result)
        normal_value = ''
        unit = ''

    if m is not None:
        description = m.group('description')
        result = m.group('result')

    else:
        description = test_result
        result = ''

    write_string = description + ',' + result + ',' + normal_value + ',' + unit
    print(write_string)