Python Google Vision：使用full_text_annotation.text提取全文后，提取每个单词的置信度_Python_Python 3.x_Google Cloud Platform_Google Vision

Python Google Vision：使用full_text_annotation.text提取全文后，提取每个单词的置信度

python python-3.x google-cloud-platform

Python Google Vision：使用full_text_annotation.text提取全文后，提取每个单词的置信度,python,python-3.x,google-cloud-platform,google-vision,Python,Python 3.x,Google Cloud Platform,Google Vision,我正在使用 def detect_document(path): client = vision.ImageAnnotatorClient() with io.open(path, 'rb') as image_file: content = image_file.read() image = vision.types.Image(content=content) response = client.document_text_det

我正在使用

def detect_document(path):
     client = vision.ImageAnnotatorClient()

     with io.open(path, 'rb') as image_file:
        content = image_file.read()

     image = vision.types.Image(content=content)

     response = client.document_text_detection(image=image)

     text = response.full_text_annotation.text
     text = text.casefold()
     text = text.replace('(','')
     text = text.replace(')','')
     text = text.replace(':','')
     text = text.replace('.','')

     return text

从手写填写的申请表中提取以下文本

a bank challan
bank branch abc mute deposit id 005saetm-0055 deposit date 14 ml 19
b personal information use capital letters and leave spaces between words
name muhammad hanif tiid
father's name muhammad yaqoob tiittitttt
computerized nic no 44 303-5214 345-3
d d m m y y y y
gender male age in years 22 date of birth  4-08-1999
domicile district mirpuskhas contact no 0333-7072258
please do not mention converted no
postal address anmol book depo naukot taluka jhuddo disstti mps
sindh
are you government servant yes
if yes, please attach noc
no
✓
religion muslim
✓
non-muslim o
c academic information
intermediate/hssc eng mirpuskhas bise match b 2016
matric/ssc seience bisemirpurkhang match a 2014
d any other certifications/diploma/professional degrees shorthand, dit, cit etc
name
le

然后使用正则表达式模式

现在我想为每个字段的所有处理创建日志

<name>

<origin>

muhammad hanif tiid 

</origin>

<originscore>

78.2

</originscore>

<final>

muhammad hanif

</final>

<corrections>

4

</corrections>

</name>

这并不能解决问题

下一步可以尝试什么？

替换此代码片段：

text = response.full_text_annotation.text
     text = text.casefold()
     text = text.replace('(','')
     text = text.replace(')','')
     text = text.replace(':','')
     text = text.replace('.','')

     return text

与：

我还使用了与上面所示相同的逻辑来获取每个提取单词的置信度，但这并不能解决问题。我需要获得每个短语的置信度得分，而不是单个单词，例如CNIC，它可能包含“44601”-“6622831”-“3”。我认为目前不可能通过短语获得置信度得分。OCR提取文本结构的层次结构如下：文本注释->页面->块->段落->单词->符号据我所知，这是正确的。这个问题有其他解决办法吗？

text = response.full_text_annotation.text
     text = text.casefold()
     text = text.replace('(','')
     text = text.replace(')','')
     text = text.replace(':','')
     text = text.replace('.','')

     return text

for page in response.full_text_annotation.pages:
    for block in page.blocks:
        for paragraph in block.paragraphs:
            for word in paragraph.words:
                word_text = ''.join([
                    symbol.text for symbol in word.symbols
                ])
                print('{}: {}'.format(
                    word_text, word.confidence))