Api 如何将自由格式的街道/邮政地址从文本解析为组件

Api 如何将自由格式的街道/邮政地址从文本解析为组件,api,parsing,street-address,Api,Parsing,Street Address,我们主要在美国开展业务,并试图通过将所有地址字段合并到单个文本区域来改善用户体验。但也存在一些问题: 用户类型的地址可能不正确或格式不标准 必须将地址分成几个部分(街道、城市、州等)来处理信用卡付款 用户可以输入的不仅仅是他们的地址(比如他们的姓名或公司) 谷歌可以做到这一点,但服务条款和查询限制令人望而却步,特别是在预算紧张的情况下 显然,这是一个常见的问题: 有没有一种方法可以将地址与周围的文本隔离开来,并将其分割成若干部分?有没有解析地址的正则表达式?我在地址验证公司工

我们主要在美国开展业务,并试图通过将所有地址字段合并到单个文本区域来改善用户体验。但也存在一些问题:

  • 用户类型的地址可能不正确或格式不标准
  • 必须将地址分成几个部分(街道、城市、州等)来处理信用卡付款
  • 用户可以输入的不仅仅是他们的地址(比如他们的姓名或公司)
  • 谷歌可以做到这一点,但服务条款和查询限制令人望而却步,特别是在预算紧张的情况下
显然,这是一个常见的问题:


有没有一种方法可以将地址与周围的文本隔离开来,并将其分割成若干部分?有没有解析地址的正则表达式?

我在地址验证公司工作时经常看到这个问题。我把答案贴在这里是为了让程序员更容易找到这个问题。我所在的公司处理了数十亿个地址,在这个过程中我们学到了很多

首先,我们需要了解关于地址的一些事情

地址不是 这意味着正则表达式已过时。从以非常特定的格式匹配地址的简单正则表达式到以下内容,我都见过:

/\(a-a-扎扎扎扎扎(Z)\s++{1,5 5}{1,5}{{1,5}{1,5}{1,1,2})的(([s{{2,5 5 5}{{2,5 5}{{2,5}{2,5 5 5 5}s+s+(((([a-a-d{5.5.5.5上周上周上周上周上周上周上周上周上周上周)m m)m m m\b)m\m\b)m\b)m\b)(((((([a-a-a-a-扎扎扎扎扎扎扎扎扎扎扎扎扎扎扎扎扎扎扎扎扎扎扎扎(Z)p)p)p)p)m m m m m\m m\m\m\m\m\m\m\m\m\m\m\m\b)m\m\b)m\b)m\zA-Z |\s+]{1,30}{1,2})([\s |,|.]+)?\b(AK)阿方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方方|或| PA | RI | SC | SD | TN | TX | UT | VA | VI | VT | WA | WV | WY | | | | | | | | | | | | | | | | | | | | | |

…一个900多行的类文件会动态生成一个超大容量正则表达式来匹配更多。我不推荐使用这些(例如)。要实现这一点,没有一个简单的神奇公式。从理论上和理论上讲,不可能将地址与正则表达式匹配

记录了多种可能的地址格式,以及它们的所有关键字和变体。最糟糕的是,地址往往模棱两可。单词可以有不止一个含义(“St”可以是“圣人”或“街道”),还有一些单词我很确定是他们发明的。(谁知道“Stravenue”是街道后缀?)

你需要一些真正能理解地址的代码,如果这些代码确实存在,那就是商业秘密。但如果你真的这么做了,你可以自己动手

地址的形状和大小出乎意料 以下是一些人为(但完整)的地址:

1)  102 main street
    Anytown, state

2)  400n 600e #2, 52173

3)  p.o. #104 60203
甚至这些都可能是有效的:

4)  829 LKSDFJlkjsdflkjsdljf Bkpw 12345

5)  205 1105 14 90210
显然,这些都不是标准化的。标点符号和换行符不能保证。下面是发生的情况:

  • 编号1是完整的,因为它包含一个街道地址以及一个城市和州。有了这些信息,就足够识别地址了,可以认为它是“可交付的”(具有一定的标准化)

  • 编号2是完整的,因为它包含一个街道地址(带有二级/单元编号)和一个5位数的邮政编码,这足以识别一个地址

  • 数字3是一种完整的邮政信箱格式,因为它包含邮政编码

  • 第4号也是完整的,因为这意味着私人实体或公司已经购买了该地址空间。一个唯一的邮政编码用于大容量或集中配送空间。任何邮政编码为12345的地址都会发送给位于纽约州斯克内克塔迪的通用电气公司。这个例子不会涉及任何人,但USP除外S仍然会提供它

  • 数字5也是完整的,信不信由你。只要有这些数字,就可以在对所有可能地址的数据库进行分析时发现完整地址。当你将每个数字视为一个组件时,填写缺少的方向、二级指示符和邮政编码+4是很简单的。下面是它的样子,完全是扩展和标准化:

  • 205 N 1105 W公寓14

    加利福尼亚州贝弗利山庄90210-5221

    地址数据不是您自己的 在大多数向许可供应商提供官方地址数据的国家,地址数据本身属于管理机构。在美国,美国邮政局拥有这些地址。加拿大邮政、皇家邮政和其他机构也是如此,尽管每个国家对所有权的执行或定义略有不同。了解这一点很重要,因为这通常是rbids反向工程地址数据库。你必须小心如何获取、存储和使用数据

    Google Maps是一个常见的快速地址修复工具,但它相当禁止;例如,如果不显示Google Map,您不能使用他们的数据或API,并且只能用于非商业目的(除非您付费),并且您不能存储数据(临时缓存除外)。有道理。谷歌的数据是世界上最好的。但是,谷歌地图不会验证地址。如果地址不存在,它仍然会显示地址所在的位置(在你自己的街道上尝试;使用你知道不存在的门牌号)。这有时很有用,但要注意

    Nagnitm同样也有局限性,特别是对于高容量和商业用途,而且数据大多来自免费来源,因此没有得到很好的维护(例如开放项目的性质)。但是,这可能仍然适合您的需要。一个伟大的社区支持它

    USPS本身有一个API,但没有任何保证或支持。它可能也很难使用。有些人使用它时很谨慎,没有任何问题。但很容易忽略USPS要求您使用他们的API只是为了确认通过他们发送的地址

    人们期望地址很难
    python3 -m pip install usaddress
    
    #!/usr/bin/env python3
    # -*- coding: utf-8 -*-
    
    # address_parser.py
    import sys
    from usaddress import tag
    from json import dumps, loads
    
    if __name__ == '__main__':
        tag_mapping = {
            'Recipient': 'recipient',
            'AddressNumber': 'addressStreet',
            'AddressNumberPrefix': 'addressStreet',
            'AddressNumberSuffix': 'addressStreet',
            'StreetName': 'addressStreet',
            'StreetNamePreDirectional': 'addressStreet',
            'StreetNamePreModifier': 'addressStreet',
            'StreetNamePreType': 'addressStreet',
            'StreetNamePostDirectional': 'addressStreet',
            'StreetNamePostModifier': 'addressStreet',
            'StreetNamePostType': 'addressStreet',
            'CornerOf': 'addressStreet',
            'IntersectionSeparator': 'addressStreet',
            'LandmarkName': 'addressStreet',
            'USPSBoxGroupID': 'addressStreet',
            'USPSBoxGroupType': 'addressStreet',
            'USPSBoxID': 'addressStreet',
            'USPSBoxType': 'addressStreet',
            'BuildingName': 'addressStreet',
            'OccupancyType': 'addressStreet',
            'OccupancyIdentifier': 'addressStreet',
            'SubaddressIdentifier': 'addressStreet',
            'SubaddressType': 'addressStreet',
            'PlaceName': 'addressCity',
            'StateName': 'addressState',
            'ZipCode': 'addressPostalCode',
        }
        try:
            address, _ = tag(' '.join(sys.argv[1:]), tag_mapping=tag_mapping)
        except:
            with open('failed_address.txt', 'a') as fp:
                fp.write(sys.argv[1] + '\n')
            print(dumps({}))
        else:
            print(dumps(dict(address)))
    
     python3 address_parser.py 9757 East Arcadia Ave. Saugus MA 01906
     {"addressStreet": "9757 East Arcadia Ave.", "addressCity": "Saugus", "addressState": "MA", "addressPostalCode": "01906"}
    
    Option Explicit
    
    Private Const TopRow As Integer = 0
    
    Public Sub ParseAddress()
    Dim strArr() As String
    Dim sigRow() As String
    Dim i As Integer
    Dim j As Integer
    Dim k As Integer
    Dim Stat As String
    Dim SpaceInName As Integer
    Dim Temp As String
    Dim PhExt As String
    
    On Error Resume Next
    
    Temp = ActiveSheet.Range("Address")
    
    'Split info into array
    strArr = Split(Temp, vbLf)
    
    'Trim the array
    For i = 0 To UBound(strArr)
    strArr(i) = VBA.Trim(strArr(i))
    Next i
    
    'Remove empty items/rows    
    ReDim sigRow(LBound(strArr) To UBound(strArr))
    For i = LBound(strArr) To UBound(strArr)
        If Trim(strArr(i)) <> "" Then
            sigRow(j) = strArr(i)
            j = j + 1
        End If
    Next i
    ReDim Preserve sigRow(LBound(strArr) To j)
    
    'Find the name (MUST BE ON THE FIRST ROW UNLESS CHECKBOX UNTICKED)
    i = TopRow
    If ActiveSheet.Shapes("chkFirst").ControlFormat.Value = 1 Then
    
    SpaceInName = InStr(1, sigRow(i), " ", vbTextCompare) - 1
    
    If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
    ActiveSheet.Range("FirstName") = VBA.Left(sigRow(i), SpaceInName)
    Else
     If MsgBox("First Name: " & VBA.Mid$(sigRow(i), 1, SpaceInName), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("FirstName") = VBA.Left(sigRow(i), SpaceInName)
    End If
    
    If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
    ActiveSheet.Range("Surname") = VBA.Mid(sigRow(i), SpaceInName + 2)
    Else
      If MsgBox("Surame: " & VBA.Mid(sigRow(i), SpaceInName + 2), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Surname") = VBA.Mid(sigRow(i), SpaceInName + 2)
    End If
    sigRow(i) = ""
    End If
    
    'Find the Street by looking for a "St, Pde, Ave, Av, Rd, Cres, loop, etc"
    For i = 1 To UBound(sigRow)
    If Len(sigRow(i)) > 0 Then
        For j = 0 To 8
        If InStr(1, VBA.UCase(sigRow(i)), Street(j), vbTextCompare) > 0 Then
    
        'Find the position of the street in order to get the suburb
        SpaceInName = InStr(1, VBA.UCase(sigRow(i)), Street(j), vbTextCompare) + Len(Street(j)) - 1
    
        'If its a po box then add 5 chars
        If VBA.Right(Street(j), 3) = "BOX" Then SpaceInName = SpaceInName + 5
    
        If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
        ActiveSheet.Range("Street") = VBA.Mid(sigRow(i), 1, SpaceInName)
        Else
          If MsgBox("Street Address: " & VBA.Mid(sigRow(i), 1, SpaceInName), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Street") = VBA.Mid(sigRow(i), 1, SpaceInName)
        End If
        'Trim the Street, Number leaving the Suburb if its exists on the same line
        sigRow(i) = VBA.Mid(sigRow(i), SpaceInName) + 2
        sigRow(i) = Replace(sigRow(i), VBA.Mid(sigRow(i), 1, SpaceInName), "")
    
        GoTo PastAddress:
        End If
        Next j
    End If
    Next i
    PastAddress:
    
    'Mobile
    For i = 1 To UBound(sigRow)
    If Len(sigRow(i)) > 0 Then
        For j = 0 To 3
        Temp = Mb(j)
            If VBA.Left(VBA.UCase(sigRow(i)), Len(Temp)) = Temp Then
            If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
            ActiveSheet.Range("Mobile") = VBA.Mid(sigRow(i), Len(Temp) + 2)
            Else
              If MsgBox("Mobile: " & VBA.Mid(sigRow(i), Len(Temp) + 2), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Mobile") = VBA.Mid(sigRow(i), Len(Temp) + 2)
            End If
        sigRow(i) = ""
        GoTo PastMobile:
        End If
        Next j
    End If
    Next i
    PastMobile:
    
    'Phone
    For i = 1 To UBound(sigRow)
    If Len(sigRow(i)) > 0 Then
        For j = 0 To 1
        Temp = Ph(j)
            If VBA.Left(VBA.UCase(sigRow(i)), Len(Temp)) = Temp Then
    
                'TODO: Detect the intl or national extension here.. or if we can from the postcode.
                If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
                ActiveSheet.Range("Phone") = VBA.Mid(sigRow(i), Len(Temp) + 3)
                Else
                  If MsgBox("Phone: " & VBA.Mid(sigRow(i), Len(Temp) + 3), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Phone") = VBA.Mid(sigRow(i), Len(Temp) + 3)
                End If
    
            sigRow(i) = ""
            GoTo PastPhone:
            End If
        Next j
    End If
    Next i
    PastPhone:
    
    
    'Email
    For i = 1 To UBound(sigRow)
        If Len(sigRow(i)) > 0 Then
            'replace with regEx search
            If InStr(1, sigRow(i), "@", vbTextCompare) And InStr(1, VBA.UCase(sigRow(i)), ".CO", vbTextCompare) Then
            Dim email As String
            email = sigRow(i)
            email = Replace(VBA.UCase(email), "EMAIL:", "")
            email = Replace(VBA.UCase(email), "E-MAIL:", "")
            email = Replace(VBA.UCase(email), "E:", "")
            email = Replace(VBA.UCase(Trim(email)), "E ", "")
            email = VBA.LCase(email)
    
                If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
                ActiveSheet.Range("Email") = email
                Else
                  If MsgBox("Email: " & email, vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Email") = email
                End If
            sigRow(i) = ""
            Exit For
            End If
        End If
    Next i
    
    'Now the only remaining items will be the postcode, suburb, country
    'there shouldn't be any numbers (eg. from PoBox,Ph,Fax,Mobile) except for the Post Code
    
    'Join the string and filter out the Post Code
    Temp = Join(sigRow, vbCrLf)
    Temp = Trim(Temp)
    
    For i = 1 To Len(Temp)
    
    Dim postCode As String
    postCode = VBA.Mid(Temp, i, 4)
    
    'In Australia PostCodes are 4 digits
    If VBA.Mid(Temp, i, 1) <> " " And IsNumeric(postCode) Then
    
        If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
        ActiveSheet.Range("PostCode") = postCode
        Else
          If MsgBox("Post Code: " & postCode, vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("PostCode") = postCode
        End If
    
        'Lookup the Suburb and State based on the PostCode, the PostCode sheet has the lookup
        Dim mySuburbArray As Range
        Set mySuburbArray = Sheets("PostCodes").Range("A2:B16670")
    
        Dim suburbs As String
        For j = 1 To mySuburbArray.Columns(1).Cells.Count
        If mySuburbArray.Cells(j, 1) = postCode Then
            'Check if the suburb is listed in the address
            If InStr(1, UCase(Temp), mySuburbArray.Cells(j, 2), vbTextCompare) > 0 Then
    
            'Set the Suburb and State
            ActiveSheet.Range("Suburb") = mySuburbArray.Cells(j, 2)
            Stat = mySuburbArray.Cells(j, 3)
            ActiveSheet.Range("State") = Stat
    
            'Knowing the State - for Australia we can get the telephone Ext
            PhExt = PhExtension(VBA.UCase(Stat))
            ActiveSheet.Range("PhExt") = PhExt
    
            'remove the phone extension from the number
            Dim prePhone As String
            prePhone = ActiveSheet.Range("Phone")
            prePhone = Replace(prePhone, PhExt & " ", "")
            prePhone = Replace(prePhone, "(" & PhExt & ") ", "")
            prePhone = Replace(prePhone, "(" & PhExt & ")", "")
            ActiveSheet.Range("Phone") = prePhone
            Exit For
            End If
        End If
        Next j
    Exit For
    End If
    Next i
    
    End Sub
    
    
    Private Function PhExtension(ByVal State As String) As String
    Select Case State
    Case Is = "NSW"
    PhExtension = "02"
    Case Is = "QLD"
    PhExtension = "07"
    Case Is = "VIC"
    PhExtension = "03"
    Case Is = "NT"
    PhExtension = "04"
    Case Is = "WA"
    PhExtension = "05"
    Case Is = "SA"
    PhExtension = "07"
    Case Is = "TAS"
    PhExtension = "06"
    End Select
    End Function
    
    Private Function Ph(ByVal Num As Integer) As String
    Select Case Num
    Case Is = 0
    Ph = "PH"
    Case Is = 1
    Ph = "PHONE"
    'Case Is = 2
    'Ph = "P"
    End Select
    End Function
    
    Private Function Mb(ByVal Num As Integer) As String
    Select Case Num
    Case Is = 0
    Mb = "MB"
    Case Is = 1
    Mb = "MOB"
    Case Is = 2
    Mb = "CELL"
    Case Is = 3
    Mb = "MOBILE"
    'Case Is = 4
    'Mb = "M"
    End Select
    End Function
    
    Private Function Fax(ByVal Num As Integer) As String
    Select Case Num
    Case Is = 0
    Fax = "FAX"
    Case Is = 1
    Fax = "FACSIMILE"
    'Case Is = 2
    'Fax = "F"
    End Select
    End Function
    
    Private Function State(ByVal Num As Integer) As String
    Select Case Num
    Case Is = 0
    State = "NSW"
    Case Is = 1
    State = "QLD"
    Case Is = 2
    State = "VIC"
    Case Is = 3
    State = "NT"
    Case Is = 4
    State = "WA"
    Case Is = 5
    State = "SA"
    Case Is = 6
    State = "TAS"
    End Select
    End Function
    
    Private Function Street(ByVal Num As Integer) As String
    Select Case Num
    Case Is = 0
    Street = " ST"
    Case Is = 1
    Street = " RD"
    Case Is = 2
    Street = " AVE"
    Case Is = 3
    Street = " AV"
    Case Is = 4
    Street = " CRES"
    Case Is = 5
    Street = " LOOP"
    Case Is = 6
    Street = "PO BOX"
    Case Is = 7
    Street = " STREET"
    Case Is = 8
    Street = " ROAD"
    Case Is = 9
    Street = " AVENUE"
    Case Is = 10
    Street = " CRESENT"
    Case Is = 11
    Street = " PARADE"
    Case Is = 12
    Street = " PDE"
    Case Is = 13
    Street = " LANE"
    Case Is = 14
    Street = " COURT"
    Case Is = 15
    Street = " BLVD"
    Case Is = 16
    Street = "P.O. BOX"
    Case Is = 17
    Street = "P.O BOX"
    Case Is = 18
    Street = "PO BOX"
    Case Is = 19
    Street = "POBOX"
    End Select
    End Function