Python正则表达式从语音到文本生成的字符串中删除社会保险号_Python_Regex_Gdprconsentform

Python正则表达式从语音到文本生成的字符串中删除社会保险号

python regex

Python正则表达式从语音到文本生成的字符串中删除社会保险号,python,regex,gdprconsentform,Python,Regex,Gdprconsentform,出于符合GDPR的原因，我正在尝试从语音到文本生成的混乱数据中删除社会安全号码（SSN）。下面是一个示例字符串（翻译成英文，解释了列出SSN时出现“and”的原因）：我的目标是删除部分“十三…四十”，同时保留字符串中可能出现的其他数字，从而导致： sample1_wo_ssn = "hello my name is sofie my social security number is and I live on mountain street number twelve" 社会保险号码的长度

出于符合GDPR的原因，我正在尝试从语音到文本生成的混乱数据中删除社会安全号码（SSN）。下面是一个示例字符串（翻译成英文，解释了列出SSN时出现“and”的原因）：

我的目标是删除部分

“十三…四十”

，同时保留字符串中可能出现的其他数字，从而导致：

sample1_wo_ssn = "hello my name is sofie my social security number is and I live on mountain street number twelve"

社会保险号码的长度可能因数据生成方式的不同而不同（3-10个分开的号码）

我的做法：

使用dict将写入的数字替换为数字

使用regex查找出现3个或更多数字的位置，其中只有空格或“

”和“

将它们分隔开，并将它们与这3个数字后面的任何数字一起删除这是我的密码：

import re

number_dict = {
    'zero': '0',
    'one': '1',
    'two': '2',
    'three': '3',
    'four': '4',
    'five': '5',
    'six': '6',
    'seven': '7',
    'eight': '8',
    'nine': '9',
    'ten': '10',
    'eleven': '11',
    'twelve': '12',
    'thirteen': '13',
    'fourteen': '14',
    'fifteen': '15',
    'sixteen': '16',
    'seventeen': '17',
    'eighteen': '18',
    'nineteen': '19',
    'twenty': '20',
    'thirty': '30',
    'forty': '40',
    'fifty': '50',
    'sixty': '60',
    'seventy': '70',
    'eighty': '80',
    'ninety': '90'
}


sample1 = "hello my name is sofie my social security number is thirteen zero four five and seventy eighteen seven and forty and I live on mountain street number twelve"
sample1_temp = [number_dict.get(item,item)  for item in sample1.split()]
sample1_numb = ' '.join(sample1_temp)
re_results = re.findall(r'(\d+ (and\s)?\d+ (and\s)?\d+\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?)', sample1_numb) 

print(re_results)

输出：

[('13 0 4 5 and 70 18 7 and 40 and ', '', '', '', '5', 'and ', '70', '', '18', '', '7', 'and ', '40', 'and ', '', '', '', '', '')]

这就是我被困的地方

在本例中，我可以执行类似于

sample1\u wh\u ssn=re.sub（re\u results[0][0]，''，sample1\u numb）

的操作来获得所需的结果，但这不会泛化

非常感谢您的帮助。

以下是您当前逻辑的实现，即：

将从
```
1
```
到
```
99
```
的字号转换为数字
删除用空格分隔的3个或更多数字的所有实例
将两位数的数字转换回单词

学分：

将单词转换为数字：通过
将数字转换为文字：按

见：

输出：

你好，我叫索菲，我的社会保险号码是，我住在山街12号

看来你只想“支持”从

到

的号码，对吗？还有，

你好，我的名字是索菲，我的社会保险号码是，我住在山街12号

结果足够了吗？或者您想将数字转换为单词数字？@WiktorStribiżew是的，从

到

的数字就足够了。它并不完美，因为有事业心的人可以使用更多的数字来列出他们的ssh。对我来说最好的方法是把数字转给单词数字。请检查下面的答案，如果它符合预期，请考虑答案。

[('13 0 4 5 and 70 18 7 and 40 and ', '', '', '', '5', 'and ', '70', '', '18', '', '7', 'and ', '40', 'and ', '', '', '', '', '')]

import re

number_words = [ "zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen"]
number_words_tens =[ "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety" ]
number_words_rx = re.compile(r'\b(?:(?:{0})?(?:{1})|(?:{0}))\b'.format("|".join(number_words_tens),"|".join(number_words)))
main_rx = re.compile(r'\s*\d+(?:\s+(?:and\s+)?\d+){2,}')
numbers_1_99 = number_words
numbers_1_99.extend(tens if ones == "zero" else (tens + "-" + ones) # stackoverflow.com/a/8982279/3832970
    for tens in "twenty thirty forty fifty sixty seventy eighty ninety".split()
    for ones in numbers_1_99[0:10])

def text2int(textnum, numwords={}): # stackoverflow.com/a/493788/3832970
    units = [
        "zero", "one", "two", "three", "four", "five", "six", "seven", "eight",
        "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen",
        "sixteen", "seventeen", "eighteen", "nineteen",
    ]
    tens = ["", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]
    numwords["and"] = (1, 0)
    for idx, word in enumerate(units):
        numwords[word] = (1, idx)
    for idx, word in enumerate(tens):
        numwords[word] = (1, idx * 10)
    current = result = 0
    for word in textnum.split():
        if word not in numwords:
          raise Exception("Illegal word: " + word)

        scale, increment = numwords[word]
        current = current + increment

    return result + current
sample1 = "hello my name is sofie my social security number is thirteen zero four five and seventy eighteen seven and forty and I live on mountain street number twelve"
sample1 = number_words_rx.sub(lambda x: str(text2int(x.group())), sample1)
re_results = main_rx.sub('', sample1)
print( re.sub(r'\d{1,2}', lambda x: numbers_1_99[int(x.group())], re_results) )