Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/wcf/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python正则表达式从语音到文本生成的字符串中删除社会保险号_Python_Regex_Gdprconsentform - Fatal编程技术网

Python正则表达式从语音到文本生成的字符串中删除社会保险号

Python正则表达式从语音到文本生成的字符串中删除社会保险号,python,regex,gdprconsentform,Python,Regex,Gdprconsentform,出于符合GDPR的原因,我正在尝试从语音到文本生成的混乱数据中删除社会安全号码(SSN)。下面是一个示例字符串(翻译成英文,解释了列出SSN时出现“and”的原因): 我的目标是删除部分“十三…四十”,同时保留字符串中可能出现的其他数字,从而导致: sample1_wo_ssn = "hello my name is sofie my social security number is and I live on mountain street number twelve" 社会保险号码的长度

出于符合GDPR的原因,我正在尝试从语音到文本生成的混乱数据中删除社会安全号码(SSN)。下面是一个示例字符串(翻译成英文,解释了列出SSN时出现“and”的原因):

我的目标是删除部分
“十三…四十”
,同时保留字符串中可能出现的其他数字,从而导致:

sample1_wo_ssn = "hello my name is sofie my social security number is and I live on mountain street number twelve"
社会保险号码的长度可能因数据生成方式的不同而不同(3-10个分开的号码)

我的做法:

  • 使用dict将写入的数字替换为数字
  • 使用regex查找出现3个或更多数字的位置,其中只有空格或“
    ”和“
    将它们分隔开,并将它们与这3个数字后面的任何数字一起删除 这是我的密码:

    import re
    
    number_dict = {
        'zero': '0',
        'one': '1',
        'two': '2',
        'three': '3',
        'four': '4',
        'five': '5',
        'six': '6',
        'seven': '7',
        'eight': '8',
        'nine': '9',
        'ten': '10',
        'eleven': '11',
        'twelve': '12',
        'thirteen': '13',
        'fourteen': '14',
        'fifteen': '15',
        'sixteen': '16',
        'seventeen': '17',
        'eighteen': '18',
        'nineteen': '19',
        'twenty': '20',
        'thirty': '30',
        'forty': '40',
        'fifty': '50',
        'sixty': '60',
        'seventy': '70',
        'eighty': '80',
        'ninety': '90'
    }
    
    
    sample1 = "hello my name is sofie my social security number is thirteen zero four five and seventy eighteen seven and forty and I live on mountain street number twelve"
    sample1_temp = [number_dict.get(item,item)  for item in sample1.split()]
    sample1_numb = ' '.join(sample1_temp)
    re_results = re.findall(r'(\d+ (and\s)?\d+ (and\s)?\d+\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?\s?(and\s)?(\d+)?)', sample1_numb) 
    
    print(re_results)
    
    
    输出:

    [('13 0 4 5 and 70 18 7 and 40 and ', '', '', '', '5', 'and ', '70', '', '18', '', '7', 'and ', '40', 'and ', '', '', '', '', '')]
    
    这就是我被困的地方

    在本例中,我可以执行类似于
    sample1\u wh\u ssn=re.sub(re\u results[0][0],'',sample1\u numb)
    的操作来获得所需的结果,但这不会泛化


    非常感谢您的帮助。

    以下是您当前逻辑的实现,即:

    • 将从
      1
      99
      的字号转换为数字
    • 删除用空格分隔的3个或更多数字的所有实例
    • 将两位数的数字转换回单词
    学分:

    • 将单词转换为数字:通过
    • 将数字转换为文字:按
    见:


    输出:
    你好,我叫索菲,我的社会保险号码是,我住在山街12号

    看来你只想“支持”从
    1
    99
    的号码,对吗?还有,
    你好,我的名字是索菲,我的社会保险号码是,我住在山街12号
    结果足够了吗?或者您想将数字转换为单词数字?@WiktorStribiżew是的,从
    1
    99
    的数字就足够了。它并不完美,因为有事业心的人可以使用更多的数字来列出他们的ssh。对我来说最好的方法是把数字转给单词数字。请检查下面的答案,如果它符合预期,请考虑答案。
    [('13 0 4 5 and 70 18 7 and 40 and ', '', '', '', '5', 'and ', '70', '', '18', '', '7', 'and ', '40', 'and ', '', '', '', '', '')]
    
    import re
    
    number_words = [ "zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen"]
    number_words_tens =[ "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety" ]
    number_words_rx = re.compile(r'\b(?:(?:{0})?(?:{1})|(?:{0}))\b'.format("|".join(number_words_tens),"|".join(number_words)))
    main_rx = re.compile(r'\s*\d+(?:\s+(?:and\s+)?\d+){2,}')
    numbers_1_99 = number_words
    numbers_1_99.extend(tens if ones == "zero" else (tens + "-" + ones) # stackoverflow.com/a/8982279/3832970
        for tens in "twenty thirty forty fifty sixty seventy eighty ninety".split()
        for ones in numbers_1_99[0:10])
    
    def text2int(textnum, numwords={}): # stackoverflow.com/a/493788/3832970
        units = [
            "zero", "one", "two", "three", "four", "five", "six", "seven", "eight",
            "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen",
            "sixteen", "seventeen", "eighteen", "nineteen",
        ]
        tens = ["", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]
        numwords["and"] = (1, 0)
        for idx, word in enumerate(units):
            numwords[word] = (1, idx)
        for idx, word in enumerate(tens):
            numwords[word] = (1, idx * 10)
        current = result = 0
        for word in textnum.split():
            if word not in numwords:
              raise Exception("Illegal word: " + word)
    
            scale, increment = numwords[word]
            current = current + increment
    
        return result + current
    sample1 = "hello my name is sofie my social security number is thirteen zero four five and seventy eighteen seven and forty and I live on mountain street number twelve"
    sample1 = number_words_rx.sub(lambda x: str(text2int(x.group())), sample1)
    re_results = main_rx.sub('', sample1)
    print( re.sub(r'\d{1,2}', lambda x: numbers_1_99[int(x.group())], re_results) )