Python Shmython-“；“是”；有那么难吗？_Python_Regex_Nlp

Python Shmython-“；“是”；有那么难吗？

python regex nlp

Python Shmython-“；“是”；有那么难吗？,python,regex,nlp,Python,Regex,Nlp,我已经写了一个程序来实现规则基本上是，如果一个单词以一个辅音（或一组辅音）开头，那么你去掉它并添加“shm”，但如果它以一个元音开头，那么你只添加“shm”。你还把整个事情放在现有单词的末尾问题是字母Y，因为有时是辅音，有时是元音。我想you变成you-shmou，但我想Python变成Python-Shmython。我该怎么办这是到目前为止我的代码 import re def word_shmord(word): orig = word if word.isupper(

我已经写了一个程序来实现

规则基本上是，如果一个单词以一个辅音（或一组辅音）开头，那么你去掉它并添加“shm”，但如果它以一个元音开头，那么你只添加“shm”。你还把整个事情放在现有单词的末尾

问题是字母Y，因为有时是辅音，有时是元音。我想

you

变成

you-shmou

，但我想

Python

变成

Python-Shmython

。我该怎么办

这是到目前为止我的代码

import re

def word_shmord(word):
    orig = word
    if word.isupper():
        prefix = "SHM"
    elif word.istitle():
        word = word.lower()
        prefix = "Shm"
    else:
        prefix = "shm"
    position = re.search("[aeiou]", word, re.IGNORECASE).start()
    new = prefix + word[position:]
    return "{}-{}".format(orig, new)


text = """
All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.
"""
text_shmext = re.sub("\w+", lambda m:word_shmord(m.group(0)), text)
print(text_shmext)

我觉得这个问题很有趣，所以我为这个问题编写了一些语言规则（或者我应该说shmoblem）

重新导入
导入字符串
从nltk.corpus导入停止词
从nltk.tokenize导入单词\u tokenize
从nltk.tokenize.sonority\u排序导入音节词典
stop=stopwords.words（'english'）
tk=音节识别器（）
def word_shmord（word）：
如果（len（word）<4且word.lower（）在stop中）或不是word.isalnum（）或word.lower（）.startswith（'shm'）：
回信
如果word中的“y”：
y=word.find（'y'）
#如果单词没有其他元音，那么Y被认为是元音
如果len（re.findall（“[aeiou]”，word，re.IGNORECASE））=0，word.count（'y'）=1：
word=word[：y]+'#'+word[y+1:]
#或者如果字母在一个单词的末尾
如果单词[-1]=“y”：
单词=单词[：-1]+'#'
#或音节的中间/结尾
if word.find（'y'）！=-1:
syll=tk.tokenize（word）
对于枚举中的i，s（syll）：
snew=s[：-1]+'#'如果s[-1]=='y'其他s
y=snew.find（'y'）
如果len（snew）//2==y：
snew=snew[：y]+'#'+snew[y+1:]
syll[i]=snew
word=''.join（syll）
如果word.isupper（）：
前缀=“SHM”
elif word.istitle（）：
word=word.lower（）
前缀=“Shm”
其他：
前缀=“shm”
元音=re.search（“[aeiou#]”，单词，re.IGNORECASE）
如果不是元音：
回信
位置=元音。开始（）
new=前缀+单词[position:]替换（'#'，'y'）
还新
text=“敏捷的棕色狐狸跳过懒惰的狗”
text_shmext=（[word_shmord（x）表示word_标记化（text）]）
#连接字符串
text_-shmext=“”.join（[“”+i如果我不在字符串中。标点符号否则我在text_-shmext中代表i]）.strip（）
打印（文本\u shmext）

输入：敏捷的棕色狐狸跳过懒惰的狗

输出：shmuick shmown shmox shmumps shmover shmazy shmog

不是真的。我想要

Skye Shmye

。也许只有单词开头的辅音才有效。但是我们关心y在后院的地位吗？它只是去了shmackyard，

import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize.sonority_sequencing import SyllableTokenizer

stop = stopwords.words('english')
tk = SyllableTokenizer()


def word_shmord(word):
    if (len(word) < 4 and word.lower() in stop) or not word.isalnum() or word.lower().startswith('shm'):
        return word
    if 'y' in word:
        y = word.find('y')
        # Y is considered to be a vowel if The word has no other vowel
        if len(re.findall("[aeiou]", word, re.IGNORECASE)) == 0 and word.count('y') == 1:
            word = word[:y] + '#' + word[y + 1:]
        # or if the letter is at the end of a word
        if word[-1] == 'y':
            word = word[:-1]+ '#'
        # or middle/end of syllable
        if word.find('y') != -1:
            syll = tk.tokenize(word)
            for i, s in enumerate(syll):
                snew = s[:-1] + '#' if s[-1] == 'y' else s
                y = snew.find('y')
                if len(snew) // 2 == y:
                    snew = snew[:y] + '#' + snew[y + 1:]
                syll[i] = snew
            word = ''.join(syll)

    if word.isupper():
        prefix = "SHM"
    elif word.istitle():
        word = word.lower()
        prefix = "Shm"
    else:
        prefix = "shm"
    vowels = re.search("[aeiou#]", word, re.IGNORECASE)
    if not vowels:
        return word
    position = vowels.start()
    new = prefix + word[position:].replace('#', 'y')
    return new


text = "The quick brown fox jumps over the lazy dog"
text_shmext = ([word_shmord(x) for x in word_tokenize(text)])
# join strings
text_shmext = "".join([" " + i if i not in string.punctuation else i for i in text_shmext]).strip()
print(text_shmext)