elasticsearch,analyzer,normalize,spelling,Filter,elasticsearch,Analyzer,Normalize,Spelling" /> elasticsearch,analyzer,normalize,spelling,Filter,elasticsearch,Analyzer,Normalize,Spelling" />

Filter 将英式英语和美式英语规范化以进行弹性搜索

Filter 将英式英语和美式英语规范化以进行弹性搜索,filter,elasticsearch,analyzer,normalize,spelling,Filter,elasticsearch,Analyzer,Normalize,Spelling,在Elasticsearch中是否有标准化英美英语的最佳实践 使用需要非常长的配置文件。实际上,在英国和美国英语中有几千个拼写不同的单词,几乎不可能找到一个真正全面的单词列表。这是一个很好的例子,但还远远不够完整 最好,我想用英语创建一个ES分析器/过滤器。也许这是更好的方法,但我不知道从哪里开始——我需要哪种类型的过滤器?它不必涵盖所有内容——它应该仅仅规范大多数搜索词。例如“灰色”-“灰色”,“颜色”-“颜色”,“中心”-“中心”,等等。这是我在摆弄了一段时间后采用的方法。它是基本规则、“修

在Elasticsearch中是否有标准化英美英语的最佳实践

使用需要非常长的配置文件。实际上,在英国和美国英语中有几千个拼写不同的单词,几乎不可能找到一个真正全面的单词列表。这是一个很好的例子,但还远远不够完整


最好,我想用英语创建一个ES分析器/过滤器。也许这是更好的方法,但我不知道从哪里开始——我需要哪种类型的过滤器?它不必涵盖所有内容——它应该仅仅规范大多数搜索词。例如“灰色”-“灰色”,“颜色”-“颜色”,“中心”-“中心”,等等。

这是我在摆弄了一段时间后采用的方法。它是基本规则、“修复”和同义词的组合:首先,应用char_过滤器强制执行一组基本拼写规则。这不是100%正确,但它做得很好:

"char_filter": {
    "en_char_filter": { "type": "mapping", "mappings": [
        # fixes
        "aerie=>axerie", "aeroplane=>airplane", "aloe=>aloxe", "canoe=>canoxe", "coerce=>coxerce", "poem=>poxem", "prise=>prixse",
        # whole words
        "armour=>armor", "behaviour=>behavior", "centre=>center" "colour=>color", "clamour=>clamor", "draught=>draft", "endeavour=>endeavor", "favour=>favor", "flavour=>flavor", "harbour=>harbor", "honour=>honor",
        "humour=>humor", "labour=>labor", "litre=>liter", "metre=>meter", "mould=>mold", "neighbour=>neighbor", "plough=>plow", "saviour=>savior", "savour=>savor",
        # generic transformations
        "ae=>e", "ction=>xion", "disc=>disk", "gramme=>gram", "isable=>izable", "isation=>ization", "ise=>ize", "ising=>izing", "ll=>l", "oe=>e", "ogue=>og", "sation=>zation", "yse=>yze", "ysing=>yzing"
    ] }
}
“修复”条目用于防止错误应用其他规则。例如,
“prise=>prixse”
防止将“prise”更改为“prize”,这具有不同的含义。您可能需要根据自己的需要进行调整

接下来,包括一个同义词过滤器,用于捕获最常用的异常:

"en_synonym_filter": { "type": "synonym", "synonyms": EN_SYNONYMS }
下面是同义词列表,其中包括用例中最重要的关键字。您可能希望根据您的需要调整此列表:

EN_SYNONYMS = (
    "accolade, prize => award",
    "accoutrement => accouterment",
    "aching, pain => hurt",
    "acw, anticlockwise, counterclockwise, counter-clockwise => ccw",
    "adaptor => adapter",
    "advocate, attorney, barrister, procurator, solicitor => lawyer",
    "ageing => aging",
    "agendas, agendum => agenda",
    "almanack => almanac",
    "aluminium => aluminum",
    "america, united states, usa",
    "amphitheatre => amphitheater",
    "anti-aliased, anti-aliasing => antialiased",
    "arbour => arbor",
    "ardour => ardor",
    "arse => ass",
    "artefact => artifact",
    "aubergine => eggplant",
    "automobile, motorcar => car",
    "axe => ax",
    "bannister => banister",
    "barbecue => bbq",
    "battleaxe => battleax",
    "baulk => balk",
    "beetroot => beet",
    "biassed => biased",
    "biassing => biasing",
    "biscuit => cookie",
    "black american, african american, afro-american, negro",
    "bobsleigh => bobsled",
    "bonnet => hood",
    "bulb, electric bulb, light bulb, lightbulb",
    "burned => burnt",
    "bussines, bussiness => business",
    "business man, business people, businessman",
    "business woman, business people, businesswoman",
    "bussing => busing",
    "cactus, cactuses => cacti",
    "calibre => caliber",
    "candour => candor",
    "candy floss, candyfloss, cotton candy",
    "car park, parking area, parking ground, parking lot, parking-lot, parking place, parking",
    "carburettor => carburetor",
    "castor => caster",
    "cataloguing => cataloging",
    "catboat, sailboat, sailing boat",
    "champion, gainer, victor, win, winner => victory",
    "chat => talk",
    "chequebook => checkbook",
    "chequer => checker",
    "chequerboard => checkerboard",
    "chequered => checkered",
    "christmas tree ball, christmas tree ball ornament, christmas ball ornament, christmas bauble",
    "christmas, x-mas => xmas",
    "cinema => movies",
    "clangour => clangor",
    "clarinettist => clarinetist",
    "conditioning => conditioner",
    "conference => meeting",
    "coriander => cilantro",
    "corporate => company",
    "cosmos, universe => outer space",
    "cosy, cosiness => cozy",
    "criminal => crime",
    "curriculums => curricula",
    "cypher => cipher",
    "daddy, father, pa, papa => dad",
    "defence => defense",
    "defenceless => defenseless",
    "demeanour => demeanor",
    "departure platform, station platform, train platform, train station",
    "dishrag => dish cloth",
    "dishtowel, dishcloth => dish towel",
    "doughnut => donut",
    "downspout => drainpipe",
    "drugstore => pharmacy",
    "e-mail => email",
    "enamoured => enamored",
    "england => britain",
    "english => british",
    "epaulette => epaulet",
    "exercise, excercise, training, workout => fitness",
    "expressway, motorway, highway => freeway",
    "facebook => facebook, social media",
    "fanny => buttocks",
    "fanny pack => bum bag",
    "farmyard => barnyard",
    "faucet => tap",
    "fervour => fervor",
    "fibre => fiber",
    "fibreglass => fiberglass",
    "flashlight => torch",
    "flautist => flutist",
    "flier => flyer",
    "flower fly, hoverfly, syrphid fly, syrphus fly",
    "foot-walk, sidewalk, sideway => pavement",
    "football, soccer",
    "forums => fora",
    "fourth => 4",
    "freshman => fresher",
    "chips, fries, french fries",
    "gaol => jail",
    "gaolbird => jailbird",
    "gaolbreak => jailbreak",
    "gaoler => jailer",
    "garbage, rubbish => trash",
    "gasoline => petrol",
    "gases, gasses",
    "gauge => gage",
    "gauged => gaged",
    "gauging => gaging",
    "gipsy, gipsies, gypsies => gypsy",
    "glamour => glamor",
    "glueing => gluing",
    "gravesite, sepulchre, sepulture => sepulcher",
    "grey => gray",
    "greyish => grayish",
    "greyness => grayness",
    "groyne => groin",
    "gryphon, griffon => griffin",
    "hand shake, shake hands, shaking hands, handshake",
    "haulier => hauler",
    "hobo, homeless, tramp => bum",
    "new year, new year's eve, hogmanay, silvester, sylvester",
    "holiday => vacation",
    "holidaymaker, holiday-maker, vacationer, vacationist => tourist",
    "homosexual, fag => gay",
    "inbox, letterbox, outbox, postbox => mailbox",
    "independence day, 4th of july, fourth of july, july 4th, july 4, 4th july, july fourth, forth of july, 4 july, fourth july, 4th july",
    "infant, suckling, toddler => baby",
    "infeasible => unfeasible",
    "inquire, inquiry => enquire",
    "insure => ensure",
    "internet, website => www",
    "jelly => jam",
    "jewelery, jewellery => jewelry",
    "jogging => running",
    "journey => travel",
    "judgement => judgment",
    "kerb => curb",
    "kiwifruit => kiwi",
    "laborer => worker",
    "lacklustre => lackluster",
    "ladybeetle, ladybird, ladybug => ladybird beetle",
    "larrikin, scalawag, rascal, scallywag => naughty boy",
    "leaf => leaves",
    "licence, licenced, licencing => license",
    "liquorice => licorice",
    "lorry => truck",
    "loupe, magnifier, magnifying, magnifying glass, magnifying lens, zoom",
    "louvred => louvered",
    "louvres => louver",
    "lustre => luster",
    "mail => post",
    "mailman => postman",
    "marriage, married, marry, marrying, wedding => wed",
    "mayonaise => mayo",
    "meagre => meager",
    "misdemeanour => misdemeanor",
    "mitre => miter",
    "mom, momma, mummy, mother => mum",
    "moonlight => moon light",
    "moult => molt",
    "moustache, moustached => mustache",
    "nappy => diaper",
    "nightlife => night life",
    "normalcy => normality",
    "octopus => kraken",
    "odour => odor",
    "odourless => odorless",
    "offence => offense",
    "omelette => omelet",
    "# fix torres del paine",
    "paine => painee",
    "pajamas => pyjamas",
    "pantyhose => tights",
    "parenthesis, parentheses => bracket",
    "parliament => congress",
    "parlour => parlor",
    "persnickety => pernickety",
    "philtre => filter",
    "phoney => phony",
    "popsicle => iced-lolly",
    "porch => veranda",
    "pretence => pretense",
    "pullover, jumper => sweater",
    "pyjama => pajama",
    "railway => railroad",
    "rancour => rancor",
    "rappel => abseil",
    "row house, serial house, terrace house, terraced house, terraced housing, town house",
    "rigour => rigor",
    "rumour => rumor",
    "sabre => saber",
    "saltpetre => saltpeter",
    "sanitarium => sanatorium",
    "santa, santa claus, st nicholas, st nicholas day",
    "sceptic, sceptical, scepticism, sceptics => skeptic",
    "sceptre => scepter",
    "shaikh, sheikh => sheik",
    "shivaree => charivari",
    "silverware, flatware => cutlery",
    "simultaneous => simultanous",
    "sleigh => sled",
    "smoulder, smouldering => smolder",
    "sombre => somber",
    "speciality => specialty",
    "spectre => specter",
    "splendour => splendor",
    "spoilt => spoiled",
    "street => road",
    "streetcar, tramway, tram => trolley-car",
    "succour => succor",
    "sulphate, sulphide, sulphur, sulphurous, sulfurous => sulfur",
    "super hero, superhero => hero",
    "surname => last name",
    "sweets => candy",
    "syphon => siphon",
    "syphoning => siphoning",
    "tack, thumb-tack, thumbtack => drawing pin",
    "tailpipe => exhaust pipe",
    "taleban => taliban",
    "teenager => teen",
    "television => tv",
    "thank you, thanks",
    "theatre => theater",
    "tickbox => checkbox",
    "ticked => checked",
    "timetable => schedule",
    "tinned => canned",
    "titbit => tidbit",
    "toffee => taffy",
    "tonne => ton",
    "transportation => transport",
    "trapezium => trapezoid",
    "trousers => pants",
    "tumour => tumor",
    "twitter => twitter, social media",
    "tyre => tire",
    "tyres => tires",
    "undershirt => singlet",
    "university => college",
    "upmarket => upscale",
    "valour => valor",
    "vapour => vapor",
    "vigour => vigor",
    "waggon => wagon",
    "windscreen, windshield => front shield",
    "world championship, world cup, worldcup",
    "worshipper, worshipping => worshiping",
    "yoghourt, yoghurt => yogurt",
    "zip, zip code, postal code, postcode",
    "zucchini => courgette"
)

我意识到这个答案与OP最初的问题有些不同,但如果你只是想规范美式英语和英式英语的拼写变体,你可以在这里查找一个可管理的列表(大约1700个替换项):。我相信还有其他的方法可以用来创建一个整合的主列表

除了拼写变化外,你必须非常小心,不要随意地用美式英语中的对应词(假定!)单独替换单词。除了最可靠的词汇替换之外,我建议不要使用其他词汇替换。例如,我看不出有什么不好的事情发生

“逆时针、逆时针、逆时针=>逆时针”

但是这个

流浪汉、无家可归者、流浪汉=>流浪汉

将索引为“一个无家可归的人”=>*“一个流浪汉”,这是胡说八道。(更不用说流浪汉、无家可归者和流浪汉是截然不同的——)

总之,除了拼写变化外,美国方言和英国方言的区别是复杂的,不能简化为简单的列表查找

另外,如果你真的想做到这一点(即,考虑语法上下文等),你可能需要一个上下文敏感的释义模型,在它进入ES索引之前,将英式英语“翻译”为美式英语(或相反,取决于你的需要)。这可以使用现成的统计翻译模型(有足够的并行数据)来完成,甚至可以使用一些使用自然语言分析、词性标记、组块等的定制内部软件