清理Swift中的文本字符串

清理Swift中的文本字符串,swift,string,data-cleaning,Swift,String,Data Cleaning,我想在我的应用程序中使用一些凌乱的文本。我无法控制文本,所以它就是这样 我正在寻找一种轻量级的方法来清理示例中显示的所有内容: original: <p>Occasionally we&nbsp;deal&nbsp;with this.</p> desired: Occasionally we deal with this. original: <p>Sometimes they \emphasize

我想在我的应用程序中使用一些凌乱的文本。我无法控制文本,所以它就是这样

我正在寻找一种轻量级的方法来清理示例中显示的所有内容:

original: <p>Occasionally we&nbsp;deal&nbsp;with this.</p>                       desired: Occasionally we deal with this.
original: <p>Sometimes they \emphasize\ like this, I could live with it</p>      desired: Sometimes they emphasize like this, I could live with it
original: <p>This is junk, but it's what I have<\/p>\r\n                         desired: This is junk, but it's what I have
original: <p>This is test1</p>                                                   desired: This is test1
original: <p>This is u\u00f1icode</p>                                            desired: This is uñicode
输出

original:偶尔我们会处理这个。

需要:偶尔我们会处理这个。 原文:有时他们强调\像这样,我可以接受它。希望:有时他们强调这样,我可以接受它 原件:这是uñ;icode

所需:这是uñicode 原文:这是垃圾,但这是我想要的:这是垃圾,但这是我想要的 原始:这是测试1

需要:这是测试1 原件:这是u\u00f1icode

所需:这是uñicode 解码:我们偶尔会处理这个问题。 解码:有时他们强调这样,我可以接受 解码:这是uñicode 解码:这是垃圾,但这是我的 解码:这是test1 解码:这是uñicode
脚注:
1.可能有许多更大的包或库可以作为其全部功能的一小部分来实现这一点,而这些在这里不太受关注。

我无法理解奇怪的反斜杠,但要删除HTML标记、HTML实体和转义符,可以使用正则表达式进行以下替换:

请注意,您需要一个HTML实体字典,否则这将不起作用。逃逸的数量很少,创建完整的字典也不会很复杂

let strings = [
    "<p>Occasionally we&nbsp;deal&nbsp;with this.</p> ",
    "<p>Sometimes they \\emphasize\\ like this, I could live with it</p>",
    "<p>This is junk, but it's what I have<\\/p>\\r\\n",
    "<p>This is test1</p>",
    "<p>This is u\\u00f1icode</p>",
]

// the pattern needs exactly one capture group
func replaceEntities(in text: String, pattern: String, replace: (String) -> String?) -> String {
    let buffer = (text as NSString).mutableCopy() as! NSMutableString
    let regularExpression = try! NSRegularExpression(pattern: pattern, options: .caseInsensitive)

    let matches = regularExpression.matches(in: text, options: [], range: NSRange(location: 0, length: buffer.length))

    // need to replace from the end or the ranges will break after first replacement
    for match in matches.reversed() {
        let captureGroupRange = match.range(at: 1)
        let matchedEntity = buffer.substring(with: captureGroupRange)
        guard let replacement = replace(matchedEntity) else {
            continue
        }
        buffer.replaceCharacters(in: match.range, with: replacement)
    }

    return buffer as String
}

let htmlEntities = [
    "nbsp": "\u{00A0}"
]

func replaceHtmlEntities(_ text: String) -> String {
    return replaceEntities(in: text, pattern: "&([^;]+);") {
        return htmlEntities[$0]
    }
}

let escapeSequences = [
    "n": "\n",
    "r": "\r"
]

func replaceEscapes(_ text: String) -> String {
    return replaceEntities(in: text, pattern: "\\\\([a-z])") {
        return escapeSequences[$0]
    }
}

func removeTags(_ text: String) -> String {
    return text
        .replacingOccurrences(of: "<[^>]+>", with: "", options: .regularExpression)
}

func replaceUnicodeSequences(_ text: String) -> String {
    return replaceEntities(in: text, pattern: "\\\\u([a-z0-9]{4})") {
        let code = Unicode.Scalar(Int($0, radix: 16)!)
        return code.map { String($0) }
    }
}

let purifiedStrings = strings
    .map(removeTags)
    .map(replaceHtmlEntities)
    .map(replaceEscapes)
    .map(replaceUnicodeSequences)

print(purifiedStrings.joined(separator: "\n"))
let字符串=[
“我们偶尔会处理这个问题。

”, “有时他们\\强调\\像这样,我可以接受它”

“, “这是垃圾,但这是我所拥有的\\r\\n”, “这是测试1

”, “这是u\\u00f1icode

”, ] //该模式只需要一个捕获组 func replaceEntities(文本:字符串,模式:字符串,替换:(字符串)->字符串?->字符串{ 让buffer=(文本为NSString).mutableCopy()为!NSMutableString 让regularExpression=try!NSRegularExpression(模式:模式,选项:。不区分大小写) 让matches=regularExpression.matches(在:文本中,选项:[],范围:NSRange(位置:0,长度:buffer.length)) //需要从头开始更换,否则第一次更换后范围将中断 对于matches.reversed()中的match{ 让captureGroupRange=match.range(at:1) 让matchedEntity=buffer.substring(带:captureGroupRange) 防护罩更换=更换(匹配)其他{ 持续 } buffer.replaceCharacters(在:match.range中,带:replacement) } 以字符串形式返回缓冲区 } 设htmlEntities=[ “nbsp”:“\u{00A0}” ] func replaceThMLentities(uText:String)->String{ 返回替换实体(在:文本,模式:&([^;]+);)中){ 退货价格[$0] } } 让转义序列=[ “n”:“\n”, “r”:“\r” ] func replaceEscapes(uText:String)->String{ 返回替换实体(在:文本,模式:“\\\\([a-z])”){ 返回转义序列[$0] } } func removeTags(utext:String)->String{ 返回文本 .replacingOccurrences(of:“]+>”,带:“”,选项:。regularExpression) } func replaceUnicodesquences(u-text:String)->String{ 返回replaceEntities(在:文本中,模式:“\\\\u([a-z0-9]{4})”){ 让code=Unicode.Scalar(Int($0,基数:16)!) 返回code.map{String($0)} } } 让字符串=字符串 .地图(移除标签) .map(替换属性) .map(替换转义) .map(替换单序列) 打印(已连接的字符串(分隔符:“\n”))
您还可以替换前导/尾随字符串,并用单个空格替换多个空格,但这很简单


您可以将其与

中的解决方案结合起来,这应该可以做到:概念“轻量级”、“简单”和“没有太多开销”似乎是开放的/基于观点的。你是说“我不想做任何实际的编程”吗?正如你自己所说,这就是问题所在。我的意思是,如果有人以前遇到过这个问题,我几乎不需要重新发明解决方案。所以你只是要求我们为你搜索。马丁·R做到了这一点(并给出了一个很好的答案)。我建议你将他的答案向上投票,并删除这个答案,这实际上是一个重复的答案。@MartinR,这个答案有助于解决问题的一部分:HTML特殊字符。谢谢我在另一个编程域中使用JSoup来完成所有清理工作。但是我是这个领域的新手,所以我开始在这里搜索。当然,我也会这样做。但我想知道这是如何“轻量级和简单”。这就是我对OP的问题的看法。他没有问我们怎么做,而是命令我们不要给出某些类型的答案(不完全清楚他禁止的是什么类型的答案)。
original: <p>Occasionally we&nbsp;deal&nbsp;with this.</p>                       desired: Occasionally we deal with this.
original: <p>Sometimes they \emphasize\ like this, I could live with it</p>      desired: Sometimes they emphasize like this, I could live with it
original: <p>This is u&#x00f1;icode</p>                                          desired: This is uñicode
original: <p>This is junk, but it's what I have<\/p>\r\n                         desired: This is junk, but it's what I have
original: <p>This is test1</p>                                                   desired: This is test1
original: <p>This is u\u00f1icode</p>                                            desired: This is uñicode




decoded: Occasionally we deal with this.
decoded: Sometimes they \emphasize\ like this, I could live with it
decoded: This is uñicode
decoded: This is junk, but it's what I have
decoded: This is test1
decoded: This is uñicode
let strings = [
    "<p>Occasionally we&nbsp;deal&nbsp;with this.</p> ",
    "<p>Sometimes they \\emphasize\\ like this, I could live with it</p>",
    "<p>This is junk, but it's what I have<\\/p>\\r\\n",
    "<p>This is test1</p>",
    "<p>This is u\\u00f1icode</p>",
]

// the pattern needs exactly one capture group
func replaceEntities(in text: String, pattern: String, replace: (String) -> String?) -> String {
    let buffer = (text as NSString).mutableCopy() as! NSMutableString
    let regularExpression = try! NSRegularExpression(pattern: pattern, options: .caseInsensitive)

    let matches = regularExpression.matches(in: text, options: [], range: NSRange(location: 0, length: buffer.length))

    // need to replace from the end or the ranges will break after first replacement
    for match in matches.reversed() {
        let captureGroupRange = match.range(at: 1)
        let matchedEntity = buffer.substring(with: captureGroupRange)
        guard let replacement = replace(matchedEntity) else {
            continue
        }
        buffer.replaceCharacters(in: match.range, with: replacement)
    }

    return buffer as String
}

let htmlEntities = [
    "nbsp": "\u{00A0}"
]

func replaceHtmlEntities(_ text: String) -> String {
    return replaceEntities(in: text, pattern: "&([^;]+);") {
        return htmlEntities[$0]
    }
}

let escapeSequences = [
    "n": "\n",
    "r": "\r"
]

func replaceEscapes(_ text: String) -> String {
    return replaceEntities(in: text, pattern: "\\\\([a-z])") {
        return escapeSequences[$0]
    }
}

func removeTags(_ text: String) -> String {
    return text
        .replacingOccurrences(of: "<[^>]+>", with: "", options: .regularExpression)
}

func replaceUnicodeSequences(_ text: String) -> String {
    return replaceEntities(in: text, pattern: "\\\\u([a-z0-9]{4})") {
        let code = Unicode.Scalar(Int($0, radix: 16)!)
        return code.map { String($0) }
    }
}

let purifiedStrings = strings
    .map(removeTags)
    .map(replaceHtmlEntities)
    .map(replaceEscapes)
    .map(replaceUnicodeSequences)

print(purifiedStrings.joined(separator: "\n"))