
清理Swift中的文本字符串,swift,string,data-cleaning,Swift,String,Data Cleaning,我想在我的应用程序中使用一些凌乱的文本。我无法控制文本,所以它就是这样 我正在寻找一种轻量级的方法来清理示例中显示的所有内容: original: <p>Occasionally we&nbsp;deal&nbsp;with this.</p> desired: Occasionally we deal with this. original: <p>Sometimes they \emphasize



original: <p>Occasionally we&nbsp;deal&nbsp;with this.</p>                       desired: Occasionally we deal with this.
original: <p>Sometimes they \emphasize\ like this, I could live with it</p>      desired: Sometimes they emphasize like this, I could live with it
original: <p>This is junk, but it's what I have<\/p>\r\n                         desired: This is junk, but it's what I have
original: <p>This is test1</p>                                                   desired: This is test1
original: <p>This is u\u00f1icode</p>                                            desired: This is uñicode


需要:偶尔我们会处理这个。 原文:有时他们强调\像这样,我可以接受它。希望:有时他们强调这样,我可以接受它 原件:这是uñ;icode

所需:这是uñicode 原文:这是垃圾,但这是我想要的:这是垃圾,但这是我想要的 原始:这是测试1

需要:这是测试1 原件:这是u\u00f1icode

所需:这是uñicode 解码:我们偶尔会处理这个问题。 解码:有时他们强调这样,我可以接受 解码:这是uñicode 解码:这是垃圾,但这是我的 解码:这是test1 解码:这是uñicode



let strings = [
    "<p>Occasionally we&nbsp;deal&nbsp;with this.</p> ",
    "<p>Sometimes they \\emphasize\\ like this, I could live with it</p>",
    "<p>This is junk, but it's what I have<\\/p>\\r\\n",
    "<p>This is test1</p>",
    "<p>This is u\\u00f1icode</p>",

// the pattern needs exactly one capture group
func replaceEntities(in text: String, pattern: String, replace: (String) -> String?) -> String {
    let buffer = (text as NSString).mutableCopy() as! NSMutableString
    let regularExpression = try! NSRegularExpression(pattern: pattern, options: .caseInsensitive)

    let matches = regularExpression.matches(in: text, options: [], range: NSRange(location: 0, length: buffer.length))

    // need to replace from the end or the ranges will break after first replacement
    for match in matches.reversed() {
        let captureGroupRange = match.range(at: 1)
        let matchedEntity = buffer.substring(with: captureGroupRange)
        guard let replacement = replace(matchedEntity) else {
        buffer.replaceCharacters(in: match.range, with: replacement)

    return buffer as String

let htmlEntities = [
    "nbsp": "\u{00A0}"

func replaceHtmlEntities(_ text: String) -> String {
    return replaceEntities(in: text, pattern: "&([^;]+);") {
        return htmlEntities[$0]

let escapeSequences = [
    "n": "\n",
    "r": "\r"

func replaceEscapes(_ text: String) -> String {
    return replaceEntities(in: text, pattern: "\\\\([a-z])") {
        return escapeSequences[$0]

func removeTags(_ text: String) -> String {
    return text
        .replacingOccurrences(of: "<[^>]+>", with: "", options: .regularExpression)

func replaceUnicodeSequences(_ text: String) -> String {
    return replaceEntities(in: text, pattern: "\\\\u([a-z0-9]{4})") {
        let code = Unicode.Scalar(Int($0, radix: 16)!)
        return code.map { String($0) }

let purifiedStrings = strings

print(purifiedStrings.joined(separator: "\n"))

”, “有时他们\\强调\\像这样,我可以接受它”

“, “这是垃圾,但这是我所拥有的\\r\\n”, “这是测试1

”, “这是u\\u00f1icode

”, ] //该模式只需要一个捕获组 func replaceEntities(文本:字符串,模式:字符串,替换:(字符串)->字符串?->字符串{ 让buffer=(文本为NSString).mutableCopy()为!NSMutableString 让regularExpression=try!NSRegularExpression(模式:模式,选项:。不区分大小写) 让matches=regularExpression.matches(在:文本中,选项:[],范围:NSRange(位置:0,长度:buffer.length)) //需要从头开始更换,否则第一次更换后范围将中断 对于matches.reversed()中的match{ 让captureGroupRange=match.range(at:1) 让matchedEntity=buffer.substring(带:captureGroupRange) 防护罩更换=更换(匹配)其他{ 持续 } buffer.replaceCharacters(在:match.range中,带:replacement) } 以字符串形式返回缓冲区 } 设htmlEntities=[ “nbsp”:“\u{00A0}” ] func replaceThMLentities(uText:String)->String{ 返回替换实体(在:文本,模式:&([^;]+);)中){ 退货价格[$0] } } 让转义序列=[ “n”:“\n”, “r”:“\r” ] func replaceEscapes(uText:String)->String{ 返回替换实体(在:文本,模式:“\\\\([a-z])”){ 返回转义序列[$0] } } func removeTags(utext:String)->String{ 返回文本 .replacingOccurrences(of:“]+>”,带:“”,选项:。regularExpression) } func replaceUnicodesquences(u-text:String)->String{ 返回replaceEntities(在:文本中,模式:“\\\\u([a-z0-9]{4})”){ 让code=Unicode.Scalar(Int($0,基数:16)!) 返回code.map{String($0)} } } 让字符串=字符串 .地图(移除标签) .map(替换属性) .map(替换转义) .map(替换单序列) 打印(已连接的字符串(分隔符:“\n”))


original: <p>Occasionally we&nbsp;deal&nbsp;with this.</p>                       desired: Occasionally we deal with this.
original: <p>Sometimes they \emphasize\ like this, I could live with it</p>      desired: Sometimes they emphasize like this, I could live with it
original: <p>This is u&#x00f1;icode</p>                                          desired: This is uñicode
original: <p>This is junk, but it's what I have<\/p>\r\n                         desired: This is junk, but it's what I have
original: <p>This is test1</p>                                                   desired: This is test1
original: <p>This is u\u00f1icode</p>                                            desired: This is uñicode

decoded: Occasionally we deal with this.
decoded: Sometimes they \emphasize\ like this, I could live with it
decoded: This is uñicode
decoded: This is junk, but it's what I have
decoded: This is test1
decoded: This is uñicode
let strings = [
    "<p>Occasionally we&nbsp;deal&nbsp;with this.</p> ",
    "<p>Sometimes they \\emphasize\\ like this, I could live with it</p>",
    "<p>This is junk, but it's what I have<\\/p>\\r\\n",
    "<p>This is test1</p>",
    "<p>This is u\\u00f1icode</p>",

// the pattern needs exactly one capture group
func replaceEntities(in text: String, pattern: String, replace: (String) -> String?) -> String {
    let buffer = (text as NSString).mutableCopy() as! NSMutableString
    let regularExpression = try! NSRegularExpression(pattern: pattern, options: .caseInsensitive)

    let matches = regularExpression.matches(in: text, options: [], range: NSRange(location: 0, length: buffer.length))

    // need to replace from the end or the ranges will break after first replacement
    for match in matches.reversed() {
        let captureGroupRange = match.range(at: 1)
        let matchedEntity = buffer.substring(with: captureGroupRange)
        guard let replacement = replace(matchedEntity) else {
        buffer.replaceCharacters(in: match.range, with: replacement)

    return buffer as String

let htmlEntities = [
    "nbsp": "\u{00A0}"

func replaceHtmlEntities(_ text: String) -> String {
    return replaceEntities(in: text, pattern: "&([^;]+);") {
        return htmlEntities[$0]

let escapeSequences = [
    "n": "\n",
    "r": "\r"

func replaceEscapes(_ text: String) -> String {
    return replaceEntities(in: text, pattern: "\\\\([a-z])") {
        return escapeSequences[$0]

func removeTags(_ text: String) -> String {
    return text
        .replacingOccurrences(of: "<[^>]+>", with: "", options: .regularExpression)

func replaceUnicodeSequences(_ text: String) -> String {
    return replaceEntities(in: text, pattern: "\\\\u([a-z0-9]{4})") {
        let code = Unicode.Scalar(Int($0, radix: 16)!)
        return code.map { String($0) }

let purifiedStrings = strings

print(purifiedStrings.joined(separator: "\n"))