python正则表达式在示例代码中工作，但不是所需的代码_Python_Regex

python正则表达式在示例代码中工作，但不是所需的代码

python regex

python正则表达式在示例代码中工作，但不是所需的代码,python,regex,Python,Regex,我正在从事一个需要解析一些网页的项目，为此我正在使用Beautifulsoup。我能够获得信息，但字符串中有许多Unicode、换行符、制表符和额外的空格。我尝试使用正则表达式来删除这些内容。我写的脚本在一个字符串上运行良好，我在一个简单的脚本中声明了这个字符串，但它在真实的东西上不起作用我的代码： str = '\xa9 Copyright 2009-10 \n\t\t\t\t All Rights Reserved. (Best viewed in 1024x768 \n\t\t\t\tr

我正在从事一个需要解析一些网页的项目，为此我正在使用Beautifulsoup。我能够获得信息，但字符串中有许多Unicode、换行符、制表符和额外的空格。我尝试使用正则表达式来删除这些内容。我写的脚本在一个字符串上运行良好，我在一个简单的脚本中声明了这个字符串，但它在真实的东西上不起作用

我的代码：

str = '\xa9 Copyright 2009-10 \n\t\t\t\t All Rights Reserved. (Best viewed in 1024x768 \n\t\t\t\tresolution & IE 6.0)                    break\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 \nChief Engineer'
reSpace = re.compile(' +')
reUni = re.compile( '(\\xa9|\\n|\\t|\\xa0)')
str = reSpace.sub(' ', str)
str = reUni.sub('', str)
print str

谢谢你的回复。我的真实代码是：

import re
from bs4 import BeautifulSoup
import os
tagslist = [] # keeps track of the tags that have been encountered
filehandle = {} # stores the file handles for every rag
reUni = re.compile( '((\\xa9)|(\\n)|(\\t)|(\\xa0))')
reSpace = re.compile(' +')
page = "filename.html"  # html file which needs to be parsed
fread = open(page, 'r')
soup = BeautifulSoup(fread.read())
fread.close()
if re.match( r'.*\.htm$', page):    # removes the .html or htm to remove "." to be enable to create a folder named "filename"
    page = site+"_parsed/"+page[:-4]+"_data"
else:
    page = site+"_parsed/"+page[:-5]+"_data"
if not os.path.exists(page):    #creates the folder named "filename"
    os.makedirs(page)
for tag in soup.find_all():
    if tag.string:  #if the tag encountered has a child string or not
        #if tag is encountered for the first time than create the file to hold its strins and declare the file handle for it
        if tag.name not in tagslist:
            tagStrFile = page+ "/" + tag.name +"_str.txt"
            filehandle[tag.name] = "handle_" + tag.name
            vars()[filehandle[tag.name]] = open(tagStrFile, 'w+') #declare the file handle
            tagslist.append(tag.name)
            filehandle[tag.name] = vars()[filehandle[tag.name]]
        str = (repr(tag.string))
        str = str[2:-1]
        str = reUni.sub('', str)
        str = reSpace.sub(' ', str)
        if str == '':
                continue
        filehandle[tag.name].write(str)
        filehandle[tag.name].write("\n")
    for tag in tagslist:    #close all the files
        filehandle[tag].close()

它创建的一小部分数据：

INTRODUCTION
SETUP
\xa0STRUCTURE \n                OF THE ORGANISATION
 The Category wise position as on 31-03-2012 of the Sanctioned Strength \n        and the Vacant Posts.
Sr.No.
Name \n          of the Post/Designation
Sanctioned \n          Strength

感谢

将多个空格（包括不间断空格）折叠为一个，您只需要一个正则表达式：

re.sub(ur'[\s\xa0]+', u' ', samplestr)

演示：

您是否使用

.strings

获取该值？请改用

.stripped_strings

。您需要向我们展示您的代码如何无法处理真实数据；使用

repr（）

为我们提供您拥有的实际数据的表示形式，以及它如何不产生预期输出。我尝试使用“.stripped_strings”，但没有提供实际数据，而是提供了“generator object stripped_strings at 0x000000000265FEA0”。这是因为

stripped_strings

是一个生成器。使用

'.join（对象字符串）

。但是

.strings

也是如此！xa0是非中断空间，为160；或使用HTML。

>>> import re
>>> samplestr = u'\xa9 Copyright 2009-10 \n\t\t\t\t All Rights Reserved. (Best viewed in 1024x768 \n\t\t\t\tresolution & IE 6.0)                    break\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0 \nChief Engineer'
>>> re.sub(ur'[\s\xa0]+', u' ', samplestr)
u'\xa9 Copyright 2009-10 All Rights Reserved. (Best viewed in 1024x768 resolution & IE 6.0) break Chief Engineer'