Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/334.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
删除Python中格式为“xbd5\xef\xbf\XBFDK\xef\xbf\xbdCP\xef\xbf\xbdHP\xef\xbf\xbd\xef\xbf\xbd6N”的UTF数据_Python_Utf 8 - Fatal编程技术网

删除Python中格式为“xbd5\xef\xbf\XBFDK\xef\xbf\xbdCP\xef\xbf\xbdHP\xef\xbf\xbd\xef\xbf\xbd6N”的UTF数据

删除Python中格式为“xbd5\xef\xbf\XBFDK\xef\xbf\xbdCP\xef\xbf\xbdHP\xef\xbf\xbd\xef\xbf\xbd6N”的UTF数据,python,utf-8,Python,Utf 8,我有一个URL列表,我需要从中使用Python刮取数据 def extract_url_data1(url): html = urllib.request.urlopen(url).read() soup = BeautifulSoup(html) for script in soup(["script", "style"]): script.extract() text = soup.get_text() lines = (line.strip() f

我有一个URL列表,我需要从中使用Python刮取数据

def extract_url_data1(url):
   html = urllib.request.urlopen(url).read()
   soup = BeautifulSoup(html)
   for script in soup(["script", "style"]):
    script.extract()
    text = soup.get_text()
    lines = (line.strip() for line in text.splitlines())
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    text = " ".join(chunk for chunk in chunks if chunk)
    return str(text.encode('utf-8'))
我正在将返回的数据存储在一个文本文件中。我面临的问题是,某些URL以xbd5\xef\xbf\XBDDK\xef\xbf\xbdCP\xef\xbf\xbdHP\xef\xbf\xbd\xef\xbf\xbd6N的形式返回数据。我只希望在文本文件中存储正确的英语单词。 请告诉我如何才能达到同样的效果,因为我已经尝试了一些正则表达式,如下面所示

re.sub(r'[^\x00-\x7f]',r' ',text)

如果要删除非英语字母,请执行以下操作:

In [1]: import re

In [2]: s = "xbd5\xef\xbf\xbdFDK\xef\xbf\xbdCP\xef\xbf\xbdHP\xef\xbf\xbd\xef\xbf\xbd6N"

In [3]: ' '.join(re.findall(r'\w+', s))
Out[3]: 'xbd5 FDK CP HP 6N'
但是,如果您只想保留有效的英语单词,则需要验证它们。这对你有帮助