用于读取sas.xpt文件的pandas.read_sas（）与存储在Google云存储（GCS）中的文件不兼容_Pandas_Sas_Google Cloud Storage

用于读取sas.xpt文件的pandas.read_sas（）与存储在Google云存储（GCS）中的文件不兼容

pandas sas google-cloud-storage

用于读取sas.xpt文件的pandas.read_sas（）与存储在Google云存储（GCS）中的文件不兼容,pandas,sas,google-cloud-storage,Pandas,Sas,Google Cloud Storage,我正在尝试将.XPT文件读入数据帧。如果文件是本地文件，但如果文件存储在GCS中，则此功能不起作用我使用以下方式将样本数据上传至地面军事系统：！旋度-Lhttps://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DEMO_J.XPT |gsutil cp-gs://my bucket/sas_sample/Nchs/Nhanes/2017-2018/DEMO_J.XPT 我还通过以下方式在本地下载了该文件： mkdir数据 !卷曲https://wwwn.cdc.

我正在尝试将.XPT文件读入数据帧。如果文件是本地文件，但如果文件存储在GCS中，则此功能不起作用

我使用以下方式将样本数据上传至地面军事系统：

！旋度-Lhttps://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DEMO_J.XPT |gsutil cp-gs://my bucket/sas_sample/Nchs/Nhanes/2017-2018/DEMO_J.XPT

我还通过以下方式在本地下载了该文件：

mkdir数据
!卷曲https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DEMO_J.XPT -o数据/DEMO_J.XPT

我曾尝试过GCS的以下方法，但均无效：

将熊猫作为pd导入
导入gcsfs
fs=gcsfs.GCSFileSystem（project='my-project'）
以fs.open（“my-bucket/sas_sample/Nchs/Nhanes/2017-2018/DEMO_J.XPT”）作为f：
df=pd.read_sas（f，format='xport'）

将熊猫作为pd导入
文件路径='gs://my bucket/sas_sample/Nchs/Nhanes/2017-2018/DEMO_J.XPT'
df=pd.read_sas（文件路径，格式='xport'，编码='utf-8'）
测向头（10）

它们都返回以下错误：

/opt/conda/anaconda/lib/python3.7/site-packages/pandas/io/sas/sas_xport.py in __init__(self, filepath_or_buffer, index, encoding, chunksize)
    278             contents = filepath_or_buffer.read()
    279             try:
--> 280                 contents = contents.encode(self._encoding)
    281             except UnicodeEncodeError:
    282                 pass

AttributeError: 'bytes' object has no attribute 'encode'

现在也尝试使用TensorFlow，但它不起作用：

from tensorflow.python.lib.io import file_io
import pandas as pd

filepath = 'gs://my-bucket/sas_sample/Nchs/Nhanes/2017-2018/DEMO_J.XPT'

with file_io.FileIO(filepath, 'r') as f:

    # ISO-8859-1
    # utf-8
    # utf-16
    # latin-1
    df = pd.read_sas(f, format='xport', encoding='utf-8')

df.head(5)

返回错误：

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-60-fb02f0706587> in <module>
     10     # utf-16
     11     # latin-1
---> 12     df = pd.read_sas(f, format='xport', encoding='utf-8')
     13 
     14 df.head(5)

/opt/conda/anaconda/lib/python3.7/site-packages/pandas/io/sas/sasreader.py in read_sas(filepath_or_buffer, format, index, encoding, chunksize, iterator)
     68 
     69         reader = XportReader(
---> 70             filepath_or_buffer, index=index, encoding=encoding, chunksize=chunksize
     71         )
     72     elif format.lower() == "sas7bdat":

/opt/conda/anaconda/lib/python3.7/site-packages/pandas/io/sas/sas_xport.py in __init__(self, filepath_or_buffer, index, encoding, chunksize)
    276         else:
    277             # Copy to BytesIO, and ensure no encoding
--> 278             contents = filepath_or_buffer.read()
    279             try:
    280                 contents = contents.encode(self._encoding)

/opt/conda/anaconda/lib/python3.7/site-packages/tensorflow_core/python/lib/io/file_io.py in read(self, n)
    126       length = n
    127     return self._prepare_value(
--> 128         pywrap_tensorflow.ReadFromStream(self._read_buf, length))
    129 
    130   @deprecation.deprecated_args(

/opt/conda/anaconda/lib/python3.7/site-packages/tensorflow_core/python/lib/io/file_io.py in _prepare_value(self, val)
     96       return compat.as_bytes(val)
     97     else:
---> 98       return compat.as_str_any(val)
     99 
    100   def size(self):

/opt/conda/anaconda/lib/python3.7/site-packages/tensorflow_core/python/util/compat.py in as_str_any(value)
    137   """
    138   if isinstance(value, bytes):
--> 139     return as_str(value)
    140   else:
    141     return str(value)

/opt/conda/anaconda/lib/python3.7/site-packages/tensorflow_core/python/util/compat.py in as_text(bytes_or_text, encoding)
    107     return bytes_or_text
    108   elif isinstance(bytes_or_text, bytes):
--> 109     return bytes_or_text.decode(encoding)
    110   else:
    111     raise TypeError('Expected binary or unicode string, got %r' % bytes_or_text)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 2967: invalid start byte

看起来Python 3的代码并没有真正更新。您可以尝试通过删除.encode（'utf-8'）来修复库，因为在Python 3中不需要它。见：

或者，您可以使用tensorflow代替gcs fuse：

from tensorflow.python.lib.io import file_io
with file_io.FileIO('gs://my-bucket/.../DEMO_J.XPT', 'r') as f:
  df = pd.read_sas(f, format='xport')

这是一个很好的例子。SAS IO连接器错误地假设所有文件缓冲区都以文本模式打开

我通过以下更改修补了本地

站点包/pandas/io/sas/sas_xport.py

，并能够读取数据帧：

class XportReader(BaseIterator):
    __doc__ = _xport_reader_doc

    def __init__(
        self, filepath_or_buffer, index=None, encoding="ISO-8859-1", chunksize=None
    ):

        self._encoding = encoding
        self._lines_read = 0
        self._index = index
        self._chunksize = chunksize

        if isinstance(filepath_or_buffer, str):
            (
                filepath_or_buffer,
                encoding,
                compression,
                should_close,
            ) = get_filepath_or_buffer(filepath_or_buffer, encoding=encoding)

        if isinstance(filepath_or_buffer, (str, bytes)):
            self.filepath_or_buffer = open(filepath_or_buffer, "rb")
        else:
            # Copy to BytesIO, and ensure no encoding
            contents = filepath_or_buffer.read()
            try:
                # NEW LINE HERE: Don't convert to binary if it's already bytes.
                if hasattr(contents, "encode"):
                    contents = contents.encode(self._encoding)
            except UnicodeEncodeError:
                pass
            self.filepath_or_buffer = BytesIO(contents)

        self._read_header()

，解决了此问题。合并后，pandas 1.1.0发布后，将不再需要手动修补程序。

尝试使用各种编码，但仍然收到一个错误，我已提交，pandas的PR待定