Python 无法使用scipy.arff.loadarff加载arff数据集_Python_Scipy

Python 无法使用scipy.arff.loadarff加载arff数据集

python

Python 无法使用scipy.arff.loadarff加载arff数据集,python,scipy,Python,Scipy,我正在尝试从（例如）下载arff数据集，并使用scipy.arff.loadarff在python中加载它然而，scipy似乎希望在头文件之后有一种csv文件，但无法解析绝大多数数据集例如：复制该问题：从scipy.arff导入加载arff 导入URL库 urllib.request.urlretrieve（'https://cometa.ujaen.es/public/full/yahoo_arts.arff","yahoo_arts.arff") ds=loadarff（'yahoo_

我正在尝试从（例如）下载arff数据集，并使用

scipy.arff.loadarff

在python中加载它

然而，scipy似乎希望在头文件之后有一种csv文件，但无法解析绝大多数数据集

例如：复制该问题：

从scipy.arff导入加载arff
导入URL库
urllib.request.urlretrieve（'https://cometa.ujaen.es/public/full/yahoo_arts.arff","yahoo_arts.arff")
ds=loadarff（'yahoo_arts.arff'））

（在本例中，我得到了

ValueError:无法将字符串转换为float:'{8 1'

）

这是预期的吗？（也称为scipy实现不完全符合arff格式）您知道一些解决方法或一些手工解析功能吗

感谢您对本主题的任何帮助/建议

这是预期的吗？（也称为scipy实现不完全符合arff格式）

是的，很遗憾。正如“它无法读取具有稀疏数据的文件（{}在文件中）。”文件

yahoo_arts.arff

在其

@data

部分中使用稀疏格式

你可以试着找到一个替代方案。我没有使用过任何一个，因此我没有任何具体的建议。

你可以使用以下解决方法：

import numpy as np
import pandas as pd


with open('yahoo_arts.arff', 'r') as fp:
    file_content = fp.readlines()


def parse_row(line, len_row):
    line = line.replace('{', '').replace('}', '')

    row = np.zeros(len_row)
    for data in line.split(','):
        index, value = data.split()
        row[int(index)] = float(value)

    return row


columns = []
len_attr = len('@attribute')

# get the columns
for line in file_content:
    if line.startswith('@attribute '):
        col_name = line[len_attr:].split()[0]
        columns.append(col_name)

rows = []
len_row = len(columns)
# get the rows
for line in file_content:
    if line.startswith('{'):
        rows.append(parse_row(line, len_row))

df = pd.DataFrame(data=rows, columns=columns)

df.head()

输出：

如Warren Weckesser的回答所示，scipy无法读取稀疏arff文件。我已经实施了一个快速解决方法来解析稀疏arff文件，如果它可以帮助其他人，我将在下面分享它。如果我有时间做一个干净的版本，我会尝试为scipy版本做贡献

编辑：对不起，rusu_ro1，我没有看到你的版本，但我想它也能工作

来自scipy.sparse导入coo_矩阵
从functools导入reduce
作为pd进口熊猫
def loadarff（文件名）：
功能=列表（）
数据=列表（）
行_idx=0
打开（文件名为“rb”）作为f：
对于f中的行：
行=行。解码（“utf8”）
如果行.startswith（“@data”）：
持续
elif line.startswith（“@relation”）：
持续
elif line.startswith（“@attribute”）：
尝试：
features.append（line.split（“”[1]）
例外情况除外，如e：
打印（f“无法分析{line}”）
提高e
elif line.startswith（“{”）：
尝试：
line=line.replace（“{”，“”）。replace（“}”，“”）
line=[[row_idx，]+[int（x）表示v.split中的x（“”）表示v.split中的v（“”）
data.append（行）
行_idx+=1
例外情况除外，如e：
打印（f“无法分析{line}”）
提高e
其他：
打印（f“无法分析{line}”）
展平=λl:[子列表中的项目在l中，子列表中的项目在l中]
数据=展平（数据）
稀疏_矩阵=coo_矩阵（[x[2]表示数据中的x]，[x[0]表示数据中的x]，[x[1]表示数据中的x]），形状=（行_idx，len（特征）））
df=pd.DataFrame（稀疏矩阵.todense（），列=features）
返回df

非常感谢您的回答！作为一种解决方法，我已经完成了一个处理稀疏数据的简单解析器。如果它能帮助其他人，我将在另一个答案中发布它。