Tensorflow 如何查看HuggingFace数据集的汇总统计数据(例如样本数量;数据类型)?

Tensorflow 如何查看HuggingFace数据集的汇总统计数据(例如样本数量;数据类型)?,tensorflow,pytorch,huggingface-transformers,Tensorflow,Pytorch,Huggingface Transformers,我正在寻找合适的数据集来测试一些新的机器学习思想。有没有办法查看HuggingFace数据集的汇总统计数据(例如样本数量;数据类型) 它们在这里提供了描述,但筛选它们有点困难。不确定我是否遗漏了明显的内容,但我认为您必须自己编写代码。使用时,只能获得每个数据集的一般信息: from datasets import list_datasets list_datasets(with_details=True)[1].__dict__ 输出: {'id': 'ag_news', 'key': 'd

我正在寻找合适的数据集来测试一些新的机器学习思想。有没有办法查看HuggingFace数据集的汇总统计数据(例如样本数量;数据类型)


它们在这里提供了描述,但筛选它们有点困难。

不确定我是否遗漏了明显的内容,但我认为您必须自己编写代码。使用时,只能获得每个数据集的一般信息:

from datasets import list_datasets
list_datasets(with_details=True)[1].__dict__
输出:

{'id': 'ag_news',
 'key': 'datasets/datasets/ag_news/ag_news.py',
 'lastModified': '2020-09-15T08:26:31.000Z',
 'description': "AG is a collection of more than 1 million news articles. News articles have been\ngathered from more than 2000 news sources by ComeToMyHead in more than 1 year of\nactivity. ComeToMyHead is an academic news search engine which has been running\nsince July, 2004. The dataset is provided by the academic comunity for research\npurposes in data mining (clustering, classification, etc), information retrieval\n(ranking, search, etc), xml, data compression, data streaming, and any other\nnon-commercial activity. For more information, please refer to the link\nhttp://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .\n\nThe AG's news topic classification dataset is constructed by Xiang Zhang\n(xiang.zhang@nyu.edu) from the dataset above. It is used as a text\nclassification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann\nLeCun. Character-level Convolutional Networks for Text Classification. Advances\nin Neural Information Processing Systems 28 (NIPS 2015).",
 'citation': '@inproceedings{Zhang2015CharacterlevelCN,\n  title={Character-level Convolutional Networks for Text Classification},\n  author={Xiang Zhang and Junbo Jake Zhao and Yann LeCun},\n  booktitle={NIPS},\n  year={2015}\n}',
 'size': 3991,
 'etag': '"560ac59ac8cb6f76ac4180562a7f9342"',
 'siblings': [datasets.S3Object('ag_news.py'),
  datasets.S3Object('dataset_infos.json'),
  datasets.S3Object('dummy/0.0.0/dummy_data.zip')],
 'author': None,
 'numModels': 1}
 DatasetDict({'train': Dataset(features: {'id': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'context': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None), 'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)}, num_rows: 87599), 'validation': Dataset(features: {'id': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'context': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None), 'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)}, num_rows: 10570)})
{'email_body': {'dtype': 'string', 'id': None, '_type': 'Value'}, 'subject_line': {'dtype': 'string', 'id': None, '_type': 'Value'}}
{'test': {'name': 'test', 'num_bytes': 1384177, 'num_examples': 1906, 'dataset_name': 'aeslc'}, 'train': {'name': 'train', 'num_bytes': 11902668, 'num_examples': 14436, 'dataset_name': 'aeslc'}, 'validation': {'name': 'validation', 'num_bytes': 1660730, 'num_examples': 1960, 'dataset_name': 'aeslc'}}
您实际需要的是以下机构提供的信息:

从数据集导入加载数据集
班=加载数据集(“班”)
警察队伍
输出:

{'id': 'ag_news',
 'key': 'datasets/datasets/ag_news/ag_news.py',
 'lastModified': '2020-09-15T08:26:31.000Z',
 'description': "AG is a collection of more than 1 million news articles. News articles have been\ngathered from more than 2000 news sources by ComeToMyHead in more than 1 year of\nactivity. ComeToMyHead is an academic news search engine which has been running\nsince July, 2004. The dataset is provided by the academic comunity for research\npurposes in data mining (clustering, classification, etc), information retrieval\n(ranking, search, etc), xml, data compression, data streaming, and any other\nnon-commercial activity. For more information, please refer to the link\nhttp://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .\n\nThe AG's news topic classification dataset is constructed by Xiang Zhang\n(xiang.zhang@nyu.edu) from the dataset above. It is used as a text\nclassification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann\nLeCun. Character-level Convolutional Networks for Text Classification. Advances\nin Neural Information Processing Systems 28 (NIPS 2015).",
 'citation': '@inproceedings{Zhang2015CharacterlevelCN,\n  title={Character-level Convolutional Networks for Text Classification},\n  author={Xiang Zhang and Junbo Jake Zhao and Yann LeCun},\n  booktitle={NIPS},\n  year={2015}\n}',
 'size': 3991,
 'etag': '"560ac59ac8cb6f76ac4180562a7f9342"',
 'siblings': [datasets.S3Object('ag_news.py'),
  datasets.S3Object('dataset_infos.json'),
  datasets.S3Object('dummy/0.0.0/dummy_data.zip')],
 'author': None,
 'numModels': 1}
 DatasetDict({'train': Dataset(features: {'id': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'context': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None), 'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)}, num_rows: 87599), 'validation': Dataset(features: {'id': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'context': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None), 'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)}, num_rows: 10570)})
{'email_body': {'dtype': 'string', 'id': None, '_type': 'Value'}, 'subject_line': {'dtype': 'string', 'id': None, '_type': 'Value'}}
{'test': {'name': 'test', 'num_bytes': 1384177, 'num_examples': 1906, 'dataset_name': 'aeslc'}, 'train': {'name': 'train', 'num_bytes': 11902668, 'num_examples': 14436, 'dataset_name': 'aeslc'}, 'validation': {'name': 'validation', 'num_bytes': 1660730, 'num_examples': 1960, 'dataset_name': 'aeslc'}}
在这里,您可以获得每个拆分的样本数(
num_rows
)以及每个功能的数据类型。但将加载整个数据集,这可能是一种不希望出现的行为,因此应出于性能原因而拒绝

另一种选择是,我没有忽略一个参数,该参数只允许加载每个数据集的
dataset\u infos.json

from datasets import list_datasets
list_datasets(with_details=True)[1].__dict__
导入数据集
导入请求
从数据集导入列表\数据集
从datasets.utils.file\u utils导入REPO\u datasets\u URL
集合=列表_数据集()
版本=数据集。\u版本__
name='dataset_infos.json'
摘要=[]
对于成套的d:
打印('加载{}'。格式(d))
尝试:
r=requests.get(REPO\u DATASETS\u URL.format(version=version,path=d,name=name))
summary.append(r.json())
除:
打印('无法加载{}'。格式(d))
#功能和拆分值可能会让您感兴趣
打印(摘要[0]['default']['features'])
打印(摘要[0]['default']['splits'])
输出:

{'id': 'ag_news',
 'key': 'datasets/datasets/ag_news/ag_news.py',
 'lastModified': '2020-09-15T08:26:31.000Z',
 'description': "AG is a collection of more than 1 million news articles. News articles have been\ngathered from more than 2000 news sources by ComeToMyHead in more than 1 year of\nactivity. ComeToMyHead is an academic news search engine which has been running\nsince July, 2004. The dataset is provided by the academic comunity for research\npurposes in data mining (clustering, classification, etc), information retrieval\n(ranking, search, etc), xml, data compression, data streaming, and any other\nnon-commercial activity. For more information, please refer to the link\nhttp://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html .\n\nThe AG's news topic classification dataset is constructed by Xiang Zhang\n(xiang.zhang@nyu.edu) from the dataset above. It is used as a text\nclassification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann\nLeCun. Character-level Convolutional Networks for Text Classification. Advances\nin Neural Information Processing Systems 28 (NIPS 2015).",
 'citation': '@inproceedings{Zhang2015CharacterlevelCN,\n  title={Character-level Convolutional Networks for Text Classification},\n  author={Xiang Zhang and Junbo Jake Zhao and Yann LeCun},\n  booktitle={NIPS},\n  year={2015}\n}',
 'size': 3991,
 'etag': '"560ac59ac8cb6f76ac4180562a7f9342"',
 'siblings': [datasets.S3Object('ag_news.py'),
  datasets.S3Object('dataset_infos.json'),
  datasets.S3Object('dummy/0.0.0/dummy_data.zip')],
 'author': None,
 'numModels': 1}
 DatasetDict({'train': Dataset(features: {'id': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'context': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None), 'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)}, num_rows: 87599), 'validation': Dataset(features: {'id': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'context': Value(dtype='string', id=None), 'question': Value(dtype='string', id=None), 'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None)}, num_rows: 10570)})
{'email_body': {'dtype': 'string', 'id': None, '_type': 'Value'}, 'subject_line': {'dtype': 'string', 'id': None, '_type': 'Value'}}
{'test': {'name': 'test', 'num_bytes': 1384177, 'num_examples': 1906, 'dataset_name': 'aeslc'}, 'train': {'name': 'train', 'num_bytes': 11902668, 'num_examples': 14436, 'dataset_name': 'aeslc'}, 'validation': {'name': 'validation', 'num_bytes': 1660730, 'num_examples': 1960, 'dataset_name': 'aeslc'}}

注意:我没有检查未加载数据集的
dataset\u infos.json
。它们内部可能有更复杂的结构或错误。

谢谢,这看起来像是我想要的。我最后写了一些与你在答案第二部分中所做的非常相似的东西:——)