Javascript 有没有办法将CSV列转换为层次关系?
我有一个700万生物多样性记录的csv,其中分类级别为列。例如:Javascript 有没有办法将CSV列转换为层次关系?,javascript,python,d3.js,data-visualization,hierarchical-data,Javascript,Python,D3.js,Data Visualization,Hierarchical Data,我有一个700万生物多样性记录的csv,其中分类级别为列。例如: RecordID,kingdom,phylum,class,order,family,genus,species 1,Animalia,Chordata,Mammalia,Primates,Hominidae,Homo,Homo sapiens 2,Animalia,Chordata,Mammalia,Carnivora,Canidae,Canis,Canis 3,Plantae,nan,Magnoliopsida,Brassic
RecordID,kingdom,phylum,class,order,family,genus,species
1,Animalia,Chordata,Mammalia,Primates,Hominidae,Homo,Homo sapiens
2,Animalia,Chordata,Mammalia,Carnivora,Canidae,Canis,Canis
3,Plantae,nan,Magnoliopsida,Brassicales,Brassicaceae,Arabidopsis,Arabidopsis thaliana
4,Plantae,nan,Magnoliopsida,Fabales,Fabaceae,Phaseoulus,Phaseolus vulgaris
我想在D3中创建一个可视化,但数据格式必须是一个网络,其中列的每个不同值都是前一列的某个值的子级。我需要从csv转到如下内容:
{
name: 'Animalia',
children: [{
name: 'Chordata',
children: [{
name: 'Mammalia',
children: [{
name: 'Primates',
children: 'Hominidae'
}, {
name: 'Carnivora',
children: 'Canidae'
}]
}]
}]
}
我还没有想到如何在不使用1000个for循环的情况下实现这一点。有人对如何在python或javascript上创建此网络有什么建议吗?要创建您想要的确切嵌套对象,我们将使用纯javascript和名为的D3方法的混合。但是,请记住,700万行(请参阅下面的post scriptum)需要计算很多 非常重要的一点是,对于这个建议的解决方案,您必须在不同的数据数组中分离王国(例如,使用
Array.prototype.filter
)。出现这种限制是因为我们需要一个根节点,而在林奈分类法中,王国之间没有关系(除非你创建“域”作为顶级,它将是所有真核生物的根,但对于古细菌和细菌,你也会遇到同样的问题)
因此,假设您的CSV(我添加了更多行)只有一个王国:
RecordID,kingdom,phylum,class,order,family,genus,species
1,Animalia,Chordata,Mammalia,Primates,Hominidae,Homo,Homo sapiens
2,Animalia,Chordata,Mammalia,Carnivora,Canidae,Canis,Canis latrans
3,Animalia,Chordata,Mammalia,Cetacea,Delphinidae,Tursiops,Tursiops truncatus
1,Animalia,Chordata,Mammalia,Primates,Hominidae,Pan,Pan paniscus
基于该CSV,我们将在此处创建一个名为tableOfRelationships
的数组,顾名思义,该数组具有列组之间的关系:
const data = d3.csvParse(csv);
const taxonomicRanks = data.columns.filter(d => d !== "RecordID");
const tableOfRelationships = [];
data.forEach(row => {
taxonomicRanks.forEach((d, i) => {
if (!tableOfRelationships.find(e => e.name === row[d])) tableOfRelationships.push({
name: row[d],
parent: row[taxonomicRanks[i - 1]] || null
})
})
});
对于上述数据,这是关系表
:
+---------+----------------------+---------------+
| (Index) | name | parent |
+---------+----------------------+---------------+
| 0 | "Animalia" | null |
| 1 | "Chordata" | "Animalia" |
| 2 | "Mammalia" | "Chordata" |
| 3 | "Primates" | "Mammalia" |
| 4 | "Hominidae" | "Primates" |
| 5 | "Homo" | "Hominidae" |
| 6 | "Homo sapiens" | "Homo" |
| 7 | "Carnivora" | "Mammalia" |
| 8 | "Canidae" | "Carnivora" |
| 9 | "Canis" | "Canidae" |
| 10 | "Canis latrans" | "Canis" |
| 11 | "Cetacea" | "Mammalia" |
| 12 | "Delphinidae" | "Cetacea" |
| 13 | "Tursiops" | "Delphinidae" |
| 14 | "Tursiops truncatus" | "Tursiops" |
| 15 | "Pan" | "Hominidae" |
| 16 | "Pan paniscus" | "Pan" |
+---------+----------------------+---------------+
看看null
作为Animalia
的父项:这就是为什么我告诉您需要按王国分隔数据集,整个表中只能有一个null
值
最后,基于该表,我们使用d3.stratify()
创建层次结构:
这是演示。打开浏览器控制台(代码段的控制台不适合此任务),检查对象的几个级别(子级
):
const csv=`RecordID,王国,门,纲,目,科,属,种
1、动物、脊索动物、哺乳动物、灵长类、人科、人、智人
2、动物、脊索动物、哺乳动物、食肉动物、犬科、犬科、拉特兰犬
3、动物、脊索动物、哺乳类、鲸目、飞燕科、长尾飞燕、短尾飞燕
1、动物、脊索动物、哺乳动物、灵长类、人科、盘类、盘类;
const data=d3.csvParse(csv);
const taxonomicRanks=data.columns.filter(d=>d!==“RecordID”);
关系常数表=[];
data.forEach(行=>{
分类库。forEach((d,i)=>{
如果(!tableOfRelationships.find(e=>e.name==row[d]))tableOfRelationships.push({
名称:第[d]行,
父项:行[taxonomicRanks[i-1]]| | null
})
})
});
const stratify=d3.stratify()
.id(功能(d){
返回d.name;
})
.parentId(函数(d){
返回d.parent;
});
常量层次数据=分层(关系表);
console.log(HierarchycalData)代码>
在Python中,对树进行编码的一种方法是使用dict
,其中键表示节点,关联的值是节点的父节点:
{'Homo sapiens': 'Homo',
'Canis': 'Canidae',
'Arabidopsis thaliana': 'Arabidopsis',
'Phaseolus vulgaris': 'Phaseoulus',
'Homo': 'Hominidae',
'Arabidopsis': 'Brassicaceae',
'Phaseoulus': 'Fabaceae',
'Hominidae': 'Primates',
'Canidae': 'Carnivora',
'Brassicaceae': 'Brassicales',
'Fabaceae': 'Fabales',
'Primates': 'Mammalia',
'Carnivora': 'Mammalia',
'Brassicales': 'Magnoliopsida',
'Fabales': 'Magnoliopsida',
'Mammalia': 'Chordata',
'Magnoliopsida': 'nan',
'Chordata': 'Animalia',
'nan': 'Plantae',
'Animalia': None,
'Plantae': None}
这样做的一个优点是确保节点是唯一的,因为dicts
不能有重复的键
如果您想编码一个更通用的有向图(即,节点可以有多个父节点),则可以使用列表作为值,并使用表示子节点(或父节点,我认为):
您可以对JS中的对象执行类似的操作,如有必要,用数组替换列表
下面是我用来创建上面第一个dict的Python代码:
import csv
ROWS = []
# Load file: tbl.csv
with open('tbl.csv', 'r') as in_file:
csvreader = csv.reader(in_file)
# Ignore leading row numbers
ROWS = [row[1:] for row in csvreader]
# Drop header row
del ROWS[0]
# Build dict
mytree = {row[i]: row[i-1] for row in ROWS for i in range(len(row)-1, 0, -1)}
# Add top-level nodes
mytree = {**mytree, **{row[0]: None for row in ROWS}}
var log=console.log;
风险值数据=`
1、动物、脊索动物、哺乳动物、灵长类、人科、人、智人
动物,脊索动物,哺乳动物,食肉动物,犬科,犬科,犬科
3,植物,南,木兰,芸苔,芸苔科,拟南芥,拟南芥
4、车前属、楠属、木兰属、豆科、豆科、菜豆属、菜豆属;
//使用值数组生成行数组
data=data.split(“\n”).map(v=>v.split(“,”);
//初始化树
var-tree={};
data.forEach(行=>{
//设置当前=每行树的根
var cur=树;
var-id=false;
row.forEach((值,i)=>{
如果(i==0){
//设置id和跳过值
id=值;
返回;
}
//如果分支不存在,则创建。
//如果最后一个值-写入id
如果(!cur[value])cur[value]=(i==row.length-1)?id:{};
//在层次结构上向下移动链接
cur=cur[值];
});
});
原木(“树:”);
日志(JSON.stringify(tree,null,“”);
//现在你们在树上有了等级制度,可以用它做任何事情。
var toStruct=功能(obj){
设ret=[];
用于(输入obj){
let child=obj[key];
设rec={};
rec.name=key;
if(typeof child==“object”)rec.children=toStruct(child);
再推(rec);
}
返回ret;
}
var struct=toStruct(树);
log(“结构:”);
console.log(struct)代码>使用python和python benedict
库(它是开源的,注意:我是作者):
安装pip安装python benedict
从benedict导入benedict作为bdict
#数据源可以是文件路径或url
数据来源=“”
记录纲,王国,门,纲,目,科,属,种
1、动物、脊索动物、哺乳动物、灵长类、人科、人、智人
动物,脊索动物,哺乳动物,食肉动物,犬科,犬科,犬科
3,植物,南,木兰,芸苔,芸苔科,拟南芥,拟南芥
4、车前属、楠属、木兰属、蚕豆属、蚕豆科、菜豆属、菜豆属
"""
数据输入=来自csv的数据(数据源)
数据输出=bdict()
祖先的等级=[‘王国’、‘门’、‘类’、‘目’、‘科’、‘属’、‘种’]
对于数据输入中的值['values']:
数据_输出['.'。连接([value[祖先]表示祖先中的祖先)
{'Homo': ['Homo sapiens', 'ManBearPig'],
'Ursus': ['Ursus arctos', 'ManBearPig'],
'Sus': ['ManBearPig']}
import csv
ROWS = []
# Load file: tbl.csv
with open('tbl.csv', 'r') as in_file:
csvreader = csv.reader(in_file)
# Ignore leading row numbers
ROWS = [row[1:] for row in csvreader]
# Drop header row
del ROWS[0]
# Build dict
mytree = {row[i]: row[i-1] for row in ROWS for i in range(len(row)-1, 0, -1)}
# Add top-level nodes
mytree = {**mytree, **{row[0]: None for row in ROWS}}
const nester = d3.nest(); // Create a nest operator
const [, ...taxonomicRanks] = data.columns; // Get rid of the RecordID property
taxonomicRanks.forEach(r => nester.key(d => d[r])); // Register key functions
const nest = nester.entries(data); // Calculate hierarchy
d3.hierarchy(nest, d => d.values) // Second argument is the children accessor
import csv
def read_data(filename):
tree = {}
with open(filename) as f:
f.readline() # skip the column headers line of the file
for animal_cols in csv.reader(f):
spot = tree
for name in animal_cols[1:]: # each name, skipping the record number
if name in spot: # The parent is already in the tree
spot = spot[name]
else:
spot[name] = {} # creates a new entry in the tree
spot = spot[name]
return tree
from pprint import pprint
pprint(read_data('data.txt'))
{'Animalia': {'Chordata': {'Mammalia': {'Carnivora': {'Canidae': {'Canis': {'Canis': {}}}},
'Primates': {'Hominidae': {'Homo': {'Homo sapiens': {}}}}}}},
'Plantae': {'nan': {'Magnoliopsida': {'Brassicales': {'Brassicaceae': {'Arabidopsis': {'Arabidopsis thaliana': {}}}},
'Fabales': {'Fabaceae': {'Phaseoulus': {'Phaseolus vulgaris': {}}}}}}}}
def walk_children(tree, parent=''):
for child in tree.keys():
full_name = parent + ':' + child
yield (parent, full_name)
yield from walk_children(tree[child], full_name)
tree = read_data('data.txt')
for (parent, child) in walk_children(tree):
print(f'parent="{parent}" child="{child}"')
parent="" child=":Animalia"
parent=":Animalia" child=":Animalia:Chordata"
parent=":Animalia:Chordata" child=":Animalia:Chordata:Mammalia"
parent=":Animalia:Chordata:Mammalia" child=":Animalia:Chordata:Mammalia:Primates"
parent=":Animalia:Chordata:Mammalia:Primates" child=":Animalia:Chordata:Mammalia:Primates:Hominidae"
parent=":Animalia:Chordata:Mammalia:Primates:Hominidae" child=":Animalia:Chordata:Mammalia:Primates:Hominidae:Homo"
parent=":Animalia:Chordata:Mammalia:Primates:Hominidae:Homo" child=":Animalia:Chordata:Mammalia:Primates:Hominidae:Homo:Homo sapiens"
parent=":Animalia:Chordata:Mammalia" child=":Animalia:Chordata:Mammalia:Carnivora"
parent=":Animalia:Chordata:Mammalia:Carnivora" child=":Animalia:Chordata:Mammalia:Carnivora:Canidae"
parent=":Animalia:Chordata:Mammalia:Carnivora:Canidae" child=":Animalia:Chordata:Mammalia:Carnivora:Canidae:Canis"
parent=":Animalia:Chordata:Mammalia:Carnivora:Canidae:Canis" child=":Animalia:Chordata:Mammalia:Carnivora:Canidae:Canis:Canis"
parent="" child=":Plantae"
parent=":Plantae" child=":Plantae:nan"
parent=":Plantae:nan" child=":Plantae:nan:Magnoliopsida"
parent=":Plantae:nan:Magnoliopsida" child=":Plantae:nan:Magnoliopsida:Brassicales"
parent=":Plantae:nan:Magnoliopsida:Brassicales" child=":Plantae:nan:Magnoliopsida:Brassicales:Brassicaceae"
parent=":Plantae:nan:Magnoliopsida:Brassicales:Brassicaceae" child=":Plantae:nan:Magnoliopsida:Brassicales:Brassicaceae:Arabidopsis"
parent=":Plantae:nan:Magnoliopsida:Brassicales:Brassicaceae:Arabidopsis" child=":Plantae:nan:Magnoliopsida:Brassicales:Brassicaceae:Arabidopsis:Arabidopsis thaliana"
parent=":Plantae:nan:Magnoliopsida" child=":Plantae:nan:Magnoliopsida:Fabales"
parent=":Plantae:nan:Magnoliopsida:Fabales" child=":Plantae:nan:Magnoliopsida:Fabales:Fabaceae"
parent=":Plantae:nan:Magnoliopsida:Fabales:Fabaceae" child=":Plantae:nan:Magnoliopsida:Fabales:Fabaceae:Phaseoulus"
parent=":Plantae:nan:Magnoliopsida:Fabales:Fabaceae:Phaseoulus" child=":Plantae:nan:Magnoliopsida:Fabales:Fabaceae:Phaseoulus:Phaseolus vulgaris"
import { set } from 'lodash'
const csvString = `RecordID,kingdom,phylum,class,order,family,genus,species
1,Animalia,Chordata,Mammalia,Primates,Hominidae,Homo,Homo sapiens
2,Animalia,Chordata,Mammalia,Carnivora,Canidae,Canis,Canis
3,Plantae,nan,Magnoliopsida,Brassicales,Brassicaceae,Arabidopsis,Arabidopsis thaliana
4,Plantae,nan,Magnoliopsida,Fabales,Fabaceae,Phaseoulus,Phaseolus vulgaris`
// First create a quick lookup map
const result = csvString
.split('\n') // Split for Rows
.slice(1) // Remove headers
.reduce((acc, row) => {
const path = row
.split(',') // Split for columns
.filter(item => item !== 'nan') // OPTIONAL: Filter 'nan'
.slice(1) // Remove record id
const species = path.pop() // Pull out species (last entry)
set(acc, path, species)
return acc
}, {})
console.log(JSON.stringify(result, null, 2))
// Then convert to the name-children structure by recursively calling this function
const convert = (obj) => {
// If we're at the end of our chain, end the chain (children is empty)
if (typeof obj === 'string') {
return [{
name: obj,
children: [],
}]
}
// Else loop through each entry and add them as children
return Object.entries(obj)
.reduce((acc, [key, value]) => acc.concat({
name: key,
children: convert(value), // Recursive call
}), [])
}
const result2 = convert(result)
console.log(JSON.stringify(result2, null, 2))
[
{
"name": "Animalia",
"children": [
{
"name": "Chordata",
"children": [
{
"name": "Mammalia",
"children": [
{
"name": "Primates",
"children": [
{
"name": "Hominidae",
"children": [
{
"name": "Homo",
"children": [
{
"name": "Homo sapiens",
"children": []
}
]
}
]
}
]
},
{
"name": "Carnivora",
"children": [
{
"name": "Canidae",
"children": [
{
"name": "Canis",
"children": [
{
"name": "Canis",
"children": []
}
]
}
]
}
]
}
]
}
]
}
]
},
{
"name": "Plantae",
"children": [
{
"name": "Magnoliopsida",
"children": [
{
"name": "Brassicales",
"children": [
{
"name": "Brassicaceae",
"children": [
{
"name": "Arabidopsis",
"children": [
{
"name": "Arabidopsis thaliana",
"children": []
}
]
}
]
}
]
},
{
"name": "Fabales",
"children": [
{
"name": "Fabaceae",
"children": [
{
"name": "Phaseoulus",
"children": [
{
"name": "Phaseolus vulgaris",
"children": []
}
]
}
]
}
]
}
]
}
]
}
]
from io import StringIO
import csv
CSV_CONTENTS = """RecordID,kingdom,phylum,class,order,family,genus,species
1,Animalia,Chordata,Mammalia,Primates,Hominidae,Homo,Homo sapiens
2,Animalia,Chordata,Mammalia,Carnivora,Canidae,Canis,Canis
3,Plantae,nan,Magnoliopsida,Brassicales,Brassicaceae,Arabidopsis,Arabidopsis thaliana
4,Plantae,nan,Magnoliopsida,Fabales,Fabaceae,Phaseoulus,Phaseolus vulgaris
"""
def recursive(dict_data):
lst = []
for key, val in dict_data.items():
children = recursive(val)
lst.append(dict(name=key, children=children))
return lst
def main():
with StringIO() as io_f:
io_f.write(CSV_CONTENTS)
io_f.seek(0)
io_f.readline() # skip the column headers line of the file
result_tree = {}
for row_data in csv.reader(io_f):
cur_dict = result_tree # cursor, back to root
for item in row_data[1:]: # each item, skip the record number
if item not in cur_dict:
cur_dict[item] = {} # create new dict
cur_dict = cur_dict[item]
else:
cur_dict = cur_dict[item]
# change answer format
result_list = []
for cur_kingdom_name in result_tree:
result_list.append(dict(name=cur_kingdom_name, children=recursive(result_tree[cur_kingdom_name])))
# Optional
import json
from os import startfile
output_file = 'result.json'
with open(output_file, 'w') as f:
json.dump(result_list, f)
startfile(output_file)
if __name__ == '__main__':
main()