Data Release Policy

Our goal is to make sequence data rapidly and broadly available to the scientific community as a community resource. It is our intention to publish the work of this project in a timely fashion, and we welcome collaborative interaction on the project and analyses. However, considerable investment was made in generating these data and we ask that you respect rights of first publication and acknowledgment as outlined in the Toronto agreement (Toronto International Data Release Workshop Authors. Prepublication data sharing. Nature. 2009 Sep 10;461(7261):168-70). By accessing these data, you agree not to publish any articles containing analyses of genes, cell types or transcriptomic data on a whole atlas or tissue scale prior to initial publication by the Tabula Microcebus Consortium and its collaborating scientists. If you wish to make use of restricted data for publication or are interested in collaborating on the analyses of these data, please use the contact form. Redistribution of these data should include the full text of the data use policy.

Donor Characteristics Summary

Processed data available from figshare

Tabula Microcebus on figshare

We provide in Figshare the cell by gene count data for the Tabula Microcebus mouse lemur scRNAseq cell atlas in Python’s h5ad and Matlab’s mat formats, as well as scripts to export the files to R’s Seurat format. Data is organized as described below. To explore and reannotate the data interactively using the browser, view the Organs Tab.

Access the data

1. h5ad file (Python) “LCA_complete_wRaw_toPublish.h5ad”:

View file on figshare

The h5ad file contains the following groups:

  • X: cell by gene count matrix with counts library-size normalized and natural log transformed. Smartseq2, ln(reads/N *1e4 +1); 10x, ln(UMI/N *1e4 +1), where N denotes the total number of reads or UMI of the cell.
  • layers: 
    • raw_counts: cell by gene count matrix with raw gene counts, not normalized by total number of reads/UMIs.
  • var: metadata for the genes
    • name: NCBI gene symbol
    • highly_variable: whether the gene is highly variable (calculated for the entire dataset after FIRM integration)
  • obsm:metadata for the top PC coefficients and UMAP (calculated for all cells according to the overall transcriptional similarity)
    • X_pca: top principal component coefficient after FIRM integration.
    • X_umap: 2D UMAp coordinates of the cells calculated with the FIRM integrated data.
  • obs: metadata for the cells
    • nCount_RNA: number of total reads (smartseq2) or UMI (10x) per cell.
    • nFeature_RNA: number of genes per cell.
    • cell_name: unique name assigned to each cell.
    • cell_barcode_10x: unique 10x barcode ID for each cell (10x data only).
    • sequencing_run_10x: unique Illumina NovaSeq 6000 system sequencing run ID for each cDNA library sequenced (10x data only). Each library contained more than one channel/sample.
    • channel_10x: unique tissue channel/sample name (10x data only). For several tissues, more than one channel/sample was sequenced (each designated by a different subtissue name).
    • possibly_contaminated_barcode_10x: contamination filtering was done to resolve cross-sample contamination in an Illumina sequencing run caused by cell barcode hopping among multiplexed 10x samples (see methods section in Tabula Microcebus manuscript for full explanation). 
    • method: smartseq2 (full-length) or 10x (3prime).
    • individual: lemur individuals available in the dataset.
    • age: age of the individual (years).
    • sex: sex of the individual.
    • tissue: tissue sampled.
    • tissue_system: tissue/organ system for each tissue sampled.
    • tissue_order: numerical ordering of each of the 27 tissues by tissue system (according to Fig. 1C in Tabula Microcebus manuscript).
    • subtissue: specification of the anatomical site sampled within the tissue. For tissues sampled multiple times at the same anatomical site, each 10x channel has distinct subtissue number.
    • compartment_v1: functional compartment for each cell type (i.e., epithelial, endothelial, stromal, immune (hematopoietic, lymphoid, myeloid, megakaryocyte-erythroid), neural, germ).
    • cell_ontology_class_v1: cell type designation using the Cell Ontology.
    • free_annotation_v1: detailed cell type designation using free text and molecular markers. PF, proliferating; LQ, low quality.
    • tissue__cell_ontology_class_v1: concatenation of the tissue and cell ontology designation.
    • tissue__free_annotation_v1:  concatenation of the tissue and free annotation designation.
    • mix_hybrid: clusters with a small number of cells that contain more than one cell type but could not be partitioned into separate clusters by subclustering with the Louvain algorithm or manually with cellxgene were labeled as a ‘mix’ cell type. Clusters with cells that expressed markers for more than one cell type and it was biologically plausible they were not a technical artifact were labeled as a ‘hybrid’ cell type.
    • low_quality: clusters that separated from a main cluster but did not express any distinguishing markers and differed only in parameters of technical quality (i.e. fewer genes and counts detected per cell) were considered low quality.
    • dendrogram_annotation_number: number assigned to each of the 256 cell type designations across the Tabula Microcebus, arranged by compartment and then ordered by organ system or biological relatedness (according to Fig. 2A in Tabula Microcebus manuscript). In addition, separate numbering is assigned to each of hybrid and mix cell types (labeled with prefix letter ‘H’ and ‘M’, respectively).
    • dendrogram_annotation_order: numerical ordering of the 256 cell type designations with the addition of the hybrid and mix cell types (according to Fig. 2B in Tabula Microcebus manuscript).
    • order__compartment_freeannotation_tissue, order__tissue_compartment_freeannotation: numerical ordering of the 768 molecularly distinct cell types where each cell type designation is separated by its tissue of origin (with mix cell types excluded). order__compartment_freeannotation_tissue: cell types are ordered by compartment (compartment_v1), then by free annotation (free_annotation_v1), and then by tissue (tissue_order); order__tissue_compartment_freeannotation: cell types are ordered by tissue (tissue_order), then by compartment (compartment_v1), and then by free annotation (free_annotation_v1).
    • MHC: counts for the major histocompatibility complex (MHC) genes based on reannotation of the locus using expression data from the Tabula Microcebus (original locus annotation from NCBI’s Annotation Release 101). Note the count is only available for cells sequenced by 10x method and count is NAN for cells sequenced by smartseq2 method. Both raw counts and normalized counts (labeled with prefix letter ‘n’) provided.
      • MHC_C_I, MHC_NC_I, MHC_all_II: sum of counts from classical Class I genes.
      • nMHC_C_I, nMHC_NC_I, nMHC_all_II: sum of normalized counts from classical Class I genes.
      • counts and normalized counts from individual classical Class I genes (Mimu_168, Mimu_W03, Mimu_W04, Mimu_249, nMimu_168, nMimu_W03, nMimu_W04, nMimu_249), non-classical Class I genes (Mimu_180ps, Mimu_191, Mimu_202, Mimu_208, Mimu_218, Mimu_229ps, Mimu_239ps, nMimu_180ps, nMimu_191, nMimu_202, nMimu_208, nMimu_218, nMimu_229ps, nMimu_239ps), and Class II genes (Mimu_DMA, Mimu_DMB, Mimu_DPA, Mimu_DPB, Mimu_DQA, Mimu_DQB, Mimu_DRA, Mimu_DRB, nMimu_DMA, nMimu_DMB, nMimu_DPA, nMimu_DPB, nMimu_DQA, nMimu_DQB, nMimu_DRA, nMimu_DRB).

Instructions to obtain counts normalized by total reads/UMIs, without the natural log transform in Python:

  • MLCA_h5ad.layers["ln(UP10K+1)_counts"] = MLCA_h5ad.X.copy()
  • MLCA_h5ad.X = MLCA_h5ad.layers["raw_counts"].copy()
  • sc.pp.normalize_total(MLCA_h5ad, target_sum=1e4)

Instructions to read h5ad file in Matlab:

View file on figshare

A mat file of the complete lemur cell atlas dataset converted from the h5ad file is provided in the Figshare files. We also provide a Matlab script to import the h5ad file to mat file: please download the h5ad file of interest, Matlab script “LCA_h5ad2Mat.m” and Matlab function “read_csmatrix.m” to the same folder, and run “LCA_h5ad2Mat.m”.

Instructions to read h5ad file in R:

View file on figshare

We provide Python and R scripts to convert a Python h5ad file into an R Seurat file. Please download the h5ad file of interest and Python script “LCA_h5ad2csv.py” to the same folder, and run in Python << python LCA_h5ad2csv.py -i input.h5ad -o output_folder -c layer_id >>, where input.h5ad is the h5ad file of interest, output_folder is the folder where the csv files will be exported, and layer_id is the gene count matrix to export (if layer_id = raw_counts, then the raw data is exported (adata.layers[‘raw_counts’]); if layer_id  = log, then the log transformed data is exported (adata.X)). Then run “LCA_csv2seurat.R” in R to create a Seurat object from the csv files. 

2. mat file (Matlab) “LCA_complete_wRaw_toPublish.mat”

View file on figshare

The mat file contains a single variable named “rawData”, a Matlab structure variable with the following fields:

  • cells: a table of the sequenced cells with metadata for individual sequenced cells (features of the table includes above “/obs” and “/obsm” list for the h5ad file, e.g., cell_name, tissue, free_annotation_v1, and X_umap, but not the MHC counts which is included in tabMHC, see below).
  • genes: gene table
    • name: NCBI gene symbol.
    • highly_variable: whether the gene is highly variable (calculated for the entire dataset).
  • mat_raw: a sparse matrix of the cell by gene transcript count (raw count).
  • mat_X: a sparse matrix of the cell by gene transcript level after library size normalization and natural log transformation (i.e., smartseq2, ln(reads/N *1e4 +1); 10x, ln(UMI/N *1e4 +1), where N denotes the total number of reads or UMI of the cell).
  • tabMHC: a table of the calculated raw counts for the major histocompatibility complex (MHC) genes (see the Tabula Microcebus manuscript for detail). Note the count is only available for cells sequenced by 10x method and count is NAN for cells sequenced by smartseq2 method. Both raw counts and normalized counts (labeled with prefix letter ‘n’) are provided.
    • MHC_C_I, MHC_NC_I, MHC_all_II: sum of counts from classical Class I genes.
    • nMHC_C_I, nMHC_NC_I, nMHC_all_II: sum of normalized counts from classical Class I genes.
    • counts and normalized counts from individual classical Class I genes (Mimu_168, Mimu_W03, Mimu_W04, Mimu_249, nMimu_168, nMimu_W03, nMimu_W04, nMimu_249), non-classical Class I genes (Mimu_180ps, Mimu_191, Mimu_202, Mimu_208, Mimu_218, Mimu_229ps, Mimu_239ps, nMimu_180ps, nMimu_191, nMimu_202, nMimu_208, nMimu_218, nMimu_229ps, nMimu_239ps), and Class II genes (Mimu_DMA, Mimu_DMB, Mimu_DPA, Mimu_DPB, Mimu_DQA, Mimu_DQB, Mimu_DRA, Mimu_DRB, nMimu_DMA, nMimu_DMB, nMimu_DPA, nMimu_DPB, nMimu_DQA, nMimu_DQB, nMimu_DRA, nMimu_DRB). 
  • version: version of the data (name of the h5ad file converted from).