飘逸的小摩托 · Pandas 使用教程 CSV - CSV ...· 6 月前 · |
火爆的凳子 · 在Gorm查询中给主表起别名· 7 月前 · |
知识渊博的橡皮擦 · python logging模块 ...· 1 年前 · |
豪气的马克杯 · jQuery怎么取到每个tr里的第二个td_ ...· 1 年前 · |
大方的炒饭 · vue proxy ...· 1 年前 · |
卖萌的水桶
1 年前 |
where k = 1, …, K represents different networks. a k is the regression coefficient capturing the expression-TCR correlation for each network, e k is a random error. Key parameters, b and a k , are to be updated in the second step. In each iteration, the Dirichlet Process re-assigns each TCR into either an existing or a newly-built network, based on similarity of this TCR to the other TCRs, in order to reduce the regression error above. Therefore, after the first step, the network labels of the TCRs are updated.
In the parameter updating step, according to the newly-assigned networks we update within-network distances and for each network k. The center of each network is re-considered by drawing one from the TCRs of the network, following the probabilities inversely correlated to their d t s to the averaged embedding of all the TCRs in the network. The regression coefficient a k and the embedding weights b are updated according to their posteriors. We iteratively perform the two steps above, and through this process tessa essentially searches for the parameters that can maximize the correlation between and .
It is important to note that, during the estimation process, the same weight, b , is applied within networks and across networks. We hypothesize that some of the features of our 30-dim embedding could always be more important or less important for all TCRs. For example, the middle of CDR3s tend to bulge out and come into closer contact with the epitopes/MHCs, and therefore could be more important. This likely holds true for most, if not all, CDR3s. Therefore, a uniform weighting could likely find these features and scale up or down their influences for all TCRs. On the other hand, we also allowed some flexibility when correlating the transcriptomic features of the cells and the embedded TCRs, within each network. This is reflected by a k in the formula above. We adopted the so-called random effect model where we assumed the correlation, between expression and TCR, of each network, to closely follow the same population correlation, with a certain degree of network-specific deviance allowed. This ensures that a general rule is found to correlate expression and TCR, but the characteristics of different TCR networks are also taken into consideration.
A detailed description of the tessa model can be found in Supplementary Note 1 along with simulation and diagnostics.
Correlation between TCRs in embedding space and expression of T cells
For Fig. 4c , TCR embeddings adjusted by the weights estimated by tessa were used to calculate the pair-wise TCR distances. The TCR distances and expression distances were binned by every 5,000 TCRs with the closest TCR distances. For Extended Data Fig. 2 , the distance calculation method was the same as in Fig. 4c , but without the weight scaling by tessa. The TCR distances and expression distances were binned and ranked before plotting.
The single cell immune profiling datasets released by the 10x Genomics consist of single cell 5’ gene expression libraries, TCR sequencing libraries, and antigen binding affinity measurements of CD8 + T cells from 4 healthy donors. The antigen binding affinity between the TCR of one T cell and pMHCs is determined by measuring the number of short sequences (‘UMIs’) specifically counted for each one of the 44 pMHC dCODE™ Dextramers® under investigation. In the application note released by the company ( https://www.10xgenomics.com/resources/application-notes/a-new-way-of-exploring-immunity-linking-highly-multiplexed-antigen-recognition-to-immune-repertoire-and-phenotype/ ), their scientists validated the antigen specificities inferred from the UMIs by comparing the inferred pMHC-specific TCRs with those that have been confirmed with experiments in VDJdb ( https://vdjdb.cdr3.net/ ), and they found exactly matching and closely similar sequence pairs (their Fig. 5 and Table 1). In another report by the 10X Genomics on the same technology ( https://www.immudex.com/media/118671/tf119302-sitc-2018-immudex-poster-in-collaboration-with-10x-genomics-dcode-dextramer-technology.pdf ), their research team found that flow cytometry and their feature barcoding technology identify similar dCODE Dextramer®-binding cell populations (their Fig. 4 ). Strikingly, that figure shows that distributions of the flow cytometry intensities (top panel) and the UMI counts (bottom panel) closely resemble each other. Therefore, that figure proves that the UMI counts are rather quantitative (at least as quantitative as conventional flow cytometry), rather than qualitative.
Our methodology to assign antigen specificity basically follows that of the original 10x report. For a T cell to be called antigen specific for one pMHC, that pMHC’s UMI count has to be >=10 and it has to be the largest across all 44 pMHCs. To give a context for the cutoff of 10, the four 10X datasets also include pMHC UMI measurements of six irrelevant peptides as negative controls to assist the detection of specific binding events. Across all cells in the four datasets, 92~97% negative control UMIs are zeros, and the average negative control counts range from 0.05 to 0.16. Importantly, for Fig. 2e , the UMI counts numbers we showed are log-scaled “clone level” UMI counts. For clones that have multiple cells sharing the same TCR, we calculated median UMI counts as clone level UMIs. It is very common for a clone to have, say, 20 cells, but only 10 cells have a specific pMHC that can be assigned according to the rule above. One clone of T cells has the same TCR, so it’s unlikely for these T cells to have different antigen specificity. In such cases, the antigen specificity for this T cell clone will be assigned according to these 10 cells’ antigen specificity (a >90% concordance has to be reached for these 10 cells). Importantly, the “clone”-level UMI counts reported in Fig. 2e will be the median of all T cells’ UMI counts in each clone. We cannot take max of the cell-level UMI counts for each clone, as bigger clones will have higher UMI counts just due to sampling size, which will bring bias into the analysis.
We analyzed the two antigen-specificity datasets (Dash and Glanville) 10 , 11 , which provided 276 and 207 TCR sequences with known antigen specificity. In these studies, single T cells from healthy donor PBMCs with known HLA types and infections of common viruses were incubated with engineered pMHCs and sorted with FACS before obtaining TCR sequences from a series of nested PCRs. Unlike scRNA-seq, these T cells do not have matched expression data. Therefore, for these two datasets, we performed hierarchical clustering based on the scaled TCR embeddings with weights learned from the single cell sequencing datasets (the b used for scaling is an average of the b s from all the single T cell sequencing datasets in Extended Data Fig. 3 ). The clustering also resulted in TCR networks that are similar to the TCR networks detected by tessa. Different tree height cutoffs were employed to test the stability of the results. We randomized the cluster labels and performed the same calculation 10,000 times to examine whether the clustering purity was achieved by chance. P-values were calculated as the number of trials that achieved a higher purity than the true hierarchical clustering results, divided by 10,000.
We identified the ‘neighbours’ for each of the TCR clones in the post-1 and the post-2 subgroups in Fig. 3c , ,d. d . For each clone, we calculated the tessa-weighted TCR distances between that clone and all the other clones with different TCRs from the same patient, and we selected the clone with the smallest TCR distance as the ‘neighbour’ of the previous clone. We counted the number of the neighbours that belong to each subgroup (pre-1, pre-2, post-1, and post-2), and divided those numbers by the total number of clones in that subgroup to obtain percentages.
In Fig. 3e - -g g and Extended Data Fig. 6ab , we first selected 11 previously established individual marker genes representing 5 key T cell function pathways, including naive T cell markers ( IL7R ), memory T cell markers ( CXCR3, GZMK ), activated T cell markers ( IFNG, TNF, FOS, JUN ), and exhausted T cell markers ( ITGAE, ENTPD1, GZMB, LAG3 ) defined by Yost et al 31 . We also examined the differentially expressed genes between post-1 cells and post-2 cells from responders. We identified TGFB1 as the top highly expressed gene in the post-1 cells that is related to immune pathways 35 . To increase the stability of our analyses, we expanded these individual genes to pathways by including the 13 genes that show the highest levels of positive correlations for each individual gene marker.
In Fig. 4e , the IL-2 signaling pathway #1 included 13 genes from Conley et al 38 and Cho et al 39 . The other four pathways, including the IL-2 pathway #2 (GSE39110_UNTREATED_VS_IL2_TREATED_CD8_TCELL_DAY3_POST_IMMUNIZATION_DN), the IFN-α/β pathway (GSE15930_STIM_VS_STIM_AND_IFNAB_24H_CD8_T_CELL_DN), the IL-12 pathway #1 (GSE22443_NAIVE_VS_ACT_AND_IL12_TREATED_CD8_TCELL_DN) and #2 (GSE13173_UNTREATED_VS_IL12_TREATED_ACT_CD8_TCELL_DN), were selected from version 7.0 of the molecular signature database (MSigDB) ( http://www.broadinstitute.org/gsea/msigdb/index.jsp ): the c7 immunologic signatures. The two negative control pathways were generated with 200 randomly selected genes from all unique genes included in the c7 immunologic signatures. These selected genes in each pathway were shown in Supplementary Table 2 . To determine the pathway activity scores, we normalized the RNA-expression raw counts by dividing raw counts of each gene and each cell by the sum of raw counts of each corresponding cell. The normalized expression values of the genes belonging to the same pathway were then log scaled, summed for each cell, and served as the pathway activity score in that cell. The activity scores of each pathway were scaled by their tenth roots for a better representation.
For the CD8 + T cells from BCC samples, the top 10% genes with the highest expression variations across all cells were used to calculate the diffusion components. The R package ‘ destiny ’ (version 3.0.1) was used to compute a neighborhood graph using 40 neighbors and the first 20 principal components. We then employed SCINA 45 to detect naive CD8 + T cells with the marker gene IL7R and five genes with the highest correlation with IL7R. Three randomly selected naive T cells were used as the root cell for diffusion pseudotime prediction with the ‘ dpt ’ function in the ‘ destiny ’ package using all 20 diffusion components and a window width of 0.1 to decide the branch cutoff.
As described before in tessa, the TCRs were grouped into K networks. In each network, the TCR distances between the center TCR and the non-center TCRs were defined as d t , their expression distances were defined as d e . tessa assumed a linear regression relationship between d t and d e , which is
where k = 1, …, K represents the k-th network. We defined the unexplained variations as,
The unexplained variations were calculated separately for each of the networks in each dataset.
In Extended Data Fig. 4 we performed a series of benchmarking analysis using GLIPH 10 . We performed the analysis on six datasets including the four Healthy-CD8 datasets from 10x Genomics, the Glanville 10 dataset, and the Dash dataset 11 ( Supplementary Table 1 ). The command ‘ gliph --tcr TCR_TABLE --gccutoff = n ’ was used to generate clusters from the TCR sequences of these datasets. We adjusted the value of the “ gccutoff ” parameter from 0.5 to 5 with a step-length of 0.5 and calculated the ‘network purities’ for each choice of the parameter.
All computations and statistical analyses were carried out in the R computing environment (version 3.5.1). We employed SCINA 45 to detect the CD8 + T cells and CD4 + T cells from single T cell sequencing data, based on two gene signatures that are genes specifically expressed in the CD8 + T cells and the CD4 + T cells, respectively. Within each single cell dataset to be analyzed, we defined the CD8 gene signature as the 10 genes with the highest correlation with CD8A, and the CD4 gene signature as top 10 genes most highly correlated with CD4. For all boxplots appearing in this study, box boundaries represent interquartile ranges, whiskers extend to the most extreme data point which is no more than 1.5 times the interquartile range, and the line in the middle of the box represents the median. The t-SNE analysis was performed with the ‘ Rtsne ’ package (version 0.15). Specifically, for Fig. 3ab and Fig. 4ab , we used the RNA expression of the T cells as the input. For Fig. 1e and Extended Data Fig. 5ab , we used the embedded TCR sequences as the input. PCA preprocessing was applied to both types of data, and the first 50 Principle Components (the default parameter of the function ‘Rtsne’ ) were employed to calculate the 2-dimensional (default) t-SNE representations, and they were plotted as principles ‘tSNE-1’ and ‘tSNE-2’. We applied Pearson correlation tests for all correlation analyses. Student’s T-test with two tails was used to calculate all the P-values (unless otherwise specified). The function ‘ geom_smooth ’ (method=‘lm’) in the package ‘ ggplot2 ’ (version 3.1.0) was applied to calculate the regression trend lines and 95% confidence intervals. The one-sided jonckheere trend test was applied to calculate the P-value in the analysis of Fig. 2e , with the function ‘ jonckheere.test ’ in the package ‘ clinfun ’ (version 1.0.15). The hierarchical clustering was performed with the ‘hclust’ function (method = ‘manhattan’) from the package ‘stats’ .
The bulk RNA-Seq datasets used for deriving TCRs and then for the auto-encoder training are publicly available at https://gdc.cancer.gov/about-data/publications/panimmune (TCGA 23 ), https://www.iedb.org/database_export_v3.php (IEDB), and http://friedmanlab.weizmann.ac.il/McPAS-TCR/ (McPAS 25 ). We made the Kidney-bulkRNA 24 dataset available in csv format at https://github.com/jcao89757/TESSA/tree/master/Tessa_released_data . All scRNA-seq/TCR-seq datasets are publicly available. The NSCLC-1 and healthy-PBMC-1 datasets are available on the 10X website https://support.10xgenomics.com/single-cell-vdj/datasets/2.2.0 . The healthy-CD8 1-4 datasets are available on https://www.10xgenomics.com/resources/application-notes/a-new-way-of-exploring-immunity-linking-highly-multiplexed-antigen-recognition-to-immune-repertoire-and-phenotype/ . The healthy-PBMC-2 dataset is also available on the 10X website https://support.10xgenomics.com/single-cell-vdj/datasets/3.0.0 . The NSCLC-2 26 , CRC 27 , and HCC 28 datasets are downloaded from the European Genome-Phenome Archive (EGA) under accession numbers, EGAS00001002430, EGAS00001002791, and EGAS00001002072, respectively. The Breast 1-5 29 datasets are available on the Gene Expression Omnibus (GEO) under accession numbers, {"type":"entrez-geo","attrs":{"text":"GSE114727","term_id":"114727"}} GSE114727 and {"type":"entrez-geo","attrs":{"text":"GSE114724","term_id":"114724"}} GSE114724 . The Melanoma 30 , BCC 31 and ECCITE-seq 16 datasets are also on the GEO database under study numbers, {"type":"entrez-geo","attrs":{"text":"GSE123139","term_id":"123139"}} GSE123139 , {"type":"entrez-geo","attrs":{"text":"GSE113590","term_id":"113590"}} GSE113590 and {"type":"entrez-geo","attrs":{"text":"GSE126310","term_id":"126310"}} GSE126310 . The Glanville 10 dataset is downloaded from https://doi.org/10.1038/nature22976 . The Dash 11 dataset is available in the NCBI Sequence Read Archive (SRA) under accession number SRP101659. The details of the data used, including sample size, role in the analysis, and references, are shown in Supplementary Table 1 . All scRNA-seq data were involved in Fig. 2 (directly or indirectly mentioned), the BCC scRNA-seq data were used in Fig. 3 , and all scRNA-seq data were used in Fig. 4 .
The tessa model: https://github.com/jcao89757/tessa (doi: 10.5281/zenodo.4161819 ) 46
The SCINA model: https://github.com/jcao89757/SCINA (doi: 10.3390/genes10070531 ) 45
Please refer to Life Sciences Reporting Summary regarding detailed information on experimental design.
Supplementary Note 1 Detailed description of tessa, along with simulation and diagnostic analyses
Supplementary Note 2 More bioinformatics analyses and discussion of tessa
Supplementary Table 2 The genes in the T cell pathways used in this study.
We would like to thank Dr. LHR Xu for his valuable input on the manuscript writing. This study was supported by the National Institutes of Health (NIH) [CCSG 5P30CA142543/TW, R15GM131390/XW], and Cancer Prevention Research Institute of Texas [CPRIT RP190208/TW].
COMPETING INTERESTS
The authors declare no conflicts of interest.
火爆的凳子 · 在Gorm查询中给主表起别名 7 月前 |