Se latent factor regression analysis was applied to each dataset [38,39,40,41]. This reduces the dimensionality of the complex gene expression array dataset assuming that many of the probe sets on the expression array chip are highly interrelated (targeting the same genes or genes in the same pathways). Dimension reduction is performed by constructing factors (groups of genes with related expression values). These factors are used in a sparse linear regression framework to explain the SPDB web variation seen in all of the probe sets. By default, most of the coefficients in this linear regression are zero. Thus, a small number (e.g., 50) of factors explain variation seen in any single dataset. Factor loadings are defined as the coefficients of the factor regression, and, to explore the biological relevance any particular factor, we examine the genes that are “in” that factor ?the genes that show significantly non-zero factor loadings. “Factor scores” are defined as the vector that best describes the co-expression of the genes in a particular factor. Both factor loadings and factor scores are fit to the data concurrently, and the full details of the process can be found in the supplementary statistical analysis section. While 50 factors were used for the results reported here, we also considered 20, 30 and 40, with minimal effect on the significant factor loadings. Notably, the initial models built to determine factors that distinguish symptomatic infected individuals from asymptomatic individuals were derived using an unsupervised process (i.e., the model classified subjects based on gene expression pattern alone, without a priori knowledge of infection status). Our statistical model is unsupervised, and thus seeks to describe the statistical properties of the expression data without using labeled data. Such unsupervised algorithms may uncover statistical characteristics that distinguish symptomatic and asymptomatic subjects, but this relationship is inferred a posteriori. The unsupervised models are not explicitly designed to perform MedChemExpress Fruquintinib classification. The specific unsupervised model employed here corresponds to Bayesian factor analysis. This model represents the gene-expression values of each sample in terms of a linear combination of factors. Within the model we impose that each factor is sparse, meaning that only a relatively small fraction of the genes have non-zero expression within the factor loading. This sparseness seeks to map each factor to a biological pathway by identifying genes which are co-expressed, and each pathway is assumed to be represented in terms of a small fraction of the total number of genes. The number of factors appropriate for the data is inferred, using a statistical tool termed the beta process [15]. We have found that, for the virus data considered here, the factor score associated with one of these factors is a good marker as toFigure S3 Cross-validation of H1N1 (Top) and H3N2 (Bottom) derived factors. (PDF) Figure S4 Genes comprising the discriminative. Factor for Influenza infection are involved in canonical antiviral pathways, such as the STAT-1 dependent portions of Interferonresponse and dsRNA-induced innate signaling depicted here (top), and the IRF-7 and RIG-I, MDA-5 dependent portions of Interferon-response and ssRNA-induced innate signaling 18325633 (bottom, www.genego.com). Pathways impacted by genes from the discriminative Factors are marked with a red target symbol. (PDF) Figure STemporal development of the comb.Se latent factor regression analysis was applied to each dataset [38,39,40,41]. This reduces the dimensionality of the complex gene expression array dataset assuming that many of the probe sets on the expression array chip are highly interrelated (targeting the same genes or genes in the same pathways). Dimension reduction is performed by constructing factors (groups of genes with related expression values). These factors are used in a sparse linear regression framework to explain the variation seen in all of the probe sets. By default, most of the coefficients in this linear regression are zero. Thus, a small number (e.g., 50) of factors explain variation seen in any single dataset. Factor loadings are defined as the coefficients of the factor regression, and, to explore the biological relevance any particular factor, we examine the genes that are “in” that factor ?the genes that show significantly non-zero factor loadings. “Factor scores” are defined as the vector that best describes the co-expression of the genes in a particular factor. Both factor loadings and factor scores are fit to the data concurrently, and the full details of the process can be found in the supplementary statistical analysis section. While 50 factors were used for the results reported here, we also considered 20, 30 and 40, with minimal effect on the significant factor loadings. Notably, the initial models built to determine factors that distinguish symptomatic infected individuals from asymptomatic individuals were derived using an unsupervised process (i.e., the model classified subjects based on gene expression pattern alone, without a priori knowledge of infection status). Our statistical model is unsupervised, and thus seeks to describe the statistical properties of the expression data without using labeled data. Such unsupervised algorithms may uncover statistical characteristics that distinguish symptomatic and asymptomatic subjects, but this relationship is inferred a posteriori. The unsupervised models are not explicitly designed to perform classification. The specific unsupervised model employed here corresponds to Bayesian factor analysis. This model represents the gene-expression values of each sample in terms of a linear combination of factors. Within the model we impose that each factor is sparse, meaning that only a relatively small fraction of the genes have non-zero expression within the factor loading. This sparseness seeks to map each factor to a biological pathway by identifying genes which are co-expressed, and each pathway is assumed to be represented in terms of a small fraction of the total number of genes. The number of factors appropriate for the data is inferred, using a statistical tool termed the beta process [15]. We have found that, for the virus data considered here, the factor score associated with one of these factors is a good marker as toFigure S3 Cross-validation of H1N1 (Top) and H3N2 (Bottom) derived factors. (PDF) Figure S4 Genes comprising the discriminative. Factor for Influenza infection are involved in canonical antiviral pathways, such as the STAT-1 dependent portions of Interferonresponse and dsRNA-induced innate signaling depicted here (top), and the IRF-7 and RIG-I, MDA-5 dependent portions of Interferon-response and ssRNA-induced innate signaling 18325633 (bottom, www.genego.com). Pathways impacted by genes from the discriminative Factors are marked with a red target symbol. (PDF) Figure STemporal development of the comb.