This is a tutorial and brief overview of different approaches to infer or adjust for cell proportions using DNA-methylation data inferred from peripheral blood.

**Download the example data set:**

(1) Visit this website https://www.microsoft.com/en-us/download/details.aspx?id=52345.

(2) Click on “Download”. This will download a file by the name **FaSTLMM.Py.102.zip.**

(3) Unzip the file and the example data can be found in …\FaSTLMM.Py.102\FastLmm.Py\doc\FastLmm-EWASher\demo, by the name

– **input_data.txt** ( 27k DNA-methylation data on 204 subjects)

– **input_phenotype.txt** (cancer status of 204 subjects)

– **covariates.txt** ( other covariates related to 204 subjects)

**Methods free of reference data:**

## RefFreeEWAS

RefFreeEWAS [1] does not need any external validation data sets and has the potential to adjust for cell mixture arising from any other tissue in addition to blood.This method is similar to surrogate variable analysis and it uses permutation based methods to draw statistical inferences.

Guide:

(a) Install the **CRAN** package “**RefFreeEWAS” **in **R** using

install.packages("RefFreeEWAS")

(b) Use the following **R** code to implement **RefFreeEWAS**

library(RefFreeEWAS) # Replace below "..." with the directory containing the data y = read.table("...\input_data.txt",sep="\t",header=T) y1 = y[,-1];rownames(y1) = y[,1]; miss = which(is.na(y1),arr.ind=TRUE)[,1]; y1 = y1[-miss,]; edata = as.matrix(log2(y1/(1-y1)); covar1 = read.table("input_phenotype.txt",header=FALSE); covar2<-read.table("covariates.txt",header=FALSE) x1 = as.numeric(covar1[,2]); x2<-as.numeric(covar2[,3]) x3<-as.numeric(covar2[,4]) x4<-as.numeric(covar2[,5]) x5<-as.numeric(covar2[,6]) x6<-as.numeric(covar2[,7]) tmp1 = lm(t(edata)~x1); dim3 = EstDimRMT(cbind(t(coef(tmp1)), t(resid(tmp1))), FALSE)$dim; test1 = RefFreeEwasModel(y2,cbind(1,x1,x2,x3,x4,x5,x6),dim3); testBoot1 = BootRefFreeEwasModel(test1, 50); res1 = summary(testBoot1); est1 = res1[, 2, 1, 'mean']; est.sd1 = res1[, 2, 1, 'sd']; Q21 = (est1/est.sd1)^2; numCov = 1; P01 = 2*(pt(-sqrt(Q21),(204-(dim3+6)))); fdr.pvalue = p.adjust(P01, method = "fdr", n = length(P01)); i = which(fdr.pvalue <= 0.05); est_ewas1 = est1[i]; sd_ewas1 = est.sd1[i]; P0_ewas1 = P01[i]; fdr_ewas1 = fdr.pvalue1[i]; cpg_ewas1 = cbind(est_ewas1,sd_ewas1,P0_ewas1,fdr_ewas1); write.table(cpg_ewas1,file="...\filname.csv",sep=",");

**filname.csv** will have list of CpG found to be significantly associated with the primary covariate (contained in file “input_phenotype.txt”) after adjusting for ** cellular heterogeneity** using

**RefFreeEWAS.**

## SVA

Surrogate variable analysis (SVA) estimates potential confounding factors based on singular value decompositions (SVD) of residuals [2].

Guide:

(a) Install the bioconductor package “**sva**“, “**limma**” and CRAN package “**MASS**” in **R** using:

source("https://bioconductor.org/biocLite.R"); biocLite(c("sva","limma")); install.packages("MASS");

(b) Use the following **R** code to implement **SVA**

lib = c("MASS","sva","limma"); lapply(lib, require, character.only = TRUE); mod1<-model.matrix(~x1+x2+x3+x4+x5+x6) mod01<-model.matrix(~x2+x3+x4+x5+x6) svobj1= sva(edata,mod1,mod01,n.sv=NULL,method="two-step"); modSv1 = cbind(mod1,svobj1$sv); fit1 = lmFit(edata,modSv1,method="robust"); fite1 = eBayes(fit1); tab1 = topTable(fite1, coef = "x1",number=length(y1[,1]), p.val=0.05,adjust = "fdr"); write.table(tab1,file="...\filename.csv",sep=",")

**filname.csv** will have list of CpG found to be significantly associated with the primary covariate (contained in file “input_phenotype.txt”) after adjusting for ** cellular heterogeneity** using

**SVA.**

## FaST-LMM-EWASher

FaST-LMM-EWASher [3] applies the maximum likelihood approach in linear mixed models and optimize spectral decomposition to estimate cell types [4].

Guide:

(a) The instruction to download data above will also download the **python** version of **FasT-LMM-EwaSher **method. The python code of this method uses the following libraries “numpy”, “scipy”, “scikit-learn” and “pylab”. The following website https://datanitro.com/blog/python_on_windows provides a tutorial to download and install these libraries.

(b) Follow the instruction in **README.txt** file contained in “**…\FaSTLMM.Py.102\FastLmm.Py\doc\FastLmm-EWASher**” directory (this directory is part of the FasT-LMM-EWASher package downloaded above). To run FaST-LMM-EWASher on the data set (obtained above), go to windows command line (You can get to the command line by typing “cmd” in the Windows search bar), navigate to the demo directory (“…\FaSTLMM.Py.102\FastLmm.Py\doc\FastLmm-EWASher\demo”) and then use:

python ..\..\..\fastlmm\ewasher\src\fastlmm-ewasher.py input_data.txt input_phenotype.txt -covar covariates.txt

FaST-LMM-EWASher creates two output folders, results/ and tmp/. The final results are in the folder **results/** that contains a file named **out_ewasher.txt**, which includes P-values and test statistics for each CpG site computed by the method in FaST-LMM-EWASher.

**Methods that require reference data:**

## Houseman et al.

Houseman et al. [5] developed a method for cell type correction that capitalizes on the idea that differentially methylated regions (DMRs) can serve as a signature for the distribution of different types of white blood cells. It uses these DMRs as a surrogate in a regression calibration based technique to identify the cell mixture distribution. Regression calibration technique can lead to bias estimate, thus external validation data is used to calibrate the model and to correct for the bias [6]. Validation data set used by Houseman et al. method consists of DNA-methylation data and sorted cell types of 46 white blood cell samples from Infinium HumanMethylation27 Beadchip and AllCells^{ Ⓡ }, LLC (Emeryville, CA) respectively. Their method was specifically for the Illumina 27k beadchip array.

Guide:

(a) Visit this website

https://drive.google.com/folderview?id=0B6IN3-9RV-LBeUt5bENMSXlsY2s&usp= sharing

to download all the relevant files

(b) Specify the input data in the **R** code by the name “**Rcodes_Cell_mixture.R**” on the line that reads following:

beta=t(as.matrix(read.table(".../input_data.txt",sep="\t",header=T)))

(c) Run the above R code by using

source("Rcodes_Cell_mixture.R")

inside **R** console (make sure the working directory contains all of the files downloaded in step (a).(d) This program will create a file by the name “Ewasher_Houseman_cell.csv”, which will have cell proportions for the 204 subjects contained in the example data set obtained above.(e) The estimated cell proportions will be used as adjusting co-variate in a linear model described below:

library(limma); library(MASS); cell = read.csv("Ewasher_data_housecell.csv",header=T); cell1 = cell[,2];cell2 = cell[,3];cell3 = cell[,4];cell4 = cell[,5]; cell5 = cell[,6];cell6 = cell[,7] mod11 = model.matrix(~x1+x2+x3+x4+x5+x6+cell2+cell3+cell4+cell5+cell6); fit1 = lmFit(y12,mod11,method="robust"); fite1 = eBayes(fit1); tab1 = topTable(fite1, coef = "x1",number=length(y1[,1]), p.val=0.05,adjust = "fdr"); write.table(tab1,file="Ewasher_data_houseman_CpG.csv",sep=",");

The file **Ewasher_data_houseman_CpG.csv **will have a list of CpG sites whose association with primary covariate **x1** (input_phenotye.txt) is adjusted for underlying cell proportions.

## Extended Houseman et al.

The method by Jaffe and Irizarry [7] was adapted from the Houseman et al. [5] method and is tailored for Illumina450k along with 27k array. The algorithm in Houseman et al. identifies 500 CpG sites used to estimate cell mixture proportions from the Illumina 27k array. The modification of Jaffe and Irizarry was motivated because of the existence of probe SNPs in the 500 CpG sites and the inconsistency of CpG sites between the 27k and 450k arrays. In addition, the flow-sorted data of the six adult male subjects were used as references [8] .

Guide:

(a) Install the required packages in **R** using:

source("https://bioconductor.org/biocLite.R") biocLite(c("minfi","quadprog","FlowSorted.Blood.450k", "IlluminaHumanMethylation450kmanifest", "IlluminaHumanMethylation450kanno.ilmn12.hg19"));

(b) Below is the R code to obtain cell estimates for the example data set:

lib = c("minfi","quadprog","FlowSorted.Blood.450k", "IlluminaHumanMethylation450kmanifest", "IlluminaHumanMethylation450kanno.ilmn12.hg19"); lapply(lib, require, character.only = TRUE); grSet1=read.table("input_data.txt",header=T); grSet=grSet1[,-1]; rownames(grSet)=grSet1[,1]; grSet=data.matrix(grSet); referenceMset = get('FlowSorted.Blood.450k.compTable'); cell = c("CD8T","CD4T", "NK","Bcell","Mono","Gran","Eos"); compData = minfi:::pickCompProbes(referenceMset, cellTypes=cell); coefs = compData$coefEsts; coefs = coefs[ intersect(rownames(grSet), rownames(coefs)), ]; rm(referenceMset); counts = minfi:::projectCellType(grSet[rownames(coefs), ], coefs); rownames(counts) = colnames(grSet); write.table(counts,file="Ewasher_data_minficell.csv",sep=",");

The file by the name **Ewasher_data_minficell.csv **will contain the cell proportions estimated using the approach extended from the **Houseman et al.** method and implemented in the **R** package **minfi.**

(c) The estimated cell proportions will be used as adjusting co-variate in a linear model described below:

cell = read.csv("Ewasher_data_minficell.csv",header=T); cell1 = cell[,1];cell2 = cell[,3];cell3 = cell[,4];cell5 = cell[,6]; cell6 = cell[,7];cell7 = cell[,8]; mod11 = model.matrix(~x1+x2+x3+x4+x5+x6+cell2+cell3+cell4+cell5+cell6+cell7); fit1 = lmFit(y12,mod11,method="robust"); fite1 = eBayes(fit1); tab1 = topTable(fite1, coef = "x1",number=length(y1[,1]), p.val=0.05,adjust = "fdr"); write.table(tab1,file="Ewasher_data_minfi_CpG.csv",sep=",");

The file **Ewasher_data_minfi_CpG.csv **will have a list of CpG sites whose association with primary covariate **x1** (input_phenotye.txt) is adjusted for underlying cell proportions.

## Refactor

ReFACTor, uses a variant form of principal component analysis (PCA) to adjust for the cell type effects. Principal components are based on top 500 most informative CpG sites. The identified principal components can be used in the model to adjust for the confounding effects of cell types.

Guide:

(a) Download the Refactor source code from following link: http://www.cs.tau.ac.il/~heran/cozygene/software/refactor.html

(b) We modified the code obtained from above to exclude the CpG sites with standard deviation in the lowest 5th percentile.

sd1=apply(y1, 1, sd, na.rm = F) include = which(sd1>=quantile(sd1,0.05)) O = y1[include,] edata1 = edata[include,] cpgnames <- rownames(O) for (site in 1:nrow(O)) { model <- lm(O[site,] ~ x1+x2+x3+x4+x5+x6) O_adj[site,] = residuals(model) } O = O_adj print('Running a standard PCA...') pcs = prcomp(scale(t(O))); coeff = pcs$rotation score = pcs$x print('Compute a low rank approximation of input data and rank sites...') x = score[,1:7]%*%t(coeff[,1:7]); An = scale(t(O),center=T,scale=F) Bn = scale(x,center=T,scale=F) An = t(t(An)*(1/sqrt(apply(An^2,2,sum)))) Bn = t(t(Bn)*(1/sqrt(apply(Bn^2,2,sum)))) # Find the distance of each site from its low rank approximation. distances = apply((An-Bn)^2,2,sum)^0.5 ; dsort = sort(distances,index.return=T); ranked_list = dsort$ix print('Compute ReFACTor components...') sites = ranked_list[1:500]; pcs = prcomp(scale(t(O[sites,]))); first_score <- score[,1:7]; score = pcs$x mod=model.matrix(~x1+x2+x3+x4+x5+x6+first_score) fit1<-lmFit(edata1,mod,method="ls") fite1<-eBayes(fit1) tab1 <- topTable(fite1, coef = "x1",number=length(y1[,1]), p.val=0.05,adjust = "fdr") write.table(tab1,file="Ewasher_data_Refactor_results.csv",sep=",",row.names=T)

## RefFreeCellMix

RefFreeCellMix is a function available in R package RefFreeEWAS. It uses a variant of non-negative matrix factorization to decompose the total methylation sites into CpG-specific methylation states for a pre-specified number of cell types and subject-specific cell-type distributions.

Guide:

(a) Install R-package RefFreeEWAS version 2.0

(b) Use the code below:

library(RefFreeEWAS) cell<-RefFreeCellMix(y2,mu0=NULL,K=7,iters=5,Yfinal=NULL,verbose=TRUE) mod1<-model.matrix(~x1+x2+x3+x4+x5+x6+cell$Omega) fit1<-lmFit(y2,mod1,method="robust") fite1<-eBayes(fit1) tab1 <- topTable(fite1, coef = "x1",number=length(O[,1]), p.val=0.05,adjust = "fdr") write.table(tab1,file="Ewasher_data_RefFreeCellMix_results.csv", sep=",",row.names=T)

## RUV

The method of removing unwanted variation (RUV) uses information from reference database, but it does not estimate cell type proportions. Instead, this approach bases on the information of negative control probes and performs factor analysis on these probes to identify factors due to unmeasured confounders. These factors are then included in subsequent analyses to adjust for cell type effects.

Guide:

(a) Top 600 CpG sites associated with the blood cell types for 27k example data was obtained from the link: http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-86 under the heading Electronic supplementary material. It can also be downloaded from google drive link mentioned under the Houseman’s method, the name of file is “WBC-Analysis-110319.RData”.

(b) Code below identifies principal components related to the dna-methylation of 600 CpG sites.

library(data.table) library(psych) library(pracma) beta<-read.csv("Top_600_Ewasher_27k.csv",header=T) beta1=beta[,-1] rownames(beta1)=beta[,1] beta1_t<-t(beta1) pca_mval<-prcomp(beta1_t,center=T,scale.=T) plot(pca_mval, type = "l") ###Based upon scree plot we can choose number of PC's## rawLoadings <- pca_mval$rotation[,1:7] %*% diag(pca_mval$sdev, 7, 7) rotatedLoadings <- varimax(rawLoadings,normalize = TRUE, eps = 1e-5)$loadings invLoadings <- t(pracma::pinv(rotatedLoadings)) scores <- scale(beta1_t) %*% invLoadings ###x1,x2,x3,x4,x5 and x6 are primary and secondary covariates, ###PC1,PC2, PC3 and PC4 are principal components ############################################################## mod<-model.matrix(~x1+x2+x3+x4+x5+x6+PC1+PC2+PC3+PC4) fit<-lmFit(edata,mod,method="robust") fite<-eBayes(fit) tab <- topTable(fite, coef = "x1",number=length(edata[,1]), p.val=0.05,adjust = "fdr")

**References**:

(1) Houseman EA, Molitor J, Marsit CJ: **Reference-free cell mixture adjustments in analysis of DNA methylation data**. *Bioinformatics (Oxford, England) *2014, **30**(10):1431-1439.

(2) Leek JT, Storey JD: **Capturing heterogeneity in gene expression studies by surrogate variable analysis**. *PLoS genetics *2007, **3**(9):e161.

(3) Zou J, Lippert C, Heckerman D, Aryee M, Listgarten J: **Epigenome-wide association studies without the need for cell-type composition**. *Nature methods *2014, **11**(3):309-311.

(4) Lippert C, Listgarten J, Liu Y, Kadie CM, Davidson RI, Heckerman D: **FaST linear mixed models for genome-wide association studies**. *Nature methods *2011, **8**(10):833-835.

(5) Houseman EA, Accomando WP, Koestler DC, Christensen BC, Marsit CJ, Nelson HH, Wiencke JK, Kelsey KT: **DNA methylation arrays as surrogate measures of cell mixture distribution**. *BMC bioinformatics *2012, **13**:86.

(6) Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM: **Measurement error in nonlinear models: a modern perspective**: CRC press; 2012.

(5) Jaffe AE, Irizarry RA: **Accounting for cellular heterogeneity is critical in epigenome-wide association studies**. *Genome Biol *2014, **15**(2):R31.

(6) Reinius LE, Acevedo N, Joerink M, Pershagen G, Dahlen SE, Greco D, Soderhall C, Scheynius A, Kere J: **Differential DNA methylation in purified human blood cells: implications for cell lineage and studies on disease susceptibility**. *PloS one *2012, **7**(7):e41361.

(7) Zou J, Lippert C, Heckerman D, Aryee M, Listgarten J: **Epigenome-wide association studies without the need for cell-type composition**. *Nature methods *2014, **11**(3):309-311.

(8) Lippert C, Listgarten J, Liu Y, Kadie CM, Davidson RI, Heckerman D: **FaST linear mixed models for genome-wide association studies**. *Nature methods *2011, **8**(10):833-835.