PgmNr P2095: Comparison of normalization and differential expression analyses using RNA-Seq data from 726 individual Drosophila melanogaster.

Authors:
Y. Lin 1 ; K. Golovnina 2 ; Z.-X. Chen 2 ; H. Lee 2 ; Y. Serrano Negron 1 ; H. Sultana 2 ; B. Oliver 2 ; S. Harbison 1


Institutes
1) National Heart Lung and Blood Institute, Bethesda, MD; 2) National Institute of Diabetes and Digestive and Kidney Diseases, Bethesda, MD.


Abstract:

In order to determine whether heritable differences in gene expression could be detected among individual flies, we performed a multi-factor experiment using RNA extracted from 768 flies. We harvested RNA from individual flies using 16 inbred lines from the Drosophila Genetic Reference Panel. These flies were reared in three separate biological replicates. The RNA was successfully sequenced for more than 98% of the flies, and the genotype and sex of each sample were verified by using a ‘bar code’ and Spearman correlation respectively. Application of this verification procedure resulted in 726 sequences remaining for further analysis. To identify the optimal analysis approach for the detection of differential gene expression among genotype, sex, environment, and their interactions, we investigated the effects of three different filtering strategies, eight normalization methods, and two statistical approaches. We assessed differential gene expression among factors and also performed a statistical power analysis using the eight biological replicates per genotype, environment, and sex in our data set. We found that two to five biological replicates were required in order to have adequate statistical power depending upon the factors analyzed. Some common normalization methods, such as Total Count, Quantile and RPKM normalization, did not align the data across samples. Analyses applying the Median, Quantile, and Trimmed Mean of M-values normalization methods were sensitive to the removal or retention of genes with low expression in the data set. The two statistical approaches, a generalized linear model with a negative binomial distribution and an ANOVA model, yielded strikingly different results. Our favored analysis approach was to normalize the read counts using the DESeq method, to apply a generalized linear model assuming a negative binomial distribution using either edgeR or DESeq software, and to remove genes with very low read counts after the statistical analysis.