Principal component analysis of the entire dataset (115 subjects). First, the data is filtered for genes with a Coefficient of Variation ≥ 90th percentile. Then, all samples are plotted according to expression of the first two Principal Components. (A) Samples are identified by batch: batch 1 (black), batch 2 (red), batch 3 (green), batch 4 (blue), batch 5 (cyan), and batch 6 (magenta). (B) Samples are identified by severity of disease (FVC%, see text): normal (black), mild disease (blue), moderate disease (green), severe disease (red), unknown (magenta); and the analytic subset: training set (open circles), validation set (closed squares). (C) Samples are identified, again, by the severity of disease (DLCO%): color code is the same as in panel B. (D) Samples are identified by family history: normal (black), familial idiopathic pulmonary fibrosis (cyan), sporadic idiopathic pulmonary fibrosis (magenta).