11.07.2015 Views

2DkcTXceO

2DkcTXceO

2DkcTXceO

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

X.-L. Meng 549can prove only that (45.4) has a positive bias. But even this weaker resultrequires an additional assumption — for multivariate θ —thatthelossofinformation is the same for all components of θ. This additional requirementfor multivariate θ was both unexpected and troublesome, because in practicethere is little reason to expect that the loss of information will be the samefor different parameters.All these complications vividly demonstrate both the need for and challengesof the multi-phase inference framework. By multi-phase, our motivationis not merely that there are multiple parties involved, but more critically thatthe phases are sequential in nature. Each phase takes the output of its immediateprevious phase as the input, but with little knowledge of how otherphases operate. This lack of mutual knowledge reality leads to uncongeniality,which makes any single-model framework inadequate for reasons statedbefore.45.3.2 Data pre-processing, curation and provenanceTaking this multi-phase perspective but going beyond the MI setting, we(Blocker and Meng, 2013) recently explored the steps needed for buildinga theoretical foundation for pre-processing in general, with motivating applicationsfrom microarrays and astrophysics. We started with a simple butrealistic two-phase setup, where for the pre-processor phase, the input is Yand the output is T (Y ), which becomes the input of the analysis phase. Thepre-process is done under an “observation model” P Y (Y |X, ξ), where X representsthe ideal data we do not have (e.g., true expression level for each gene),because we observe only a noisy version of it, Y (e.g., observed probe-levelintensities), and where ξ is the model parameter characterizing how Y is relatedto X, including how noises were introduced into the observation process(e.g., background contamination). The downstream analyst has a “scientificmodel” P X (X|θ), where θ is the scientific estimand of interest (e.g., capturingthe organism’s patterns of gene expression). To the analyst, both X and Yare missing, because only T (Y ) is made available to the analyst. For example,T (Y ) could be background corrected, normalized, or aggregated Y .Theanalyst’s task is then to infer θ based on T (Y ) only.Given such a setup, an obvious question is what T (Y )shouldthepreprocessorproduce/keep in order to ensure that the analyst’s inference of θwill be as sharp as possible? If we ignore practical constraints, the answerseems to be rather trivial: choose T (Y ) to be a (minimal) sufficient statisticfor∫P Y (y|θ, ξ) = P Y (y|x; ξ)P X (x|θ)µ(dx). (45.7)But this does not address the real problem at all. There are thorny issuesof dealing with the nuisance (to the analyst) parameter ξ, as well as theissue of computational feasibility and cost. But most critically, because of theseparation of the phases, the scientific model P X (X|θ) and hence the marginal

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!