14.06.2013 Views

Databases and Systems

Databases and Systems

Databases and Systems

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

90<br />

query on that b<strong>and</strong>. As the maps in GDB increasingly come to span multiple levels of<br />

resolution all the way down to sequence features, errors such as these become more<br />

serious.<br />

Nonlinear Alignment<br />

There are a number of reasons why points in the dispersion plot do not fall on the line<br />

y=x. These can include measurement error in mapping, mistakes made in mapping or<br />

data entry, as in the outliers in figure 2, which reflect dramatic disagreements<br />

between two maps as to the position of a locus. More importantly, distances in one<br />

map may have a nonlinear relationship to distances in another. For example, genetic<br />

<strong>and</strong> physical distances typically have a non-linear relationship because recombination<br />

occurs more frequently near the middle of the chromosomal arms, causing a high<br />

ratio of genetic to physical distance in those regions, but a low ratio in the<br />

pericentromeric regions where recombination is inhibited. This nonlinear relationship<br />

between different types of map distances means that the regression line will be a<br />

more accurate transformation in some regions than in others.<br />

One solution to the problem of nonlinearity is to use a nonlinear transformation<br />

function which warps the maps into better correspondence. The best transformation<br />

function would be the curve defined by the common markers, which we can<br />

approximate by a piece-wise linear function. However the transformation must be<br />

monotonic (nondecreasing) if it is to preserve marker orders when transforming a<br />

map into universal coordinates, <strong>and</strong> we have seen how order discrepancies between<br />

maps can introduce outliers in the plots. To remove these we select a maximal subset<br />

of the common markers that are order-consistent between the two maps. Such a<br />

subset is called a longest monotonic chain. A piecewise-linear function is then<br />

defined over that set of points. Figure 5 shows the dispersion for a set of maps<br />

aligned with the longest chain method as compared with linear regression. The<br />

reduction in dispersion relative to the linear transformation is apparent.<br />

Special care must be taken at the ends of maps, where there may be loci that occur<br />

beyond the ends of the piecewise linear function. The function must be extrapolated<br />

to h<strong>and</strong>le such points. Two obvious solutions are to use the regression line slope or<br />

the slope of the last linear segment; we have found the former to be more robust.<br />

Although longest chain piecewise linear alignment is clearly an improvement over<br />

linear regression, it can probably be further improved on. The algorithm chooses<br />

arbitrarily among equally long longest chains; it should be possible to somehow<br />

average among all possible longest chains. Also it might be worth considering other<br />

measures of length besides number of points, such as distance covered, or composite<br />

measures that include both number of points <strong>and</strong> distance spanned

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!