* 
Least squares criterion (identical image intensities) 
* 
Simple correlation (not identical, linearly correlated,
good for monomodal images) 
* 
Multimodal images:



MR (Magnetic Resonance) 

CT (Computed Tomography) 

PET (Positron Emission Tomography) 

SPECT (Single PhotonEmission Computed Tomography) 



Woods criterion (manual segmentation) 

Robust estimators (SPECTMR) 




Mutual Information 

Given two images X
and Y, their joint probability density
function (joint pdf) P( i, j ) can be
defined by simple normalization of their 2Dhistogram 



Given the marginal probabiilty density functions: P_{x}(
i ) and P_{y}( j ) 



Then mutual information between X
and Y is given by: 







Mutual information measure can be considered very general
since it makes very few asumptions regarding the relationships of the image intentisities, it does not assume linear
correlation, nor even functional correlation, it assumes only statistical dependence.
Pitfall: mutual information treats intensity values in a purely qualitative way, without considering
any notion of proximity in the intensity space (nearby intensities convey spatial information).
In a real image, tissue is never represented by a single intensity value, but rather by a certain interval. Fig.
1 in [14] shows a synthetic situation in which mutual information is not well adapted. 



Images as Random Variables 

Statistical concepts have proven to be powerful tools
for the design and computation of similarity, therefore in this paper, we "artifitially" considered images
as random variables: this corresponds to interpreting an image histogram as a probability density function
(pdf). Moreover, we consider the 2Dhistogram of an image pair as their joint pdf. This means that when randomly
selecting a voxel in image X, the probability
of obtaining an intensity i is proportional to the number of voxels (N_{i}) in X,
having the intensity i: 

(1) 



In order to define the joint pdf of an image pair, the
authors consider two images (X,Y) and
a spatial transformation T that maps the
grid of Y, Ω_{y}, to the
grid of X, Ω_{x}.
X_{ and }Y take
their intensity values from a finite set A
which can be assumed to be the same for the two images. Typically A
= {0 ... 255}. 

(2) 



By applying transformation T
to image Y,
a new mapping is defined from the transformed positions of Y
to A: 

(3) 



The points of T(Ω_{x})
that don't have eight neighbors in Ω_{x}
are rejected. T(Ω_{y})*
denotes the subset of accepted points and denotes
the transformation of X. The image pair
is defined then as the following random couple: 

(4) 



Then, the joint pdf X
and Y_{t
is defined in a similar way as it was done for a single image in (1):} 

(5) 



The marginal pdf's of the images X
and Y_{T}
are entirely determined by the joint pdf P_{T}(i,j).
However, they are not equal a priori to those obtained by considering single images. Due to interpolation,
they depend on the transformation T: 

(6) 





Random variables geometry 

In the previous section the joint pdf of two images has
been defined. The task now is to find some kind of dependence between two random variables when their joint pdf
is known. For this purpose, the geometry of the L^{2
}space provides a simple method for quantifying the functional dependence between two random
variables. L^{2
}is defined as the space of square integrable real variables, that is, the variables which verify: 



where E
denotes the expectation operator. L^{2
} is a Hilbert space with respect to the dot product .
Thus, the corresponding norm is the second order moment of a variable: 

(7) 



The L^{2
} norm is closely related to the classical notions of expectation, variance
and standard deviation, therefore, equation (7) can be rewritten as: 

(8) 



Due to its Hilbertian structure, L^{2
} has interesting geometric properties. The notion of orthogonality between
two variables can be defined as: 

(9) 



The meaning of orthogonality in L^{2
} relates with the notion of independence, but in a less restrictive way.
Two variables X
and Y
are said to be independent if their joint pdf is equal to the product of their marginal pdf's, that is .
It can be shown that for two such variables, thus: 

(10) 



The converse, however, is false. Orthogonality in L^{2
} is a weaker constraint than independence. It may be seen as a notion of
independence on the average. In a general way, the angle between two variables X
and Y
is defined thanks to a basic property of dot products: 

(10) 



Expectation 

L^{2
}contains the onedimensional space Δ of
deterministic variables, i.e. variables which are constant on the state space Ω.
Given a variable X,
its expectation is: 

(11) 



Therefore, E(X)
is nothing but the orthogonal projection of X
onto Ω. In the sense of the L^{2
} norm, it is the constant variable which best approximates X
(classical notion of mean). 



Figure 1. Geometric interpretation of expectation. E(X)
is the orthogonal projection of X
onto the constant direction Δ. 



Correlation Coefficient 

A quick method to approximate the degree of dependence
between two variables is to compute their correlation coefficient. Given two variables X
and Y,
it is defined as: 

(12) 



From a geometric point of view, we can write: 

(13) 



Using equation (10), the correlation coefficient between
X
and Y
can be interpreted in a geometric way. Let α denote the angle between X
 E(X)
and Y
 E(Y).
We have: 

(14) 



Figure 2. Geometric interpretation of the correlation
coefficient. We have ρ(X,Y)
= cos(α). The constant (or deterministic) direction is denoted by Δ. 



We see that ρ(X,Y)
is larger as the angle α is small. It reaches 1
if X  E(X)
and Y
 E(Y)
are colinear. This is to say that the correlation coefficient measures the linear dependence between two variables.
As we want to take into account general functions between X
and
Y,
possibly nonlinear and noninvertible, this is not a good measure of functional dependence. 



Conditional Expectation 

Evaluating the functional dependence between two variables
comes down to an interpolation problem with no constraints. Suppose we want to estimate a variable Y
with another variable X.
A natural approach would be: (1) find the function that best fits
Y
among all possible functions of X;
(2) quantify the quality of the estimate with respect to Y.
The notion of conditional expectation provides a straightforward method for such an evaluation, without having to
test every possible function of X.
If X
and Y
are not independent, knowing an event X
= x should provide some new information about Y.
Any event X
= x induces a conditional pdf for Y,
that is: 

(15) 



Then, the corresponding
a posteriori expectation of Y
is: 

(16) 



To any possible realization of X
corresponds an a posteriori expectation of Y.
Thus, it can be defined a function of X,
which is the conditional expectation of Y
in terms of X: 

(17) 



Notice that E(Y
 X) is also a random variable. It is easy to verify that it is an unbiased
estimate, i.e.: 

(18) 



The conditional expectation's major interest is that it
is the optimal approximator in the sense of the L^{2
}norm. In [14], appendix B, A. Roche et al. show that E(Y
 X) is the measurable function of X
that has the smallest distance to Y: 

(19) 





Total Variance Theorem 

A geometric interpretation of the conditional expectation
is now presented. For this matter it is considered the subspace L_{x}
of every possible function Φ of X
(provided that it remains in L^{2}): 

(20) 



Every constant variable is a (constant) function of X,
so that: 

(21) 



Figure 3. Geometric interpretation of the conditional
expectation. It is the orthogonal projection onto L_{x.} 



As the conditional expectation E(Y
 X) minimizes the distance between Y
and L_{x},
E(Y
 X) is the orthogonal projection of Y
onto L_{x.
T}his is due to the Hilbertian structure of L^{2}.
This simple geometrical property allows us to compute easily the distance between Y
and L_{x}.
Indeed, Y
 E(Y
 X) is orthogonal to any vector of L_{x}
by definition of the orthogonal projection. It can be noted that: 

(22) 



Therefore, the triangle whose vertices are Y,
E(Y)
and E(Y
 X) is rightangled in E(Y
 X). Applying the Pythagorean theorem, we retrieve a result known as
the total variance theorem: 

(23) 



since E[E(Y
 X)] = E(Y),
and since 

(24) 



the equation (23) can be rewritten as: 

(25) 



where Ex,
is the operator defined by: 

(26) 



This may be seen as an energy conservation equation. The
variance of Y
is decomposed as a sum of two "energy" terms: 



1.
Var[E(Y
 X)] that is
the variance of the conditional expectation E(Y
 X).
It measures the part of Y
which is predicted by X. 



2.
Conversely, the term E_{x}[Var[E(Y
 X = x)],
which is called the conditional variance, represents the square distance of Y
to the space L_{x}.
It measures the part of Y
which is functionally independent of X. 



Design of the Correlation Ratio 

Based on the previous analysis, a measure of functional
dependence between _{X}
and Y
can be designed. Accounting for the interpretation of the total variance theorem in terms of energy, it seems natural
to compare the "explained" energy of Y
with its total energy. This leads to the definition of the correlation ratio between _{X}
and Y. 

(27) 



The correlation ratio also has a simple geometric interpretation.
Let θ denote the angle between Y
 E(Y) and the space L_{x},
By definition, θ is also the angle between Y
 E(Y) and
E(Y  X)  E(Y)
(Fig. 3). Then we have: 

(28) 



Unlike the correlation coefficient which measures the linear
dependence between two variables, the correlation ratio measures their functional dependence. It takes on
values between 0 and 1. A value near 1 indicates a high functional dependence, while a value near 0 indicates a
low functional dependence. The two extreme cases are: 

(29) 



By nature, the correlation ratio is asymmetric since the
two variables fundamentally do not play the same role in the functional relationship. Thus, in general: 



Properties of the correlation ratio: 



This last equality is true if and only if
E(Y  X)
is linear with respect to X,
i.e. if 





Computation of the Correlation Ratio 

In order to compute
for a given transformation T,
in practice it is used the equation: 



Thus, having defined the joint pdf of X
and Y_{t},
we compute using: 



with: 





