Word list

Finn Årup Nielsen
Neurobiology Research Unit, Rigshospitalet


Word list with short explaination in the areas of neuroinformatics and statistics.

abundance matrix
A data matrix $ {\bf X}(N \times P)$ that contains actual numbers of occurrences or proportions [Kendall, 1971] according to [Mardia et al., 1979, exercise 13.4.5].
activation function
The nonlinear function in the output of the unit in a neural network. Can be a threshold function, a piece-wise linear function or a sigmoidal function, e.g., hyperbolic tangent or logistic sigmoid. If the activation function is on the output of the neural network it can be regarded as a link function.
active learning
1: The same as focusing [MacKay, 1992b]. 2: supervised learning [Haykin, 1994]
adaptive principal components extraction
APEX. Artificial neural network structure with feed-forward and lateral connections to compute principal components [Kung and Diamantaras, 1990].
adjacency matrix
A binary and square matrix $ {\bf A}(N \times N)$ describing the connections in a graph consisting of $ N$ nodes.
Akaike's Information criteria
AIC. Also called ``information criteria A''

AIC$\displaystyle = -2 \log[ \underset{\boldsymbol \theta}{\max} \, p({\bf X}\vert{\boldsymbol \theta})] + 2 P,$ (1)

where $ P$ is the number of parameters of the model.
analysis of covariance
A type of (univariate) analysis of variance where some of the independent variables are supplementary/non-interesting -- usually confounds/nuisance variable -- and used to explain variation in the dependent variable [Pearce, 1982].
anti-Hebbian learning
Modeling by employing a constraint.
asymmetric divergence
Equivalent to the relative entropy.
author cocitation analysis
Analysis of the data formed when a (scientific) paper cites two different authors. The performed with, e.g., cluster analysis [McCain, 1990]. An overview of author cocitation analysis is available in [Lunin and White, 1990].
Modeling with the input the same as the output.
1: The method to find the (first-order) derivative of a multilayer neural network. 2: The method of adjusting (optimizing) the parameters in a multilayer neural network.
Bayes factor
A ratio of evidences

$\displaystyle B = \frac{p({\bf X}\vert{\cal H}_1)}{p({\bf X}\vert{\cal H}_2)},$ (2)

where $ {\cal H}_1$ and $ {\cal H}_2$ are hypotheses, models or hyperparameters. See, e.g., [Kass and Raftery, 1995] and attributed to Turing and Jeffreys.
Bayesian information criterion
(BIC) Also called Schwarz (Bayesian) (information) criterion (SBC)

BIC$\displaystyle = -2 \log[ \underset{\boldsymbol \theta}{\max} \, p({\bf X}\vert{\boldsymbol \theta})] + P \log(N)$ (3)

where the first term includes the maximum log likelihood, $ P$ is the number of parameters and $ N$ is the number of objects. Written in another normalization

BIC$\displaystyle = \log[ \underset{\boldsymbol \theta}{\max} \, p({\bf X}\vert{\boldsymbol \theta})] - \frac{P}{2} \log(N)$ (4)

1: A threshold unit in a neural network. 2: A models inability to model the true system. 3: The difference between the mean estimated model and the true model. 4: The difference between the mean of an estimator and the true value.
bias-variance trade-off
The compromise between the simplicity of the model (which causes bias) and complexity (which causes problems for estimation and results in variance) [Geman et al., 1992].
binary matrix
A matrix with elements as either one or zero. Also called a (0,1)-matrix.
A resampling scheme that samples with replacement in the sample, and it is useful to assess the accurary of an estimate. See, e.g., [Zoubir and Boashash, 1998].
Burt matrix
A product matrix of an indicator matrix [Burt, 1950] [Jackson, 1991, p. 225]: $ {\bf X}^{\sf T}{\bf X}$
canonical correlation analysis
A type of multivariate analysis.
canonical variable analysis
A set of different multivariate analyses.
canonical variate analysis
Canonical correlation analysis for discrimination, i.e., with categorical variables.
central moment
k'th-order sample central moment

$\displaystyle m_k = \frac{1}{n} \sum_n^N (x_n - \bar{x})^k$ (5)

1: In SPM99 a region in a thresholded SPM which voxels are connected. 2: In cluster analysis a set of voxels (or other objects) assigned together and associated with a ``center''.
cluster analysis
An unsupervised method to group data points. A specific method is K-means.
1: The tendency of data points to be unequally distributed in, e.g., space or time. 2: cluster analysis.
coefficient of variance
The standard deviation normalized by the mean

coefficient of variance$\displaystyle = \frac{\sigma_x}{\bar{x}}.$ (6)

Sometimes found with the abbreviations COV or CoV.
1: Any mental process. 2: A mental process that is not sensor-motoric or emotional. 3: The process involved in knowing, or the act of knowing (Encyclopaedia Britannica Online)
complete likelihood
(Complete-data likelihood) The joint propability density of observed and unobserved variables

$\displaystyle p({\bf X}, {\bf Y}\vert{\boldsymbol \theta}),$ (7)

where $ {\bf X}$ is observed and $ {\bf Y}$ is hidden (data/parameters) and $ {\boldsymbol \theta}$ is the parameters.
conditional (differential) entropy

$\displaystyle h({\bf y}\vert{\bf x}) = - \int p({\bf x},{\bf y}) \ln p({\bf y}\vert{\bf x})\, d{\bf x}\, d{\bf y}$ (8)

conditional probability density function

$\displaystyle p({\bf x} \vert {\bf y}) = \frac{p({\bf x},{\bf y})}{p({\bf y})}$ (9)

Here the probability density for $ {\bf x}$ given y.
(usual meaning) A nuisance variable that is correlated with the variable of interest.
conjugate prior
(Also ``natural conjugate prior'') An informative prior ``which have a functional form which integrates naturally with data measurements, making the inferences have an analytically convenient form'' [MacKay, 1995]. Used for regularization.
An estimator is consistent if the variance of the estimate goes to zero as more data (objects) are gathered.
In (general) linear modeling: A vector $ {\bf c}$ for a linear combination of parameters that sum to zero. Vectors associated with estimable linear combinations of the parameters that do not sum to zero are sometimes also called -- against definition -- `contrasts' [Nichols, 2002].
cost function
The function that is optimized. Can be developed via maximum likelihood from a distribution assumption between the target and the model output. Other names are Lyapunov function (dynamical systems), energy function (physics), Hamiltonian (statistical mechanics), objective function (optimization theory), fitness function (evolutionary biology). [Hertz et al., 1991, page 21-22].

$\displaystyle -\int p({\bf x}) \ln q({\bf x}) d{\bf x}$ (10)

Equivalent to the sum of the relative entropy and the entropy (of the distribution the relative entropy is measured with respect to). Can also be regarded as the average negative log-likelihood, e.g., with $ q({\bf x})$ as a modeled density and $ p({\bf x})$ as the true unknown density [Bishop, 1995, pages 58-59].
Empirical method to assess model performance, where the data set is split in two: One set that is used to estimate the parameters of the model and the other set that is used to
data matrix
A matrix $ {\bf X}(N \times P)$ in which $ N$ number of $ P$-dimensional multivariate measurements are represented, see, e.g., [Mardia et al., 1979, sec. 1.3].
In the statistical sense: If two random variables are dependent then one of the random variables convey information about the value of the other. While correlation is only related to the probability density function with the two first moments, dependency is related to all the moments.
dependent variable
The variable to be explained/predicted from the independent variable. The output of the model. Often denoted $ {\bf y}$.
design matrix
A matrix containing the ``independent'' variables in a multivariate regression analysis.
differential entropy
Entropy for continuous distributions

$\displaystyle h({\bf x}) = -\int p({\bf x}) \ln p({\bf x}) d{\bf x}$ (11)

Often just called ``entropy''.
directed divergence
Equivalent to relative entropy according to [Haykin, 1994].
An eigenvector associated with a principal component from a principal component analysis that can be interpreted as an image or volume.
empirical Bayes
Bayesian technique where the prior is specified from the sample information
A measure for the information content or degree of disorder.
elliptic distribution / elliptically contour distribution
A family of distribution with elliptical contours.

$\displaystyle p({\bf x}) = \vert{\boldsymbol \Sigma}\vert^{-1/2} \psi \left[ ({...
...mbol \mu})^{\sf T}{\boldsymbol \Sigma}^{-1} ({\bf x}-{\boldsymbol \mu}) \right]$ (12)

The distribution contains the Gaussian distribution, the multivariate $ t$ distribution, the contaminated normal, multivariate Cauchy and multivariate logistic distribution
The procedure to find the model or the parameters of the model. In the case or parameters: a single value for each parameter in, e.g., maximum likelihood estimation, or a distribution of the parameters in Bayesian technique.
The probability of the data given the model or hyperparameters [MacKay, 1992a]. ``Likelihood for hyperparameters'' or ``likelihood for models'', e.g., $ p({\bf X}\vert{\cal M})$. Found by integrating the parameters of the model

$\displaystyle p({\bf X}\vert{\cal M}) = \int p({\bf X}\vert{\cal M}, {\boldsymbol \theta}) \, p({\boldsymbol \theta}\vert{\cal M}) \, d{\boldsymbol \theta}$ (13)

This is the same as ``integrated likelihood'' or ``marginal likelihood''.
expectation maximization
A special group of optimization methods for models with unobserved data [Dempster et al., 1977].
explorative statistics
Statistics with the aim of generating hypotheses rather than testing hypothesis.
feed-forward neural network
A nonlinear model.
finite impulse response (model)
A linear model with a finite number of input lags and a single output.
fixed effect model
Model where the parameters are not random [Conradsen, 1984, section 5.2.2]. See also random effects model and mixed effects model.
Reduction of the number of voxels analyzed using a F-test.
Frobenius norm
Scalar descriptor of a matrix [Golub and Van Loan, 1996, equation 2.3.1]. The same as the square root of the sum of the singular values [Golub and Van Loan, 1996, equation 2.5.7].

$\displaystyle \Vert {\bf X} \Vert _{F} = \sqrt{\sum_n^N{\sum_p^P {\vert x_{np}\vert^2}}}.$ (14)

full width half maximum
Used in specification of filter width, related to the standard deviation for a Gaussian kernel by

FWHM$\displaystyle = \sqrt{8 \ln 2}\ \sigma \approx 2.35 \sigma.$ (15)

functional integration
The notion that brain region interact or combine in solving a mental task. Brain data in this connection is usually modeled with multivariate statistical methods [Friston, 1997].
functional segregation
The issue in analyses of brain data with univariate statistical methods [Friston, 1997].
functional volumes modeling
Meta-analysis in Talairach space, -- a term coined in [Fox et al., 1997].
Gauss-Newton (method)
Newton-like optimization method that uses the ``inner product'' Hessian.
general linear hypothesis
Hypotheses associated with the multivariate regression model of the form [Mardia et al., 1979, section 6.3]

$\displaystyle {\bf C}_1{\bf B}{\bf M}_1 = {\bf D}.$ (16)

In many case the the hypothesis is of a simple type testing only for difference between the rows

$\displaystyle {\bf C}_1{\bf B} = {\bf0}.$ (17)

In the univariate case it becomes yet simpler

$\displaystyle {\bf c}_1{\bf b} = 0.$ (18)

general linear model
A type of multivariate regression analysis, usually where the independent variables are a design matrix.
generalized additive models
A group of nonlinear models proposed by [Hastie and Tibshirani, 1990]: Each input variable is put through a non-linear function. All transformed input variable is then used in a multivariate linear regression, and usually with a logistic sigmoid function on the output. In the notation of [Bishop, 1995]:

$\displaystyle y = g\left(\sum_i \phi_i\left(x_i\right) + w_0 \right)$ (19)

generalized inverse
An ``inverse'' of a square or rectangular matrix. If $ {\bf A}^-$ denotes the generalized inverse of $ \bf A$ then it will satisfy some of the following Penrose equations (Moore-Penrose conditions) [Golub and Van Loan, 1996, section 5.5.4] and [Ben-Israel and Greville, 1980]:

$\displaystyle \bf A A^-A$ $\displaystyle = \bf A$ $\displaystyle (I)$ (20)
$\displaystyle \bf A^- A A^-$ $\displaystyle = \bf A^-$ $\displaystyle (II)$ (21)
$\displaystyle \bf (A A^-)^*$ $\displaystyle = \bf A A^-$ $\displaystyle (III)$ (22)
$\displaystyle \bf (A^- A)^*$ $\displaystyle = \bf A^- A$ $\displaystyle (IV),$ (23)

where $ ^*$ denotes the conjugate, i.e., transposed matrix if $ \bf A$ is real. A matrix that satisfy all equations is uniquely determined and called the Moore-Penrose inverse and (often) denoted as $ \bf
generalized least squares
1: Regression with correlated (non-white) noise, thus were the noise covariance matrix is not diagonal, see, e.g., [Mardia et al., 1979, section 6.6.2]. 2: The Gauss-Newton optimization method.
generalized linear model
An ``almost'' linear model where the output of the model has a link function for modeling non-Gaussian distributed data.
Good-Turing frequency estimation
A type of regularized frequency estimation [Good, 1953]. Often used in word frequency analysis [Gale and Sampson, 1995].
Gram matrix
A Gram matrix is Hermitian and constructed from a data matrix $ {\bf X}(N \times P)$ with $ N$ $ P$-dimensional vectors $ {\bf x}_n$

$\displaystyle {\bf G} = {\bf XX}^{\sf T},$ (24)

i.e., the elements of the Gram matrix $ {\bf G}(N \times N)$ are the dot products between all possible vectors $ {\bf x}_{n_1} \cdot
{\bf x}_{n_2}$.
gyrification index
An index for the degree of folding of the cortex

GI$\displaystyle = \frac{\text{Total surface length}}{\text{Outer surface contour}}$ (25)

There are no folding for GI$ =1$ and folding for GI$ >1$.
hard assignment
Assignment of a data point to one (and only one) specific component, e.g., cluster in cluster analysis or mixture component in mixture modeling.
Hebbian learning
Estimation in a model where the magnitude of a parameter is determined on how much it is ``used''.
hemodynamic response function
The coupling between neural and vascular activity.
Modeling where the input and output are different, e.g., ordinary regression analysis.
Stationarity under translation.
A parameter in a model that is used in the estimation of the model but has no influence on the response of the estimated model if changed. An example of a hyperparameter is weight decay
A problem is ill-posed if the singular values of the associated matrix gradually decay to zero [Hansen, 1996], cf. rank deficient
Algorithms maximizing the mutual information (between outputs) [Becker and Hinton, 1992].
incidence matrix
A binary matrix with size (nodes $ \times$ links) [Mardia et al., 1979, p. 383]: $ x_{np} \in \{0, 1\}$. Also called indicator matrix.
incomplete likelihood
A marginal density marginalized over hidden data/parameters.

$\displaystyle p({\bf X}\vert{\boldsymbol \theta})$ $\displaystyle = \int p({\bf X}, {\bf Y}\vert{\boldsymbol \theta}) \, d{\bf Y}$ (26)
  $\displaystyle = p({\bf X}, {\bf Y}\vert{\boldsymbol \theta}) / p({\bf Y}\vert{\bf X}, {\bf\theta})$ (27)

indicator matrix
A data matrix with elements one or zero [Jackson, 1991, p. 224]: $ x_{np} \in \{0, 1\}$.
information criterion
Includes Akaike's information criteria (AIC), Bayesian information criterion (BIC), (Spiegelhalter's) Bayesian deviance information criterion (DIC), ...
The external occipital protuberance of the skull (Webster). Used as marker in EEG. Opposite the nasion. See also: nasion, peri-auri
integrated likelihood
The same as the evidence.
International Consortium for Brain Mapping
(Abbreviation: ICBM) Group of research institutes. They have developed a widely used brain template know as the ICBM or MNI template.
In the framework of input-system-output: To find the input from the system and the output.
Stationarity under rotation.
Karhunen-Lóeve transformation
The same as principal component analysis. The word is usually used in communication theory.
kernel density estimation
Also known as Parzen windows and probabilistic neural network.
Cluster analysis algorithm with hard assignment.
k-nearest neighbor
A classification technique
Information that has been placed in a context.
Kullback-Leibler distance
Equivalent to relative entropy.
Normalized fourth order central moment. The univariate Fisher kurtosis $ \gamma_2$ is defined as

$\displaystyle \gamma_2 = \frac{\mu_4}{\mu_2^2} - 3,$ (28)

where $ \mu_4$ and $ \mu_2$ are the fourth and second central moment, respectively.
latent class analysis
Also called latent class decomposition. A decomposition of a contingency table (multivariate categorical) into $ K$ latent classes, see, e.g., [Hofmann, 2000]. The observed variables are often called manifest variables.

$\displaystyle P({\bf d}, {\bf w}) = \sum^K_k P(z_k)\, P({\bf d}\vert z_k)\, P({\bf w}\vert z_k)$ (29)

latent semantic indexing
Also ``latent semantic analysis''. Truncated singular value decomposition applied on a bag-of-words data matrix [Deerwester et al., 1990].
lateral orthogonalization
Update method (``anti-Hebbian learning'') or connections between the units in the same layer in a neural network which impose orthogonality between the different units, i.e., a kind of deflation technique. Used for connectionistic variations of singular value decomposition, principal component analysis and partial least squares among others, see, e.g., [Hertz et al., 1991, pages 209-210] and [Diamantaras and Kung, 1996, section 6.4].
In the framework of input-system-output: To find the system from (a set of) inputs and outputs. In some uses the same as estimation and training.
A cross-validation scheme where each data point in turn is kept in the validation/test set while the rest is used for training the model parameters.
A function where the data is fixed and the parameters are allowed to vary

$\displaystyle p({\bf X}\vert{\boldsymbol \theta}).$ (30)

link function
A (usually monotonic) function on the output of a linear model that is used in generalized linear models to model non-Gaussian distributions.
Number for the readability of a text [Björnsson, 1971].
logistic (sigmoid) (function)
A monotonic function used as a link function to convert a variable in the range $ ]-\infty; \infty[$ to the range $ [0; 1]$ suitable for interpretation as a probability

$\displaystyle y = \frac{1}{1+\exp(-x)}$ (31)

A non-linear subspace in a high dimensional space. A hyperplane is an example on a linear manifold.
Mann-Whitney test
A non-parametric test for a translational difference between to distributions by the application of a rank transformation. Essentially the same as the Wilcoxon rank-sum test, though another statistics is used: The `Mann-Whitney $ U$-statistics'
marginal likelihood
The same as the evidence and the ``integrated likelihood''.
marked point process
A point process where each point has an extra attribute apart from its location, e.g., a value for its magnitude.
Markov chain Monte Carlo
Sampling technique in simulation and Bayesian statistics.
mass-univariate statistics
Univariate statistics when applied to several variables.
matching prior
A prior in Bayesian statistics that is selected so posterior credible sets resemble frequentist probabilities.
maximal eigenvector
The eigenvector associated with the largest eigenvector.
maximum a posteriori
Maximum likelihood estimation ``with prior''.
maximum likelihood
A statistical estimation principal with optimization of the likelihood function
Robust statistics.
Metropolis-Hasting algorithm
Sampling technique for Markov chain Monte Carlo. [Metropolis et al., 1953]
mixed effects model
Type of ANOVA that consists of random and fixed effects [Conradsen, 1984, section 5.3]. See also fixed effects model and random effects model.
Moore-Penrose inverse
A generalized inverse that satisfy all of the ``Penrose equations'' and is uniquely defined [Penrose, 1955].
multiple regression analysis
A type of multivariate regression analysis where there is only one response variable ( $ {\bf Y} = {\bf
multivariate analysis
Statistics with more than one variable is each data set, as opposed to univariate statistics.
multivariate analysis of variance
(MANOVA) Type of analysis of variance with multiple measures for each `object' or experimental design. One-way MANOVA can be formulated as [Mardia et al., 1979, section 12.2]

$\displaystyle {\bf x}_{ij} = {\boldsymbol \mu} + {\boldsymbol \tau}_j + {\boldsymbol \epsilon}_{ij},$ (32)

where $ \mathbf{x}$ is observations (`responses' or `outcomes'), $ {\boldsymbol \mu}$ is the general effect, $ {\boldsymbol \tau}_j$ are the condition effects and $ {\boldsymbol \epsilon}$ is the noise often assumed to be independent Gaussian distributed. The most common test considers the difference in condition effect

$\displaystyle H_o : {\boldsymbol \tau}_1 = ... = {\boldsymbol \tau}_K.$ (33)

multivariate regression analysis
A type of linear multivariate analysis using the following model

$\displaystyle {\bf Y} = {\bf X}{\bf B}+{\bf U}$ (34)

$ {\bf Y}$ is an observed matrix and $ {\bf X}$ is a known matrix. $ {\bf B}$ is the parameters and $ {\bf U}$ is the noise matrix. The model is either called multivariate regression model (if $ {\bf X}$ is observed) or general linear model (if $ {\bf X}$ is ``designed''). [Mardia et al., 1979, chapter 6]
multivariate regression model
A type of multivariate regression analysis where the known matrix ($ {\bf X}$) is observed.
mutual information
Originally called ``information rate'' [Shannon, 1948].
$\displaystyle I({\bf y};{\bf x}) =
\int p({\bf x},{\bf y}) \ln \frac{p({\bf x}\vert{\bf y})}{p({\bf
x})}\, d{\bf x}\, d{\bf y}$     (35)
$\displaystyle = \int p({\bf x},{\bf y}) \ln \frac{p({\bf y}\vert{\bf x})}{p({\bf
y})}\, d{\bf x}\, d{\bf y}$     (36)
$\displaystyle = \int p({\bf x},{\bf y}) \ln
\frac{p({\bf x},{\bf y})}{p({\bf x})p({\bf y})}\, d{\bf x}\, d{\bf y}$     (37)

The bridge of the nose. Opposite the inion. See also: inion, and periauricular points. Used as reference point in EEG.
neural network
A model inspired by biological neural networks.
neurological convention
Used in connection with transversal or axial images of the brain to denote the left side of the image is to the left side of the brain, - as opposed to the ``radiological convention''.
non-informative prior
A prior with little effect on the posterior, e.g., an uniform prior or Jeffreys' prior. Often improper, i.e., not normalizable.
non-parametric (model/modeling)
1: A non-parametric model is a model where the parameters do not have a direct physical meaning. 2: A model with no direct parameters. [Rasmussen and Ghahramani, 2001]:
[...] models which we do not necessarily know the roles played by individual parameters, and inference is not primarily targeted at the parameters themselves, but rather at the predictions made by the models
``Outlierness''. How ``surprising'' an object is.
a variable of no interest in the modeling that makes the estimation of the variable of interest more difficult. See also confound.
A single ``data point'' or ``example''. A single instance of an observation of one or more variables.
Used to find the point estimate of a parameter. ``Optimization'' is usually used when the estimation requires iterative parameter estimation, e.g., in connection with nonlinear models.
The same as multidimensional scaling.
For matrices: Unitary matrix with no correlation among the (either column or row) vectors.
For matrices: Unitary matrix with no correlation among the (either column or row) vectors.
pairwise interaction point process
A spatial point process that is a specialization of the Gibbsian point process [Ripley, 1988, page 50]

$\displaystyle p({\bf x})$ $\displaystyle = \exp[-U({\bf x})]$ (38)
$\displaystyle U({\bf x})$ $\displaystyle = a_0 + \sum_i \psi({\bf x}_i) + \sum_{i<j} \phi({\bf x}_i, {\bf x}_j)$ (39)

partial least squares (regression)
Multivariate analysis technique usually with multiple response variables [Wold, 1975]. Much used in chemometrics.
Parzen window
A type of probability density function model [Parzen, 1962] where a window (a kernel) is placed at every object. The name is used in the pattern recognition literature and more commonly known as kernel density estimation.
penalized discriminant analysis
(Linear) discriminant analysis with regularization.
A (multilayer) feed-forward neural network [Rosenblatt, 1962].
See also: nasion, inion
The power spectrum $ {\bf z}$ of a finite signal $ {\bf x}$ where $ {\bf y}$ is the complex frequency spectrum and $ m= 1,\ldots,M$

$\displaystyle y(m)$ $\displaystyle = \sum_{n=1}^N x(n) \exp \left[ -j \frac{2\pi}{N} (n-1)(m-1) \right]$ (40)
$\displaystyle z(m)$ $\displaystyle = 1/N \vert y(m) \vert^2$ (41)

If only the lower part of the power spectrum is taken

$\displaystyle z(m)$ $\displaystyle = 2/N \vert y(m) \vert^2.$ (42)

The term is also use for the power spectrum of a rectangular-windowed signal.
The notion of a single word having several meanings. The opposite of synonymy.
preliminary principal component analysis
Principal component analysis made prior to a supervised modeling, e.g., an artificial neural network analysis or canonical variate analysis.
In the framework input-system-output: To find the output from the input and system.
principal component analysis
(PCA) An unsupervised multivariate analysis that identifies an orthogonal transformation and remaps objects to a new subspace with the transformation [Pearson, 1901], [Mardia et al., 1979, page 217].

$\displaystyle {\bf Y} = \left( {\bf X} -{\bf 1}\bar{\bf x}^{\sf T} \right) {\bf G}.$ (43)

An individual elemenent in $ y_{nk}$ is called a score. The $ k$'th column in $ {\bf Y}$ is called the $ k$'th principal component.
principal component regression
Regularized regression by principal component analysis [Massy, 1965].
principal coordinate analysis
A method in multidimensional scaling. Similar to principal component analysis on a distance matrix if the distance measure is Euclidean [Mardia et al., 1979, section 14.3].
principal covariate regression
Multivariate analysis technique [de Jong and Kiers, 1992].
principal manifold
Generalization of principal curves, a non-linear version of principal component analysis. [DeMers and Cottrell, 1993]
Distribution of parameters before observations have been made. Types of priors are: uniform, non-informative, Jeffrey's, reference, matching, informative, (natural) conjugate.
probabilistic neural network
A term used to denote kernel density estimation [Specht, 1990].
probabilistic principal component analysis
Principal component analysis with modeling of a isotropic noise [Tipping and Bishop, 1997]. The same as sensible principal component analysis.
profile likelihood
A likelihood where some of the variables -- e.g., nuisance variables -- are maximized [Berger et al., 1999, equation 2]

$\displaystyle \tilde{\cal L}({\boldsymbol \theta}_1) = \underset{ {\boldsymbol\theta}_2 }{\max} \, {\cal L}({\boldsymbol \theta}_1, {\boldsymbol \theta}_2 ).$ (44)

radiological convention
Used in connection with transversal or axial images of the brain to denote the right side of the image is to the left side of the brain, - as opposed to the ``neurological convention''.
random effects model
Type of ANOVA where the parameters (on the ``first'' level'') are regarded as random [Conradsen, 1984, section 5.2.3]. See also fixed effects model and mixed effects model.
The size of a subspace span by the vectors in a matrix
rank deficient
A matrix is said to be rank deficient if there is a well-defined gap between large and small singular values [Hansen, 1996], cf. ill-posed.
The method of stabilizing the model estimation.
relative entropy
A distance measure between two distributions [Kullback and Leibler, 1951].

$\displaystyle K(p\vert\vert q) = \int p({\bf x}) \ln \frac{p({\bf x})}{q({\bf x})}\, d{\bf x}$ (45)

Other names are cross-entropy, Kullback-Leibler distance (or information criterion), asymetric divergence, directed divergence.
In cluster analysis and mixture modeling the weight of assignment to a particular cluster $ k$ for a specific data point $ {\bf x}_n$. With probabilistic modelling this can be interpreted as a posterior probability

$\displaystyle P(k \vert {\bf x}_n).$ (46)

restricted canonical correlation analysis
(RCC) Canonical correlation analysis with restriction on the parameters, e.g., non-negativity [Das and Sen, 1994,Das and Sen, 1996].
robust statistics
Statistics designed to cope with outliers.
A part of an experiment consisting of several measurements, e.g., several scans. The measurements in a run is typically done with a fixed frequency. Multiple runs can be part of a session and a run might consist of one or more trials or events.
saliency map
A map of which inputs are important for predicting the output.
saturation recovery
Type of MR pulse sequence.
self-organizing learning
The same as unsupervised learning
sensible principal component analysis
Principal component analysis with modeling of an isotropic noise [Roweis, 1998]. The same as probabilistic principal component analysis [Tipping and Bishop, 1997]
A part of the experiment: An experiment might consists of multiple sessions with multiple subjects and every session contain one or more runs.
S-shaped. Often used in connection about the non-linear function in an artificial neural network.
The left/right asymmetry of a distribution. The usual definition is

$\displaystyle \gamma_1 = \sqrt{b_1} = \frac{\mu_3}{\mu_2^{3/2}}.$ (47)

where $ \mu_3$ and $ \mu_2$ are the third and second central moment. Multivariate skewness can be defined as [Mardia et al., 1979, p. 21, 148+]

$\displaystyle b_{1,P}$ $\displaystyle = \frac{1}{N^2} \sum_{n,m}^N g^3_{n,m},$ (48)
$\displaystyle g_{n,m}$ $\displaystyle = ({\bf x}_n - \mathbf{\bar{x}})^{\sf T} {\bf S}^{-1} ({\bf x}_m - \mathbf{\bar{x}}).$ (49)

slice timing correction
In fMRI: Correction for the difference in sampling between slices in a scan. Slices can acquired interleaved/non-interleaved and ascending/descending.
A vector function that is a generalization of the logistic sigmoid activation function suitable to transform the variables in a vector from the interval $ ]-\infty; \infty[$ to $ [0; 1]$ so they can be used as probabilities [Bridle, 1990]

$\displaystyle y_p = \frac{\exp(x_p)}{\sum_{p'} \exp(x_{p'})}.$ (50)

The sparseness of the matrix is the number of non-zero elements to the total number of elements.

Another definition is a function of the $ L_1$ and $ L_2$ norm [Hoyer, 2004, p. 1460]

$\displaystyle s({\bf x}) = \frac{ \sqrt{n} - \left(\sum \vert x_i \vert \right) \left/ \sqrt{\sum x_i^2 } \right. }{\sqrt{n} - 1}.$ (51)

spatial independent component analysis
Type of independent component analysis in functional neuroimaging [Petersson et al., 1999]. See also temporal component analysis.
A value extracted from a data set, such as the empirical mean or the empirical variance.
statistical parametric images
A term used by Peter T. Fox et al. to denote the images that are formed by statistical analysis of functional neuroimages.
statistical parametric mapping
The process of getting statistical parametric maps: Sometimes just denoting voxel-wise t-tests, other times ANCOVA GLM modeling with random fields modeling, and sometimes also including the preprocessing: realignment, spatial normalization, filtering, ...
statistical parametric maps
A term used by Karl J. Friston and others to denote the images that are formed by statistical analysis of functional neuroimages, especially those formed from the program SPM.
A distribution with negative kurtosis, e.g., the $ [0; 1]$-uniform distribution. Also called `platykurtic'.
A statistics (function) $ {\bf t}({\bf x})$ is sufficient if it is ``enough'' to estimate the parameters of the model ``well'', -- or more specifically: Enough to described the likelihood function within a scaling factor. The likelihood can then be factorized:

$\displaystyle p({\bf X}\vert{\boldsymbol \theta}) = g({\bf t}({\bf X}), {\boldsymbol \theta}) h({\bf X}).$ (52)

A distribution with positive kurtosis, i.e., a heavy-tailed distribution such as the Laplace distribution. Also called `leptokurtic'.
supervised (learning/pattern recognition)
Estimation of a model to estimate a ``target'', e.g., classification (the target is the class label) or regression (the target is the dependent variable) estimation with labeled data.
The notion that several words have the same meaning. The opposite of polysemy.
The part of the physical world under investigation. Interacts with the environment through input and output.
system identification
In the framework of input-system-output: To find the system from (a set of) inputs and outputs. The same as learning, although system identification usually refers to parametric learning.
temporal independent component analysis
Type of independent component in functional neuroimaging [Petersson et al., 1999]. See also spatial independent component analysis
test set
Part of a data set used to test the performance (fit) of a model. If the estimate should be unbiased the test set should be independent of the training and validation set.
time-activity curve
The curve generated in connection with dynamic positron emission tomography (PET) images
total least square
Multivariate analysis estimation technique
Term used in connection with neural networks to denote parameter estimation (parameter optimization). Sometimes called learning.
training set
Part of a data set used to fit the parameters of the model (not the hyperparameters). See also test and validation set.
An element of a psychological experiment usually consisting of a stimulus and a response.
truncated singular valued decomposition
Singular value decomposition of a matrix $ {\bf X}(N \times P)$ where only a number of components, say $ K <$   rank$ ({\bf X})$, are maintained

$\displaystyle \mathbf{\hat{X}}_K = {\bf U}{\bf L}_K{\bf V}^{\sf T},$ (53)

where the diagonal $ {\bf L}_K$ is $ [ l_1, l_2, l_3, 0, \ldots, 0 ]$. The truncated SVD matrix is the $ K$-ranked matrix with minimum 2-norm and Frobenius norm of the difference between all $ K$-ranked matrices and $ {\bf X}$.
univariate statistics
Statistics with only one response variable, as opposed to multivariate analysis.
unsupervised learning
Learning with only one set of data, -- there is no target involved. Cluster analysis and principal component analysis is usually regarded as unsupervised.
variational Bayes
An extension of the EM algorithm were both the hidden variables and the parameters are associated with probability density functions (likelihood and posteriors).

With observed $ {\bf y}$, hidden $ {\bf x}$ and parameters $ {\boldsymbol \theta}$

$\displaystyle \ln p({\bf y}\vert m)$ $\displaystyle \geq \int q({\bf x}) \, q({\boldsymbol \theta}) \ln \frac{p({\bf ...
...t m)}{q({\bf x})\, q({\boldsymbol \theta})}\, d{\bf x} \, d{\boldsymbol \theta}$ (54)

A 3-dimensional pixel. The smallest picture element in a volumetric image.
validation set
Part of a data set used to tune hyperparameters.
vector space model
In information retrieval: Representation of a document as a vector where each element is (usually) associated with a word [Salton et al., 1975]. The same as the bag-of-words representation.
E.g., the model parameters of a neural network.
white noise
Noise that is independent (in the time dimension).
Also called ``standard score'' and denotes a random variable transformed so the mean is zero and the standard deviation is one. For a normal distributed random variable the transformation is:

$\displaystyle z_x = \frac{x - \mu_x}{\sigma_x}$ (55)


Becker and Hinton, 1992
Becker, S. and Hinton, G. E. (1992).
A self-organizing neural network that discovers surfaces in random-dot stereograms.
Nature, 355(6356):161-163.

Ben-Israel and Greville, 1980
Ben-Israel, A. and Greville, T. N. E. (1980).
Generalized Inverses: Theory and Applications.
Robert E. Krieger Publishing Company, Huntington, New York, reprint edition edition.

Berger et al., 1999
Berger, J. O., Liseo, B., and Wolpert, R. L. (1999).
Integrated likelihood methods for eliminating nuisance parameters (with discussion).
Statistical Science, 14:1-28. http://ftp.isds.duke.edu/WorkingPapers/97-01.ps. CiteSeer: http://citeseer.ist.psu.edu/berger99integrated.html.

Bishop, 1995
Bishop, C. M. (1995).
Neural Networks for Pattern Recognition.
Oxford University Press, Oxford, UK. ISBN 0198538642 [ bibliotek.dk | isbn.nu ] .

Björnsson, 1971
Björnsson, C. H. (1971).
GEC, Copenhagen.

Bridle, 1990
Bridle, J. S. (1990).
Probalistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition.
In Fogelman Soulié, F. and Hérault, J., editors, Neurocomputing: Algorithms, Architectures and Application, pages 227-236. Springer-Verlag, New York.

Burt, 1950
Burt, C. (1950).
The factorial analysis of qualitative data.
British Journal of Statistical Psychology (Stat. Sec.), 3:166-185.

Conradsen, 1984
Conradsen, K. (1984).
En Introduktion til Statistik.
IMSOR, DTH, Lyngby, Denmark, 4. edition.
In Danish.

Das and Sen, 1994
Das, S. and Sen, P. K. (1994).
Restricted canonical correlations.
Linear Algebra and its Applications, 210:29-47. http://www.sciencedirect.com/science/article/B6V0R-463GRJ4-3/2/2570d4ea8024361049d6def4f4bb9716.

Das and Sen, 1996
Das, S. and Sen, P. K. (1996).
Asymptotic distribution of restricted canonical correlations and relevant resampling methods.
Journal of Multivariate Analysis, 56(1):1-19. DOI: 10.1006/jmva.1996.0001. http://www.sciencedirect.com/science/article/B6WK9-45NJFKB-14/2/44c1e203190336d7ebbb18bc09cda8b7.

de Jong and Kiers, 1992
de Jong, S. and Kiers, H. A. L. (1992).
Principal covariates regression.
Chemometrics and Intelligent Laboratory Systems, 14:155-164.

Deerwester et al., 1990
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990).
Indexing by latent semantic analysis.
Journal of the American Society for Information Science, 41(6):391-407. http://www.si.umich.edu/~furnas/POSTSCRIPTS/LSI.JASIS.paper.ps. CiteSeer: http://citeseer.ist.psu.edu/deerwester90indexing.html.

DeMers and Cottrell, 1993
DeMers, D. and Cottrell, G. W. (1993).
Nonlinear dimensionality reduction.
In Hanson, S. J., Cowan, J. D., and Lee Giles, C., editors, Advances in Neural Information Processing Systems: Proceedings of the 1992 Conference, pages 580-587, San Mateo, CA. Morgan Kaufmann Publishers.

Dempster et al., 1977
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977).
Maximum likelihood from incomplete data via the EM algorithm.
Journal of the Royal Statistical Society, Series B, 39:1-38.

Diamantaras and Kung, 1996
Diamantaras, K. I. and Kung, S.-Y. (1996).
Principal Component Neural Networks: Theory and Applications.
Wiley Series on Adaptive and Learning Systems for Signal Processing, Communications, and Control. Wiley, New York. ISBN 0471054364 [ bibliotek.dk | isbn.nu ] .

Fox et al., 1997
Fox, P. T., Lancaster, J. L., Parsons, L. M., Xiong, J.-H., and Zamarripa, F. (1997).
Functional volumes modeling: Theory and preliminary assessment.
Human Brain Mapping, 5(4):306-311. http://www3.interscience.wiley.com/cgi-bin/abstract/56435/START.

Friston, 1997
Friston, K. J. (1997).
Basic concepts and overview.
In SPMcourse, Short course notes, chapter 1. Institute of Neurology, Wellcome Department of Cognitive Neurology. http://www.fil.ion.ucl.ac.uk/spm/course/notes.html.

Gale and Sampson, 1995
Gale, W. A. and Sampson, G. (1995).
Good-Turing frequency estimation without tears.
Journal of Quantitative Linguistics, 2:217-237.

Geman et al., 1992
Geman, S., Bienenstock, E., and Doursat, R. (1992).
Neural networks and the bias/variance dilemma.
Neural Computation, 4(1):1-58.

Golub and Van Loan, 1996
Golub, G. H. and Van Loan, C. F. (1996).
Matrix Computation.
John Hopkins Studies in the Mathematical Sciences. Johns Hopkins University Press, Baltimore, Maryland, third edition. ISBN 0801854148 [ bibliotek.dk | isbn.nu ] .

Good, 1953
Good, I. J. (1953).
The population frequencies of species and the estimation of population parameters.
Biometrika, 40(3 and 4):237-264.

Hansen, 1996
Hansen, P. C. (1996).
Rank-Deficient and Discrete Ill-Posed Problems.
Polyteknisk Forlag, Lyngby, Denmark. ISBN 8750207849 [ bibliotek.dk | isbn.nu ] .
Doctoral Dissertation.

Hastie and Tibshirani, 1990
Hastie, T. J. and Tibshirani, R. J. (1990).
Generalized Additive Models.
Chapman & Hall, London.

Haykin, 1994
Haykin, S. (1994).
Neural Networks.
Macmillan College Publishing Company, New York. ISBN 0023527617 [ bibliotek.dk | isbn.nu ] .

Hertz et al., 1991
Hertz, J., Krogh, A., and Palmer, R. G. (1991).
Introduction to the Theory of Neural Computation.
Addison-Wesley, Redwood City, Califonia, 1st edition.
Santa Fe Institute.

Hofmann, 2000
Hofmann, T. (2000).
Learning the similarity of documents: An information geometric approach to document retrieval and categorization.
In Solla, S. A., Leen, T. K., and Müller, K.-R., editors, Advances in Neural Information Processing Systems 12, pages 914-920, Cambridge, Massachusetts. MIT Press. ISSN 1049-5258 [ bibliotek.dk ] . ISBN 0262194503 [ bibliotek.dk | isbn.nu ] .

Hoyer, 2004
Hoyer, P. O. (2004).
Non-negative matrix factorization with sparseness constraints.
Journal of Machine Learning Research, 5:1457-1469. http://www.jmlr.org/papers/volume5/hoyer04a/hoyer04a.pdf.

Jackson, 1991
Jackson, J. E. (1991).
A user's guide to principal components.
Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics. John Wiley & Sons, New York. ISBN 0471622672 [ bibliotek.dk | isbn.nu ] .

Kass and Raftery, 1995
Kass, R. E. and Raftery, A. E. (1995).
Bayes factors.
Journal of the American Statistical Association, 90(430):773-795. A review about Bayes factors.

Kendall, 1971
Kendall, D. G. (1971).
Seriation from abundance matrices.
In Hodson, F. R., Kendall, D. G., and Tantu, P., editors, Mathematics in the Archeological and Historical Sciences, pages 215-251. Edinburgh University Press.

Kullback and Leibler, 1951
Kullback, S. and Leibler, R. A. (1951).
On information and sufficiency.
Annals of Mathematical Statistics, 22:79-86.

Kung and Diamantaras, 1990
Kung, S. Y. and Diamantaras, K. I. (1990).
A neural network learning algorithm for adaptive principal component extraction (APEX).
In International Conference on Acoustics, Speech, and Signal Processing, pages 861-864. Albuquerque, NM.

Lunin and White, 1990
Lunin, L. F. and White, H. D. (1990).
Author cocitation analysis, introduction.
Journal of the American Society for Information Science, 41(6):429-432.

MacKay, 1992a
MacKay, D. J. C. (1992a).
Bayesian interpolation.
Neural Computation, 4(3):415-447.

MacKay, 1992b
MacKay, D. J. C. (1992b).
Information-based objective functions for active data selection.
Neural Computation, 4(4):590-604. ftp://wol.ra.phy.cam.ac.uk/pub/www/mackay/selection.nc.ps.gz. CiteSeer: http://citeseer.ist.psu.edu/47461.html.

MacKay, 1995
MacKay, D. J. C. (1995).
Developments in probabilistic modelling with neural networks--ensemble learning.
In Kappen, B. and Gielen, S., editors, Neural Networks: Artificial Intelligence and Industrial Applications. Proceedings of the 3rd Annual Symposium on Neural Networks, Nijmegen, Netherlands, 14-15 September 1995, pages 191-198, Berlin. Springer.

Mardia et al., 1979
Mardia, K. V., Kent, J. T., and Bibby, J. M. (1979).
Multivariate Analysis.
Probability and Mathematical Statistics. Academic Press, London. ISBN 0124712525 [ bibliotek.dk | isbn.nu ] .

Massy, 1965
Massy, W. F. (1965).
Principal component analysis in exploratory data research.
Journal of the American Statistical Association, 60:234-256.

McCain, 1990
McCain, K. W. (1990).
Mapping authors in intellectual space: A technical overview.
Journal of the American Society for Information Science, 41(6):433-443.

Metropolis et al., 1953
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953).
Equations of state calculations by fast computing machine.
Journal of Chemical Physics, 21:1087-1091(1092?).

Nichols, 2002
Nichols, T. E. (2002).
Visualizing variance with percent change threshold.
Technical note, Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA. http://www.sph.umich.edu/fni-stat/PCT/PCT.pdf.

Parzen, 1962
Parzen, E. (1962).
On the estimation of a probability density function and mode.
Annals of Mathematical Statistics, 33:1065-1076.

Pearce, 1982
Pearce, S. C. (1982).
Analysis of covariance.
In Kotz, S., Johnson, N. L., and Read, C. B., editors, Encyclopedia of Statistical Science, volume 1, pages 61-69. John Wiley & Sons. ISBN 0471055468 [ bibliotek.dk | isbn.nu ] .

Pearson, 1901
Pearson, K. (1901).
On lines and planes of closest fit to systems of points in space.
The London, Edinburgh and Dublin Philosophical Magazine and Journal of Science, 2:559-572.

Penrose, 1955
Penrose, R. (1955).
A generalized inverse for matrices.
Proc. Cambridge Philos. Soc., 51:406-413.

Petersson et al., 1999
Petersson, K. M., Nichols, T. E., Poline, J.-B., and Holmes, A. P. (1999).
Statistical limitations in functional neuroimaging. i. non-inferential methods and statistical models.
Philosophical Transactions of the Royal Society - Series B - Biological Sciences, 354(1387):1239-1260.

Rasmussen and Ghahramani, 2001
Rasmussen, C. E. and Ghahramani, Z. (2001).
Occam's razor.
In Leen, T. K., Dietterich, T. G., and Tresp, V., editors, Advances in Neural Information Processing Systems, Boston, MA. MIT Press. http://nips.djvuzone.org/djvu/nips13/RasmussenGhahramani.djvu.

Ripley, 1988
Ripley, B. D. (1988).
Statistical inference for spatial process.
Cambridge University Press, Cambridge, UK. ISBN 0521352347 [ bibliotek.dk | isbn.nu ] .

Rosenblatt, 1962
Rosenblatt, F. (1962).
Principals of Neurodynamics.
Spartan, New York.

Roweis, 1998
Roweis, S. (1998).
EM algorithms for PCA and SPCA.
In Jordan, M. I., Kearns, M. J., and Solla, S. A., editors, Advances in Neural Information Processing Systems 10: Proceedings of the 1997 Conference. MIT Press. http://www.gatsby.ucl.ac.uk/~roweis/papers/empca.ps.gz. ISBN 0262100762 [ bibliotek.dk | isbn.nu ] .

Salton et al., 1975
Salton, G., Wong, A., and Yang, C. S. (1975).
A vector space model for automatic indexing.
Communication of the ACM, 18:613-620.

Shannon, 1948
Shannon, C. E. (1948).
A mathematical theory of communication.
Bell System Technical Journal, 27:379-423, 623-656.

Specht, 1990
Specht, D. F. (1990).
Probabilistic neural networks.
Neural Networks, 3(1):109-118.

Tipping and Bishop, 1997
Tipping, M. E. and Bishop, C. M. (1997).
Probabilistic principal component analysis.
Technical Report NCRG/97/010, Neural Computing Research Group, Aston University, Aston St, Birmingham, B4 7ET, UK.

Wold, 1975
Wold, H. (1975).
Soft modeling by latent variables, the nonlinear iterative partial least squares approach.
In Cani, J., editor, Perspectives in Probability and Statistics, Papers in Honour of M. S. Bartlett. Academic Press.

Zoubir and Boashash, 1998
Zoubir, A. M. and Boashash, B. (1998).
The bootstrap and its application in signal processing.
IEEE Signal Processing Magazine, pages 56-76. An article with an introduction to bootstrap that rather closely follows the Efron and Tibshirani Bootstrap book.

Finn Årup Nielsen 2010-04-23