Skip to main content
Internet Archive's 25th Anniversary Logo

Full text of "Principal support vector machines for linear and nonlinear sufficient dimension reduction"

See other formats




The Anjials of Statistics 

2011, Vol. 39, No. 6, 3182-3210 

DOI: 10.1214/11-AOS932 

@ Institute of Mathematical Statistics, 2011 


"^ . By Bing Li^, Andreas Artemiou and Lexin Li^ 

^ ' Pennsylvania State University, Michigan Technological University 

and North Carolina State University 

We introduce a principal support vector machine (PSVM) ap- 
CO , proach that can be used for both hnear and nonlinear sufficient di- 

mension reduction. The basic idea is to divide the response variables 
into slices and use a modified form of support vector machine to find 
L^ , the optimal hyperplanes that separate them. These optimal hyper- 

r^ • planes are then aligned by the principal components of their normal 

• ' vectors. It is proved that the aligned normal vectors provide an un- 

1 -^ . biased, y^n-consistent, and asymptotically normal estimator of the 

j^ ' sufficient dimension reduction space. The method is then generalized 

to nonlinear sufficient dimension reduction using the reproducing ker- 
nel Hilbert space. In that context, the aligned normal vectors become 
functions and it is proved that they are unbiased in the sense that 
they are functions of the true nonlinear sufficient predictors. We com- 
^ ■ pare PSVM with other sufficient dimension reduction methods by 

^^ ' simulation and in real data analysis, and through both comparisons 

O^ , firmly establish its practical advantages. 


1. Introduction. With the increase of computer power in storing and 
(3 ' processing data, high dimensional data have become increasingly prevalent 

CN I across many disciplines. The demand for effective methods to extract use- 

ful information from such data has led inevitably to dimension reduction, 
an area that has undergone tremendous development during the past two 
^ . Let X be a p-dimensional predictor and y be a response variable. In 

its classical form, sufficient dimension reduction (SDR) [Li (1991, 1992), 
Cook and Weisberg (1991), Cook (1998)] identifies a small number of linear 

Received October 2010; revised September 2011. 

^Supported in part by NSF Grants DMS-07-04621 and DMS-08-06058. 
^Supported in part by NSF Grant DMS- 11-06668. 
AMS 2000 subject classifications. 62-09, 62G08, 62H12. 

Key words and phrases. Contour regression, invariant kernel, inverse regression, prin- 
cipal components, reproducing kernel Hilbert space, support vector machine. 

This is an electronic reprint of the original article published by the 
Institute of Mathematical Statistics in The Annals of Statistics^ 
2011, Vol. 39, No. 6, 3182-3210. This reprint differs from the original in 
pagination and typographic detail. 



combinations of predictors that can replace the original predictor vector X 
without loss of information on the conditional distribution of Y given X. In 
other words, the objective is to find a, p x d {d <p) matrix r] such that the 
following conditional independence holds: 

(1) y_lLX|T7'^X. 

In this relation, the identifiable parameter is the subspace spanned by the 
columns of 77, rather than t] itself. The intersection of all subspaces satis- 
fying (1), provided itself satisfies (1), is called the central subspace, and is 
denoted by 5y|x [Cook (1994)]. Cook (1996) and Yin, Li and Cook (2008) 
showed that Syvx. uniquely exists under very mild conditions. Thus, we 
assume its existence throughout this article. Many methods have been pro- 
posed for this problem since the publication of the original works. See, for 
example. Cook and Li (2002), Xia et al. (2002), Yin and Cook (2002), Fung 
et al. (2002), Li, Zha and Chiaromonte (2005), Cook and Ni (2005), Li and 
Wang (2007), Li and Dong (2009). 

A more general sufficient dimension reduction problem, as formulated in 
Cook (2007), is to seek an arbitrary function cji:W ^Mr such that 

(2) yxx|0(x). 

We refer to this problem as nonlinear sufficient dimension reduction, and 
any one-to-one function of 4>i^) as the nonlinear sufficient predictor. Several 
recent works pioneered estimation procedures for nonlinear dimension reduc- 
tion of this type, including Wu (2008), Wu, Liang and Mukherjee (2008), 
Wang (2008) and Yeh, Huang and Lee (2009), by extending sliced inverse 
regression [SIR; Li (1991)] from different angles. 

In this paper, we propose a sufficient dimension reduction method, to be 
called the principal support vector machine (PSVM), that can extract the 
sufficient predictors in both problems (1) and (2). Let (Xi, Yi), . . . , (X„, y„) 
be a sample of (X, Y). The basic idea of PSVM is to divide Xi, . . . , X„ into 
several slices according to the values of the responses, and then use sup- 
port vector machine [SVM; Vapnik (1998)] to find the optimal hyperplanes 
that separate these slices. The optimal hyperplanes are then aligned by ap- 
plying principal component analysis to their normal vectors. We show that 
the principal components are, in fact, an unbiased estimator of the central 
subspace 5y|x- This idea is then extended to the nonlinear dimension reduc- 
tion problem (2) via the reproducing kernel Hilbert space [RKHS; Aronszajn 
(1950), Hsing and Ren (2009)]. In this context, the normal vectors in the 
linear case become functions in the RKHS. It is shown that the normal func- 
tions thus derived are functions of (f) in the general problem (2). This is, to 
our knowledge, the first result of this type. 

Our proposal is noticeably different from the existing SDR methods in 
the following respects. First, PSVM is developed under, and for, a uni- 


fied framework of linear and nonlinear sufficient dimension reduction. Such 
a standpoint allows us to formulate some theoretical properties, such as 
unbiasedness, more rigorously and generally than previous works. Second, 
PSVM improves the accuracy for sufficient dimension reduction, for the fol- 
lowing reason. It is well known that a regression surface is more accurately 
estimated at the center of the data cloud than at the outskirt. However, 
an inverse regression based method, such as SIR, tends to downweight the 
slice means near the center due to their shorter lengths. Since PSVM relies 
on separating hyperplanes rather than slice means, it makes better use of 
the central portion of the data than inverse regression. This improvement 
is clearly demonstrated in our numerical studies. Finally, PSVM establishes 
a firm connection between sufficient dimension reduction and the acclaimed 
machine learning technique, support vector machine, both of which have 
been extensively used in high dimensional data analysis. This combination 
brings fresh insights and further advances to both subjects. Along with the 
theoretical development of PSVM, we develop a more complete asymptotic 
theory for SVM than previously given, and introduce the notion of invariant 
kernel for SVM. Meanwhile, we expect some inherent advantages of SVM to 
benefit sufficient dimension reduction estimation. For instance, SVM tends 
to be more robust against outliers than a typical moment method. This is 
because the separating hyperplanes are largely determined by the support 
vectors lying in the interior of the data cloud, as a result an observation far 
away from the data cloud has less influence than a typical moment-based 
estimator. In this sense, SVM behaves more like a median than a mean. It 
is also expected to help address several challenging issues facing the existing 
SDR methods, such as small-n-large-p and presence of categorical predic- 
tors. However, due to limited space these potential advantages cannot be 
fully discussed within this paper. Some of them, such as robustness and 
categorical predictors, are further explored in Artemiou (2010). 

The rest of the paper is organized as follows. In Section 2, we illustrate the 
basic idea of PSVM by examples and figures, and give intuitions about why 
it works. In Section 3, we formally introduce the linear PSVM and study its 
population-level properties in terms of its unbiasedness as an estimator of 
the central subspace. In Section 4, we develop the estimation procedures 
for the linear PSVM, and describe how to implement it using standard 
SVM packages. In Section 5, we generalize the linear PSVM to the kernel 
PSVM to solve the nonlinear sufficient dimension reduction problem, and 
establish its unbiasedness in this general setting. In Section 6, we develop 
an algorithm to implement the kernel PSVM, and introduce the notion of 
invariant kernel. In Section 7, we study the asymptotic properties for the 
linear PSVM estimator. Though the identified subspaces are asymptotically 
consistent, they are almost surely incorrect for finite sample sizes. Thus, in 
Section 8, we compare the linear and kernel PSVM with other dimension 


reduction methods in finite sample by simulation. In Section 9, we apply 
it to analyze a data set concerning the recognition of vowels, and make 
further comparisons in the practical setting. All the proofs are given in 
a complementary document published online by The Annals of Statistics. 

2. Principal support vector machine: The basic idea. The idea of the 
principal support vector machine arises from an interplay of several ideas: 
sliced inverse regression, support vector machine, and contour regression 
[Li, Zha and Chiaromonte (2005)]. In this section, we illustrate this idea by 
two simple examples that cover both linear and nonlinear dimension reduc- 
tion. Throughout this paper, X represents a random vector; X^ represents 
the rth component of X; Xj represents the ith random vector from a sam- 
ple Xi, . . . ,X.„, and Xir represents the rth component of Xj. 

First, consider the regression model 

(3) Y = f{Xi + 2X2)+e, 

where e _IL (Xi,X2). This is a linear sufficient dimension reduction problem, 
in which the central subspace is spanned by (1,2) G R^. Note that the 
contours for the regression function is the set {(xi, X2) : xi + 2x2 = c}, which 
is uniquely associated with the vector (1,2)^. Based on this intuition, Li, Zha 
and Chiaromonte (2005) introduced the contour regression, which estimates 
the contour directions by the directions in X that are aligned with the 
smallest increments in Y. 

Here, we propose to identify the contours by the separating hyperplanes 
derived from the support vector machine as applied to different slices of X, 
formed according to the values of Y. Let 5i = {Xj :Yi < c} and 52 = {Xj : 
Yi > c} for some constant c. We use SVM to obtain the optimal separating 
hyperplane of Si and S2, and repeat the process to obtain several hyper- 
planes. Intuitively, the normals of these hyperplanes are roughly aligned with 
the directions in which the regression surface varies — directions that form 
the central subspace. We use the principal components of these normals to 
estimate the central subspace. A related idea is Loh (2002), who proposed 
to divide each individual predictor according to the mean of Y and assess 
the importance of that predictor by its degree of separation. 

As an illustration, we generate 100 replications from model (3) where / 
is taken to be the identity mapping. We divide Xi , . . . , Xioo into 4 slices 
according to the 25th, 50th, 75th sample quantiles oiYi, . . . ,Yn, as indicated 
in Figure 1 by differently colored dots. Application of SVM between these 
slices yields three hyperplanes, represented by the solid lines on the right 
panel, which closely resemble the contours derived from the true model, as 
shown on the left panel. Clearly, the normals of the three hyperplanes give 
close estimate of the central subspace. 



Fig. 1. Linear contours for model Y — 2Xi + X2 + e. Left panel: true contours; right 
panel: contours based on linear SVM. Contour levels are three evenly spaced sample quan- 
tiles ofYi,...,Y„. The sample size is n— 100 . 

We can apply the same idea to sufficient nonlinear dimension reduction. 



y = /(Xi+X|) + e, 

where / is an unknown function. The contours of this function are of the 
form {(xi, 3:2) : xi +x| = c}, which are no longer hyperplanes in M?. However, 
if we map x to a higher dimensional space of functions of x that is rich 
enough to contain xi + x^, then the contours become hyperplanes again. 
We can apply SVM at that level to find the optimal hyperplanes, and then 
map them back to the x-space to extract the nonlinear predictor. Usually, in 
conjunction with mapping a low-dimensional regressor to a high-dimensional 
regressor, a Tikhonov-type regularization is applied, so that the overfitting 
tendency of increased dimension is counteracted by the regularization. 

As in the linear case we generate 100 replications from model (4) and use 
the same set of quantiles to slice the response. The curves in the left panel 
in Figure 2 are the true contours computed from the function y = xi + x"^. 
Those in the right panel are obtained by first applying kernel SVM (with 
Gaussian radial basis) to find hyperplanes in R^^'^ and then mapping them 
back to R^. Clearly, any function of (xi,X2) that generates the contours in 
the right panel would closely resemble the true predictor xi + x"^, modulo 
a monotone transformation. 

3. PSVM for linear sufHcient dimension reduction. We first develop 
PSVM for linear sufficient dimension reduction. We begin with a population- 
level formulation of SVM, since it is usually described at the sample level. 


Fig. 2. Nonlinear contours for model Y = Xi + X2 + e. Left panel: true contours; right 
panel: contours based on kernel SVM with gauss radial kernel. Contour levels are three 
evenly spaced sample quantiles ofYi,...,Y„. The sample size is n= 100 . 

which is not the best way to set up our problem. For now, assume Y to be 
a binary random variable taking values —1 and 1. The soft-margin SVM is 
defined through the following optimization: 

minimize ip tp -\ — > ^j 



among {tp,t,^) G 


subject to Ci > 0, Yi [-0 ' (Xi - X) - t] > 1 - ^i, i = 1, . . . , n, 

where A is a positive constant often referred to as the "cost." See Vapnik 
[(1998), page 411] for the intuitions behind this construction. If {ip*,t*,^*) 
is the solution to (5), then the set {x : -0* x = t*} is the optimal hyperplane 
that separates {Xj ■.Yi = —1} and {Xj : 1^ = 1}. 

Although this representation defines the algorithm, it does not tell us what 
objective function is minimized at the population level. To see things more 
clearly, we first carry out the optimization for a fixed {ip,t). This amounts 
to minimizing Yl^=i C« subject to ^j > max{0, 1 — Yiltp (Xj — X) — t]}. The 
optimal solution is ^* = {1 — Yi[ip (Xj — X) — t]}"*" where a"*" = max(a,0). 
Substituting ^* into (5), we have 






This corresponds to the following objective function at the population level: 

(7) v^"^^ + A^[i-y(V'^(x-^x)-t)]+. 

The hyperplane that minimizes this criterion can be viewed as that which 
best separates the conditional distributions of X|y = —1 and X|y = 1. 


Jiang, Zhang and Cai (2008) used a slight variation of representation (7) 
to derive the asymptotic distribution of SVM. We also use a representation 
similar to (7) but with two important modifications, as we describe below. 
Return now to sufficient dimension reduction problem (1) where Y is 
an arbitrary random variable (in particular, it can be either continuous or 
categorical). Let Oy be the support of Y and let Ai and A2 be disjoint 
subsets of Qy. Let Y be the discrete random variable defined by 

(8) Y = I{YeA2)-l{YeAi). 

We introduce the following objective function for linear SDR: 

(9) LiiP, t) = ip'^'Sip + XE{1 - y[^^(X - EX) - t]}+, 

where S = var(X). Compared with (7), we have made two modifications. 
First, we allow Y to take the value 0, so that we can use a pair of disjoint 
subsets that are not a partition of fiy. Second, we have inserted 5] in the 
first term of (9). This is so that the objective function transforms in a desired 
manner. We will return to this point in Section 6. 

We now establish the unbiasedness of the normal vector for the opti- 
mal separating hyperplane in SVM as an estimator of the central subspace. 
Let Fn be the empirical distribution based on the sample (Xi,Yi ),..., 
(X.n,Yn), Fq be the true distribution of (X.,Y), and T be a statistic that 
can be expressed as a matrix- valued function of the distribution of (X, 1"). 
In our context, we say T(F„) is an unbiased estimator of 5y|x, if it satisfies 

(10) span[T(Fo)] C 5y|x. 

Theorem 1. Suppose -^(XIt^^X) is a linear function of rj^l^, where rj 
is as defined in (1). If [if)* ,t*) minimizes the objective function (9) among 
all {ip, t)eWx R, then tp* G 5y|x. 

The linearity condition on £^(X|t7^X) in the theorem is well known and 
generally assumed in the SDR literature. See, for example, Li and Duan 
(1989), Li (1991) and Li and Dong (2009). It implies 

(11) ii;(v.'^x|r,Tx) = ^Tp;(s)x, 

where Pr,(I]) is the projection matrix r]{r]^'Sr])~^r]^'S [Cook (1998)]. It 
is satisfied when X is elliptically symmetric [Eaton (1986)], and is approx- 
imately satisfied when p is large [Hall and Li (1993)]. Interestingly, as we 
show in Section 5, this assumption is no longer needed for the unbiasedness 
in the more general setting of nonlinear sufficient dimension reduction. 

Here we note that, though Theorem 1 is far from a trivial generalization, 
the type of argument used in the proof is somewhat standard in the SDR 


literature. See, for example, Li and Duan (1989), Cook (1998) and Cook 
and Li (2002). It is possible to extend the above theorem to more general 
objective functions. For example, the theorem still holds if a i— t- a"*" in the 
objective function is replaced by any convex function u{a). 

4. Estimation procedure for linear PSVM. 

4.1. Estimation. We propose two ways to generate the set of pairs of 
slices for PSVM. One, which we call "left versus right" (LVR), repeatedly 
divides the predictors into two groups according to a set of cutting points 
for the response. The other, which we call "one versus another" (OVA), 
partitions the predictors into several slices and pairs up all possible slices. 
We summarize the estimation procedure as follows. 

1. Compute the sample mean X and sample variance matrix S. 

2. (LVR) Let qr,r = 1,. . . ,h — 1, be h — 1 dividing points. For example, 
they can be equally spaced sample percentiles of {Yi, . . . ,Yn}. Let 

(12) Y; = I{Yi > Qr) - I{Yi < qr) 

and let (■0^, i^)) r = 1, . . . , /i — 1, be the minimizer of 

(13) V'^i]V^ + A^„{i-y^[(x-x)^^-t]}+. 

2'. (OVA) Apply SVM to each pair of slices from the h slices. More specif- 
ically, let (7o = ™in{li, . • • ,5^} and (//j = max{Yi, . . . ,y„}. For each (r, s) 
satisfying 1 < r < s < /i, let 

fr = I{qs-l <Yi< Qs) - I{qr-i <Yi< q^). 

Let {il}j.g,trs) be the minimizer of the objective function 

'lP'^t^p + XEn{l - y"1(X - X)^rp - t]}+ . 

3. Let vi, . . . , Vrf be the d leading eigenvectors of either one of the ma- 

h-l h h 

(14) Mn = Y,^r'4'J or Mn = Y,Yl ^rs¥rs- 

r=l r=l s=r+l 

We use subspace spanned by v = (vi, . . . , v^) to estimate 5y|x- 

Based on our experiences, LVR works best when the response is a con- 
tinuous variable, where Y being larger or smaller has a concrete physical 
meaning; OVA works best when the response is categorical, where the val- 
ues of Y are simply labels of classes, such as different vowels in our example 
in Section 9. Our numerical studies also suggest that the estimation re- 


suits are not overly sensitive to the choice of the number of slices h, though 
a larger h often works better. 

Standard packages for SVM minimize the objective function (6) instead 
of (13). However, they can be modified to suit our procedure. Let C = S^'^i/^ 
and Z = I]"i/2(x - X). Then (13) becomes 

(15) c^c + AK[i-y^(z^C-t)]+. 

We can apply standard packages to minimize (15) to obtain ^, whose trans- 
formation S~^'^^ is the desired minimizer of (13). We use the kernlab 
package in R to solve problem (15). See Karatzoglou and Meyer (2006) for 
an exposition of this package. 

4.2. Order determination. Estimating the dimension d of the central 
subspace is a vital ingredient of sufficient dimension reduction estimation. 
Here, we propose a cross-validated BIC procedure [Schwarz (1978)] for this 
purpose. The BIC component of this procedure is an extension of a criterion 
introduced by Wang and Yin (2008), and is also related to Zhu, Miao and 
Peng (2006). We refer to this combined procedure as CVBIC. 

Let M„ be one of the matrices in (14), and let Aj(M„) be its ith largest 
eigenvalue. Let Gn{k) = Yl,i=i ^i(^n) — ci{n)c2{k) , where ci{n) is a sequence 
of positive numbers or random variables that converge (in probability) to 0, 

and C2{k) is a nonrandom increasing function of k. Let d be the maximizer 
of Gn{k) over {0,...,p}. In Section 7, we show that P{d = d) — )■ 1. The 
standard choices of ci(n) and C2(fe) are ci(n) oc n~^/^log(n) and C2{k) = /c, 
so that the penalty term is CQn~^''^\og{n)k, where cq > is a constant (or 
random variable) of order 0(1) [or Op(l)]. Since the eigenvalues Aj(M„) may 
differ for different problems, it is sensible to make cq comparable to their 
magnitude. One reasonable choice is to make cq proportional to Ai(M„), 
leading to the following BIC-type criterion: 


(16) Y. ^*(^") - aAi(M„)n-V2 iog(„)^. 

We now turn to the choice of a. Though this choice does not affect the 
consistency of (i, it does affect its finite-sample performance. Moreover, from 
our experience this choice is also sensitive to p, d, and the regression model. 
For these reasons, it is important to have a systematic way of choosing a. The 
SVM used in our setting suggests naturally the cross-validation, because the 
former provides a set of labels to validate. We outline the CVBIC procedure 
as follows, using LVR as an illustration. 

First, divide the data into a training set and a testing set, denoted by 

{(Xi,yi), . . . , (x„,, y;j}, {(Xi, n), . . . , (x„„y„,)}. 


Apply the PSVM to the training set with dividing points qi, . . . ,qh~i to 
obtain a set normal vectors ■0;^, . . . ,i/'/^_i. Let M„j = X^j=i '^'^'i^J . Second, 
for a fixed a, maximize the criterion (16), with M„, replaced by M„j, to 
obtain an integer k. Let vj^, . . . , v^ be the k leading eigenvectors of Mm and 
transform the testing predictors Xj to X^- = (vi, . . . , Vfc)'''Xj, i = 1, . . . ,n2- 
Third, let Lj = I{Yi > q^) — I{Yi < qr) be the true label of Yi in the testing 

set. Apply SVM to (X^ ,Li), . . . , (KuJ ,Ln2) to predict Li,. . . ^Ln^. Repeat 
this process for all dividing points and record the total number of misclas- 
sifications. The optimal a is the one that minimizes the total number of 
misclassifications. Finally, substitute the optimal a into (16) and maximize 
it again using the full data to estimate d. In Section 8.3, we investigate the 
numerical performance of CVBIC under a variety of combinations of p, d, n 
and regression models. 

4.3. Special features of linear PSVM. As we conclude the exposition of 
the linear PSVM, we mention some special features of this method. One is 
that it shares the similar limitation with SIR when dealing with regression 
functions that are symmetric about the origin. If the regression function 
is /(||X||), then all slices of the form {Xj:1^ G 5} are roughly concentric 
spheres in R^, which no hyperplane in W can separate. However, as we 
shall see in Sections 5 and 8, this is remedied by the kernel PSVM, because 
when mapped into higher dimensional feature space the slices become linear 

Another is that when dealing asymmetric regression functions, the linear 
PSVM tends to work better than SIR for the following reason. Recall that 
SIR is based, roughly, on the principal components of the slice mean vec- 
tors of the form £^(X|y G /S) — E(X.), where S is an interval in Qy This 
determines that it downweights the slice means near the center of the data 
cloud, where the Euclidean norm of ii^(X|y £ S) — E(X.) is smaller. How- 
ever, it is well known that the regression function £'(y|X) tends to be more 
accurately estimated near the center of the data cloud [see, e.g., Kutner, 
Nachtsheim and Neter (2004), Section 2.4]. In comparison, the linear PSVM 
relies on the normals of the separating hyperplanes of the slices, which does 
not downweight the data near the center. As we will see from our simulation 
studies in Section 8, this brings substantial improvement to the estimate. We 
should point out, however, that there is an important exception. As shown 
in Cook (2007) and Cook and Forzani (2008), under the assumption that Y 
has a finite support and X|y has a conditional multivariate normal distri- 
bution where var(X|y) is independent of Y, SIR is the maximum likelihood 
estimate of the central subspace. In this case, no regular estimate can be 
more efficient than SIR. The mentioned advantage of linear PSVM applies 
mainly to the forward regression setting where the conditional distribution 
of X|y is typically non-Gaussian. 


5. Kernel PSVM for nonlinear dimension reduction. In this section, we 
extend the PSVM to nonlinear sufficient dimension reduction as defined 
by (2). We first develop the objective function by generalizing the linear 
PSVM objective function (9), and then establish the unbiasedness of the 
proposed nonlinear PSVM estimator. 

Before proceeding further, we note that the function in relation (2) is 
not unique in the strict sense, but is unique modulo injective transforma- 
tions. Again, the situation is parallel to linear sufficient dimension reduction 
problem (1), where 77 X is only unique modulo injective linear transfor- 
mations. Any injective linear transformation of 77 X is an equivalent linear 
predictor, because it does not change the linear subspace. Likewise, for non- 
linear SDR, any injective transformation of is an equivalent sufficient 
predictor, because it does not change conditional independence (2). 

Let ^ be a Hilbert space of functions of X. In analogy to the linear 
objective function (9), consider Ai'H x M-^M"'" defined by 

(17) A('0, t) = variV'CX)] + XE[1 - Y{ij(X) - Eij(X) - i)]+, 

where Y is as defined in (8). To see that this is indeed a generalization 
of (9), consider the bilinear form from H x 7^ to M defined by 6(/i,/2) = 
cov[/i(X),/2(X)]. Under the assumption that the mapping 

(18) n^L^iPj^), f^f 

is continuous, the bilinear form b induces a bounded and self-adjoint operator 
Tj-.'H^'H such that (/i, 11/2)-^ = b{fi, f2), where (•, •)-^ is the inner product 
in Ti. See, for example, Conway (1990), Theorem 2.2, and Fukumizu, Bach 
and Jordan (2004). The objective function (17) can now be rewritten as 

(19) A(V, t) = (V', SV)w + XE[1 - y(VXX) - Ei;{X) - t)] + . 

Thus, A(^,t) is a generalization of L('ip,t) with the matrix S replaced by 
the operator S, the linear function ip X replaced by an arbitrary function ^ 
in Ti, and the inner product in R^ replaced by the inner product in Ti. For 
the usual kernel SVM, the population-level objective function is 

iij, ij)n + XE[1 - y(^(X) - Ei;{X) - t)] + . 

Comparing with (19), we see a parallel modification to the linear case. The 
significance of this modification is further discussed in Section 6. 

We now establish that, if (■i/'*,t*) is the minimizer of A(^,t), then ip* 
is necessarily a function of the sufficient predictor 0(X) in the nonlinear 
problem problem (2). This is a generalization of the notion unbiasedness in 
the linear setting. Our definition of unbiasedness (10) in the linear sufficient 
dimension reduction setting is equivalent to 

(20) [T(Fo)]'^X is a linear function of r]~^X. 


It is the statement (20) that is more readily generahzed to the nonhnear 
sufficient dimension reduction setting: we simply require i/' to be a function 
of the sufficient predictor 0(X) in (2). The following definition makes this 
notion rigorous. For a generic random element U, let 0"{U} denote the a- 
field generated by U. 

Definition 1. A function ip gT-L is unbiased for nonlinear sufficient 
dimension reduction (2) if it has a version that is measurable a{cf)(X.)}. 

The reason that we only require a version of "0 to be measurable (t{0(X)} 
is that the L2-metric ignores measure zero sets. 

Theorem 2. Suppose the mapping (18) is continuous and: 

1. % is a dense subset of L2{Pyi), 

2. y_lLX|0(X). 

If {ip* ^t*) minimizes (19) among all {il^,t) G "H x M, then i/;*(X) is unbiased. 

Condition 1 is satisfied by some commonly used reproducing kernel Hilbert 
spaces. For example, if ^ is a reproducing kernel Hilbert space based on the 
Gaussian radial basis, then the collection of functions {c + g : c £ M, g £ Q} 
is dense in L2{Pk)- See Fukumizu, Bach and Jordan (2009). 

It is important to note that in this more general setting we no longer 
require any linearity assumption that resembles the one assumed in The- 
orem 1. In contrast, the kernel sliced inverse regression developed by Wu 
(2008) and Wu, Liang and Mukherjee (2008), and functional sliced inverse 
regression by Hsing and Ren (2009) all require a version of the linearity 
condition to hold in the reproducing kernel Hilbert space. 

The notion of unbiasedness for sufficient dimension reduction is more akin 
to Fisher consistency than to unbiasedness in the classical setting. While un- 
biasedness in the classical setting can exclude many useful statistics, Fisher 
consistency often guarantees correct asymptotic behavior without putting 
undue restrictions on the expectation. Moreover, an estimator that is not 
Fisher consistent is clearly undesirable, because it is guaranteed not to con- 
verge to the true parameter. For these reasons unbiasedness for linear SDR 
is a useful criterion, even though some useless estimators (such as 0) are 
unbiased. Unbiasedness for nonlinear SDR plays the parallel role, except 
that it only requires the estimator to be an arbitrary, rather than a linear, 
function of the true predictor. This relaxation also allows us to establish the 
unbiasedness of PSVM without evoking the linearity condition. 

Theorem 2 assumes that A{ip,t) attains its minimum in ^ x M. We think 
this is a reasonable assumption for the following reasons. As shown be- 
low, A(^,t) is lower semicontinuous with respect to the weak topology 


in ?^ X M. Since any closed, bounded, and convex set in a Hilbert space 
is compact with respect to the weak topology [Weidmann (1980), Theo- 
rem 4.25, Conway (1990), Corollary V.1.5], by the generalized Weierstrass 
theorem [Kurdila and Zabarankin (2005), Section 7.3], A('i/',t) attains its 
minimum within any such set in ?^ x R. The next proposition establishes 
this fact. Let %' be the Hilbert space ?^ x M endowed with the inner product 

Proposition 1. IfH is an RKHS with its kernel k satisfying £'k(X, 
X) < oo, then K{il),t) is lower seniicontinuous with respect to the weak topol- 
ogy in %' , and attains its minimum in any closed, hounded, and convex set 
in %' . 

6. Estimation of kernel PSVM and invariant kernel. The purpose of 
this section is twofold. First, because we have modified ('i/',V')w to {ip,T,ip)'}{ 
in the kernel SVM objective function, we can no longer use the standard 
SVM packages to solve for tp* . Therefore, we reformulate the minimization 
of A('0,t) as quadratic programming that can be solved by available com- 
puter packages. Second, in deriving this quadratic programming problem, we 
gain more insights into the meaning and significance of this modification. 
As we shall see, by replacing {ip,ip)H by {'tp,T,ip)'}{, we are in effect making 
SVM invariant with respect to the marginal distribution of X. Intuitively, 
since we are using SVM to make inference about the conditional distribution 
of y|X, it is plausible that the procedure does not depend on the marginal 
distribution of X. 

Let 7^ be a linear space of functions from Ox to M spanned by J^n = 
{■01, • • • , V'fc}- The choice of these functions will be discussed later, but it 
will ensure £'„['i/'j(X)] = 0, so that ipi{x) = ilJi{'x) — Eni^i(X.). Let 

/Vi(Xi) ••• V'fc(Xi)' 

(21) *= : •.. : 

\V'l(X„) ••• IpkC^n 

Then the sample version of the objective function (19) is 


(22) A(c) = n'^c^^^^c + Xn'^ V[l - Yi{^Jc - t)] + 

where ^^ = (V'i(Xj), . . . ,'0fc(Xj)) and c G M^'. We minimize A(c) among 
all c. 

In the following, y = (yi, . . . ,yn)^ and a,/3,^ G M". The symbol < rep- 
resents componentwise inequality. The symbol represents the Hadamard 
product between matrices. For a matrix A of full column rank. Pa is the 
projection A(A A)~^A . The symbols and 1 represent, respectively, the 
n-dimensional vectors whose entries are and 1. 


Theorem 3. Ifc* minimizes k{c) overM.^ , thenc* = i(*'^*)~^*"''(y0 
a*), where a* is the solution to the quadratic programming problem: 

maximize 1 a — j{aQy) P^(Q!0y) 

subject to < a < XI, a y = 0. 

Note that the quadratic programming problem (23) differs from that of 
the standard kernel SVM, where the projection P^ is replaced by the ker- 
nel matrix K„ = {^(i, j) : i, j = 1, . . . , n} for some positive definite bivariate 
mapping k : Ox x Ox — > M. The kernel matrix K„ uniquely determines the 
sample estimate of the covariance operator S, which bears the information 
about the shape of the marginal distribution of X. By replacing K„ with P^ , 
we are, in effect, removing the information about X. For this reason we call 
the matrix P^ an invariant kernel 

For the function class Ti, we use the reproducing kernel Hilbert space 
based on the mapping k. Common choices of k include the polynomial kernel 
k(xi,X2) = (x^X2 + c)'^', where r is a positive integer, and the Gaussian radial 

II II 2 

kernel k(xi,X2) = e~'^"^^~^^" , where 7 > 0. Let 

(24) n^ = {cq + cik(-,Xi) H h c„k(-,X„) : cq, . . . , c„ e M} 

with inner product specified by (K(-,a),K(-,b)) =K(a,b). In the standard 
kernel SVM, it is a common practice to use all functions in T-Lk as Ti. How- 
ever, the invariant nature of our kernel, P*, determines that we cannot 
use all those functions, because if so then P^ becomes nearly an identity 
matrix (note that if P^ were an identity matrix then the objective func- 
tion in (23) would become independent of Xi, . . . ,X„). We instead use the 
principal functions of the linear operator S„, as defined by (V'i,5]„'02) = 
cov„[V'i(X),^2(X)], as our basis -F„. Here cov„(-,-) denotes sample covari- 
ance. Let Q„ = I„ — Jn/n, where I^ is the n x n identity matrix and J„ is 
the n X n matrix whose entries are 1. The next proposition tells us how to 
find the eigenfunctions of S„. Its proof is easy and omitted. 

Proposition 2. Letw = {wi,...,Wn), V'w = J^'^d^i'^^^i) - ^ni^i'^,^)]- 
The following statements are equivalent: 

1. w is an eigenvector of the matrix Q„K„Q„ with eigenvalue X; 

2. -i/^w is an eigenfunction of the operator S„, with eigenvalue X/n. 

If Xj^O, then either statement implies (V'w(Xi), . . . ,'0w(X,i)) = Xw^ . 

Although the eigenvectors of Q„K„Q„ and the eigenfunctions of S„ are 
similar objects, it is the latter that can be evaluated at any x, not just the 
observed Xi, . . . ,X„. This property is important for prediction. Essentially, 


we use the first k eigenfunctions (j)i,. . .^(pk of S„ as tlie functions in T^ ■ 
This is equivalent to using {aicpi, . . . , a^cpk} = {V'l) • • • ) "^fc} for any nonzero 
ai, . . . ,afc. We choose Oj to satisfy aj(^j(Xi), . . . ,'0i(X„))''' = Wj, where Wj 
is the eigenvector of Q„K„Q„, corresponding to its ith eigenvalue Aj. Thus 
ai = 1/Aj. With this choice, ^ is simply (wi, . . . , ■w^). The choice of number 
of basis functions, k, should allow sufficient flexibility but not as large as n; 
our experiences indicate that the choice of k in the range n/3 ~ 2n/3 works 
well. We summarize the kernel PSVM estimation procedure as follows. 

1. (Optional) Marginally standardize Xi,...,X„. Let fir and a^ be the 
sample mean and sample variance Xir,. . . ,Xnr- Reset Xir to be {Xir — 
fir) far- The purpose of this step is so that the kernel k treats different 
components of Xj more or less equally. This step can be omitted if the 
components of Xj have similar variances. 

2. Choose a kernel k and the number of basis functions k (say k = n/2). 
Compute * = (wi, . . . , w^) and P^ from Q„K„Q„. 

3. Divide the sample according to LVR or OVA, each yielding a set of 
slices. For each pair of slices, solve the quadratic programming problem 
in Theorem 3 using the P^ computed from step 2. This gives coefficient 
vectors cj, . . . , c! e R'', where h = h-l for LVR and h = (0 for OVA. 

4. Compute the first d eigenvectors, vi, ... jV^i, of the matrix ^^^^c*c* . 
Denote the rth component of of v^ as Vsr- 

5. The sth sufficient predictor evaluated at x is fsiV'i(x) H l-fsfcV'fc(x), 

where ipr{'^) = A~^^"^^ u;ri[K(x, Xj) — £^„k(x, X)]. If step 1 is used, then x 
should be marginally standardized by the fir and ar computed from that 

Many computing packages are available to solve the quadratic programming 
problem in step 3. We use the ipop program in the kernlab package in R. 
See Karatzoglou et al. (2004). If the Gaussian radial kernel is used in step 2, 
then we recommend choosing 7 as 

1 " 

^2/ i<j,j=2 

Alternatively, we can use the population version of the above quantity, 
(26) 7 = l/(i?l|X-X'||)2, 

where X and X' are independent A^(0, Ip) random vectors. This quantity can 
be easily evaluated by Monte Carlo. In Section 8, we use (26) for large-scale 
simulations to avoid repeated evaluations of (25), whereas in Section 9 we 
use (25) for the real data analysis, where it needs to be calculated only once. 
Some authors recommend sample median in (25). See Gretton et al. (2005) 
and Fukumizu, Bach and Jordan (2009). This does not make a significant 
difference in our examples. 


7. Asymptotic analysis of linear PSVM. In this section, we give a com- 
prehensive asymptotic analysis of linear PSVM estimator introduced in Sec- 
tions 3 and 4. This is developed in three parts. First, we derive the influence 
function for the normal vector -0 based on two slices. In this part, we em- 
ploy some asymptotic properties of SVM developed recently by Jiang, Zhang 
and Cai (2008). In the second part, we derive the asymptotic distribution 
of the linear PSVM estimator, (vi, . . . , v^^), defined in Section 4.1. In the 
third part, we establish the consistency of the order determination criterion 
introduced in Section 4.2. 

7.1. Influence function for support vector machine. The asymptotic re- 
sults of Jiang, Zhang and Cai (2008) are largely applicable here except for 
three places: our SVM involves an additional S; our A is fixed but the A in 
their paper depends on n; they did not derive the explicit form of the hessian 
matrix — and hence neither the asymptotic variance — but we are interested 
in the explicit asymptotic distribution. The first two points are minor but the 
third needs nontrivial additional work. We only consider the case where Y is 
defined through a partition {yli,j42} of ily- Thus, our results only apply to 
the LVR scheme. The asymptotic analysis the OVA scheme can be carried 
out similarly, and is omitted. 

We first develop some notation. Let 6 = (t/>^,t)^, Z = (X~'',y)'^, X* = 
(X^, -1)^, E* =diag(5],0). Then 

(27) V^SV + A[l - YixJiP - t)]+ = 9^-^*0 - A(l - e^X*Y)+. 

We denote this function by m(0,Z). Let ilz be the support of Z and 
let h:6 X i7z — ^ I^^ be a function of (0,Z). Let Dg denote the {p + 1)- 
dimensional column vector of differential operators {d/d6i,. . . ,d/d6p-^i) . 
The next theorem gives the gradient of the support vector machine objective 
function E[m{0,7i)]. 

Theorem 4. Suppose, for each y = —1, 1, the distribution of X|y = y 
is dominated by the Lebesgue measure and £'(||X|p) <oo. Then 

(28) DeE[m{e, Z)] = (2V'^S, 0)^ - A^[X*y/(l - e'^X.*Y > 0)]. 

We now present the hessian matrix of support vector machine, which leads 
to the asymptotic variance of 0. To our knowledge, this is the first time that 
the asymptotic variance is explicitly given. This result is then used to derive 
the asymptotic distribution of the linear PSVM estimator. 

Theorem 5. Suppose X has a convex and open support and its condi- 
tional distributions given Y = 1 and Y = —1 are dominated by the Lebesgue 
measure. Suppose, moreover: 


1. for any linearly independent ip,S & W, y = —1,1, and ti G M, the fol- 
lowing function is continuous: 

u H> £;(X*|V'^X = n,<5^X = v,Y = y)/^Tx|5Tx^y(w|?^,y); 

2. for anyi = 1, . . . ,p, andy= —1, 1, i/iere is a nonnegative function Ci{v,y) 
with E[ci(y,Y)\Y] < oo such that 

vE{Xi\il;^X = u,S'^X = v,Y = y)f^T^^gT^Yi'^\^^y) ^ Ci(f ,y); 

3. there is a nonnegative function co{v,y) with E[co(y,Y)\Y] < oo such 
that /^Tx|5Tx,y(^ib,y) < co{v,y). 

Then the function h^ DeE[m{9,Z)] is differentiable in all directions with 
derivative matrix 

2diag(5],0) + A Y, P{Y = y)f^T^^y{t + y\y)E{^*^*^\i^'^^ = t + y). 

Furthermore, if the function (■0,t) i->/ ,Tx|y(i + ?/|?/)-E'(X*X*^|'j/'^X = t + y) 
is continuous, then Dg[m{6,Z)] is jointly differentiable with respect to 6. 

Joint differentiability and directional differentiability are sometimes ref- 
ered to as Frechet differentiability and Gateaux differentiability. The latter 
is generally weaker than the former. In a finite-dimensional space, having 
continuous directional derivative in all directions implies joint differentia- 
bility [Bickel et al. (1993), page 453]. The next theorem gives the influence 
function for support vector machine. 

Theorem 6. If the conditions in Theorems 4 o-nd 5 are satisfied, then 
6 = 90- H-i{(2^([^,0)^ - XEn[X.*YI{l - YO'^X* > 0)]} + op{n-^^'^), 
where H is hessian matrix given by Theorem 5. 

The proof is similar to that of Jiang, Zhang and Cai (2008) and is omitted. 
Alternatively, one can prove it by applying Theorem 5.23 of van der Vaart 

7.2. Asymptotic distribution of (vi,. . . ,Vd). Consider a fixed division 
point Qr, where r G {1, . . . ,h — 1}. Let y be as defined in (12), and Z'' = 
(X^,y'')^. Let 0or = {-ipor^torV be the minimizer of E[m{6,Z'')], and 6r = 
{tpj ,tr) be the minimizer of En[m{0,Zi'^)]. Let H,. be the hessian matrix 
of E[m{6, Z*")], and let F,. be the first p rows of H~"^. By Theorem 6, 

(29) ^, = -0or - Sr(6'or, Z"^) + opin-^/^), 


where s^(0, Z*") = Fr[(2t/>~^5], 0)"^ - XX*Y''I{1 - Y^'e'^X* > 0)]. Let 

h-l h-1 

r=l r=l 

For a matrix A G ]R'"i^'''2, let Kri,r2 ^ I^''^''^^^^''^ be the commutation matrix 
defined by the relation Kri,r2 vec(A) = vec(A ). See Magnus and Neudecker 
(1979). Two properties of K,.i,r-2 that will prove useful for our purpose are 
that Kr^,r2 = ^J2,ri ^nd that for any B G W^''''^ 

(30) A®B = Kr,,^3(B®A)Kr4,r2. 

We now present the asymptotic distribution of M„. 

Theorem 7. Under the assumptions in Theorems 4 and 5, -y/nvec(M„ — 
Mq) converges to multivariate normal with mean and variance 



(V + Kp,p) Yl Y.^'^Ori'Ot ® A,i)(Ip2 + K 
r=l t=l 

where A^t = ^[s^(6>or, ZOs7(6>oi, Z*)]. 

This result leads directly to the asymptotic distribution of V = (vi , . . . , 
V(^). Since, by Theorem 1, span(Mo) ^ 5y|x, we have rank(Mo) < d. We 
make the working assumption that rank(Mo) = d. This means we exclude 
the situations where the regression surface is symmetric about the origin. 
Since Mq is positive semi-definite, it has the spectral decomposition UDU"'', 
where U is a p x d matrix whose columns are the eigenvectors of Mq cor- 
responding to nonzero eigenvalues, and D is a d x d diagonal matrix with 
diagonal elements being the nonzero eigenvalues. The following corollary is 
a direct consequence of Theorem 7 and Bura and Pfeiffer (2008). Its proof 
is omitted. 

Corollary 1. Under the assumptions in Theorems 4 o,nd 5 and 
rank(Mo) = d, -y/nvec(V — Vq) — > N{Q, T), where T is the pd x pd matrix 

(D-^U^ Ip)(Ip2 + Kp,p) ^ ^(^OrV'M » Art)(V + Kp,p)(UD^l 

r=l i=l 


It is possible to refine the PSVM estimator by introducing weights to M, 
Take the LVR scheme for example. Let ^ = {ipi, ■ ■ ■ jipj^^i). Let A be an 
h — 1 by h — 1 matrix. Rather than working with Mnj we could base the 
spectral decomposition on a weighted matrix M„(A) = ^ A^. Let v(A) = 


(vi(A), . . . , Vrf(A)) be the first d eigenvectors of M„(A). One way to deter- 
mine the optimal A is by minimizing a real- valued monotone function (say 
trace) of the asymptotic variance matrix of vec[v(A)], which can be ex- 
tracted from the asymptotic distribution. This type of argument was used 
in Li (2000, 2001) to construct optimal estimating equations. Alternatively, 
one can develop an optimal procedure using the minimum distance approach 
introduced by Cook and Ni (2005). We leave these to future research. 

7.3. Consistency of the BIC-type criterion. In the following, we say a se- 
quence of random variables Wn converges in probability to infinity {Wn — )■ oo) 
if, for any K > 0, lim„^oo -P(|^n| > K) = 1. Let d is the maximizer of Gn{k) 
over {0, . . . ,p} as defined in Section 4.2. 

Theorem 8. Suppose P{ci{n) > 0) = 1, ci(n) -;> 0, n^/^ci(n) -^ oo, 
and C2{k) is an increasing function of k. Under the conditions in Theo- 
rems 1, 4 o-nd 5 and rank(Mo) = d, we have lim„_>>oo P{d = d) = 1. 

Note that we have again made the working assumption rank(Mo) = d. 
However, even when this assumption is violated the theorem still holds 
with d replaced by the rank of Mq. 

8. Simulation studies. In this section, we compare the linear and kernel 
PSVM with four other methods based on the idea of inverse regression: SIR, 
the sliced average variance estimator [SAVE; Cook and Weisberg (1991)], 
directional regression [DR; Li and Wang (2007)], and kernel sliced inverse 
regression [Wu (2008)]. We also investigate the performance of the CVBIC 
for order determination. 

8.1. Linear dimension reduction. We use the following models: 

Model I: Y = Xi/[0.5 + (X2 + 1)^] + ae. 

Model II: Y = Xi {Xi + X2 + 1) + ae, 

Model III: Y = (Xf + xlf^ \og{Xl + Xlf'^ + ae, 

where X ~ iV(0,Ip), p = 10,20,30, e ~ iV(0,l) and a = 0.2. The sample 
size n is taken to be 100. The first two models, which are taken from Li 
(1991), are asymmetric about 0, but the last one is symmetric about 0. As 
we have discussed in Section 4.3, linear PSVM, like SIR, does not work 
when the regression surface is symmetric about 0. The first two examples 
show how the linear PSVM compares with other methods in the situations 
where it works. The purpose of the last model is to provide a benchmark 
of error when it fails, so that we can gauge how the kernel PSVM improves 
the situation in the next comparison. 


Table 1 

Estimated means and simulation standard errors (in parentheses) of the distance 

measure (31) and mean computation times (in second) of linear sufficient dimension 

reduction methods 






Linear PSVM 



0.84 (0.22) 
1.14 (0.18) 
1.31 (0.14) 

1.55 (0.19) 
1.93 (0.05) 
1.96 (0.03) 

1.02 (0.23) 
1.32 (0.17) 
1.48 (0.11) 

0.65 (0.17) 
0.93 (0.16) 
1.17 (0.14) 



1.20 (0.27) 
1.51 (0.19) 
1.67 (0.16) 

1.43 (0.16) 
1.72 (0.15) 
1.84 (0.12) 

1.17 (0.23) 
1.46 (0.14) 
1.63 (0.12) 

0.85 (0.25) 
1.26 (0.23) 
1.58 (0.17) 



1.80 (0.13) 
1.89 (0.08) 
1.93 (0.05) 

0.87 (0.21) 
1.46 (0.20) 
1.72 (0.12) 

0.85 (0.20) 
1.45 (0.20) 
1.71 (0.12) 

1.65 (0.16) 
1.85 (0.10) 
1.93 (0.05) 






To evaluate the performance of each method, we use the distance measure 
suggested by Li, Zha and Chiaromonte (2005). Specifically, let Si and ^2 be 
two subspaces of W . Then 

(31) dist(cSi,52) = ||P5i-P52ll, 

where P^^ and P^j are orthogonal projections on to Si and ^2, and || • || is 
a matrix norm. In the following the Frobenius norm is used. 

For SAVE and DR, we use /i = 4 slices, and for SIR, we use /i = 8 slices, 
having roughly the same number of points. Our choices of h are in line with 
the usual practice in the SDR literature for such a sample size. For methods 
such as SAVE and DR that involve the second-order inverse moment, h is 
suggested to be chosen smaller than that for methods such as SIR which 
only involve the first-order inverse moment [Li and Zhu (2007)]. For the lin- 
ear PSVM, the cost A is taken to be 1. The number of division points {qr) 
is 20. We have tried some other numbers of division points and obtained 
very similar results. In general, our experiences suggest that a relatively 
large number of dividing points is preferable. The results are presented in 
Table 1. The entries are of the form a{h) where a is the mean, and h is the 
standard deviation of the distance criterion (31) calculated from 200 simu- 
lated samples. The last row in Table 1 records the CPU time (in seconds) 
each method uses for Model I with p = 10 (on a Dell OptiPlex 745 desktop 
computer with speed 2.66 GHz). 

Table 1 shows that the linear PSVM consistently performs better than 
the other methods in all cases for models I and II. The intuition behind this 
improvement is explained in Section 4.3. Also, as expected, the linear PSVM 
and SIR do not perform well for Model III because of the syinmetry of the 


regression function. However, as we will see in the next comparison, this 
defect is no longer present in the kernel PSVM. The linear PSVM requires 
more computing time than the classical methods, mainly because it needs 
to process more dividing points, and, for each dividing point, the full data 
(rather than a slice of data) are processed. 

8.2. Nonlinear dimension reduction. As we have mentioned. Model III 
is symmetric about 0, and the linear PSVM fails. To a certain degree, the 
shape of regression surface of Model II is also symmetric about 0. We now 
use these two models to investigate the performance of the kernel PSVM for 
nonlinear sufficient dimension reduction. In terms of linear dimension reduc- 
tion, Model III has two sufficient predictors, Xi, X2, but in terms of nonlin- 
ear dimension reduction, it has only one sufficient predictor, (Xf + X2)^'^, 
or any monotone function of it. The kerenl PSVM is designed to recover 
a monotone transformation of {Xf + X|)^' ^ without having to assume any 
regression model. In doing so, it solves two problems at one stroke — further 
reducing the dimension from 2 to 1, and avoiding the difficulty of SIR in 
dealing with symmetric responses. 

To illustrate the idea, in Figure 3 we present the 2-D and 3-D scatter 
plots for Y versus the nonlinear and linear predictors obtained by different 
methods. The upper left panel is the 2-D scatter plot for Y versus the true 
nonlinear predictor (Xf -l-Xl)^'^; the upper right panel is the 2-D scatter 
plot of Y versus the first kerenl PSVM predictor; the lower panels are 3-D 
scatter plots for Y versus the first two predictors from SAVE and DR. We 
can see that all three methods capture the right shape of the regression 
function, but kernel PSVM only requires one predictor and its sufficient 
plot appears sharper (bearing in mind that the upper right panel only has 
to resemble a monotone transformation of the upper left panel). 

To make a more precise comparison, we need to design a new criterion 
that can compare one nonlinear predictor with two linear predictors; the 
criterion (31) is no longer suitable for this purpose. Since the nonlinear suf- 
ficient predictor estimates a monotone function of {Xf +X|)^'^, we use the 
absolute value of Spearman's correlation to measure their closeness, which 
is invariant under monotone transformation [Kutner, Nachtsheim and Neter 
(2004), page 87]. To measure the closeness between two linear predictors and 
the true nonlinear predictor {Xf + Xl)"^'^, let (^n, . . . , Uin), {U21, ■ ■ ■ , U2n) 
represent the two linear predictors obtained by SAVE or DR. These predic- 
tors estimate linear combinations of Xii,X2i but do not specify Xn, X2i 
themselves. We therefore regress Tj = Xf^ + X|j on 

{{l,Uu,U2^,Ul,Ul)■.i = l,...,n}. 

If Uii and U2i are (linearly independent) linear combinations oi Xn and X2i, 
then this regression is guaranteed to recover the true predictor Tj regard- 
less of the specific form of the linear combinations. Let Ti,...,r„ be the 



kerenl PSVM predictor 


^^ o 


o (b 

"^ ° 

o o 


^;^^^^, ^ 

X o --g^- 

^ o 


y^ 1 1 1 1 

0.0 0.2 0.4 0.6 0.8 1.0 

0.0 0.2 0.4 0.6 o.a 

Fig. 3. Comparison between linear and nonlinear sufficient dimension reduction meth- 
ods. Upper left panel: true nonlinear predictor \/Xf + X| versus Y ; upper right panel: 
first (nonlinear) PSVM predictor versus Y ; lower left: first two SAVE predictors versus Y ; 
lower right panel: first two DR predictors versus Y . 

fitted responses of tliis regression. We use the absolute values of Spearman's 
correlation between Ti and Tj to measure the performance of SAVE and DR. 
We compute these numbers for 200 simulation samples, and tabulate their 
means and standard deviations in Table 2. Note that large numbers represent 
better performance, and all numbers are between and 1. The SAVE and DR 
estimators are computed in exactly the same way as in the linear dimension 
reduction comparison. For the kernel PSVM, the cost is 1, the number of 
division points is still 20, the kernel is the Gaussian radial basis, and the 
number of principal eigenfunctions of E„ is taken to be 60. The parameter 7 
is calculated by (26), which are approximately 0.0526, 0.0257 and 0.0169 for 
p = 10,20,30, respectively. We see that the kernel PSVM actually performs 
better than SAVE and DR, even though it uses only one predictor. It also 
performs better than KSIR. Moreover, the accuracy of the kernel PSVM 
remains reasonably high for larger p, where the accuracies of SAVE, DR, 
and KSIR drop considerably. 


Table 2 

Estimated means and simulation standard errors (in parentheses) of Spearman 

correlations of linear and nonlinear sufficient dimension reduction 


Model II 

Model III 





























































8.3. Estimation of structural dimension. We now investigate the perfor- 
mance of the CVBIC order-determination procedure for a variety of combi- 
nations of {p,d,n). We still use models I and II, but, to include different d, 
we add the following models which both have d=l: 

Model IV: Y = Xi/[0.5 + {Xi + 1)^] + ae, 

Model V: Y = Xi (2Xi + 1) + ae. 

These are derived from models I and II by replacing X2 by Xi. 

We apply CVBIC in conjunction with PSVM to Models I, II, IV and V, 
with {d,n,p) ranging over the set {1,2} x {200,300,400,500} x {10,20,30}. 
The training and testing sample sizes are ni = n2 = n/2. We take 20 divid- 
ing points Qr as equally-spaced sample quantiles of li , . . . , Yn-^ . As a com- 
parison we also apply the order-determination procedure for SIR based on 
Theorem 5.1 of Li (1991) with significant level a = 0.05. The results are 
presented in Table 3, where the entries are the percentage of correct esti- 
mation of d out of 200 simulated samples for each of the 48 combinations of 
(model, p,n). Table 3 shows that CVBIC works very well, with percentage of 
correct estimation reaching as high as 100% for sample size of 200 (training 
sample size 100). In almost all cases, PSVM compares favorably with SIR 
for order determination. Also clear from the table is the trend of increasing 
accuracy for both methods as n increases. 

9. Application and further discussions. We now compare the kernel PSVM 
with SIR, SAVE, and DR in a real data analysis concerning recognition of 
vowels. The data can be found in the UCI Machine Learning Repository 
( The response variable Y is 
a categorical variable of 11 levels, representing different vowel sounds. The 
predictor X is a 10-dimensional vector describing the features of a sound. For 


Table 3 
Rate of correct order determination by SIR and PSVM in % 



n ■ 

= 200 

n = 

= 300 

n ■ 

= 400 


= 500 






























































































































clear presentation, we select only three vowels: the sounds in heed^ head and 
hud, with training and testing sample sizes being 144 and 126, respectively. 

For each dimension reduction method, we find a set of sufficient predic- 
tors from the training data, and evaluate them at the testing set, resulting 
in a sufficient plot for the testing set. Given that the testing data are inde- 
pendent of the training data from which the sufficient predictors is derived, 
the degree of separation of the vowels in the sufficient plot objectively re- 
flects the discriminating power of a dimension reduction method. The four 
scatter plots in Figure 4 present the ffi'st two predictors found by SIR (upper 
left panel), SAVE (upper right panel), DR (lower left panel), and the kerenl 
PSVM (lower right panel). For the kernel PSVM, the OVA scheme is used. 
The basis functions are the first 40 eigenfunctions of the operator S„ derived 
from the Gaussian radial kernel, whose parameter 7 is calculated by (25). 
The cost A is 1. We have varied the number of eigenfunctions (from 10 to 60) 
and the cost (from 0.5 to 20), but they do not seem to result in significant 
difference in the degree of separation in the test data. 

From Figure 4, we see that the kernel PSVM achieves much better sep- 
aration of the three vowels in the test data than the other three methods. 
The second best performer is DR, followed by SIR and SAVE. It is also 
interesting to note that the various degrees of separation are also reflected 
in the sufficient plots; that is, the distance between heed and hud is larger 
than those between heed and head, and head and hud. 

We would like to comment that classification, though important, is not 
the sole purpose for sufficient dimension reduction, and that linear and non- 
linear sufficient dimension reductions have their own strengths in reducing, 


























8 _^ e 

° oo 








'--;: o o o 


oO "in 


8 ° 

^° S 



° = 







o y 

Fig. 4. Firsi too predictors based on SIR, SAVE, DR, and the kernel PSVM plotted for 
the vowel recognition testing data set. Green, red and blue colors indicate the vowel sounds 
in heed, head, hud. 

discriminating, visualizing, and interpreting high-dimensional data. To illu- 
minate the point, consider an example where variation, rather than loca- 
tion, is the differentiating characteristic. Let y be a bernoulli variable with 
P[Y = 1) = P{Y = 0) = 1/2 and 

(X|y = y)~iV 

where cr^(O) = 1 and (7^(1) = 10. Let (Xi, Yi), . . . , (X„,y„) be a sample from 
this model, where n = 200 and p = 10. For simplicity, we fix the number of 
cases of y = 1 at n/2, because this has no bearing on our problem. In this 
case, the central subspace is span(ei, 62), where e^ = (0, . . . , 1, . . . , 0)"'" with 
the 1 occupying the ith position. 

We apply SAVE and the kernel PSVM and the results are shown in Fig- 
ure 5, where the top panel shows the scatter plot for the true sufficient 
predictors Xi and X2, the lower left panel shows the first two SAVE predic- 





o o 







o % 




+ -ft 

/ °° 













o oS> 

O OqO 











o o 










-2-10 1 

SAVE predictorl 

Fig. 5. Variation as the differentiating characteristic. Blue o represents the Y = Q cases 
and red + represents the Y — 1 cases. 

tors, and the lower right panel shows the boxplot of a single kernel PSVM 
predictor. Since for a single variable we cannot produce a scatter plot, for 
clarity we use a boxplot to represent the predictor. The value of the kernel 
PSVM predictor is represented by the height in the boxplot; the two boxes 
represents the two groups. All three plots are based on the testing data. 
What is interesting is that kernel PSVM in some sense "translates" the 
difference in variation into the difference in location. The intuitive reason 
is that there is a quadratic — and hence variance — component in the kernel 
mapping, but in the mapped high-dimensional space the variance compo- 
nent is treated as an augmented part of feature vector [as in {x,x'^)]. Of 
course this is only a simplification of the situation: there is still significant 
difference in variation in the kernel PSVM predictor for the two groups. 

In this case, linear dimension reduction methods such as SAVE have a def- 
inite advantage, both for their clear separation of variation and for their good 



Si* ° 


iP o 

O O ° O + Or 

T 1 1 1 1 1 r 

-3-2-10 1 2 3 


' I 

' I 








Fig. 6. Degrees of separation by SAVE (upper panels) and kernel PSVM (lower panels) 
for higher dimensions: p = 60 (left panels), p — 80 (middle panels) and p — 100 (right 

interpretability. In the meantime, this example also shows that kernel PSVM 
is capable of differentiating variation, to the degree comparable to SAVE, 
but its interpretability is not as direct as SAVE. 

Another desirable feature of the kernel PSVM is that its accuracy is more 
stable than the classical methods as the dimension p increases. Figure 6 
shows the sufficient predictors derived from SAVE and kernel PSVM for 
p = 60,80,100 (from left to right). The upper panels are the scatter plots 
for the first two SAVE predictors, and the lower panels are the boxplots 
representing the single kernel PSVM predictor. Again, all plots are based 
on testing data. We see that SAVE gradually loses its discriminating power 
as p is increased to 100, whereas the discriminating power of kernel PSVM 
remains reasonably strong. 

Acknowledgments. We are very grateful to three referees and an Asso- 
ciate Editor, whose many useful comments and suggestions greatly broad- 
ened and deepened an earlier version of this work. 



Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc. 68 337- 
404. MR0051437 

Artemiou, a. a. (2010). Topics on supervised and unsupervised dimension reduction. 
Ph.D. thesis, Pennsylvania State Univ., University Park, PA. 

BiCKEL, P., Klaassen, C. a. J., RiTOV, Y. and Wellner, J. (1993). Efficient and 
Adaptive Inference in Semi-Parametric Models. Johns Hopkins Univ. Press, Baltimore. 

BuRA, E. and Pfeiffer, R. (2008). On the distribution of the left singular vectors of 
a random matrix and its applications. Statist. Probab. Lett. 78 2275-2280. MR2462662 

Conway, ,J. B. (1990). A Course in Functional Analysis, 2nd ed. Graduate Texts in 
Mathematics 96. Springer, New York. MR1070713 

Cook, R. D. (1994). Using dimension-reduction subspaces to identify important inputs 
in models of physical systems. In Proc. Section on Physical and Engineering Sciences 
18-25. Amer. Statist. Assoc, Alexandria, VA. 

Cook, R. D. (1996). Graphics for regressions with a binary response. J. Amer. Statist. 
Assoc. 91 983-992. MR1424601 

Cook, R. D. (1998). Regression Graphics: Ideas for Studying Regressions Through Graph- 
ics. Wiley, New York. MR1645673 

Cook, R. D. (2007). Fisher lecture: Dimension reduction in regression. Statist. Sci. 22 
1-26. MR2408655 

Cook, R. D. andFORZANi, L. (2008). Principal fitted components for dimension reduction 
in regression. Statist. Sci. 23 485-501. MR2530547 

Cook, R. D. and Ll, B. (2002). Dimension reduction for conditional mean in regression. 
Ann. Statist. 30 455-474. MR1902895 

Cook, R. D. and Nl, L. (2005). Sufficient dimension reduction via inverse regression: 
A minimum discrepancy approach. J. Amer. Statist. Assoc. 100 410-428. MR2160547 

Cook, R. D. and Weisberg, S. (1991). Discussion of "Sliced inverse regression for di- 
mension reduction," by K.-C. Li. J. Amer. Statist. Assoc. 86 316-342. 

Eaton, M. L. (1986). A characterization of spherical distributions. J. Multivariate Anal. 
20 272-276. MR0866075 

FuKUMizu, K., Bach, F. R. and Jordan, M. I. (2004). Dimensionality reduction for 
supervised learning with reproducing kernel Hilbert spaces. J. Mach. Learn. Res. 5 
73-99. MR2247974 

Fukumizu, K., Bach, F. R. and Jordan, M. I. (2009). Kernel dimension reduction in 
regression. Ann. Statist. 37 1871-1905. MR2533474 

Fung, W. K., He, X., Liu, L. and Shi, P. (2002). Dimension reduction based on canonical 
correlation. Statist. Sinica 12 1093-1113. MR1947065 

Gretton, a., Bousquet, O., Smola, A. and Scholkopf, B. (2005). Measuring sta- 
tistical dependence with Hilbert-Schmidt norms. In 16th International Conference on 
Algorithmic Learning Theory (S. Jain, H. U. Simon and E. Tomita, eds.). Lecture 
Notes in Computer Science 3734 63-77. Springer, Berlin. MR2255909 

Hall, P. and Ll, K.-C. (1993). On almost linearity of low-dimensional projections from 
high-dimensional data. Ann. Statist. 21 867-889. MR1232523 

Hsing, T. and Ren, H. (2009). An RKHS formulation of the inverse regression dimension- 
reduction problem. Ann. Statist. 37 726-755. MR2502649 

Jiang, B., Zhang, X. and Cai, T. (2008). Estimating the confidence interval for pre- 
diction errors of support vector machine classifiers. J. Mach. Learn. Res. 9 521-540. 

Karatzoglou, a. and Meyer, D. (2006). Support vector machines in R. J. Stat. Softw. 
15 9. 


Karatzoglou, a., Smola, A., Hornik, K. and Zeileis, A. (2004). Kernlab — an S4 

package for kernel methods in R. J. Stat. Software 11 9. 
KuRDlLA, A. J. and Zabarankin, M. (2005). Convex Functional Analysis. Birkhauser, 

Basel. MR2145612 
KuTNER, M. H., Nachtsheim, C. J. and Neter, J. (2004). Applied Linear Regression 

Models, 4th ed. McGraw-Hill/Irwin, Boston. 
Li, K.-C. (1991). Sliced inverse regression for dimension reduction (with discussion). 

J. Amer. Statist. Assoc. 86 316-342. MR1137117 
Li, K.-C. (1992). On principal Hessian directions for data visualization and dimension 

reduction: Another application of Stein's lemma. J. Amer. Statist. Assoc. 87 1025- 

1039. MR1209564 
Li, B. (2000). Nonparametric estimating equations based on a penalized information cri- 
terion. Canad. J. Statist. 28 621-639. MR1793116 
Li, B. (2001). On quasi likelihood equations with non-parametric weights. Scand. J. Stat. 

28 577-602. MR1876501 
Li, B. and Dong, Y. (2009). Dimension reduction for nonelliptically distributed predic- 
tors. Ann. Statist. 37 1272-1298. MR2509074 
Li, K.-C. and Duan, N. (1989). Regression analysis under link violation. Ann. Statist. 17 

1009-1052. MR1015136 
Li, B. and Wang, S. (2007). On directional regression for dimension reduction. J. Amer. 

Statist. Assoc. 102 997-1008. MR2354409 
Li, B., Zha, H. and Chiaromonte, F. (2005). Contour regression: A general approach 

to dimension reduction. Ann. Statist. 33 1580-1616. MR2166556 
Li, Y. and Zhu, L.-X. (2007). Asymptotics for sliced average variance estimation. Ann. 

Statist. 35 41-69. MR2332268 
LOH, W.-Y. (2002). Regression trees with unbiased variable selection and interaction 

detection. Statist. Sinica 12 361-386. MR1902715 
Magnus, J. R. and Neudecker, H. (1979). The commutation matrix: Some properties 

and applications. Ann. Statist. 7 381-394. MR0520247 
SCHWARZ, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461-464. 

VAN DER Vaart, A. W. (1998). Asymptotic Statistics. Cambridge Series m Statistical and 

Probabilistic Mathematics 3. Cambridge Univ. Press, Cambridge. MR1652247 
Vapnik, V. N. (1998). Statistical Learning Theory. Wiley, New York. MR1641250 
Wang, Y. (2008). Nonlinear dimension reduction in feature space. Ph.D. thesis, Pennsyl- 
vania State Univ., University Park, PA. 
Wang, Q. and Yin, X. (2008). A nonlinear multi-dimensional variable selection method 

for high dimensional data: Sparse MAVE. Comput. Statist. Data Anal. 52 4512-4520. 

Weidmann, J. (1980). Linear Operators in Hilbert Spaces. Graduate Texts m Mathematics 

68. Springer, New York. MR0566954 
Wu, H.-M. (2008). Kernel sliced inverse regression with applications to classification. 

J. Comput. Graph. Statist. 17 590-610. MR2528238 
Wu, Q., Liang, F. and Mukherjee, S. (2008). Regularized sliced inverse regression for 

kernel models. Technical report, Duke Univ., Durham, NC. 
XiA, Y., Tong, H., Li, W. K. and Zhu, L.-X. (2002). An adaptive estimation of dimen- 
sion reduction space. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 363-410. MR1924297 
Yeh, Y.-R., Huang, S.-Y. and Lee, Y.-Y. (2009). Nonlinear dimension reduction with 

kernel sliced inverse regression. IEEE Transactions on Knowledge and Data Engineering 

21 1590-1603. 



Yin, X. and Cook, R. D. (2002). Dimension reduction for the conditional fcth moment 
in regression. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 159-175. MR1904698 

Yin, X., Li, B. and Cook, R. D. (2008). Successive direction extraction for estimating 
the central subspace in a multiple-index regression. J. Multivariate Anal. 99 1733-1757. 

Zhu, L., Miao, B. and Peng, H. (2006). On sliced inverse regression with high- 
dimensional covariates. J. Amer. Statist. Assoc. 101 630-643. MR2281245 

B. Li 

Department of Statistics 

Pennsylvania State University 

326 Thomas Building 

University Park, Pennsylvania 16802 


E-MAIL; bing@stat.psu.cdu 

A. Artemiou 

Department of Mathematical Sciences 

Michigan Technological University 

Fisher Hall, Room 306 

1400 Townsend Drive 

Houghton, Michigan 49931 


E-MAIL: aarteniio@mtu.cdu 

L. Li 

Department of Statistics 

North Carolina State University 

2311 Stinson Drive 

Campus Box 8203 

Raleigh, North Carolina 27695-8203