1 Introduction
Least squares approximations are ubiquitously used in numerical computation when trying to reconstruct an unknown function defined on some domain from its observations at a limited amount of points . In its simplest form the method amounts to minimizing the least squares fit
(1.1) 
over a set of functions that are subject to certain constraints expressing a prior on the unknown function . There are two classical approaches for imposing such constraints:

Add a penalty term to the least squares fit. Classical instances include norms of reproducing kernel Hilbert spaces or norms that promote sparsity of when expressed in a certain basis of functions.

Limit the search of to a space of finite dimension . Classical instances include spaces of algebraic or trigonometric polynomials, wavelets, or splines.
The present paper is concerned with the second approach, in which approximability by the space may be viewed as a prior on the unknown function. We measure accuracy in the Hilbertian norm
(1.2) 
where is a probability measure over . We denote by the associated inner product. The error of best approximation is defined by
(1.3) 
and is attained by , the orthogonal projection of onto . Since the least squares approximation is picked in , it is natural to compare with . In particular, the method is said to be nearoptimal (or instance optimal with constant ) if the comparison
(1.4) 
holds for all , where is some fixed constant.
The present paper is motivated by applications where the sampling points are not prescribed and can be chosen by the user. Such a situation typically occurs when evaluation of
is performed either by a computer simulation or a physical experiment, depending on a vector of input parameters
that can be set by the user. This evaluation is typically costly and the goal is to obtain a satisfactory surrogate model from a minimal number of evaluations(1.5) 
For a given probability measure and approximation space of interest, a relevant question is therefore whether instance optimality can be achieved with sample size that is moderate, ideally linear in .
Recent results of [10, 16] for polynomial spaces and [6] in a general approximation setting show that this objective can be achieved by certain random sampling schemes in the more general framework of weighted least squares methods. The approximation is then defined as the solution to
(1.6) 
where is a positive function and the are independently drawn according to a probability measure , that satisfy the constraint
(1.7) 
The choice of a sampling measure that differs from the error norm measure appears to be critical in order to obtain instance optimal approximations with an optimal sampling budget.
In particular, it is shown in [10, 6], that there exists an optimal choice of such that the weighted least squares is stable with high probability and instance optimal in expectation, under the nearlinear regime up to logarithmic factors. The optimal sampling measure and weights are given by
(1.8) 
where is the socalled Christoffel function defined by
(1.9) 
with any orthonormal basis of .
In many practical applications, the space is picked within a family that has the nestedness property
(1.10) 
and accuracy is improved by raising the dimension . The sequence may either be a priori defined, or adaptively generated, which means that the way is refined into may depend on the result of the least squares computation. Example of such hierarchical adaptive or nonadaptive schemes include in particular:

Sparse polynomial approximation in high dimension, which is relevant for the treatment of certain parametric and stochastic PDEs, see [4].
In this setting, we are facing the difficulty that the optimal measure defined by (1.8) varies together with .
In order to maintain an optimal sampling budget, one should avoid the option of drawing a new sample
(1.11) 
of increasing size at each step . In the particular case where are the univariate polynomials of degree , and a Jacobi type measure on , it is known that converges weakly to the equilibrium measure defined by
(1.12) 
with a uniform equivalence
(1.13) 
see [13, 17]. This suggests the option of replacing all by the single , as studied in [16]. Unfortunately, such an asymptotic behaviour is not encountered for most general choices of spaces . An equivalence of the form (1.13) was proved in [14] for sparse multivariate polynomials, however with a ratio that increases exponentially with the dimension, which theoretically impacts in a similar manner the sampling budget needed for stability.
In this paper, we discuss sampling strategies that are based on the observation that the optimal measure enjoys the mixture property
(1.14) 
As noticed in [12], this leads naturally to sequential sampling strategies, where the sample is recycled for generating . The main contribution of this paper is to analyze such a sampling strategy and prove that the two following properties can be jointly achieved in an expectation or high probability sense:

Stability and instance optimality of weighted least squares hold uniformly over all .

The total sampling budget after step is linear in up to logarithmic factors.
The rest of the paper is organized as follows. We recall in §2 stability and approximation estimates from
[10, 6] concerning the weighted least squares method in a fixed space . We then describe in §3 a sequential sampling strategy based on (1.14) and establish the above optimality properties 1) and 2). These optimality properties are numerically illustrated in §4 for the algorithm and two of its variants.2 Optimal weighted least squares
We denote by the discrete Euclidean norm defined by
(2.1) 
and by the associated inner product. The solution to (1.6) may be thought of as an orthogonal projection of onto for this norm. Expanding
(2.2) 
in the basis of , the coefficient vector is solution to the linear system
(2.3) 
where is the Gramian matrix for the inner product with entries
(2.4) 
and the vector has entries . The solution always exists and is unique when is invertible.
Since are drawn independently according to , the relation (1.7) implies that , and in particular
(2.5) 
The stability and accuracy analysis of the weighted least squares method can be related to the amount of deviation between and its expectation measured in the spectral norm. Recall that for matrices , this norm is defined as . This deviation also describe the closeness of the norms and over the space , since one has
(2.6) 
Note that this closeness also implies a bound
(2.7) 
on the condition number of .
Following [6], we use the particular value , and define as the solution to (1.6) when and set otherwise. The probability of the latter event can be estimated by a matrix tail bound, noting that
(2.8) 
where the are
independent realizations of the rank one random matrix
(2.9) 
where is distributed according to . The matrix Chernoff bound, see Theorem 1.1 in [18], gives
(2.10) 
where is an almost sure bound for . Since , it follows that is always larger than . With the choice given by (1.8) for the sampling measure, one has exactly , which leads to the following result.
Theorem 2.1.
Assume that the sampling measure and weight function are given by (1.8). Then for any , the condition
(2.11) 
implies the following stability and instance optimality properties:
(2.12) 
The probabilistic inequality (2.12) induces an instance optimality estimate in expectation, since in the event where , one has
(2.13)  
and therefore
(2.14) 
By a more refined reasoning, the constant can be replaced by which tends to as becomes large, see [6].
In summary, when using the optimal sampling measure defined by (1.8), stability and instance optimality can be achieved in the near linear regime
(2.15) 
where controls the probability of failure. To simplify notation later, we set .
3 An optimal sequential sampling procedure
In the following analysis of a sequential sampling scheme, we assume a sequence of nested spaces and corresponding basis functions to be given such that for each , is an orthonormal basis of . In practical applications, such spaces may either be fixed in advance, or adaptively selected, that is, the choice of depends on the computation of the weighted least squares approximation (1.6) for . In view of the previous result, one natural objective is to genererate sequences of samples distributed according to the different measures , with
(3.1) 
for some prescribed .
The simplest option for generating such sequences would be directly drawing samples from for each separately. Since we ask that is proportional to up to logarithmic factors, this leads to a total cost after step given by
(3.2) 
which increases faster than quadratically with . Instead, we want to recycle the existing samples in generating to arrive at a scheme such that the total cost remains comparable to , that is, close to linear in .
To this end, we use the mixture property (1.14). In what follows, we assume a procedure for sampling from each update measure to be available. In the univariate case, standard methods are inversion transform sampling or rejection sampling. These methods may in turn serve in the multivariate case when the
are tensor product basis functions on a product domain and
is itself of tensor product type, since are then product measures that can be sampled via their univariate factor measures. We first observe that, in order to draw distributed according to , for some fixed , we can proceed as follows:(3.3) 
Let now be an increasing sequence representing the prescribed size of the samples . Suppose that we are given i.i.d. according to . In order to obtain the new sample i.i.d. according to , we can proceed as stated in the following Algorithm 1.
Algorithm 1 requires a fixed number of samples from and an additional number of samples from
. The latter is a random variable that can be expressed as
(3.4) 
where for each fixed , the are i.i.d. Bernoulli random variables with . Moreover, is a collection of independent random variables. This immediately gives an expression for the total cost after successive applications of Algorithm 1, beginning with samples from , as the random variable
(3.5) 
We now focus on the particular choice
(3.6) 
as in (2.15), for a prescribed . This particular choice ensures that, for all ,
(3.7) 
where denotes the Gramian for according to (2.4).
We first estimate for this choice of . For this purpose, we note that and use the following lemma.
Lemma 3.1.
For and ,
(3.8) 
Proof:.
For the upper bound, we note that and
(3.9)  
By the Stirling bound for ,
(3.10) 
Combining this with (3.9) gives
(3.11)  
where the inequality for is verified for the choice of in (2.11) by direct evaluation for and monotonicity. For the lower bound, we estimate
(3.12) 
Using monotonicity and integration by parts with , we obtain
(3.13)  
Moreover,
(3.14) 
and using this in (3.13) gives
(3.15) 
where
(3.16) 
The latter inequality can be directly verified for the first few values of and then follows for by monotonicity. Thus, we obtain
(3.17) 
which concludes the proof of the lemma. ∎
As an immediate consequence, we obtain an estimate of the total cost in expectation by
Theorem 3.2.
With , the total cost after steps of Algorithm 1 satisfies
(3.18) 
We next derive estimates for the probability that the upper bound in the above estimate is exceeded substantially by . The following Chernoff inequality can be found in [1]. For convenience of the reader, we give a standard short proof following [15].
Lemma 3.3.
Let , and be a collection of independent Bernoulli random variables with for all . Set . Then, for all ,
In particular, for all ,
Proof:.
For and , by Markov’s inequality, and using that , ,
Now take . Moreover, for , one has . ∎
We now apply this result to the random part of the total sampling costs as defined in (3.5), and obtain the following probabilistic estimate.
Theorem 3.4.
With , the total cost after step of Algorithm 1 satisfies, for any ,
(3.19)  
with .
Proof:.
With the choice we are ensured that, for each value of separately, the failure probability is bounded by . When the intermediate results in each step are used to drive an adaptive selection of the sequence of basis functions, however, it will typically be of interest to ensure the stronger uniform statement that with high probability, for all . To achieve this, we now consider a slight modification of the above results with dependent choice of failure probability to ensure that remains wellconditioned with high probability jointly for all . We define the sequence of failure probabilities
(3.21) 
for a fixed , and now analyze the repeated application of Algorithm 1, now using
(3.22) 
samples for . Note that differs only by a further term of order from .
Proof:.
For the upper bound in (3.23), note that . Using (3.10),
(3.24) 
Thus, for all ,
where we have used (3.9), (3.11) together with (3.24). Let us deal with the lower bound. For all ,
(3.25) 
Moreover using (3.15) and (3.16)
Thus, combining the previous bound together with (3) and the definition of concludes the proof of the lemma. ∎
As a consequence, analogously to (3.18) we have
(3.26) 
Using the above lemma as in Theorem 3.4 combined with a union bound, we arrive at the following uniform stability result.
Theorem 3.6.
In summary, applying Algorithm 1 successively to generate the samples , we can ensure that holds uniformly for all steps with probability at least . The corresponding total costs for generating can exceed a fixed multiple of only with a probability that rapidly approaches zero as increases.
4 Numerical illustration
In our numerical tests, we consider two different types of orthonormal bases and corresponding target measures on :

On the one hand, we consider the case where is equal to , is the standard Gaussian measure on and the vector space spanned by the Hermite polynomials normalized to one with respect to the norm up to degree , for all . This case is an instance of polynomial approximation method where the function and its approximants from might be unbounded in , as well as the Christoffel function .

On the other hand, we consider the case where with the uniform measure on and the approximation spaces are generated by Haar wavelets refinement. The Haar wavelets are of the form with and , and . In adaptive approximation, the spaces are typically generated by including as the scale level grows, the values of such that the coefficient of for the approximated function is expected to be large. This can be described by growing a finite tree within the hierarchical structure induced by the dyadic indexing of the Haar wavelet family: the indices or can be selected only if has already been. In our experiment, we generate the spaces by letting such a tree grow at random, starting from . This selection of is done once and used for all further tests. The resulting sampling measures exhibit the local refinement behavior of the corresponding approximation spaces .
As seen further, although the spaces and measures are quite different, the cost of the sampling algorithm behaves similarly in these two cases. While we have used univariate domains for the sake of numerical simplicity, we expect a similar behaviour in multivariate cases, since our results are also immune to the spatial dimension . As already mentioned, the sampling method is then facilitated when the functions are tensor product functions and is a product measure.
(a) Hermite polynomials  (b) Haar wavelets 
The sampling measures are shown in Figure 1. In case (ii), the measures are uniform measures on dyadic intervals in , from which we can sample directly, using operations per sample. In case (i), we have instead . Several strategies for sampling from these densities are discussed in [6]. For our following tests, we use inverse transform sampling: we draw uniformly distributed in , and obtain a sample from by solving . To solve these equations, in order to ensure robustness, we use a simple bisection scheme. Each point value of can be obtained using operations, exploiting the fact that for ,
where is the standard Gaussian density and
its cumulative distribution function. The bisection thus takes
operations to converge to accuracy. For practical purposes, the sampling for case (i) can be accelerated by precomputation of interpolants for
.In each of our numerical tests, we consider the distributions of and resulting from the sequential generation of sample sets for by Algorithm 1
. In each case, the quantiles of the corresponding distributions are estimated from 1000 test runs. Figure
2 shows the results for Algorithm 1 in case (i) with as in (3.6), where we choose . We know from Theorem 2.1 that, for each , one has with probability greater than . In fact, this bound appears to be fairly pessimistic, since no sample with is encountered in the present test. In Figure 2(b), we show the ratio . Recall that is defined in 3.5 as the total number of samples used in the repeated application of Algorithm 1 to produce , and thus provides a comparison to the costs of directly sampling from . The results are in agreement with Theorems 3.2 and 3.4, which show in particular that with very high probability.(a) Gramian condition numbers  (b) Cost comparison 
Using the same setup with as in (3.22), where , leads to the expected improved uniformity in . The corresponding results are shown in Figure 3, with sampling costs that are in agreement with (3.26) and Theorem 3.6. Since the effects of replacing by are very similar in all further tests, we only show results for with in what follows.
(a) Gramian condition numbers  (b) Cost comparison 
While the simple scheme in Algorithm 1 already ensures nearoptimal costs with high probability, there are some practical variants that can yield better quantitative performance. A first such variant is given in Algorithm 2. Instead of deciding for each previous sample separately whether it will be reused, here a queue of previous samples is kept, from which these are extracted in order until the previous sample set is exhausted. Clearly, the costs of this scheme are bounded from above by those of Algorithm 1.
As expected, in the results for Algorithm 2 applied in case (i), which are shown in Figure 4, we find an estimate of the distribution of that is essentially identical to the one for Algorithm 1 in Figure 2(a). The costs, however, are substantially more favorable than the ones in Figure 2(b): using Algorithm 2, the successive sampling of uses only a small fraction of additional samples when compared to directly sampling only .
Comments
There are no comments yet.