calibration - Université de Neuchâtel

Transcription

calibration - Université de Neuchâtel
Colloque sur les méthodes de sondage
en l'honneur de Jean-Claude Deville
Neuchâtel
24-26 juin 2009
A twenty-year story of
calibration at Insee …
and elsewhere
Olivier Sautory (Cepe-Insee)
[email protected]
Contents
A story of calibration at the French NSI Insee
based upon :
• papers by Jean-Claude Deville…
• … and papers by his colleagues
• softwares developed for calibration…
• … and their use at Insee
and a survey on few articles on calibration
but (of course !) much less detailed than :
Särndal, C.-E. (2007), The calibration approach in survey theory and
practice, Survey methodology, vol 33, n°2, 99-119
(93 references…)
I. Prehistory of calibration
I.1 The REDRE software
Lemel Y. (1976). Une généralisation de la méthode du quotient pour
le redressement des enquêtes par sondage, Annales de l'Insee, n°2223, 273-281.
Survey on health : sample s of households (k) of size n + sample of
all the persons living in the selected households.
Objective : to find weights wk such that :
• for the categorical variable X (size of household) measured on the
households, with categories X1…Xj… XJ :
j
w
1
I
=
M
=
#
households
in
U
in
category
X
( j = 1...J )
∑ k X(k)=X
j
j
k∈s
• idem for categorical variables X', X", etc. …
… and such that :
• for the categorical variable Z (sex × agegroup) measured on the
persons, with categories Z1…Zi… ZI :
i
i
w
z
(
k
)
N
#
persons
in
U
in
category
Z
(i = 1...I)
=
=
∑ k
i
ind
k∈s
where z i ( k ) = # members of the household k in category Zi
• idem for categorical variables Z', Z", etc.
Solution : wk = weight for household k and for each member of the
household
→ Problem of integrated weighting (househod / person), not a
problem of estimation
Calibration equations :
(E)
t
Z w = v , with :
(c,n) (n,1)
( c ,1)
• c = number of independent constraints
• Z = matrix (n,c) containing the 1I X(k) = X and the z (i ) ( k )
j
• w = vector containing the weights wk
• v = vector containing the known population marginal counts Mj et Ni
An infinity of solutions in w (if c < n). We seek to minimize the
variance of the weights :
Min ∑ w 2k subject to (E )
wk
k ∈s
→ w = Z ( t Z Z) −1 v
i.e. weights of the GREG estimator (if sampling weights equal)
The software REDRE computes weights wk, satisfying calibration
equations for categorical and numerical variables.
I.2 Generalized regression estimator (GREG)
Särndal C.E. (1982). Implications of survey design for generalized
regression estimation of linear functions. Journal of Statistical
Planning and Inference, 7, 155-170.
Population U, sample s, inclusion probabilities πk
Objective : to estimate a total TY = ∑ y k
k∈U
1
yk = ∑ d k yk
k∈s π k
k∈s
Horvitz-Thompson estimator : T̂Y ,HT = ∑
Auxiliary variables
x ′k = ( x1k K x jk K x Jk ) k ∈s
Known population totals : TX = (TX ... TX ... TX )'
1
j
J
By a linear assisting model (with "weightings" or "scale factors" qk) :
′
T̂Y ,GREG = T̂Y ,HT + (TX − T̂X ,HT ) B̂s
⎛
⎞
B̂s = ⎜⎜ ∑ d k q k x k x ′k ⎟⎟
⎠
⎝ k ∈s
We can write :
−1
⎛
⎞
⎜ ∑ d k q k x k yk ⎟
⎟
⎜
⎠
⎝ k ∈s
T̂Y ,GREG = ∑ w k y k = ∑ (1 + q k λ′ x k ) d k y k
k ∈s
k ∈s
⎞
′⎛
où λ′ = (TX − T̂X ,HT ) ⎜⎜ ∑ d k q k x k x ′k ⎟⎟
⎝ k ∈s
⎠
−1
The weights wk are calibrated to the population totals TX :
∑w
k ∈s
k
x k = TX
See also :
Fuller W. (2002). Regression estimation for survey samples, Survey
Methodology, 28, 5-23
I.3 The 1st paper by J.-C. Deville about calibration
Deville J.-C. (1988). Estimation linéaire et redressement sur information
auxiliaire d'enquête par sondage, Essais en l'honneur d'Edmond
Malinvaud, 915-927.
• to determine new weights "close to" the sampling weights (so that the
linear estimators using these weights are "almost" unbiased) and
which satisfy calibration equations on known population auxiliary
totals.
• minimization of a quadratic distance between weights
→ generalization of the procédure by Lemel : any sampling weights +
individual weights unrelated to the sampling weights)
• links with the GREG estimator
• application : calibration on the marginal counts of a two-way frequency
table
Conclusion of the article
La recherche de pondérations de bonne qualité est une vieille
préoccupation des praticiens des enquêtes. D'un autre point de vue, la
théorie se préoccupe beaucoup depuis quelques années des estimateurs
par la régression ou par la "régression généralisée". Curieusement, le
lien pourtant simple entre ces deux problématiques ne semble pas avoir
été bien compris ni surtout bien exploité.
Il permet cependant de bien comprendre l'efficacité qu'on peut attendre
d'une bonne pondération des données d'enquêtes et de choisir les
variables de redressement les plus pertinentes.
D'autres développements peuvent être attendus, en particulier pour
approcher les problèmes liés aux situations de non-réponse ou de
bases de sondage incomplètes.
II. The JASA papers
Article submitted to JASA (november 1989)
Deville J.-C. & Särndal C.-E. Calibration estimators and generalized
raking techniques in survey sampling
Published articles
Deville J.-C. & Särndal, C.-E. (1992). Calibration estimation in
survey sampling, Journal of the American Statistical Association, 87,
n°418, 375-382.
Deville J.-C., Särndal C.-E. & Sautory O. (1993). Generalized raking
procedures in survey sampling, Journal of the American Statistical
Association, 88, n°423, 1013-1020.
II.1. The calibration approach
Population U, sample s, inclusion probabilities πk
Total to estimate :
TY = ∑ y k
k∈U
Horvitz-Thompson estimator : T̂Y ,HT
Auxiliary variables
1
= ∑ yk = ∑ d k yk
k∈s π k
k∈s
x ′k = ( x1k K x jk K x Jk ) k ∈s
(numerical variables or categories indicators)
Known population totals TX = (TX ... TX ... TX )' TX j = ∑ x jk .
1
j
J
k∈U
Objective : compute new weights wk "close to" the sampling
weights dk which satisfy equations : ∀ j = 1... J ∑ w k x jk = TX
k∈s
→ calibration estimator :
T̂Y ,cal = ∑ w k y k
k∈s
j
II.2. Derivation of the calibration weights
We consider for each element k a positive distance function Gk(w,dk),
differentiable / w, convex, with Gk(dk,dk) = 0. We seek the wk such that :
min ∑ G k (w k , d k ) subject t o
wk
k∈s
∑w
k∈s
k
x k = TX
1 −1
∂G k
(w , d k )
We denote :Fk ( u ) = g k (u ) , where g k (w ) =
dk
∂w
Hence :w k = d k Fk (x ′k λ ) where λ solution of : ∑ d k Fk (x ′k λ ) x k = TX
s
Usually :
We denote :
⎛ w ⎞ dk
G k (w , d k ) = G⎜ ⎟
⎝ dk ⎠ qk
g ( w / d k ) = q k g k (w ),
q k = scale factor > 0
F(u ) = g −1 (u ) ,
Hence : w k = d k F(q k x ′k λ ) where λ solution of : ∑ d k F(q k x ′k λ ) x k = TX
s
II.3. Link with the GREG estimator
We choose :
Hence :
1 (w − d k )
G k (w , d k ) =
2qk
dk
2
g ( w / d k ) = w / d k − 1,
⎛ w ⎞ 1⎛ w ⎞
i.e. G ⎜⎜ ⎟⎟ = ⎜⎜ − 1⎟⎟
⎝ dk ⎠ 2 ⎝ dk ⎠
F(u ) = 1 + u
→ linear method
Then :
′−1
⎛
⎞
λ = ⎜⎜ ∑ d k q k x k x ′k ⎟⎟ (TX − T̂X ,HT )
⎝ k ∈s
⎠
T̂Y ,cal = T̂Y ,GREG
2
II.4. Other distance functions
Deville, J.-C. and Särndal, C.-E. (1992) consider 6 other functions G,
particularly :
•
G k (w , d k ) =
1
(w Log( w / d k ) − w + d k ) → F(u ) = exp(u )
qk
or exponential method, which give weights that can be obtained by
the raking ratio algorithm (Deming et Stephan), in the case where all
the auxiliary variables are categorical.
• functions G, and consequently functions F, that guarantee weight
ratios satisfying L ≤ wk/dk ≤ U, where L and U are two constants
chosen by the statistician, such that L < 1 < U (bounded methods)
Example : F(u) = 1 + u if L − 1 ≤ u ≤ U − 1
=L
if u < L − 1
=U
if u > U − 1
III. The CALMAR software
III.1 The SAS macro Calmar
CALMAR (CALibration on MARgins) = SAS macro program which
offers 4 calibration methods (1990) (numerical or categorical auxiliary
variables).
Sautory O. (1991). Redressements d'échantillons d'enquêtes auprès des
ménages par calage sur marges. Journées de Méthodologie Statistique,
jms.insee.fr
Vector λ is determined by the solution of the non-linear system
resulting from the calibration equations :
∑ d k F(q k x′k λ ) x k = TX
s
The system is solved by the iterative Newton's method (the first
iteration always give the solution of the linear method).
In practice : convergence is achieved in a few iterations.
w (ki +1) w (ki )
Max
−
<ε
k ∈s
dk
dk
III.2 Negative weights
Huang E. & Fuller W. (1978). Nonnegative regression estimation for
sample survey data. Proceedings, Social statistics section, American
Statistical Association, 300-305.
The exponential method the bounded methods (with L > 0) guarantee
positive weights.
Park M. & Fuller W. (2005). Towards nonnegative regression weights
for survey samples. Survey Methodology, 31, 93-101.
III.3. Extreme weights
The bounded methods avoid extremely large weight ratios and
excessively small or negative weight ratios → they are most used at
Insee.
But, L and U cannot be chosen arbitrarily : to guarantee a solution to the
calibration equations, L must not exceed a maximal value Lmax (< 1), and
U cannot be less than a minimal value Umin (> 1).
In practice Lmax et Umin are determined by successive trials
Remark : Calmar doesn't use an optimization program under constraints,
since the constraints are taken into account in the definition of the F
function.
See also :
Théberge A. (2000). Calibration and restricted weights. Survey
Methodology, 26, p. 113-122.
IV. Extensions of calibration
IV.1 Integrated weighting (households – persons)
Caron N. & Sautory O. (1996). Calage sur des échantillons de ménages,
d'individus, d'individus-Kish, issus d'une même enquête, communication
aux Journées de Statistique de l'ASU, Québec.
Social surveys where households and persons are interviewed.
• households m (sample sM), weights dm = 1/πm
auxiliary variables xm, totals TX
• all the members i in household m (sample sI), weights dm,i = dm
auxiliary variables zm,i, totals TZ
• one (Kish-) individual km in each household m, selected with a SRS
among the em "eligible" members (sample sk), weight dk = em dm
m
auxiliary variables vk , totals TV
m
Objectives
• to produce the same weights for all members of a household
• to ensure consistency in the statistics obtained for the various data files
Method : we calculate for each household m :
• the totals of the individuals-variables : z m =
∑z
m,i
(m,i)∈ men m
• the estimated totals of the Kish individuals variables : v̂ m = e m v k m
Vector of calibration variables for household m : (x m , z m , v̂ m )
Vector of totals : (TX ,TZ, TV)
Calibration equations for the sample of households:
∑d
m∈ sM
m
F(x′m λ + z′m μ + v̂′m γ) (x m , z m , v̂ m ) = (TX , TZ , TV )
→ calibration weights : wm, and
wm,i = wm
wk = e m wm
m
• method can be used with Calmar, with some SAS programming
• method automatized in Calmar2 (2003)
• can be used in surveys that involve cluster or two-stage sampling,
where there is auxiliary information about the clusters (or PSUs)
and the SSUs (for example : establishments – employees)
• in the case households − persons, an alternative to the method
proposed by :
Lemaître G. & Dufour J. (1987). An integrated method for
weighting persons and families. Survey Methodology, 13, 199-207.
See also :
Estevao V. & Särndal C.-E. (2006). Survey estimates by calibration
on complex auxiliary information, International Statistical Review,
74, 127-147.
Isaki C., Tsay J. & Fuller W. (2004). Weighting sample data subject
to independent controls. Survey Methodology, 30, 39-49.
IV.2 Calibration in two-phase sampling
Dupont F. (1995). Alternative adjustments when there are several
levels of auxiliary information. Survey Methodology, 21, 125-135.
population U
first-phase sample s1, sampling weights d1k
second-phase sample s2, conditional sampling weights d2k.
Auxiliary information
• auxiliary variables x1 known for k ∈ s1 (hence for k ∈ s2), and
population total TX known.
1
• auxiliary variables x2 known for k∈ s1 (hence for k ∈ s2)
Two main strategies
(1) single step : calibration on x1 from s2 to U and on x2 from s2 to s1
∑d
k∈s2
1k
′ γ1 + x ′2k γ 2 ) (x1k , x 2k ) = (TX , T̂X ) where T̂X =
d 2k F(x1k
1
2
2
∑d
k ∈s1
1k
x 2k
(2) two steps : calibration on x1 from s1 to U ( → intermediate weights)
calibration on x1 from s2 to U and on x2 from s2 to s1
∑d
k ∈ s1
1k
′ λ 1) x1k = TX
F(x1k
1
→ weights w 1k
*
*
′
′
d
d
F(
x
μ
+
x
μ
)
(x
,
x
)
=
(T
,
T̂
)
where
T̂
∑ 1k 2k 1k 1 2k 2 1k 2k
X
X
X =
k∈s2
1
2
2
or :
*
*
′
′
w
d
F(
x
μ
+
x
μ
)
(x
,
x
)
=
(T
,
T̂
)
where
T̂
∑ 1k 2k 1k 1 2k 2 1k 2k
X
X
X =
k ∈s2
1
2
2
∑w
k ∈s1
1k
∑w
k ∈s1
1k
x 2k
x 2k
F. Dupont compares these procedures, and examines links with a
GREG approach.
See also :
Hidiroglou M. & Sarndäl C.-E. (1998). Use of auxiliary information
for two-phase sampling. Survey Methodology, 24, 11-20.
Estevao V. & Särndal C.-E. (2002). The ten cases of auxiliary
information for calibration in two-phase sampling. Journal of Official
Statistics, 18, 233-255.
Simulations show that the two-step procedure is not always better
than the single step procedure (x1 and x2 weakly correlated, and y
highly correlated with x2).
Estevao V. & Särndal C.-E. (2006). Survey estimates by calibration
on complex auxiliary information, International Statistical Review,
74, 127-147.
IV.3 Calibration on more complex parameters
Calibration on a ratio
∑x
R=
∑x
k∈U
k∈U
1k
=
2k
TX
1
TX
2
known
We define the "auxiliary variable" zk = x1k − R x2k , and the "known
population total" used in the calibration is TZ = 0.
Calibration on a distribution function
Ren R. (2002). Estimation de la fonction de répartition et des fractiles
d'une population finie. Journées de Méthodologie Statistique,
jms.insee.fr
See also :
Harms T. & Duchesne P. (2006). On calibration estimation for
quantiles. Survey Methodology, 32, 37-52.
IV.4 Calibration on uncertain auxiliary information
Objective : how to use uncertain auxiliary information, for example
when the information comes from another survey sample (independent
from the sample to calibrate) ?
Deville J.-C. (1999). Calage simultané de plusieurs enquêtes.
Communication présentée au symposium 1999 de Statistique Canada.
Deville J.-C. & Sautory O. (2008). Calage d'une enquête sur une
information auxiliaire incertaine. Communication présentée aux 40èmes
Journées de Statistique de la Société Française de Statistique à Ottawa.
We have two unbiased estimations of a vector of totals of J "auxiliary
variables" TX :
• from the sample s : T̂X ,HT , with variance V
~
~
• from another source : TX , with variance V
We seek an unbiased linear estimator of the following type ("GREG") :
′
~
T̂Y = T̂Y ,HT + (TX − T̂X ,HT ) B with Var (T̂Y ) minimal
We obtain Bopt = W-1 Γ, where :
~
~
~
W = Var (T̂X ,HT − TX ) = V + V, estimated by Ŵ = V̂ + V̂
Γ = Cov(T̂X ,HT , T̂Y ,HT )
estimated by Γ̂
′ −1 ˆ
~
→ estimator T̂Y ≈ T̂Y = T̂Y ,HT + (TX − T̂X ,HT ) Ŵ Γ
0
Computation of T̂Y
0
We seek the best unbiased linear estimator of TX, denoted TX* , as a linear
~
combination of T̂X ,HT and TX , where A is a squared matrix of size J :
~
TX* = A TX + (I − A) T̂X ,HT with Var(TX* ) minimal
We obtain :
A =V W
−1
~ −1
and I − A = V W
and it can be shown than if we perform the calibration of the sample on
the totals TX* , then the calibrated estimator T̂Y ,cal = T̂Y
0
→ we perform the calibration on
~
 TX + (I − Â) T̂X ,HT
where  = V̂ Ŵ −1
V. Calibration for
nonresponse adjustment
V.1 Direct (conventional) calibration
An "usual" method at Insee : to adjust a sample by a calibration
technique without a preliminary adjustment for nonresponse.
Dupont F. (1993). Calage et redressement de la non-réponse totale.
Journées de Méthodologie Statistique, jms.insee.fr
Method c :
∑ d k F(x′k λ ) x k = TX or
k ∈r
∑ α d k F(x′k λ ) x k = TX
k ∈r
Method d : calibration after adjustment for nonresponse
∑d
,α=
∑d
k ∈s
k ∈r
(a) adjustment for nonresponse
pk = response probabilities (conditionnally to s), estimated with a
response modeling and a technique of estimation
(ML ) → p̂ k
(b) calibration using the adjusted design weights
dk
F(x ′k μ ) x k = TX
∑
k ∈ r p̂ k
k
k
Comparison between c ("direct") and d ("traditional")
We suppose :
• adjustment for N.R. using a GLM, where H is one of the usual
1
calibration functions
pk =
H (z′k β )
(zk nonresponse explanatory variables, known for k œ s)
• zk variables included in the calibration variables xk
Then : c and d are "similar" (c can be seen as an implicit estimation
of a response modeling)
Remark : c and d are identical if F = H = exp, and xk = zk
or if : only one categorical variable X = Z (method HRG +
poststratification)
V.2 Generalized calibration
Deville J.-C. (1998). La correction de la non-réponse par calage ou par
échantillonnage équilibré. Article présenté au congrès de l'AFCAS,
Sherbrooke, Québec.
A new approach for calibration, an alternative to the "distance
minimization" approach, more general … and useful for adjustment for
nonresponse.
Calibration functions :
F(z′k λ ) tq F(0 ) = 1
where zk : vector of p variables known for k œ s
λ : vector of p adjustment parameters
Calibration equations :
∑ d F(z′ λ ) x
k
k
k
= TX
s
Solution in λ →
w k = d k F(z′k λ )
T̂Y ,cal = ∑ w k y k asymptotically equivalent to
k∈s
′
T̂Y ,GREGi = T̂Y ,HT + (TX − T̂X ,HT ) B̂szx (obtained with F(z′k λ ) = 1 + z′k λ ) )
where B̂szx
⎛
⎞
= ⎜⎜ ∑ d k z k x ′k ⎟⎟
⎝ k ∈s
⎠
−1
⎛
⎞
⎜ ∑ d k zk yk ⎟
⎜
⎟
k
∈
s
⎝
⎠
B̂szx= coefficients of the regression of yk on the xk using the
instrumental variables zk, weighted by the dk.
See also :
Estevao V. & Särndal C.-E. (2000). A functional approach to calibration.
Journal of Official Statistics, 16, 379-399.
Kott P. (2006). Using calibration weighting to adjust for nonresponse
and coverage errors. Survey Methodology, 32, 133-142.
V.3 Direct generalized calibration
(E)
∑ d H (z ′ β ) x
k ∈r
k
k
k
= TX
where H is a calibration function
(E) can be interpreted as a generalized calibration equation after
nonresponse adjustment
1
pk =
Response model :
H(z′k β 0 )
( )
H[z′k (β 0 + λ )]
H z′k β̂
x k = ∑ d k H(z′k β 0 )
xk
TX = ∑ d k H(z′k β 0 )
H(z′k β 0 )
H(z′k β 0 )
r
r
1
F(z′k λ )
= ∑ dk
pk
r
where β̂ = β 0 + λ
F calibration function defined by : F(z′k λ ) =
instruments : grad F(z′k λ ) λ =0
H[z′k (β 0 + λ )]
H(z′k β 0 )
H′(z′k β 0 )
=
z k = z*k
H(z′k β 0 )
( F(0) = 1 )
Properties of the method
• performs a nonresponse adjustment even when the variables that
cause the non-response are known only for respondents
• handles situations where the non-response factors are variables of
interest ("non ignorable" response mechanism)
• results in a lower nonresponse bias (thanks to the zk), and a smaller
variance (thanks to the xk).
This method is implemented in Calmar2 (by J. Le Guennec).
Sautory O. (2003). Calmar2 : a new version of the Calmar calibration
adjustment program. Proceedings of the Statistics Canada Symposium
2003.
See also :
Kott P. & Chang T. (2008). Can calibration be used to adjust for
"nonignorable" nonresponse ? Joint Statistical Meetings, 2008, Denver.
Application 1
Le Guennec J. (2002). Application du calage généralisé à la correction de
la non-réponse : expérimentation. Journées de Méthodologie Statistique,
jms.insee.fr
Study based on simulations, using data from a survey of living
conditions carried out in 1996 ; sample drawn from the 1990 census data
file.
Calibration variables X (also nonresponse factors…) : size of household
(1 person/2 and +), occupation of the head of household (working/non
working), residence (Paris/provinces), nationality of the head of
household (French/foreigner) in 1990 (sampling frame).
Instrumental variables Z : same variables measured in 1996 (survey)
Results : large reduction of the nonresponse bias, diminution of the MSE,
all the more large as the variable of interest is correlated with the
nonresponse factors
Application 2
Bardaji J. et Le Guennec J. (2002). Mise en œuvre du calage généralisé
dans une enquête de la DARES. Journées de Méthodologie Statistique,
jms.insee.fr
Survey to evaluate the effectiveness of a plan "contrat emploi consolidé"
(CEC) (continuation of the plan called CES).
One year fixed-term contract, renewable by amendment for one year,
salary partially borne by the state, maximum length of time : 5 years.
Early completion : non renewal, breach, on employers's or employee's
initiative.
Variable of interest : situation
(working/unemployed/non working)
at
the
end
of
the
plan
Hypothesis : working persons respond less (not yet concerned),
unemployed persons respond less (plan = failure) : non ignorable
response mechanism
Application 2
Nonresponse explanatory factors in the sampling frame : gender, age,
weekly work-time, type of contract, etc.
and a variable rupture : 5 years in CEC, CEC broken in the course of the
year, amendment non renewed.
Traditional calibration using these X variables
→ % working increases, % non working decreases, % unemployed
stable
Instrumental variable fin : left the employer willingly, lay-off or non
renewal by the employer, other.
Generalized calibration with the X variables, the Z variables are the
variable fin and the X variables, except the variable rupture.
→
% working stable, % non working decreases, % unemployed
increases
Merci de votre attention !

Documents pareils