calibration - Université de Neuchâtel
Transcription
calibration - Université de Neuchâtel
Colloque sur les méthodes de sondage en l'honneur de Jean-Claude Deville Neuchâtel 24-26 juin 2009 A twenty-year story of calibration at Insee … and elsewhere Olivier Sautory (Cepe-Insee) [email protected] Contents A story of calibration at the French NSI Insee based upon : • papers by Jean-Claude Deville… • … and papers by his colleagues • softwares developed for calibration… • … and their use at Insee and a survey on few articles on calibration but (of course !) much less detailed than : Särndal, C.-E. (2007), The calibration approach in survey theory and practice, Survey methodology, vol 33, n°2, 99-119 (93 references…) I. Prehistory of calibration I.1 The REDRE software Lemel Y. (1976). Une généralisation de la méthode du quotient pour le redressement des enquêtes par sondage, Annales de l'Insee, n°2223, 273-281. Survey on health : sample s of households (k) of size n + sample of all the persons living in the selected households. Objective : to find weights wk such that : • for the categorical variable X (size of household) measured on the households, with categories X1…Xj… XJ : j w 1 I = M = # households in U in category X ( j = 1...J ) ∑ k X(k)=X j j k∈s • idem for categorical variables X', X", etc. … … and such that : • for the categorical variable Z (sex × agegroup) measured on the persons, with categories Z1…Zi… ZI : i i w z ( k ) N # persons in U in category Z (i = 1...I) = = ∑ k i ind k∈s where z i ( k ) = # members of the household k in category Zi • idem for categorical variables Z', Z", etc. Solution : wk = weight for household k and for each member of the household → Problem of integrated weighting (househod / person), not a problem of estimation Calibration equations : (E) t Z w = v , with : (c,n) (n,1) ( c ,1) • c = number of independent constraints • Z = matrix (n,c) containing the 1I X(k) = X and the z (i ) ( k ) j • w = vector containing the weights wk • v = vector containing the known population marginal counts Mj et Ni An infinity of solutions in w (if c < n). We seek to minimize the variance of the weights : Min ∑ w 2k subject to (E ) wk k ∈s → w = Z ( t Z Z) −1 v i.e. weights of the GREG estimator (if sampling weights equal) The software REDRE computes weights wk, satisfying calibration equations for categorical and numerical variables. I.2 Generalized regression estimator (GREG) Särndal C.E. (1982). Implications of survey design for generalized regression estimation of linear functions. Journal of Statistical Planning and Inference, 7, 155-170. Population U, sample s, inclusion probabilities πk Objective : to estimate a total TY = ∑ y k k∈U 1 yk = ∑ d k yk k∈s π k k∈s Horvitz-Thompson estimator : T̂Y ,HT = ∑ Auxiliary variables x ′k = ( x1k K x jk K x Jk ) k ∈s Known population totals : TX = (TX ... TX ... TX )' 1 j J By a linear assisting model (with "weightings" or "scale factors" qk) : ′ T̂Y ,GREG = T̂Y ,HT + (TX − T̂X ,HT ) B̂s ⎛ ⎞ B̂s = ⎜⎜ ∑ d k q k x k x ′k ⎟⎟ ⎠ ⎝ k ∈s We can write : −1 ⎛ ⎞ ⎜ ∑ d k q k x k yk ⎟ ⎟ ⎜ ⎠ ⎝ k ∈s T̂Y ,GREG = ∑ w k y k = ∑ (1 + q k λ′ x k ) d k y k k ∈s k ∈s ⎞ ′⎛ où λ′ = (TX − T̂X ,HT ) ⎜⎜ ∑ d k q k x k x ′k ⎟⎟ ⎝ k ∈s ⎠ −1 The weights wk are calibrated to the population totals TX : ∑w k ∈s k x k = TX See also : Fuller W. (2002). Regression estimation for survey samples, Survey Methodology, 28, 5-23 I.3 The 1st paper by J.-C. Deville about calibration Deville J.-C. (1988). Estimation linéaire et redressement sur information auxiliaire d'enquête par sondage, Essais en l'honneur d'Edmond Malinvaud, 915-927. • to determine new weights "close to" the sampling weights (so that the linear estimators using these weights are "almost" unbiased) and which satisfy calibration equations on known population auxiliary totals. • minimization of a quadratic distance between weights → generalization of the procédure by Lemel : any sampling weights + individual weights unrelated to the sampling weights) • links with the GREG estimator • application : calibration on the marginal counts of a two-way frequency table Conclusion of the article La recherche de pondérations de bonne qualité est une vieille préoccupation des praticiens des enquêtes. D'un autre point de vue, la théorie se préoccupe beaucoup depuis quelques années des estimateurs par la régression ou par la "régression généralisée". Curieusement, le lien pourtant simple entre ces deux problématiques ne semble pas avoir été bien compris ni surtout bien exploité. Il permet cependant de bien comprendre l'efficacité qu'on peut attendre d'une bonne pondération des données d'enquêtes et de choisir les variables de redressement les plus pertinentes. D'autres développements peuvent être attendus, en particulier pour approcher les problèmes liés aux situations de non-réponse ou de bases de sondage incomplètes. II. The JASA papers Article submitted to JASA (november 1989) Deville J.-C. & Särndal C.-E. Calibration estimators and generalized raking techniques in survey sampling Published articles Deville J.-C. & Särndal, C.-E. (1992). Calibration estimation in survey sampling, Journal of the American Statistical Association, 87, n°418, 375-382. Deville J.-C., Särndal C.-E. & Sautory O. (1993). Generalized raking procedures in survey sampling, Journal of the American Statistical Association, 88, n°423, 1013-1020. II.1. The calibration approach Population U, sample s, inclusion probabilities πk Total to estimate : TY = ∑ y k k∈U Horvitz-Thompson estimator : T̂Y ,HT Auxiliary variables 1 = ∑ yk = ∑ d k yk k∈s π k k∈s x ′k = ( x1k K x jk K x Jk ) k ∈s (numerical variables or categories indicators) Known population totals TX = (TX ... TX ... TX )' TX j = ∑ x jk . 1 j J k∈U Objective : compute new weights wk "close to" the sampling weights dk which satisfy equations : ∀ j = 1... J ∑ w k x jk = TX k∈s → calibration estimator : T̂Y ,cal = ∑ w k y k k∈s j II.2. Derivation of the calibration weights We consider for each element k a positive distance function Gk(w,dk), differentiable / w, convex, with Gk(dk,dk) = 0. We seek the wk such that : min ∑ G k (w k , d k ) subject t o wk k∈s ∑w k∈s k x k = TX 1 −1 ∂G k (w , d k ) We denote :Fk ( u ) = g k (u ) , where g k (w ) = dk ∂w Hence :w k = d k Fk (x ′k λ ) where λ solution of : ∑ d k Fk (x ′k λ ) x k = TX s Usually : We denote : ⎛ w ⎞ dk G k (w , d k ) = G⎜ ⎟ ⎝ dk ⎠ qk g ( w / d k ) = q k g k (w ), q k = scale factor > 0 F(u ) = g −1 (u ) , Hence : w k = d k F(q k x ′k λ ) where λ solution of : ∑ d k F(q k x ′k λ ) x k = TX s II.3. Link with the GREG estimator We choose : Hence : 1 (w − d k ) G k (w , d k ) = 2qk dk 2 g ( w / d k ) = w / d k − 1, ⎛ w ⎞ 1⎛ w ⎞ i.e. G ⎜⎜ ⎟⎟ = ⎜⎜ − 1⎟⎟ ⎝ dk ⎠ 2 ⎝ dk ⎠ F(u ) = 1 + u → linear method Then : ′−1 ⎛ ⎞ λ = ⎜⎜ ∑ d k q k x k x ′k ⎟⎟ (TX − T̂X ,HT ) ⎝ k ∈s ⎠ T̂Y ,cal = T̂Y ,GREG 2 II.4. Other distance functions Deville, J.-C. and Särndal, C.-E. (1992) consider 6 other functions G, particularly : • G k (w , d k ) = 1 (w Log( w / d k ) − w + d k ) → F(u ) = exp(u ) qk or exponential method, which give weights that can be obtained by the raking ratio algorithm (Deming et Stephan), in the case where all the auxiliary variables are categorical. • functions G, and consequently functions F, that guarantee weight ratios satisfying L ≤ wk/dk ≤ U, where L and U are two constants chosen by the statistician, such that L < 1 < U (bounded methods) Example : F(u) = 1 + u if L − 1 ≤ u ≤ U − 1 =L if u < L − 1 =U if u > U − 1 III. The CALMAR software III.1 The SAS macro Calmar CALMAR (CALibration on MARgins) = SAS macro program which offers 4 calibration methods (1990) (numerical or categorical auxiliary variables). Sautory O. (1991). Redressements d'échantillons d'enquêtes auprès des ménages par calage sur marges. Journées de Méthodologie Statistique, jms.insee.fr Vector λ is determined by the solution of the non-linear system resulting from the calibration equations : ∑ d k F(q k x′k λ ) x k = TX s The system is solved by the iterative Newton's method (the first iteration always give the solution of the linear method). In practice : convergence is achieved in a few iterations. w (ki +1) w (ki ) Max − <ε k ∈s dk dk III.2 Negative weights Huang E. & Fuller W. (1978). Nonnegative regression estimation for sample survey data. Proceedings, Social statistics section, American Statistical Association, 300-305. The exponential method the bounded methods (with L > 0) guarantee positive weights. Park M. & Fuller W. (2005). Towards nonnegative regression weights for survey samples. Survey Methodology, 31, 93-101. III.3. Extreme weights The bounded methods avoid extremely large weight ratios and excessively small or negative weight ratios → they are most used at Insee. But, L and U cannot be chosen arbitrarily : to guarantee a solution to the calibration equations, L must not exceed a maximal value Lmax (< 1), and U cannot be less than a minimal value Umin (> 1). In practice Lmax et Umin are determined by successive trials Remark : Calmar doesn't use an optimization program under constraints, since the constraints are taken into account in the definition of the F function. See also : Théberge A. (2000). Calibration and restricted weights. Survey Methodology, 26, p. 113-122. IV. Extensions of calibration IV.1 Integrated weighting (households – persons) Caron N. & Sautory O. (1996). Calage sur des échantillons de ménages, d'individus, d'individus-Kish, issus d'une même enquête, communication aux Journées de Statistique de l'ASU, Québec. Social surveys where households and persons are interviewed. • households m (sample sM), weights dm = 1/πm auxiliary variables xm, totals TX • all the members i in household m (sample sI), weights dm,i = dm auxiliary variables zm,i, totals TZ • one (Kish-) individual km in each household m, selected with a SRS among the em "eligible" members (sample sk), weight dk = em dm m auxiliary variables vk , totals TV m Objectives • to produce the same weights for all members of a household • to ensure consistency in the statistics obtained for the various data files Method : we calculate for each household m : • the totals of the individuals-variables : z m = ∑z m,i (m,i)∈ men m • the estimated totals of the Kish individuals variables : v̂ m = e m v k m Vector of calibration variables for household m : (x m , z m , v̂ m ) Vector of totals : (TX ,TZ, TV) Calibration equations for the sample of households: ∑d m∈ sM m F(x′m λ + z′m μ + v̂′m γ) (x m , z m , v̂ m ) = (TX , TZ , TV ) → calibration weights : wm, and wm,i = wm wk = e m wm m • method can be used with Calmar, with some SAS programming • method automatized in Calmar2 (2003) • can be used in surveys that involve cluster or two-stage sampling, where there is auxiliary information about the clusters (or PSUs) and the SSUs (for example : establishments – employees) • in the case households − persons, an alternative to the method proposed by : Lemaître G. & Dufour J. (1987). An integrated method for weighting persons and families. Survey Methodology, 13, 199-207. See also : Estevao V. & Särndal C.-E. (2006). Survey estimates by calibration on complex auxiliary information, International Statistical Review, 74, 127-147. Isaki C., Tsay J. & Fuller W. (2004). Weighting sample data subject to independent controls. Survey Methodology, 30, 39-49. IV.2 Calibration in two-phase sampling Dupont F. (1995). Alternative adjustments when there are several levels of auxiliary information. Survey Methodology, 21, 125-135. population U first-phase sample s1, sampling weights d1k second-phase sample s2, conditional sampling weights d2k. Auxiliary information • auxiliary variables x1 known for k ∈ s1 (hence for k ∈ s2), and population total TX known. 1 • auxiliary variables x2 known for k∈ s1 (hence for k ∈ s2) Two main strategies (1) single step : calibration on x1 from s2 to U and on x2 from s2 to s1 ∑d k∈s2 1k ′ γ1 + x ′2k γ 2 ) (x1k , x 2k ) = (TX , T̂X ) where T̂X = d 2k F(x1k 1 2 2 ∑d k ∈s1 1k x 2k (2) two steps : calibration on x1 from s1 to U ( → intermediate weights) calibration on x1 from s2 to U and on x2 from s2 to s1 ∑d k ∈ s1 1k ′ λ 1) x1k = TX F(x1k 1 → weights w 1k * * ′ ′ d d F( x μ + x μ ) (x , x ) = (T , T̂ ) where T̂ ∑ 1k 2k 1k 1 2k 2 1k 2k X X X = k∈s2 1 2 2 or : * * ′ ′ w d F( x μ + x μ ) (x , x ) = (T , T̂ ) where T̂ ∑ 1k 2k 1k 1 2k 2 1k 2k X X X = k ∈s2 1 2 2 ∑w k ∈s1 1k ∑w k ∈s1 1k x 2k x 2k F. Dupont compares these procedures, and examines links with a GREG approach. See also : Hidiroglou M. & Sarndäl C.-E. (1998). Use of auxiliary information for two-phase sampling. Survey Methodology, 24, 11-20. Estevao V. & Särndal C.-E. (2002). The ten cases of auxiliary information for calibration in two-phase sampling. Journal of Official Statistics, 18, 233-255. Simulations show that the two-step procedure is not always better than the single step procedure (x1 and x2 weakly correlated, and y highly correlated with x2). Estevao V. & Särndal C.-E. (2006). Survey estimates by calibration on complex auxiliary information, International Statistical Review, 74, 127-147. IV.3 Calibration on more complex parameters Calibration on a ratio ∑x R= ∑x k∈U k∈U 1k = 2k TX 1 TX 2 known We define the "auxiliary variable" zk = x1k − R x2k , and the "known population total" used in the calibration is TZ = 0. Calibration on a distribution function Ren R. (2002). Estimation de la fonction de répartition et des fractiles d'une population finie. Journées de Méthodologie Statistique, jms.insee.fr See also : Harms T. & Duchesne P. (2006). On calibration estimation for quantiles. Survey Methodology, 32, 37-52. IV.4 Calibration on uncertain auxiliary information Objective : how to use uncertain auxiliary information, for example when the information comes from another survey sample (independent from the sample to calibrate) ? Deville J.-C. (1999). Calage simultané de plusieurs enquêtes. Communication présentée au symposium 1999 de Statistique Canada. Deville J.-C. & Sautory O. (2008). Calage d'une enquête sur une information auxiliaire incertaine. Communication présentée aux 40èmes Journées de Statistique de la Société Française de Statistique à Ottawa. We have two unbiased estimations of a vector of totals of J "auxiliary variables" TX : • from the sample s : T̂X ,HT , with variance V ~ ~ • from another source : TX , with variance V We seek an unbiased linear estimator of the following type ("GREG") : ′ ~ T̂Y = T̂Y ,HT + (TX − T̂X ,HT ) B with Var (T̂Y ) minimal We obtain Bopt = W-1 Γ, where : ~ ~ ~ W = Var (T̂X ,HT − TX ) = V + V, estimated by Ŵ = V̂ + V̂ Γ = Cov(T̂X ,HT , T̂Y ,HT ) estimated by Γ̂ ′ −1 ˆ ~ → estimator T̂Y ≈ T̂Y = T̂Y ,HT + (TX − T̂X ,HT ) Ŵ Γ 0 Computation of T̂Y 0 We seek the best unbiased linear estimator of TX, denoted TX* , as a linear ~ combination of T̂X ,HT and TX , where A is a squared matrix of size J : ~ TX* = A TX + (I − A) T̂X ,HT with Var(TX* ) minimal We obtain : A =V W −1 ~ −1 and I − A = V W and it can be shown than if we perform the calibration of the sample on the totals TX* , then the calibrated estimator T̂Y ,cal = T̂Y 0 → we perform the calibration on ~  TX + (I − Â) T̂X ,HT where  = V̂ Ŵ −1 V. Calibration for nonresponse adjustment V.1 Direct (conventional) calibration An "usual" method at Insee : to adjust a sample by a calibration technique without a preliminary adjustment for nonresponse. Dupont F. (1993). Calage et redressement de la non-réponse totale. Journées de Méthodologie Statistique, jms.insee.fr Method c : ∑ d k F(x′k λ ) x k = TX or k ∈r ∑ α d k F(x′k λ ) x k = TX k ∈r Method d : calibration after adjustment for nonresponse ∑d ,α= ∑d k ∈s k ∈r (a) adjustment for nonresponse pk = response probabilities (conditionnally to s), estimated with a response modeling and a technique of estimation (ML ) → p̂ k (b) calibration using the adjusted design weights dk F(x ′k μ ) x k = TX ∑ k ∈ r p̂ k k k Comparison between c ("direct") and d ("traditional") We suppose : • adjustment for N.R. using a GLM, where H is one of the usual 1 calibration functions pk = H (z′k β ) (zk nonresponse explanatory variables, known for k œ s) • zk variables included in the calibration variables xk Then : c and d are "similar" (c can be seen as an implicit estimation of a response modeling) Remark : c and d are identical if F = H = exp, and xk = zk or if : only one categorical variable X = Z (method HRG + poststratification) V.2 Generalized calibration Deville J.-C. (1998). La correction de la non-réponse par calage ou par échantillonnage équilibré. Article présenté au congrès de l'AFCAS, Sherbrooke, Québec. A new approach for calibration, an alternative to the "distance minimization" approach, more general … and useful for adjustment for nonresponse. Calibration functions : F(z′k λ ) tq F(0 ) = 1 where zk : vector of p variables known for k œ s λ : vector of p adjustment parameters Calibration equations : ∑ d F(z′ λ ) x k k k = TX s Solution in λ → w k = d k F(z′k λ ) T̂Y ,cal = ∑ w k y k asymptotically equivalent to k∈s ′ T̂Y ,GREGi = T̂Y ,HT + (TX − T̂X ,HT ) B̂szx (obtained with F(z′k λ ) = 1 + z′k λ ) ) where B̂szx ⎛ ⎞ = ⎜⎜ ∑ d k z k x ′k ⎟⎟ ⎝ k ∈s ⎠ −1 ⎛ ⎞ ⎜ ∑ d k zk yk ⎟ ⎜ ⎟ k ∈ s ⎝ ⎠ B̂szx= coefficients of the regression of yk on the xk using the instrumental variables zk, weighted by the dk. See also : Estevao V. & Särndal C.-E. (2000). A functional approach to calibration. Journal of Official Statistics, 16, 379-399. Kott P. (2006). Using calibration weighting to adjust for nonresponse and coverage errors. Survey Methodology, 32, 133-142. V.3 Direct generalized calibration (E) ∑ d H (z ′ β ) x k ∈r k k k = TX where H is a calibration function (E) can be interpreted as a generalized calibration equation after nonresponse adjustment 1 pk = Response model : H(z′k β 0 ) ( ) H[z′k (β 0 + λ )] H z′k β̂ x k = ∑ d k H(z′k β 0 ) xk TX = ∑ d k H(z′k β 0 ) H(z′k β 0 ) H(z′k β 0 ) r r 1 F(z′k λ ) = ∑ dk pk r where β̂ = β 0 + λ F calibration function defined by : F(z′k λ ) = instruments : grad F(z′k λ ) λ =0 H[z′k (β 0 + λ )] H(z′k β 0 ) H′(z′k β 0 ) = z k = z*k H(z′k β 0 ) ( F(0) = 1 ) Properties of the method • performs a nonresponse adjustment even when the variables that cause the non-response are known only for respondents • handles situations where the non-response factors are variables of interest ("non ignorable" response mechanism) • results in a lower nonresponse bias (thanks to the zk), and a smaller variance (thanks to the xk). This method is implemented in Calmar2 (by J. Le Guennec). Sautory O. (2003). Calmar2 : a new version of the Calmar calibration adjustment program. Proceedings of the Statistics Canada Symposium 2003. See also : Kott P. & Chang T. (2008). Can calibration be used to adjust for "nonignorable" nonresponse ? Joint Statistical Meetings, 2008, Denver. Application 1 Le Guennec J. (2002). Application du calage généralisé à la correction de la non-réponse : expérimentation. Journées de Méthodologie Statistique, jms.insee.fr Study based on simulations, using data from a survey of living conditions carried out in 1996 ; sample drawn from the 1990 census data file. Calibration variables X (also nonresponse factors…) : size of household (1 person/2 and +), occupation of the head of household (working/non working), residence (Paris/provinces), nationality of the head of household (French/foreigner) in 1990 (sampling frame). Instrumental variables Z : same variables measured in 1996 (survey) Results : large reduction of the nonresponse bias, diminution of the MSE, all the more large as the variable of interest is correlated with the nonresponse factors Application 2 Bardaji J. et Le Guennec J. (2002). Mise en œuvre du calage généralisé dans une enquête de la DARES. Journées de Méthodologie Statistique, jms.insee.fr Survey to evaluate the effectiveness of a plan "contrat emploi consolidé" (CEC) (continuation of the plan called CES). One year fixed-term contract, renewable by amendment for one year, salary partially borne by the state, maximum length of time : 5 years. Early completion : non renewal, breach, on employers's or employee's initiative. Variable of interest : situation (working/unemployed/non working) at the end of the plan Hypothesis : working persons respond less (not yet concerned), unemployed persons respond less (plan = failure) : non ignorable response mechanism Application 2 Nonresponse explanatory factors in the sampling frame : gender, age, weekly work-time, type of contract, etc. and a variable rupture : 5 years in CEC, CEC broken in the course of the year, amendment non renewed. Traditional calibration using these X variables → % working increases, % non working decreases, % unemployed stable Instrumental variable fin : left the employer willingly, lay-off or non renewal by the employer, other. Generalized calibration with the X variables, the Z variables are the variable fin and the X variables, except the variable rupture. → % working stable, % non working decreases, % unemployed increases Merci de votre attention !