What`s happening if there is "B" in the general linear modeling output?
Transcription
What`s happening if there is "B" in the general linear modeling output?
Paper 75006 What's happening if there is "B" in the general linear modeling output? Shilong Kuang, Kelley Blue Book Inc., Irvine, CA ABSTRACT ® The SAS GLM procedure is an efficient tool in statistical data analysis, especially when we have categorical ® variables (also called class variables) as predictors. In SAS practice, you may see the letter "B" showing up sometimes in the parameter estimate output, and you may wonder what's happening in our model. What is the cause of this? Can we still trust our model? Can we verify the modeling output by some alternative procedures? Where should we put more attention for the similar situations in future? In this paper, we answer those questions by demonstrating some intuitively understandable examples, with the corresponding theoretical statistical background attached. With those in mind, you can easily turn the general linear ® modeling into a more powerful tool! SAS ROCKS! Keywords: General linear model, estimable function, singular matrix, generalized inverse. INTRODUCTION ® The SAS GLM procedure analyzes data within the framework of general linear models. When we are using the GLM procedure in SAS, we may see some “weird” message in the output like the following: Note: The X'X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates are followed by the letter 'B' are not uniquely estimable. You may wonder what’s happening in our model. Is there anything wrong in the model? To what extent can we trust the modeling output? We begin our demonstration by the following simple example. EXAMPLE TO START We consider the following one-way classified data, there is one categorical variable “group” with 3 levels (A,B,C). SAS Code 1 data example_1; input group $ y @@; cards; A 6.4 A 6.2 B 8.4 B 8.7 C 3.4 C 3.1 ;run; proc glm data=example_1; class group; model y=group /solution e; run; SAS Output1 Parameter Estimate Intercept group A group B 3.25 3.05 5.3 B B B Standard Error 0.1354006 0.1914854 0.1914854 group C 0 B . 1 t Value Pr > |t| 24 15.93 27.68 0.0002 0.0005 0.0001 . . How do we interpret the letter “B” in the output? 11 0 0 11 0 0 If we write down the design matrix X in the form Y = Xβ + ε , X = 1010 1010 u ,β = τ1 τ2 , τ3 10 01 10 01 We can see there are only 3 independent rows in X, with 4 parameters in In other words, for this system of linear equations, (u, τ 1 , τ 2 , τ 3 ) . y1 = u + τ 1 y2 = u + τ 2 , there are only 3 independent equations with 4 y = u +τ 3 3 variables, we know there is no unique solution. In fact, there are infinite many solutions. The design matrix X is not a full-rank matrix, the row space r(X) has dimension 3 while the column space has dimension 4. This explains the note: “the X'X matrix has been found to be singular”. This singularity indicates the model as defined has too many parameters, it’s over-parameterized! To overcome this “embarrassing” situation, by default in the GLM parameterization, we put the last level “C” (in alphabetical order) of the categorical variable “group” as the reference level, considering as zero effect. This restriction explains why we see the 0 estimate for the parameter “group C” in the previous modeling output. ALTERNATIVE WAY TO DOUBLE CHECK THE OUTPUT We can use the alternative PROC REG procedure to verify the output, assigning dummy variables to each level of the categorical variable (except the last level), as following: SAS Code 2 data example_1; set example_1; if group='C' then do; d1=0; d2=0; end; else if group='A' then do; d1=1; d2=0; end; else if group='B' then do; d1=0; d2=1; end; run; proc reg data=example_1; model y=d1 d2; run; Parameter estimate output is the same as the previous GLM output: SAS Output 2 Variable DF Parameter Estimate Standard Error Intercept 1 3.25 d1 1 3.05 d2 1 5.3 0.19149 t Value Pr > |t| 0.1354 24 0.0002 0.19149 15.93 0.0005 27.68 0.0001 The modeling equation can be written as: Y(group=’C’)=intercept=3.25, which is the same as the previous GLM output: Y(group=’C’)=intercept+”group C” parameter estimate=3.25+0=3.25. 2 MORE STATISTICS EXPLANATION Since the design matrix X in the previous example is not full rank, the unique inverse of (X’X) does not exist. The absence of a unique solution indicates that at least some parameters in the model can’t be estimated, and they are said to be non-estimable! DEFINITION OF ESTIMABLE FUNCTION λ ' β of parameters is said to be estimable if there exists a linear λ ' β , that is, E (a ' Y ) = λ ' β . If no such function exists, then the linear In a general linear model, a linear combination function a' Y combination that is unbiased for λ' β is defined as non-estimable function. LEMMA 1 λ' β is estimable if and only if λ ' = a' X for some a ' ; that is, λ ' is in the row space of X . Proof: λ ' = a' X for some a ' , then E (λ ' Y ) = a ' E (Y ) = a' Xβ = λ ' β Part-1, we prove the sufficiency assumption: given λ ' β is estimable, by the estimable definition, there exists some linear a' Y that is unbiased for λ ' β , that is, E (a ' Y ) = λ ' β for any β . It follows E (a ' Y ) = a' E (Y ) = a ' Xβ = λ ' β for any β . a' Xβ = λ ' β for any β . Therefore, a' X = λ ' . Part-2, for the necessary assumption: given combination That is, LEMMA 2 λ ' β is estimable if and only if there is a solution conjugate normal equation). ξ such that X ' Xξ = λ Proof: Part-1, we prove the sufficiency assumption: if there is a solution Therefore by Lemma 1, we know λ' β λ ' = ξ ' X ' X := a ' X Let , then such that X ' Xξ = Xξ = λ , then . is estimable. Part-2, for the necessary assumption: assume λ ' = a' X , where a ξ (the last equation is usually called λ' β is estimable, by lemma 1, there exist some a' such that λ = X ' a . By the generalized inverse property, we have ( X ' X )( X ' X ) − X ' X = X ' X ⇒ [( X ' X )( X ' X ) − X '− X ' ] p×n X n× p = O p× p . L := ( X ' X )( X ' X ) − X '− X ' , it’s easy to show L' L = O p× p ⇒ L = O p×n That is, ( X ' X )( X ' X ) − X ' = X ' X 'a = X 'a = λ . ξ such that X ' Xξ = λ We then have ( X ' X )( X ' X ) Therefore, there is a solution − (*) where ξ = ( X ' X )− X ' a . COROLLARY The necessary and sufficient condition for a linear combination λ' β of parameters to be estimable is ( X ' X )( X ' X ) λ = λ . − Proof: Based on Lemma 1 and lemma 2, λ' β is estimable if and only if the equation (*) is satisfied: ( X ' X )( X ' X ) X ' a = X ' a = λ . That is, ( X ' X )( X ' X ) − λ = λ . − 3 EXAMPLE APPLICATION Let us use our first example to demonstrate how to apply the previous conclusion in the corollary. Consider Yij = u + τ i + ε ij , where i = 1, 2, ... , a and j = 1, 2, ... , n i The corresponding design matrix: 1n1 1n1 0 n1 ...0 n1 X = 1n2 0 n2 1n2 ...0 n2 1n1 1n2 1n3 ... 1na , u 1n1 0 n2 0 n2 ... 0 n2 X '= , β= ... ... ... τa 0 n1 0 n2 0 n3 ... 1na 1na 0 na 0 na ...1na τ1 Then, n n1 n2 ... na X'X = n1 n1 0 ... 0 where ... n = n1 + n2 ... + n a , and we choose one of its generalized inverses : na 0 0 ... na 0 0 0 ... 0 ( X ' X )− = 0 1 1 ... 1 1 0 0 ... 0 n1 and ... 0 0 0 ... 1 na ( X ' X )( X ' X ) − = 0 1 0 ... 0 ... 0 0 0 ... 1 ( a +1)×( a +1) ( a +1)×( a +1) By the corollary (corollary seems to be more useful in most situations!): a linear combination estimable if and only if λ' β of parameters is ( X ' X )( X ' X ) λ = λ , where λ ' = (λ0 , λ1 , ... , λa ) . That is, − a (λ0 , λ1 , ... , λa ) = (∑ λi , λ1 , ... , λa ) i =1 Therefore, the necessary and sufficient condition for a linear combination λ' β of parameters to be estimable is a λ0 = ∑ λi ( ** ) i =1 For instance, we can check if τ 1 − τ 2 is estimable, since τ 1 − τ 2 = (0, 1, −1, ..., 0) (u, τ 1 , ... ,τ a )' a That is, λ0 = 0, λ1 = 1, λ2 = −1, λ3 = 0, ..., λa = 0 which satisfied the equation (**): λ0 = ∑ λi Therefore, i =1 τ 1 − τ 2 is estimable. 4 We can similarly check that each of τ1 ,τ 2 ,τ 3 a is not estimable separately, since λ0 = 0 ≠ ∑ λi = 1 . This i =1 explains again the note message as we mentioned in the beginning: “Terms whose estimates are followed by the letter 'B' are not uniquely estimable”, and all group A, group B, group C parameters’ estimate in GLM output have the letter “B” attached. CONCLUSION ® In this paper, we investigate the “weird” letter “B” output in the SAS GLM procedure. By the demonstration with intuitively understandable examples, we are able to answer a series of questions: is there anything wrong in the model? To what extent we can trust the output? What alternative strategies we can take? etc. Furthermore, we provide the theoretical statistical explanation to rigorously prove our statement. ACKNOWLEDGMENTS I would like to thank my former Statistics professor Dr. Daniel Jeske for his various help at the beginning of my statistics career, Also I would like to thank our Vice-President Mr. Shawn Hushman for the trust and various SAS training support. REFERENCE 1. 2. Applied Regression Analysis - A Research Tool, John O. Rawlings, Sastry G. Pantula, David A. Dickey, 2 edition, ISBN 0-387-98454-2, 1998 Springer-Verlag New York, Inc.. INTRODUCTION TO BUILDING A LINEAR REGRESSION MODEL, LESLIE A. CHRISTENSEN, SAS Users Group International (SUGI), Proceedings 22, Statistics and Analytics. nd RECOMMENDED READING SAS Usage Note 38384: How to interpret the results of the SOLUTION option in the MODEL statement of PROC GLM? http://support.sas.com/kb/38/384.html. SAS Usage Note 22585: Why is the X'X matrix found to be singular in the PROC GLM Output? http://support.sas.com/kb/38/384.html. SAS Study Notes: http://www.dumblittledoctor.com/sas_tutorial_home.php. CONTACT INFORMATION Your comments and questions are very valued and encouraged. Please contact our author at: Name: Dr. Shilong Kuang Enterprise: Kelley Blue Book, Inc. Address: 195 Technology Drive City, State ZIP: Irvine, CA,92618 E-mail: [email protected] Web: SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. 5