What`s happening if there is "B" in the general linear modeling output?

Transcription

What`s happening if there is "B" in the general linear modeling output?
Paper 75006
What's happening if there is "B" in the general linear modeling output?
Shilong Kuang, Kelley Blue Book Inc., Irvine, CA
ABSTRACT
®
The SAS GLM procedure is an efficient tool in statistical data analysis, especially when we have categorical
®
variables (also called class variables) as predictors. In SAS practice, you may see the letter "B" showing up
sometimes in the parameter estimate output, and you may wonder what's happening in our model. What is the cause
of this? Can we still trust our model? Can we verify the modeling output by some alternative procedures? Where
should we put more attention for the similar situations in future?
In this paper, we answer those questions by demonstrating some intuitively understandable examples, with the
corresponding theoretical statistical background attached. With those in mind, you can easily turn the general linear
®
modeling into a more powerful tool! SAS ROCKS!
Keywords: General linear model, estimable function, singular matrix, generalized inverse.
INTRODUCTION
®
The SAS GLM procedure analyzes data within the framework of general linear models. When we are using the GLM
procedure in SAS, we may see some “weird” message in the output like the following:
Note: The X'X matrix has been found to be singular, and a generalized inverse was used to solve the
normal equations. Terms whose estimates are followed by the letter 'B' are not uniquely estimable.
You may wonder what’s happening in our model. Is there anything wrong in the model? To what extent can we trust
the modeling output? We begin our demonstration by the following simple example.
EXAMPLE TO START
We consider the following one-way classified data, there is one categorical variable “group” with 3 levels (A,B,C).
SAS Code 1
data example_1;
input group $ y @@;
cards;
A 6.4 A 6.2
B 8.4 B 8.7
C 3.4 C 3.1
;run;
proc glm data=example_1;
class group;
model y=group /solution e;
run;
SAS Output1
Parameter
Estimate
Intercept
group A
group B
3.25
3.05
5.3
B
B
B
Standard
Error
0.1354006
0.1914854
0.1914854
group C
0
B
.
1
t Value
Pr > |t|
24
15.93
27.68
0.0002
0.0005
0.0001
.
.
How do we interpret the letter “B” in the output?
11 0 0
11 0 0
If we write down the design matrix
X
in the form
Y = Xβ + ε , X =
1010
1010
u
,β =
τ1
τ2
,
τ3
10 01
10 01
We can see there are only 3 independent rows in X, with 4 parameters in
In other words, for this system of linear equations,
(u, τ 1 , τ 2 , τ 3 )
.
 y1 = u + τ 1

 y2 = u + τ 2 , there are only 3 independent equations with 4
y = u +τ
3
 3
variables, we know there is no unique solution. In fact, there are infinite many solutions.
The design matrix X is not a full-rank matrix, the row space r(X) has dimension 3 while the column space has
dimension 4. This explains the note: “the X'X matrix has been found to be singular”. This singularity indicates the
model as defined has too many parameters, it’s over-parameterized!
To overcome this “embarrassing” situation, by default in the GLM parameterization, we put the last level “C” (in
alphabetical order) of the categorical variable “group” as the reference level, considering as zero effect. This
restriction explains why we see the 0 estimate for the parameter “group C” in the previous modeling output.
ALTERNATIVE WAY TO DOUBLE CHECK THE OUTPUT
We can use the alternative PROC REG procedure to verify the output, assigning dummy variables to each level of
the categorical variable (except the last level), as following:
SAS Code 2
data example_1;
set example_1;
if group='C' then do; d1=0; d2=0; end;
else if group='A' then do; d1=1; d2=0; end;
else if group='B' then do; d1=0; d2=1; end;
run;
proc reg data=example_1;
model y=d1 d2;
run;
Parameter estimate output is the same as the previous GLM output:
SAS Output 2
Variable
DF
Parameter
Estimate
Standard
Error
Intercept
1
3.25
d1
1
3.05
d2
1
5.3
0.19149
t Value
Pr > |t|
0.1354
24
0.0002
0.19149
15.93
0.0005
27.68
0.0001
The modeling equation can be written as: Y(group=’C’)=intercept=3.25, which is the same as the previous GLM
output: Y(group=’C’)=intercept+”group C” parameter estimate=3.25+0=3.25.
2
MORE STATISTICS EXPLANATION
Since the design matrix X in the previous example is not full rank, the unique inverse of (X’X) does not exist. The
absence of a unique solution indicates that at least some parameters in the model can’t be estimated, and they are
said to be non-estimable!
DEFINITION OF ESTIMABLE FUNCTION
λ ' β of parameters is said to be estimable if there exists a linear
λ ' β , that is, E (a ' Y ) = λ ' β . If no such function exists, then the linear
In a general linear model, a linear combination
function
a' Y
combination
that is unbiased for
λ' β
is defined as non-estimable function.
LEMMA 1
λ' β
is estimable if and only if
λ ' = a' X
for some
a ' ; that is, λ '
is in the row space of
X
.
Proof:
λ ' = a' X for some a ' , then
E (λ ' Y ) = a ' E (Y ) = a' Xβ = λ ' β
Part-1, we prove the sufficiency assumption: given
λ ' β is estimable, by the estimable definition, there exists some linear
a' Y that is unbiased for λ ' β , that is, E (a ' Y ) = λ ' β for any β . It follows
E (a ' Y ) = a' E (Y ) = a ' Xβ = λ ' β for any β .
a' Xβ = λ ' β for any β . Therefore, a' X = λ ' .
Part-2, for the necessary assumption: given
combination
That is,
LEMMA 2
λ ' β is estimable if and only if there is a solution
conjugate normal equation).
ξ
such that
X ' Xξ = λ
Proof: Part-1, we prove the sufficiency assumption: if there is a solution
Therefore by Lemma 1, we know
λ' β
λ ' = ξ ' X ' X := a ' X
Let
, then
such that X ' Xξ
= Xξ
= λ , then
.
is estimable.
Part-2, for the necessary assumption: assume
λ ' = a' X
, where a
ξ
(the last equation is usually called
λ' β
is estimable, by lemma 1, there exist some
a'
such that
λ = X ' a . By the generalized inverse property, we have
( X ' X )( X ' X ) − X ' X = X ' X ⇒ [( X ' X )( X ' X ) − X '− X ' ] p×n X n× p = O p× p .
L := ( X ' X )( X ' X ) − X '− X ' , it’s easy to show L' L = O p× p ⇒ L = O p×n
That is,
( X ' X )( X ' X ) − X ' = X '
X 'a = X 'a = λ .
ξ such that X ' Xξ = λ
We then have ( X ' X )( X ' X )
Therefore, there is a solution
−
(*)
where
ξ = ( X ' X )− X ' a .
COROLLARY
The necessary and sufficient condition for a linear combination
λ' β
of parameters to be estimable is
( X ' X )( X ' X ) λ = λ .
−
Proof: Based on Lemma 1 and lemma 2,
λ' β
is estimable if and only if the equation (*) is satisfied:
( X ' X )( X ' X ) X ' a = X ' a = λ . That is, ( X ' X )( X ' X ) − λ = λ .
−
3
EXAMPLE APPLICATION
Let us use our first example to demonstrate how to apply the previous conclusion in the corollary. Consider
Yij = u + τ i + ε ij , where i = 1, 2, ... , a
and
j = 1, 2, ... , n i
The corresponding design matrix:
1n1 1n1 0 n1 ...0 n1
X =
1n2 0 n2 1n2 ...0 n2
1n1 1n2 1n3 ... 1na
,
u
1n1 0 n2 0 n2 ... 0 n2
X '=
,
β=
...
...
...
τa
0 n1 0 n2 0 n3 ... 1na
1na 0 na 0 na ...1na
τ1
Then,
n n1 n2 ... na
X'X =
n1 n1 0 ... 0
where
...
n = n1 + n2 ... + n a , and we choose one of its generalized inverses :
na 0 0 ... na
0 0 0 ... 0
( X ' X )− =
0 1 1 ... 1
1
0
0 ... 0
n1
and
...
0 0 0 ...
1
na
( X ' X )( X ' X ) − =
0 1 0 ... 0
...
0 0 0 ... 1
( a +1)×( a +1)
( a +1)×( a +1)
By the corollary (corollary seems to be more useful in most situations!): a linear combination
estimable if and only if
λ' β
of parameters is
( X ' X )( X ' X ) λ = λ , where λ ' = (λ0 , λ1 , ... , λa ) . That is,
−
a
(λ0 , λ1 , ... , λa ) = (∑ λi , λ1 , ... , λa )
i =1
Therefore, the necessary and sufficient condition for a linear combination
λ' β
of parameters to be estimable is
a
λ0 = ∑ λi
( ** )
i =1
For instance, we can check if
τ 1 − τ 2 is estimable, since τ 1 − τ 2 = (0, 1, −1, ..., 0) (u, τ 1 , ... ,τ a )'
a
That is,
λ0 = 0, λ1 = 1, λ2 = −1, λ3 = 0, ..., λa = 0 which satisfied the equation (**): λ0 = ∑ λi
Therefore,
i =1
τ 1 − τ 2 is estimable.
4
We can similarly check that each of
τ1 ,τ 2 ,τ 3
a
is not estimable separately, since
λ0 = 0 ≠ ∑ λi = 1 . This
i =1
explains again the note message as we mentioned in the beginning: “Terms whose estimates are followed by the
letter 'B' are not uniquely estimable”, and all group A, group B, group C parameters’ estimate in GLM output have the
letter “B” attached.
CONCLUSION
®
In this paper, we investigate the “weird” letter “B” output in the SAS GLM procedure. By the demonstration with
intuitively understandable examples, we are able to answer a series of questions: is there anything wrong in the
model? To what extent we can trust the output? What alternative strategies we can take? etc.
Furthermore, we provide the theoretical statistical explanation to rigorously prove our statement.
ACKNOWLEDGMENTS
I would like to thank my former Statistics professor Dr. Daniel Jeske for his various help at the beginning of my
statistics career, Also I would like to thank our Vice-President Mr. Shawn Hushman for the trust and various SAS
training support.
REFERENCE
1.
2.
Applied Regression Analysis - A Research Tool, John O. Rawlings, Sastry G. Pantula, David A. Dickey, 2
edition, ISBN 0-387-98454-2, 1998 Springer-Verlag New York, Inc..
INTRODUCTION TO BUILDING A LINEAR REGRESSION MODEL, LESLIE A. CHRISTENSEN, SAS
Users Group International (SUGI), Proceedings 22, Statistics and Analytics.
nd
RECOMMENDED READING
SAS Usage Note 38384: How to interpret the results of the SOLUTION option in the MODEL statement of PROC
GLM? http://support.sas.com/kb/38/384.html.
SAS Usage Note 22585: Why is the X'X matrix found to be singular in the PROC GLM Output?
http://support.sas.com/kb/38/384.html.
SAS Study Notes: http://www.dumblittledoctor.com/sas_tutorial_home.php.
CONTACT INFORMATION
Your comments and questions are very valued and encouraged. Please contact our author at:
Name: Dr. Shilong Kuang
Enterprise: Kelley Blue Book, Inc.
Address: 195 Technology Drive
City, State ZIP: Irvine, CA,92618
E-mail: [email protected]
Web:
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
5