Training Feed-Forward Neural Networks with Monotonicity

Transcription

Training Feed-Forward Neural Networks with
Monotonicity Requirements
Antoine Mahul 1 Alexandre Aussem 2
Research Report LIMOS/RR-04-11
June 2004
1 [email protected]
2 [email protected]
Abstract
In this paper, we adapt the classical learning algorithm for feed-forward neural networks when monotonicity is require in the input-output mapping. Such requirements arise, for instance, when prior
knowledge of the process being observed is available. Monotonicity can be imposed by the addition
of suitable penalization terms to the error function. The objective function, however, depends nonlinearly on the first-order derivatives of the network mapping. We show that these derivatives can
easily be obtained by an extension of the standard back-propagation algorithm procedure. This yields
a computationally efficient algorithm with little overhead compared to back-propagation.
Keywords: Neural networks, Monotonicity, Non-linear optimization, Penalty methods.
Résumé
Dans cet article, nous adaptons l’algorithme d’apprentissage classique des réseaux de neurones feedforward lorsque la monotonie est exigée dans la relation à apprendre. De telles exigences apparaissent,
par exemple, lorsqu’une connaissance a priori sur le processus observé est disponible. La monotonie
peut être imposée en ajoutant des termes de pénalisation adéquats à la fonction d’erreur. Cependant,
la fonction objectif dépend alors de façon non-linéaire des dérivées de la fonction représentée par le
réseau. Nous montrons que ces dérivées peuvent être aisément obtenues par une extension de l’algorithme standard de rétro-propagation. Cela amène à un algorithme efficace avec un surcoût de calcul
faible par rapport à la rétro-propagation.
Mots clés : Réseaux de neurones, Monotonie, Optimisation non-linéaire, Méthodes de pénalité.
LIMOS / Blaise Pascal University (FRANCE)
Research Report RR-04-11
Introduction
By virtue of their universal approximation capabilities, feed-forward neural networks are good candidates for the approximation of continuous non-linear mappings. However, inclusion of prior knowledge into the network training can lead to significant improvements in the network performance,
especially when the amount of training data is limited.
For interpolation problems, monotonicity requirements can easily be imposed by the addition of suitable penalization terms to the error function in much the same way as smoothness is imposed by the
adjunction of suitable regularization terms. The new error function, however depends non-linearly on
the first-order derivatives of the network mapping, and so the standard back-propagation algorithm
cannot be applied. In this paper, we derive a computationally efficient learning algorithm, for a feedforward network of arbitrary topology, which can be used to minimize such penalized error functions.
The derivatives with respect to the weights for a multi-layer perceptron are obtained by an extension
of the back propagation algorithm procedure.
As an example [5], consider the optimization problems arising typically in Traffic Engineering where
the overall operational performance of the communication network is to be maximized while respecting some predefined Quality of Service (QoS) requirements. Unfortunately, the QoS, usually
expressed in terms of average response time and/or loss rate, is difficult to express analytically in
terms of the incoming traffic characteristics. Also, it is particularly appealing to train a MLP as a
”black-box” on simulation data for the evaluation of the QoS values. As delay, loss and jitter are
monotonically increasing with respect to the incoming traffic rates, inclusion of this prior knowledge
into the training procedure is important. It may also be a stringent requirement for the optimization
algorithms to converge.
1 Learning with monotonicity requirements
We consider in this paper feed-forward network of arbitrary topology. We first introduce our notations.
1.1 Definitions and notations
Let N be the set of neurons. Nin ⊂ N is the subset of input neurons and N out ⊂ N is the subset of
output neurons. A is the set of arcs, and w c = wij the weight of an arc c = (i, j) ∈ A. For given
input vector x and weight vector w, s i (x, w) and ai (x, w) are the input value and the activity of a
neuron i ∈ N . fi is the activation function of a neuron i ∈ N . The value of the input neurons are set
to x = {si , i ∈ Nin }.
Inputs and outputs are determined by the relations:
X
si (x, w) =
wki ak (x, w) ∀i ∈
/ Nin
(1)
k∈N ,(k,i)∈A
ai (x, w) = fi (si (x, w))
∀i ∈ N
(2)
We also note a0i (x, w) = fi0 (si (x; )) and a00i (x, w) = fi00 (si (x, w)).
Definition. For a given input x and a given weight vector w, the jacobian matrix J(x, w) of a neural
2
network is defined by:
J(x, w) = Jij (x, w) i∈Nout ,j∈N
with
in
∂yi
∂ai
Jij (x, w) =
=
(x, w)
∂xj
∂sj
∀i ∈ Nout , ∀j ∈ Nin
1.2 Learning problem subject to monotonicity constraints
An output value yi (i ∈ Nout ) of the neural network is increasing (resp. decreasing), on a compact set
∂yi
K ⊂ R|Nin | , according to an input value xj (j ∈ Nin ) if the corresponding element Jij (x, w) = ∂x
j
of the jacobian matrix is positive, for all x ∈ K.
So, a monotonic increase constraint on the regression function is:
Jij (x, w) ≥ 0
i ∈ Nout , j ∈ Nin , ∀x ∈ K
(C)
The learning is classically achieved by a gradient descent method. Generally, we seek to minimize the
quadratic error of estimation Eq on a given base B of examples (w ∈ R|A| is the weight vector of the
neural network):
X
X
2
min
E
(w)
=
a
(x,
w)
−
y
q
i
i
(P0 ) w∈R|A|
(x,y) ∈ B i ∈ Nout
With the monotonic constraints, the learning problem becomes:

X
X
2

ai (x, w) − yi
min E(w) =




(P1 )

∀x ∈ K, ∀i ∈ Nout , ∀j ∈ Nin

s.t. Jij (x, w) ≥ 0


|A|
w∈R
(C1 )
The monotonicity requirement is difficult to enforce on the whole domain, K. So we consider the
restricted problem:

X
X
2

min
E(w)
=
ai (x, w) − yi




(P2 )

s.t. Jij (x, w) ≥ 0
∀(x, ·) ∈ B, ∀i ∈ Nout , ∀j ∈ Nin (C2 )




|A|
w∈R
1.3 Penalty method
The principles of the penalty method (see [1, 6] for details) to solve problem (P2 ) are briefly recalled
in this section. Let (P ) be an non-linear optimization problem with constraints:


min f (x)


(P ) s.t. gi (x) ≥ 0
∀i ∈ [1..m]



n
x∈R
3
Penalty methods solve a sequence of unconstrained subproblems which approach iteratively the infinite penalty function (σ(x) = 0 if all constraints are valid and σ(x) = ∞ otherwise). The subproblems (SPk ) are :
(
m
X
ϕ(gi (x))
(SPk ) minn Φ(x, µ) = f (x) + µ
x∈R
i=1
where ϕ is a function from R to R. For instance,
• if ϕ = ϕ1 : x → 21 min(0, x) 2 , Φ matches the quadratic penalty function,
• if ϕ = ϕ2 : x → − log(x), Φ matches the logarithmic barrier function.
The penalty function Φ(w, µ) associated to problem (P 2 ) is then (w ∈ R|A| is the vector of the weights
of the neural network):
X
X X
ϕ Jij (x, w)
Φ(w, µ) = E(w) + µ
(x,y) ∈ B i ∈ Nout j ∈ Nin
=
X
X
ai (x, w) − yi
i ∈ Nout
(x,y) ∈ B
=
X
|
{z
E(x,w)
E(x, w) + µP (x, w)
(x,y) ∈ B
2
}
+µ
X
X
ϕ Jij (x, w)
i ∈ Nout j ∈ Nin
|
{z
P (x,w)
!
}
At each iteration of the penalty method, we have to solve the following subproblems:
X
(Pk ) min Φ(w, µk ) =
E(x, w) + µk P (x, w)
w
(x,y) ∈ B
The sequence µk has to meet lim µk = ∞ in the case of penalty functions and lim µk = 0 in the
k→∞
k→∞
case of barrier functions.
In order to apply a gradient descent method for solving these subproblems, we must be able to compute
the gradient ∇Φ(w, µ) :
X
∇E(x, w) + µ ∇P (x, w)
∀c ∈ A
(3)
∇Φ(w, µ) =
(x,y) ∈ B
The term ∇E can be computed by the back-propagation algorithm. We focus our attention on the
penalty term ∇P in the next section, whose components are:
X X ∂Jij (x, w)
∂P
(x, w) =
ϕ0 Jij (x, w)
∀c ∈ A, ∀x ∈ B
(4)
∂wc
∂wc
i ∈ Nout j ∈ Nin
2 Forward-backward algorithm for gradient computation
In this section, the derivation of the gradient ∇P is presented in details, in the same way as was done
in [3]. Let x be an input vector of the neural network and w a weight vector. In order to simplify the
notations, we will omit x and w in the sequel.
4
2.1 The jacobian matrix
The calculation of Jij can be performed in a forwards or backwards as shown here.
Proposition 1 (Forward relation for jacobian elements). We have, for all (i, j) ∈ N × N in :
Jij = a0i
 i

δj


(δji is the Kronecker symbol)
if i ∈ Nin
X
wki Jkj
(5)
otherwise.
k,(k,i)∈A
Proof. Equation (5) can be demonstrated by using (2) in the definition of Jij :
Jij =
∂ai
∂si
∂ai ∂si
=
= a0i
∂sj
∂sj ∂sj
∂sj
Then, according to (1), we have
∂si
=
∂sj
X
∂ak
∂sj
|{z}
wki
k,(k,i)∈A
∀(i, j) ∈ N × Nin
Jkj
Proposition 2 (Backward relation for jacobian elements). For all (i, j) ∈ N out × N
Jij =
 0 i

aj δj
0

aj
if j ∈ Nout ,
X
wjk Jik
(6)
otherwise.
k,(j,k)∈A
Proof. Let (i, j) ∈ Nout × N .
Jij =
∂ai
∂ai ∂aj
∂ai
=
= a0j
∂sj
∂aj ∂sj
∂aj
So if j ∈
/ Nout , by using the classical formulation of the back-propagation:
∂ai
=
∂aj
X
∂ai ∂sk
∂ai
=
wjk
∂sk ∂aj
∂sk
k,(j,k)∈A
|{z} k,(j,k)∈A
X
wjk
Finally, if j ∈ Nout , then
∂ai
= δij .
∂aj
2.2 Differentiation of the jacobian matrix
For i ∈ Nout , j ∈ Nin and c ∈ A, we calculate
Remark. Let κij =
∂Jij
.
∂wc
∂si
, defined for (i, j) ∈ N × Nin .
∂sj
5
Then we can write, according to the demonstration of proposition 1:
Jij = a0i κij and
 i

δj
X
κij =
wki Jkj


(7)
if i ∈ Nin ,
(8)
otherwise.
k,(k,i)∈A
k
Proposition 3. Note νij
=
∂Jij
, ∀i ∈ Nout , ∀j ∈ Nin et ∀k ∈ N . We have the following back∂sk
propagation relation:

00
i


aj κjk δj
k
X
νij
=
00
0 k
w
a
κ
J
+
a
ν

jl
jk
il
j
j
il


if k ∈ Nout ,
otherwise.
(9)
l,(jl)∈A
Proof. This can be easily demonstrated by using the back-propagation formula (6). Let i ∈ Nout ,
j ∈ Nin and k ∈ N . If k ∈
/ Nout , then


X
X
∂a0j
∂Jij
∂  0
0 ∂Jil

wjl Jil
wjl Jil =
aj
=
+ aj
∂sk
∂sk
∂sk
∂sk
l,(j,l)∈A
l,(j,l)∈A
X
∂Jil
∂sj
=
wjl a00j Jil
+ a0j
∂sk
∂sk
l,(j,l)∈A
X
=
wjl a00j Jil κjk + a0j νilk
l,(j,l)∈A
Then, if j ∈ Nout :
∂a0j
∂a0j ∂sj
∂Jij
∂ 0 i
aj δj = δji
=
= δji
= δji a00j κjk
∂sk
∂sk
∂sk
∂sj ∂sk
Proposition 4. For i ∈ Nout , j ∈ Nin and c = (k, l) ∈ A, we have
∂Jij
= Jil Jkj + ak νilj
∂wkl
(10)
Proof. As j ∈ Nin , we can write
∂Jij
∂
∂ai
∂
∂ai
=
=
∂wkl
∂wkl ∂sj
∂sj ∂wkl
a
Now
∂ai
∂wkl
z }|k {
∂ai ∂sl
=
= ak Jil
∂sl ∂wkl
|{z}
Jil
So
∂ak
∂Jil
∂
∂Jij
[ak Jil ] = Jil
+ ak
=
∂wkl
∂sj
∂sj
∂sj
∂Jil
= Jil Jkj + ak
∂sj
6
2.3 Differentiation of the penalty term
We now set our attention on Eq. (4):
X
∂Jij
∂P
ϕ0 Jij
=
∂wc
∂wc
(i,j) ∈ Nout ×Nin
Remark. For (i, j) ∈ N × Nin , define Jij+ as
X
Jki ϕ0 Jkj
Jij+ =
k ∈ Nout
From (6), Jij+ can be calculated by back-propagation:
 0 0
if i ∈ Nout ,

ai ϕ (Jij )
X
+
Jij =
+
0
wik Jkj
otherwise.

ai
(11)
k,(i,k)∈A
Remark. If we note, for (i, j) ∈ Nin × N ,
X j
+
νki ϕ0 Jkj
νij
=
k ∈ Nout
+
From (10), we obtain a chain rule for the computation of ν ij
,

00
0

if i ∈ Nout ,

ai κij ϕ (Jij )
+
X
νij =
+
+
+ a0i νkj
otherwise.
wik a00i κij Jkj



(12)
Proposition 5. For all (k, l) ∈ A,
X ∂P
+
+
=
Jkj Jlj + ak νlj
∂wkl
(13)
k,(i,k)∈A
Using Eq. (4) and Eq. (10), we finally obtain:
j ∈ Nin
2.4 Algorithm
We can now establish a forward-backward algorithm (a detailed algorithm is given in appendix) to
compute the gradient ∇P (x), for an given input vector x:
Algorithme 1 (Principle algorithm for computing the penalty term gradient).
1. (forward propagation) Compute, in the topological order (from inputs to outputs), κ ij and
Jij using (8) and (7), for all (i, j) ∈ N × Nin ,
2. (backward propagation) Compute, in the reverse topological order (from outputs to inputs),
+
Jij+ and νij
using (11) and (12), for all (i, j) ∈ N × Nin ,
3. (final step) Use (13) to compute the overall gradient.
Like the standard back-propagation algorithm for error gradient computation, the complexity of this
algorithm is linear according to the number of synaptic weights.
7
3 Heuristic for a feasible initialization of feed-forward neural networks
In barrier methods, the initial solution should satisfy the constraints. We propose in this section an
heuristic method to construct an initial weight vector w meeting all monotonicity constraints (all terms
of the jacobian matrix will be positive).
Let consider again the relation (10)):
Jij =
∂ai
= a0i
∂sj
X
wki Jkj
k,(k,i)∈A
Assuming a monotonic increasing activation function, a 0i is positive. For all hidden neuron i and for
all j ∈ Nout , we have
wki Jkj ≥ 0, ∀k | (k, i) ∈ A
=⇒
Jij ≥ 0.
So, we can use this relation to define a simple heuristic to initialize weights that satisfies the monotonicity constraints. For all nodes i ∈ N , define u i by


−1 if Jij ≤ 0, ∀j ∈ Nout ,
ui = 1
if Jij ≥ 0, ∀j ∈ Nout ,


0
otherwise.
The sign of the initial weights may be heuristically defined as follows:
Algorithme 2 (Weight initialization (see Figure 1)).
step 1. Set the following values for u i :
if i ∈ Nin or i ∈ Nout , then ui = 1,
if i is a bias neuron, ui = 0,
if i is a hidden neuron, then ui ∈ {−1, 1}.
step2. Determine the sign of a weight w ij , (i, j) ∈ A, with these rules:
if ui uj > 0, then wij > 0,
if ui uj < 0, then wij < 0,
if ui uj = 0, then the sign of wij does not matter.
1
0
x1
+1
−1
+1
y1
−1
+1
x2
+1
Figure 1: Heuristic initialization of weights in a feed-forward neural network.
8
4 Numerical results
To illustrate the learning with monotonicity requirements, we consider a toy regression problem of an
increasing unidimensional function perturbed by a simple wavelet. We use a batch gradient algorithm
to solve unconstrained optimization problems. We also use the Armijo rule for the line search at each
iteration (or ”epoch”). In the case of the barrier function, it is fundamental not to overcome the barrier
during the line search. The Armijo rule, which preserves the convergence of descent methods ([2])),
is: for scalar s > 0, σ ∈ [0, 21 ] and β ∈ [0, 1], we choose ηt = β m s where m is the lower natural
integer such as:
Φ(wt , µ) − Φ(wt + β m s dt , µ) ≥ −σ β m s ∇Φ(wt , µ)T dt
We consider here a non-linear unidimensional
regression problem which target data are generated by
the function y = x3 − 5 x exp −100 x2 + ε, where ε is a uniform random variable in the range
[−0.05, 0.05].
We have considered a Multi-Layer Perceptron with 5 hidden units (so we have 16 parameters). The
MLP is trained with a learning base composed by 250 patterns and we use a second base of 1000
patterns for the validation. We have made three different learnings with the same initialization (using our heuristic): the first one without increasing requirements (classical learning), the second one
considering a penalty term for increasing requirements, and the last one a barrier term.
Quadratic Error (MSE)
learning
validation
classic
penalty
barrier
f (x) = x3
7.47 · 10−4
4.18 · 10−3
4.19 · 10−3
4.22 · 10−3
9.05 · 10−4
5.40 · 10−2
5.47 · 10−2
5.27 · 10−3
Monotonicity Error (MSME)
learning
validation
0.967
0
0
-
Non Increasing Pattern (%)
learning
validation
1.139
9.67 · 10−16
0
-
6%
0%
0%
-
8.3%
0.2%
0%
-
Epochs
3377412
3983
130759
-
Table 1: Learning results with 250 patterns in the learning base.
Table 1 summarize our experiment. For each learning process, we give, for the learning and validation
bases:
• the Mean Square Error (MSE):
MSE =
1
|B|
X
X
ai (x, w) − yi
2
• the Mean Square Monotonicity Error (MSME):
MME =
1
|B|
X
X
X
min 0, Jij (x, w)
(x,y) ∈ B i ∈ Nout j ∈ Nin
2
• the rate of patterns where the neural evaluation is non increasing,
• and the number of epochs of each learning (we stop when
9
k wk+1 −wk k∞
k w k k∞
< 10−8 ).
Classical learning
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1
f (x) = x3
learning patterns
neural estimations
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0.8
1
0.8
1
Learning with penalty function
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1
f (x) = x3
learning patterns
neural estimations
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
Learning barrier function
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1
f (x) = x3
learning patterns
neural estimations
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
Figure 2: Learning results of a 5-hidden units MLP with a the learning base of 250 patterns.
10
We also include in the table the MSE obtained by the function f (x) = x 3 on the learning and validation bases. We can see that the MLP trained with monotonicity requirements is really closed to the
underlying function (see also Figure 2). Monotonicity requirements are respected on the learning base
for training with penalty and barrier functions, and violations are negligible on the unknown patterns
of the validation base. These violations can be reduced by considering a bigger learning base.
The learning with monotonicity constraints converges faster than the non-constrained learning, particularly when we use the quadratic penalty function. Learning results with a barrier term or with a
penalty term are closed. However the use of the barrier function allow to ensure the monotonicity for
all examples of the validation base.
Conclusion
A learning algorithm satisfying monotonicity requirements for feed-forward neural networks is presented here and has been successfully applied on a simple regression problem. A rigorous optimization scheme based on penalty methods is used to solve the constrained learning problem. Other
optimization techniques can also be considered like the augmented lagrangian method or trust region
algorithms. We also propose a heuristic to initialize a neural network satisfying the monotonicity
constraints. We show with a simple example that introducing constraints on an a priori knowledge on
the underlying function can drastically enhance the generalization capacity of the neural network and
speed up the learning process.
We use such a learning approach to solve a routing problem with delay constraints in telecommunication networks (see [9, 5]). The neural network was trained to estimate delays induced by the load of
the network, which is an unknown increasing function of the traffic rates. This neural estimator was
then used in a routing optimization scheme for which the increase condition of the delay function is
fundamental condition.
References
[1] Dimitri P. Bertsekas. Constrained Optimization and Lagrange Multiplier Methods. Academic
Press, 1982.
[2] Dimitri P. Bertsekas. Nonlinear Programming: Second Edition. Athena Scientific, 1999.
[3] Christopher M. Bishop. Curvature-driven smoothing: a learning algorithm for feed-forward
networks. IEEE Transactions on Neural Networks, 4(5), September 1993.
[4] Christopher M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press,
1995.
[5] Christophe Duhamel, Antoine Mahul, and Alexandre Aussem. Routing with neural-based
QoS constraints. In Proceedings of the first International Network Optimization Conference
(INOC’2003), pages 201–206, Evry-Paris, France, October 2003.
[6] Roger Fletcher. Penalty functions. In A. Bachem, M. Grötschel, and B. Korte, editors, Mathematical Programming, The State of the Art, Bonn 1982, pages 87–114. Springer-Verlag, 1983.
11
[7] Jouko Lampinen and Arto Selonen. Using background knowledge in multilayer perceptron
learning. In M. Frydrych, J. Parkkinen, and A.Visa, editors, Proceedings of the 10th Scandinavian Conference on Image Analysis SCIA’97, volume 2, pages 545–549, 1997.
[8] Antoine Mahul and Alexandre Aussem. Distributed neural networks for QoS estimation in
communication network. International Journal of Computational Intelligence and Applications,
3(3):297–308, 2003.
[9] Antoine Mahul and Alexandre Aussem. Neural-based quality of service estimation in MPLS
routers. In Supplementary Proceedings of the International Conference on Artificial Neural
Networks (ICANN’2003), pages 390–393, Instanbul, Turkey, June 2003.
[10] Joseph Sill and Yaser S. Abu-Mostafa. Monotonic hints. Advances in Neural Information Processing Systems, 9:634, 1997.
Appendix
We give here a detailed version of the algorithm for computation of the penalty term gradient.
/* Computation of the values to propagate */
for all i ∈ Nin , do
ai ← xi
Jii ← 1
for all j ∈ Nin such as j 6= i, do Jij ← 0
end for
/* Forward propagation */
for all P
i∈
/ Nin in topological order, do
si ← k,(k,i) ∈ A wki ak
ai ← σi (si ), a0i ← σi0 (si ) et a00i ← σi00 (si )
for all jP
∈ Nin , do
κij ← k,(k,i) ∈ A wki Jkj
Jij ← a0i κij
end for
end for
/* Computation of the values to back-propagate */
for all i ∈ Nout , do
for all j ∈ Nin , do
+
← a0i ϕ0 (Jij )
Jij
+
νij ← a00i κij ϕ0 (Jij )
end for
end for
/* Backward propagation */
for all i ∈
/ Nout in reverse topological order, do
for all P
j ∈ Nin , do
+
x ← k,(i,k) ∈ A wik Jkj
P
+
y ← k,(i,k) ∈ A νkj
+
Jij
← a0i x
+
νij ← a00i κij x + a0i y
end for
end for
12
/* Final step: gradient computation */
for all c = (k, l) ∈ A, do
x←0
for all j ∈ Nin , do
+
x ← x + Jkj Jlj+ + ak νlj
end for
∂P/∂wkl ← x
end for
13

Training Feed-Forward Neural Networks with Monotonicity

Transcription

Documents pareils

WANEP BÃ©nin - Tisser des Relations pour la Paix

WANEP BÃ©nin - Tisser des Relations pour la Paix

Data Compression using Neural Network

Short CV

An Algorithm for Fast Convergence in Training Neural Networks

Proposition de th`ese LABEX SMART

Neural Machine Translation by Jointly Learning to Align and Translate

A very brief overview of deep learning

www.consorem.ca

From a rule-based conception to dynamic patterns. Analyzing the