Utama Deep Learning: Fundamentals, Theory and Applications

Deep Learning: Fundamentals, Theory and Applications

, , ,

The purpose of this edited volume is to provide a comprehensive overview on the fundamentals of deep learning, introduce the widely-used learning architectures and algorithms, present its latest theoretical progress, discuss the most popular deep learning platforms and data sets, and describe how many deep learning methodologies have brought great breakthroughs in various applications of text, image, video, speech and audio processing.

Deep learning (DL) has been widely considered as the next generation of machine learning methodology. DL attracts much attention and also achieves great success in pattern recognition, computer vision, data mining, and knowledge discovery due to its great capability in learning high-level abstract features from vast amount of data. This new book will not only attempt to provide a general roadmap or guidance to the current deep learning methodologies, but also present the challenges and envision new perspectives which may lead to further breakthroughs in this field.

This book will serve as a useful reference for senior (undergraduate or graduate) students in computer science, statistics, electrical engineering, as well as others interested in studying or exploring the potential of exploiting deep learning algorithms. It will also be of special interest to researchers in the area of AI, pattern recognition, machine learning and related areas, alongside engineers interested in applying deep learning models in existing or new practical applications.

Tahun:
2019
Edisi:
1st ed.
Penerbit:
Springer International Publishing
Bahasa:
english
Halaman:
168
ISBN 13:
978-3-030-06073-2
Series:
Cognitive Computation Trends 2
File:
PDF, 4.43 MB
Unduh (pdf, 4.43 MB)

You may be interested in Powered by Rec2Me

 
 
You can write a book review and share your experiences. Other readers will always be interested in your opinion of the books you've read. Whether you've loved the book or not, if you give your honest and detailed thoughts then people will find new books that are right for them.
Cognitive Computation Trends 2
Series Editor: Amir Hussain

Kaizhu Huang · Amir Hussain
Qiu-Feng Wang
Rui Zhang Editors

Deep Learning:
Fundamentals,
Theory and
Applications

Cognitive Computation Trends
Volume 2

Series Editor
Amir Hussain
School of Computing
Edinburgh Napier University
Edinburgh, UK

Cognitive Computation Trends is an exciting new Book Series covering cuttingedge research, practical applications and future trends covering the whole spectrum
of multi-disciplinary fields encompassed by the emerging discipline of Cognitive
Computation. The Series aims to bridge the existing gap between life sciences,
social sciences, engineering, physical and mathematical sciences, and humanities.
The broad scope of Cognitive Computation Trends covers basic and applied work
involving bio-inspired computational, theoretical, experimental and integrative
accounts of all aspects of natural and artificial cognitive systems, including:
perception, action, attention, learning and memory, decision making, language
processing, communication, reasoning, problem solving, and consciousness.

More information about this series at http://www.springer.com/series/15648

Kaizhu Huang • Amir Hussain • Qiu-Feng Wang
Rui Zhang
Editors

Deep Learning:
Fundamentals, Theory
and Applications

123

Editors
Kaizhu Huang
Xi’an Jiaotong-Liverpool University
Suzhou, China

Qiu-Feng Wang
Xi’an Jiaotong-Liverpool University
Suzhou, China

Amir Hussain
School of Computing
Edinburgh Napier University
Edinburgh, UK
Rui Zhang
Xi’an Jiaotong-Liverpool University
Suzhou, China

ISSN 2524-5341
ISSN 2524-535X (electronic)
Cognitive Computation Trends
ISBN 978-3-030-06072-5
ISBN 978-3-030-06073-2 (eBook)
https://doi.org/10.1007/978-3-030-06073-2
Library of Congress Control Number: 2019930405
© Springer Nature Switzerland AG 2019
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitati; on,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, express or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG.
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

Over the past 10 years, deep learning has attracted a lot of attention, and many
exciting results have been achieved in various areas, such as speech recognition,
computer vision, handwriting recognition, machine translation, and natural language understanding. Rather surprisingly, the performance of machines has even
surpassed humans’ in some specific areas. The fast development of deep learning
has already started impacting people’s lives; however, challenges still exist. In
particular, the theory of successful deep learning has yet to be clearly explained,
and realization of state-of-the-art performance with deep learning models requires
tremendous amounts of labelled data. Further, optimization of deep learning models
can require substantially long times for real-world applications. Hence, much effort
is still needed to investigate deep learning theory and apply it in various challenging
areas. This book looks at some of the problems involved and describes, in depth, the
fundamental theories, some possible solutions, and latest techniques achieved by
researchers in the areas of machine learning, computer vision, and natural language
processing.
The book comprises six chapters, each preceded by an introduction and followed
by a comprehensive list of references for further reading and research. The chapters
are summarized below:
Density models provide a framework to estimate distributions of the data, which
is a major task in machine learning. Chapter 1 introduces deep density models
with latent variables, which are based on a greedy layer-wise unsupervised learning
algorithm. Each layer of deep models employs a model that has only one layer of
latent variables, such as the Mixtures of Factor Analyzers (MFAs) and the Mixtures
of Factor Analyzers with Common Loadings (MCFAs).
Recurrent Neural Networks (RNN)-based deep learning models have been
widely investigated for the sequence pattern recognition, especially the Long Shortterm Memory (LSTM). Chapter 2 introduces a deep LSTM architecture and a
Connectionist Temporal Classification (CTC) beam search algorithm and evaluates
this design on online handwriting recognition.
Following on above deep learning-related theories, Chapters 3, 4, 5 and 6
introduce recent advances on applications of deep learning methods in several
v

vi

Preface

areas. Chapter 3 overviews the state-of-the-art performance of deep learning-based
Chinese handwriting recognition, including both isolated character recognition and
text recognition.
Chapters 4 and 5 describe application of deep learning methods in natural
language processing (NLP), which is a key research area in artificial intelligence
(AI). NLP aims at designing computer algorithms to understand and process natural
language in the same way as humans do. Specifically, Chapter 4 focuses on
NLP fundamentals, such as word embedding or representation methods via deep
learning, and describes two powerful learning models in NLP: Recurrent Neural
Networks (RNN) and Convolutional Neural Networks (CNN). Chapter 5 addresses
deep learning technologies in a number of benchmark NLP tasks, including entity
recognition, super-tagging, machine translation, and text summarization.
Finally, Chapter 6 introduces Oceanic data analysis with deep learning models,
focusing on how CNNs are used for ocean front recognition and LSTMs for sea
surface temperature prediction, respectively.
In summary, we believe this book will serve as a useful reference for senior
(undergraduate or graduate) students in computer science, statistics, electrical
engineering, as well as others interested in studying or exploring the potential of
exploiting deep learning algorithms. It will also be of special interest to researchers
in the area of AI, pattern recognition, machine learning, and related areas, alongside
engineers interested in applying deep learning models in existing or new practical
applications. In terms of prerequisites, readers are assumed to be familiar with basic
machine learning concepts including multivariate calculus, probability and linear
algebra, as well as computer programming skills.
Suzhou, China
Edinburgh, UK
Suzhou, China
Suzhou, China
March 2018

Kaizhu Huang
Amir Hussain
Qiu-Feng Wang
Rui Zhang

Contents

1 Introduction to Deep Density Models with Latent Variables . . . . . . . . . . .
Xi Yang, Kaizhu Huang, Rui Zhang, and Amir Hussain
2

Deep RNN Architecture: Design and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . .
Tonghua Su, Li Sun, Qiu-Feng Wang, and Da-Han Wang

3

Deep Learning Based Handwritten Chinese Character and Text
Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Xu-Yao Zhang, Yi-Chao Wu, Fei Yin, and Cheng-Lin Liu

4

Deep Learning and Its Applications to Natural Language
Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Haiqin Yang, Linkai Luo, Lap Pong Chueng, David Ling,
and Francis Chin

1
31

57

89

5

Deep Learning for Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . 111
Jiajun Zhang and Chengqing Zong

6

Oceanic Data Analysis with Deep Learning Models . . . . . . . . . . . . . . . . . . . . . 139
Guoqiang Zhong, Li-Na Wang, Qin Zhang, Estanislau Lima, Xin Sun,
Junyu Dong, Hui Wang, and Biao Shen

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161

vii

Chapter 1

Introduction to Deep Density Models
with Latent Variables
Xi Yang, Kaizhu Huang, Rui Zhang, and Amir Hussain

Abstract This chapter introduces deep density models with latent variables which
are based on a greedy layer-wise unsupervised learning algorithm. Each layer of the
deep models employs a model that has only one layer of latent variables, such as the
Mixtures of Factor Analyzers (MFAs) and the Mixtures of Factor Analyzers with
Common Loadings (MCFAs). As the background, MFAs and MCFAs approaches
are reviewed. By the comparison between these two approaches, sharing the
common loading is more physically meaningful since the common loading is
regarded as a kind of feature selection or reduction matrix. Importantly, MCFAs can
remarkably reduce the number of free parameters than MFAs. Then the deep models
(deep MFAs and deep MCFAs) and their inferences are described, which show that
the greedy layer-wise algorithm is an efficient way to learn deep density models
and the deep architectures can be much more efficient (sometimes exponentially)
than shallow architectures. The performance is evaluated between two shallow
models, and two deep models separately on both density estimation and clustering.
Furthermore, the deep models are also compared with their shallow counterparts.
Keywords Deep density model · Mixture of factor analyzers · Common
component factor loading · Dimensionality reduction

1.1 Introduction
Density models provide a framework for estimating distributions of the data
and therefore emerge as one of the central theoretical approaches for designing
machines (Rippel and Adams 2013; Ghahramani 2015). One of the essential
X. Yang · K. Huang () · R. Zhang
Xi’an Jiaotong-Liverpool University, Suzhou, China
e-mail: xi.yang@xjtlu.edu.cn; kaizhu.huang@xjtlu.edu.cn; rui.zhang02@xjtlu.edu.cn
A. Hussain
School of Computing, Edinburgh Napier University, Edinburgh, UK
e-mail: a.hussain@napier.ac.uk
© Springer Nature Switzerland AG 2019
K. Huang et al. (eds.), Deep Learning: Fundamentals, Theory and Applications,
Cognitive Computation Trends 2, https://doi.org/10.1007/978-3-030-06073-2_1

1

2

X. Yang et al.

probabilistic methods is to adopt latent variables, which reveals data structure
and explores features for subsequent discriminative learning. The latent variable
models are widely used in machine learning, data mining and statistical analysis.
In the recent advances in machine learning, a central task is to estimate the deep
architectures for modeling the type of data which has the complex structure such as
text, images, and sounds. The deep density models have always been the hot spot
for constructing sophisticated density estimates. The existed models, probabilistic
graphical models, not only prove that the deep architectures can create a better prior
for complex structured data than the shallow density architectures theoretically,
but also has practical significance in prediction, reconstruction, clustering, and
simulation (Hinton et al. 2006; Hinton and Salakhutdinov 2006). However, they
often encounter computational difficulties in practice due to a large number of free
parameters and costly inference procedures (Salakhutdinov et al. 2007).
To this end, the greedy layer-wise learning algorithm is an efficient way to
learn deep architectures which consist of many latent variables layers. With this
algorithm, the first layer is learned by using a model that has only one layer of
latent variables. Especially, the same scheme can be extended to train the following
layers at a time. Compared with previous methods, this deep latent variable model
has fewer free parameters by sharing the parameters between successive layers,
and a simpler inference procedure due to a concise objective function. The deep
density models with latent variables are often used to solve unsupervised tasks. In
many applications, the parameters of these density models are determined by the
maximum likelihood, which typically adopts the expectation-maximization (EM)
algorithm (Ma and Xu 2005; Do and Batzoglou 2008; McLachlan and Krishnan
2007). In Sect. 1.4, we shall see that EM is a powerful and elegant method to find
the maximum likelihood solutions for models with latent variables.

1.1.1 Density Model with Latent Variables
Density estimation is a major task in machine learning. In general, the most
commonly used method is Maximum Likelihood Estimate
(MLE). In this way, we
N
1
can establish a likelihood function L(μ, Σ) =
n=1 ln p(yn |μ, Σ). However,
the derivation of directly calculating the likelihood functions has computational
difficulties because of the very high dimensions of Σ. Thus, a set of variables x
is defined to govern multiple y, and when the distribution of p(x) is found, p(y)
can be determined by the joint distribution over y and x. Typically, the covariates
Σ are ruled out. In this setting, x is assumed to affect the manifest variables
(observable variables), but it is not directly observable. Thus, x is so-called the latent
variable (Loehlin 1998). Importantly, the introduction of latent variables allows
the formation of complicated distributions from simpler components. In statistics,

1y
n

- N observation data, μ - mean, Σ - covariates

1 Introduction to Deep Density Models with Latent Variables

3

Table 1.1 Different types of latent variable models. We fucose on the Latent Profile Analysis and
the Latent Class Analysis, where the latent variables are treated as a multinomial distribution
Manifest variables
Continuous
Discrete

Latent variables
Continuous
Factor analysis
Latent trait analysis

Discrete
Latent profile analysis
Latent class analysis

latent variable models are probabilistic models that represent the existence of a set
of latent variables and relate them to manifest variables (Vermunt and Magidson
2004). These models are typically grouped according to whether the nature of the
manifest and latent variables are discrete or continuous (see Table 1.1) (Galbraith
et al. 2002; Everett 2013). Whilst, they are also widely applied to analyze data,
especially in the presence of repeated observations, multi-level data, and panel
data (Bishop 1998). In this chapter, we shall see the finite mixture models which
based on discrete latent variables (details in Sect. 1.2). This finite mixture model
is a convex combination of more than one density functions and comes up with an
expression of heterogeneity in a finite number of latent classes, such as the Mixtures
of Factor Analyzers (MFAs) (Ghahramani and Hinton 1996) and the Mixtures of
Factor Analyzers with Common Loadings (MCFAs) (Baek et al. 2010).

1.1.2 Deep Architectures via Greedy Layer-Wise Learning
Algorithm
The latent variable models are usually expressed as a shallow architecture which
just has one visible layer and one single hidden layer, such as a directed acyclic
graph (see in Fig. 1.1). As shown in Sects. 1.2.4.1 and 1.2.4.2, shallow architectures
has achieved good performances in practice. Despite all this, shallow architectures
also have a serious problem which is very inefficient in dealing with the complex
structural data or large amounts of hidden units (Bengio et al. 2007). In contrast,
the deep architectures have multiple hidden layers to help in learning better
representations of the complicated hidden units. A directed deep architecture is
illustrated in Fig. 1.2. Also, deep architectures reduce the need for hand-crafted
features which is a very time-consuming process requiring expert knowledge.
Fig. 1.1 A directed acyclic
graph represents a shallow
density model which has only
one hidden layer

4

X. Yang et al.

Fig. 1.2 A multi-layer
directed acyclic graph
represents a deep density
model which has three hidden
layers and one visible layer

However, it is difficult to train all layers at once when a multi-layered architecture
to build generative models of data (Arnold and Ollivier 2012).
To this end, the greedy layer-wise procedures are developed for training deep
generative models by using the same criterion for optimizing each layer starting
from the bottom and for transferring the problem upwards to the next layer, socalled the layer-wise learning algorithm (Arnold and Ollivier 2012). Thus, the
latent variable models can be an ideal criterion for a deep architecture, as well as
providing a framework for building more complex probability distributions (Bengio
et al. 2009). Specifically, although a deep latent variable model has many layers
of latent variables, the learning procedure is just using a model has only one layer
of latent variables for optimizing each layer (Tang et al. 2012). Besides, the deep
architectures can also be used for a finite mixture model, which means each subpopulation is assumed to come from the manifest variables having more than one
distributions. Deep density models also have very important practical significance
in many disciplines, especially, attracted considerable attention in unsupervised
learning (Patel et al. 2015; Tang et al. 2013; Yang et al. 2017).

1.1.3 Unsupervised Learning
In practical applications, the (deep) density models with latent variables are typically used for solving unsupervised tasks, especially in dimensionality reduction,
and clustering. The (deep) density model with latent variables is a probabilistic
algorithm which based on modeling the density of the observed data and assumes
that there are some potential/unobservable processes to manage data generation.
Clustering is a method of putting similar data points together, and probabilistic
unsupervised learning builds a generation model that describes the item’s clustering.
Wherein, each cluster can be represented by a precise component distribution
(multiple distributions for a deep model). Hence, each data point is generated from
the selection of a component matters (Law et al. 2004; Figueiredo and Jain 2002).

1 Introduction to Deep Density Models with Latent Variables

5

The idea is that all the same types of data points belonging to the same cluster (all
from the same distribution) are more or less equivalent, and any differences between
them are accidental (Bishop 2006).
In dimensionality reduction, the primary role of the latent variables is to
allow a complicated distribution over observed variables constructed from simpler
conditional distributions. Importantly, the dimensional of latent variables are always
lower than the observable variables. Therefore, the higher-dimensional observable
variables are reduced to the low-dimensional latent variables to represent a model.

1.2 Shallow Architectures of Latent Variable Models
This section introduces the density models with latent variables that explicitly shows
how inferences of the models correspond to operations in a single layer of a deep
architecture. Density models are widely used in data mining, pattern recognition,
machine learning, and statistical analysis. As a basic model, Factor analysis (FA)
is commonly used to define a appropriate density distribution of data to promote
the well-known mixtures of Gaussians (Everitt 1984). MFAs are improved density
estimators that model by allowing each Gaussian in the mixture to be represented
in a different lower-dimensional manifold. In particular, MFAs simultaneously
perform clustering and dimensionality reduction of the data (Montanari and Viroli
2011; McLachlan and Peel 2000). Exploiting similar ideas of MFAs, MCFAs are
proposed as another mixture framework by sharing a common component loading.
Importantly, unlike MFAs which specify a multivariate standard normal prior for
the latent factors over all the components, MCFAs exploit on each latent unit
Gaussian density distributions whose mean and covariance could be learned from
data. On the other hand, sharing a common factor loading for all the components
reduces the parameters significantly. While, since a common factor loading can be
considered as a feature selection or dimensionality reduction matrix, it can be well
justified and physically more meaningful (Baek and McLachlan 2011; Tortora et al.
2016). Since MFAs and MCFAs are implemented involving steps that attribute the
postulated sub-populations to individual observations, these models are typically
used in unsupervised learning or clustering procedures, and achieved impressive
performance (Mclanchlan et al. 2003; Yang et al. 2017).

1.2.1 Notation
We begin by considering the problem of identifying groups/clusters of data points in
a multidimensional space. Suppose we have a dataset y consisting of N observations
which have a p-dimensional vector {y1 , y2 , . . . , yp } of each feature variable. By
introducing latent variables, the manifest distribution p(y) can be signified in terms
of a q-dimensional vector {z1 , z2 , . . . , zq } of latent variables z, where q is a small

6

X. Yang et al.

number than p (Smaragdis et al. 2006). Through this process, the joint distribution
P (y, z) is decomposed into the conditional distribution of the feature variables given
the latent variables and the product of the marginal distribution of the latent variables
p(z) (Bishop 1998).
P (y, z) = p(y|z)p(z) = p(z)

p


p(yi |z).

(1.1)

i=1

As shown in Fig. 1.3, the latent variable models can be graphically represented by a
directed graph.
By considering a mixture of C multinomial distributions, the density of y could
be modeled as a finite mixture model
P (y; θ ) =

C


πc p(y|c; θ c ), where

c = 1, . . . , C.

(1.2)

c=1

These distributions are referred to as components of this model (or sub-populations
of observations) and describe the p-variate density function (Figueiredo and Jain
2002). The mixture distribution can be expressed by a simple graph, as shown in
Fig. 1.4. These components must belong to the same parametric family of distributions but with different parameters θ . θ c denotes the observations’ parameters
which are associated with the component c. For instance, the mixture components
belong to Gaussian distributions, and then each component has different means and
variances. Additionally, the parameters are random variables in a Bayesian setting,
and prior probability distributions p(y|c; θ c ) are placed over the variables (Bishop
1998; Loehlin 1998). π c = p(c) denotes the C mixture weights and satisfies the
requirement that probabilities sum to 1.
Fig. 1.3 The factorization
property of Eq. 1.1 can be
expressed in terms of a
directed graph, in which the
feature variables yi are
independent of the latent
variables z
Fig. 1.4 A simple mixture
model is expressed in terms
of a Bayesian network

1 Introduction to Deep Density Models with Latent Variables

7

Fig. 1.5 A mixture of latent variables model is graphically expressed by the Bayesian network.
The feature variables yi are conditionally independent of the given mixture weights c and latent
variables z

Then, a mixture of latent variables model can be obtained by combining ingredients from the technical of mixture models with the idea of latent variables (Fokoué
2005). Consequently, the joint distribution P (y, z) is derived as follows and the
corresponding Bayesian network is shown in Fig. 1.5.
P (y, z) =

C


p(y|p(c), z)p(c)p(z|c) =

c=1

C

c=1

πc p(zc )

p


p(yi |z).

(1.3)

i=1

This mixture model is explicitly computed with discrete latent variables. Hence, the
integral is replaced by a sum.

1.2.2 Mixtures of Factor Analyzers
We first introduce a globally nonlinear latent variable model which is referred
to as Mixture of Factor Analyzers (MFAs). MFAs are considered as extending
the traditional Factor Analysis by ideas of the analysis of the finite mixture of
distributions. To separate the observations independently into c non-overlapping
components, the MFAs approach is modeled as
y=

C


μc + Wc z +  c

with probability π c (c = 1, . . . , C),

(1.4)

c=1

where μc is a p-dimensional mean vector associated with the component c; Wc
is a p × q factor loading matrix of the cth component, where q < p; and πc =
p(c) denotes the mixing proportion. Each (unobservable) factor z is distributed
independently by a normal distribution with zero mean and a q × q identity
covariance matrix, N(0, Iq ). The noise model  c is also independent and assumed
as a Gaussian distribution with zero mean and a q-dimensional diagonal covariance
matrix, N (0, Ψ c ). Given the latent variables and the component indicator variables,
the conditional distribution of observations p(y|c, z) follows a Gaussian distribution
which has mean Wc z + μc and variance Ψ c .

8

X. Yang et al.

The joint distribution is given by p(y|c, z)p(z|c)p(c), and the density of the
observed data P (y; θ ) is obtained by summing up all mixture components. Then,
the MFAs model is obtained by marginalizing over the latent variables

p(y|c) = z p(y|c, z)p(z|c)dz = N(y; μc , Wc WTc + Ψ c ),
(1.5)
C
C
P (y; θ ) = c=1 πc p(y|c, z)p(z|c) = c=1 πc N (y; μc , Wc WTc + Ψ c ), (1.6)
where θ denotes the model parameter vector. It consists of the mixture weight
π c , the means of the component μc , the factor loading Wc and the covariance of
component matrices Ψ c (Montanari and Viroli 2011)
⎞C
πc
⎟
⎜
μc
⎟
.
θc = ⎜
⎝ Wc (vec) ⎠
Ψ c (diag) c=1
⎛

(1.7)

Therefore, the MFAs can be considered as a globally nonlinear latent variable
model, where C Gaussian factors are fitted on the data.

1.2.2.1

Maximum Likelihood

After determining the “best-fitting” distribution (Eq. 1.6), the parameters will be
estimated for that distribution. Maximum likelihood estimation (MLE) is a common
technique for estimating the parameters of a probability distribution. In other words,
the MLE can be known as to maximize the sample likelihood by estimating the
values of the parameters. These unknown parameters are contained in a vector θ (as
Eq. 1.7). Loosely speaking, the goal of MLE is to maximize a likelihood function
of the sample data. Suppose we have a sample of independent and identically
distributed (iid) random variables, {y1 ; y2 ; . . . ; yN }, which is described by the
MFAs model (Eq. 1.6). The basic likelihood function is defined as
P (y; θ ) =

C

c=1

πc

N


N(yn ; μc , Wc WTc + Ψ c ).

(1.8)

i=1

Since maximizing the product is very tedious, logarithms are often used to turn
multiplications into summation. It will be equivalent to maximise the log-likelihood
since the logarithm is an increasing function (Montanari and Viroli 2011)
L(θ ) = log P (y; θ ) =

N
C 

c=1 n=1

log{πc N(yn |μc , Wc WTc + Ψ c )},

θ̂ = arg max L(θ ).
θ

(1.9)
(1.10)

1 Introduction to Deep Density Models with Latent Variables

9

Here, θ̂ denotes the estimated parameters when the empirical log-likelihood has a
maximum value.

1.2.2.2

Maximum A Posteriori

Maximum A Posteriori (MAP) is also a method for estimating variables in the
probability distributions settings. The MAP is closely related to the MLE, which can
be regarded as a regularization of the MLE. However, the MAP is more interested in
a posterior distribution, not only the likelihood. For inference, MAP usually comes
up with Bayesian setting, and the posterior over the components can be found by
recalling Bayes rule
q(z, c|y) = q(z|y, c)q(c|y),
q(c|y) =

p(y|c)p(c)
C
h=1 p(y|h)p(h)

∝ p(y|c)p(c).

(1.11)
(1.12)

More concretely, the posterior over the latent factors is also a multivariate Gaussian
density on z given y and c:
q(z|y, c) = N(z; κ c , V−1
c ),

(1.13)

where
T
V−1
c = I + Wc Ψ c Wc ,
T −1
κ c = V−1
c Wc Ψ c (y − μc ).

(1.14)

A point estimate can be made by selecting the component c with maximum a
posterior probability
ĉ = arg max p(c)p(c|y).
c

(1.15)

Then, the likelihood in the MLE (Eq. 1.10) is replaced by the posterior. Consequently, MAP estimation will back at MLE equation
θ̂ MAP = arg max
θ

N
C 

c=1 n=1

log{p(y|c)p(c)} = θ̂ MLE .

(1.16)

10

X. Yang et al.

1.2.3 Mixtures of Factor Analyzers with Common Factor
Loadings
Suppose the observations y follow a Gaussian distribution N (y; μc , Σ c ). As a
special case of MFAs, MCFAs further assumes the additional restrictions:
μc = Aξ c ;

Σ c = AΩ c AT + Ψ ;

Ψc = Ψ;

Wc = AKc .

(1.17)

Thereby, we get a new linear model by rewriting Eq. 1.4
y=

C


Azc + 

with probability π c (c = 1, . . . , C).

(1.18)

c=1

Conventionally, A is the p × q common loading matrix, and zc is refered to as
hidden variables or latent factors. Different with the latent factor of MFAs, this
common loading matrix can be regarded as a global transformation matrix, and the
Gaussian distributions over each latent factor whose mean and covariance could be
learned from data. These settings not only reduce the parameters significantly but
also estimate the density accurately. Besides, a multivariate standard Gaussian prior
of MFAs may limit the flexibility.
To use this framework, the MCFAs are defined to a directed generative model, as
follows
p(c) = πc ,

C


πc = 1;

c=1

 ∼ N(0, Ψ );

y|c, z ∼ N(Azc , Ψ ),

(1.19)

where Ψ is a p × p diagonal matrix which represents the variances of the
independent noise. c denotes the component indicator over the C total components
of the mixture, and p(c) = πc is a mixing proportion. The prior of the latent factor
z given c follows a Gaussian density distribution:
p(z|c) = N(zc ; ξ c , Ω c ).
Here, ξ c is a q-dimension vector and Ω c is a q × q positive definite symmetric
matrix. With the above definitions, the density of y given c can be written as a
shallow form by integrating out the latent variables z
p(y|c) =



z p(y|c, z)p(z|c)dz

μc = Aξ c ,

= N (y; μc , Σ c ),

Σ c = AΩ c AT + Ψ .

(1.20)

1 Introduction to Deep Density Models with Latent Variables

11

Finally, the marginal density of MCFAs model on observed data y is then given by
a mixture which is obtained by summing the joint distribution p(y|c, z)p(z|c)p(c)
p(y) =

C


p(c)p(y|c),

c=1

P (y; θ c ) =

C


πc N(y; Aξ c , AΩ c AT + Ψ ).

c=1

Apparently, the model of MCFAs is also a multivariate Gaussian with constrained
mean and covariance. Here, the parameter vector consists of the global parameters,
the common loading matrix Ac , the covariance matrix Ψ , and the local parameters
for each mixture component
⎛

⎞C
πc
θ c = ⎝ ξ c (vec) ⎠
.
Ω c (Sym) c=1
1.2.3.1

(1.21)

Maximun Likelihood

In Sect. 1.2.2.1, the MLE of MFAs was introduced. We now turn to the MLE of
MCFAs with same inference procedure. A sample of iid random variables was
described by the MCFAs model (Eq. 1.21). The log-likelihood function is given by
L(θ) = log P (y; θ ) =

N
C 


log{πc N(yn |Aξ c , AΩ c AT + Ψ )},

(1.22)

c=1 n=1

When the empirical log-likelihood has a maximum, the maximizing a function is
shown as follows
θ̂ = arg max L(θ).
θ

(1.23)

A general technique for finding maximum likelihood estimators in latent variable
models is the Expectation-maximization(EM) algorithm. In Sect. 1.4, we shall see
that the Maximum likelihood learning can straightforward use the EM algorithm.

1.2.3.2

Maximum A Posteriori

In Maximum A Posteriori (MAP), the formulation of the posterior can be expressed
using the Bayes rule:

12

X. Yang et al.

q(c|y; θ c ) =

πc N(y; Aξ c , AΩ c AT + Ψ )
.
C

T
πh N(y; Aξ h , AΩ h A + Ψ )

(1.24)

h=1

The component c can be selected by maximizing a posterior probability
ĉ = arg max p(c)q(c|y; θ c ).
c

(1.25)

The posterior of the latent factors is also a multivariate Gaussian density on z
given the observations y and the component c:
q(z|y, c) = N(z; κ c , V−1
c ),

(1.26)

where
−1
T −1 A,
V−1
c = Ωc + A Ψ
T −1 (y − Aξ ).
κ c = ξ c + V−1
c
c A Ψ

(1.27)

More concretely, the posterior probability over the components of the mixture
can be found by p(c|y; θ c ) = pr{ωc = 1|y}, where ωc = 1 if y belongs to the cth
component, otherwise ωc = 0.

1.2.4 Unsupervised Learning
The MFAs model performs the dimensionality reduction of data by making locally
linear assumptions. Therefore, the model transforms the points of different clusters
into different sub-spaces, as show in Fig. 1.6. The principle is to maximize the
similarity of points from the same cluster. Importantly, the MCFAs model performs
the dimensionality reduction of data by making global linear assumptions, which
has capability to transform the different clusters into a sub-space, as shown in
Fig. 1.7. Different with MFAs, MCFAs not only follow the principle of maximizing
the similarity but also to maximize the distance between clusters.
We conduct extensive experiments on a variety of datasets to evaluate the
performance of both algorithms, including artificial data, gray images, and digitized
aerial image. The following datasets are used in our empirical experiments.
• ULC-3: The urban land cover (ULC) data is used to classify a high resolution
aerial image which consists of 3 types with 273 training samples, 77 test samples
and 147 attributes (Johnson and Xie 2013; Johnson 2013).
• Coil-4-proc: This dataset contains images for 4 objects discarding the background and each object has 72 samples (Nene et al. 1996). The images are down

1 Introduction to Deep Density Models with Latent Variables

13

Fig. 1.6 The sketch of clustering and dimensionality reduction. MFAs drop the data points of
different clusters into different sub-spaces and cluster them at the same time. Different color
represents different cluster

Fig. 1.7 The sketch of clustering and dimensionality reduction. MCFAs performs the dimensionality reduction and clustering simultaneously.Different color represents different cluster

sampled into 32 by 32 pixels and then reshaped to a 1024-dimensional vector.
There are just 248 samples in the training set and 40 samples in the test set.
• Leuk72_3k: This dataset is an artificial dataset including 3 classes which have
been drawn from randomly generated Gaussian mixtures. The Leuk72_3k has
only 54 training samples and 18 test samples with 39 attributes.
• USPS1-4: This handwriting digit data contains 1 to 4 digits images of size 16 by
16 pixels. Each image is reshaped to a 256-dimensional vector. The training set
includes 100 samples of each digit and the test set also consists of 100 of each
digit.

X. Yang et al.

Average Log-likelihood (Training set)

14

-100

ULC-3

Dataset
Coil-4-proc
Leuk72_3k

USPS1-4

-101
-102
-103
MFAs

-104

MCFAs

Fig. 1.8 Performance on various real data (on Training Set) in terms of the log-likelihood (the
larger, the better)

1.2.4.1

Empirical Results

Finite mixture models are able to present arbitrarily complex probability density
functions, which fact makes them an excellent choice for representing complex class-conditional (Figueiredo and Jain 2002). Empirically, the average loglikelihood as a criterion for examining the quality of the density estimates after
modeling the density of the observed data. The empirical results are demonstrated
for both MFAs and MCFAs. The model parameters are estimated over the training
data by maximizing the log-likelihood value. By multiple trials, the average loglikelihood value on the training data is shown in Fig. 1.8. In order to intuitively
observe the trend of the entire experimental results, the image results have to be
treated as logarithmic, which is used to controlled all the results in the same range.
On the testing data, the log-likelihood value obtained using only the model has
been trained and did not update the model parameters, and the results are shown in
Fig. 1.9. From the comparison of both results, the log-likelihood values obtained by
the MCFAs model are lower than that of the MFAs on all datasets. Consequently,
sharing a common loading can improve the true log-likelihood dramatically.

1.2.4.2

Clustering

Then, the clustering error rate is demonstrated of both training and testing datasets
and the best results are reported from multiple trials. In the experiments, both
methods have been initialized by random assortment. Also, the number of mixes
is set to be the same as the number of real categories. Comparing both MFAs and
MCFAs results on the training data in Fig. 1.10, the results of these two methods are
not significantly different on the Coil-4-proc dataset, and even get the same result
on the Leuk72_3k dataset. However, comparing the results in Fig. 1.11, the MCFA

Average Log-likelihood (Testing set)

1 Introduction to Deep Density Models with Latent Variables

-100

ULC-3

Dataset
Coil-4-proc
Leuk72_3k

15

USPS1-4

-101
-102
-103
MFAs

-10

MCFAs

4

Fig. 1.9 Performance on various real data (on Testing Set) in terms of the log-likelihood (the
larger, the better)
0.35
Error Rate (Training set)

MFAs

0.3

MCFAs

0.25
0.2
0.15
0.1
0.05
0

ULC-3

Leuk72_3k
Coil-4-proc
Dataset

USPS1-4

Fig. 1.10 Clustering error rate on 4 datasets. The best result is reported from each model on
training set

model still maintains good performance on the test dataset. On the whole, the results
of MCFAs are consistently better than the MFAs.

1.3 Deep Architectures with Latent Variables
In the above section, we defined the shallow architectures and also exhibit inferences
to optimize a single layer. We now consider how to extend the shallow models
by defining the deep density models. Firstly, the deep architectures needs to have
many layers of latent variables. Secondly, parameters should be learned by the
efficient greedy layer-wise algorithms. In this section, two deep density models
will be described, Deep Mixtures of Factor Analyzers (DMFAs) who adopts an

16

X. Yang et al.
0.35
Error Rate (Testing set)

MFAs

0.3

MCFAs

0.25
0.2
0.15
0.1
0.05
0

ULC-3

Leuk72_3k
Coil-4-proc
Dataset

USPS1-4

Fig. 1.11 Clustering error rate on 4 datasets. The best result is reported from each model on testing
sets

MFAs in each hidden layer (McLachlan and Peel 2000; Tang et al. 2012) and
Deep Mixtures of Factor Analyzers with Common loadings (DMCFAs) who adopts
an MCFAs (Yang et al. 2017). Therefore, both deep models are the directed
generative models and use the layer-wise training procedure as approximations.
The observation vector and the first hidden layer are treated as an MFAs/MCFAs
model, and learning parameters are used by this unsupervised method. After fixing
the first layer parameters, the priors of next layer MFAs/MCFAs are replaced by
sampling the hidden units of the current layer MFA/MCFAs The same scheme can
be extended to train the following layers. Compared with the shallow models with
same scale mixtures (Collapse Models), the deep models have fewer free parameters
and a simpler inference procedure. On one hand, the components of adjacent layers
share the parameters; on the other hand, large-scale mixtures can cause the objective
function of a shallow model to be too complexity.

1.3.1 Deep Mixtures of Factor Analyzers
Having formulated the MFAs model, we now show how to construct the MFAs into
a deep architecture. In a shallow model, each FA in MFAs has an isotropic Gaussian
prior in its factor, as well as a Gaussian posterior over each training sample.
However, the posterior is generally non-Gaussian when a posterior is aggregated
for many training samples. If we replace the prior of each FA with a separate mixed
model, this mixture model can learn to model an aggregated posterior rather than
model an isotropic Gaussian, and the sketches are shown in Fig. 1.12. Therefore,
it can improve a variational lower bound on the log probability of the training
data (McLachlan and Peel 2000; Tang et al. 2012). According to this method, Tang
et al. (2012) construct a DMFAs model by replacing the FA in the mixture with an
MFAs model and even substitute the FA of the next mixtures. The graphical model

1 Introduction to Deep Density Models with Latent Variables

17

Fig. 1.12 The sketch of a mixture of the DMFAs’ higher layer for clustering and dimensionality
reduction. Left: The aggregated posterior of a component in the lower layer is not a Gaussian
distribution. Right: The higher layer has an ability to model a better aggregated posterior of the
lower layer
Fig. 1.13 Graphical models
of a two-layer DMFA. The
DMFAs is a deep directed
graphical model utilizing the
multi-layer factor analyzers
which are developed by
adopting a MFAs model in
each hidden layer

of two-layer DMCFAs is visualized in Fig. 1.13. Importantly, the pivotal method is
to sample the data regarding the posterior distributions of the current layer and treat
it as the training data for the next layer.
The observations y and the factors z in the first hidden layer are treated as the
MFAs, and the first layer parameters θ (1) are used by this unsupervised method to
learn. 
The aggregated posterior over z factor with a specific component c is given
n
(1) and ĉ are treated as training
by N1 N
n=1 p(z , cn = c|yn ). For the second layer, z
data by sampling posterior distribution (Eqs. 1.13 and 1.15). Then a more powerful
MFAs prior replaces the standard multivariate normal prior
(2)
p(z|c) = PMF A (z(1)
c ; θ cs ).

(1.28)

The same scheme can be extended to training the following layers.
In the second layer, some new symbols need to be defined: z(1) is a q-dimension
vector as the data input to the second layer; θ (2)
cs emphasizes that a new parameters

18

X. Yang et al.

vector in the second layer which is specific to component c of the first layer
MFAs2 : The layer factors are denoted as a d-dimension vector z(2) ; s is a new
sub-component
indicator variable, and the total number of sub-components is S
satisfying S = C
c=1 Mc ; mc = 1, . . . , Mc denotes the number of sub-components
corresponding to the cth first layer component; The second layer mixing proportions

(2)
(2)
p(s) = πs are defined as p(cs )p(s|cs ), where Ss=1 πs = 1 and cs denotes
the sub-components corresponding to the c component. Then, the DMFAs prior is
written as follows
p(z; c) = p(c)p(mc |c)p(z|mc ).

(1.29)

The density of vectors z(1) follows the joint density over z(2) and s:
p(z(1) , z(2) , s) = p(z(1) , c|z(2) , s)p(z(2) |s)p(s),
(2)

(2)

(1.30)

(2)

p(z(1) , c|s, z(2) ) = N(z(1) ; Ws z(2) + μs , Ψ s ),
p(z(2) |s) = N(0, I).

(1.31)

(2)

(2)

(2)

q×d , Ψ
q×q , μ
Here, the new parameter vector θ (2)
s ∈R
s ∈
cs consists of Ws ∈ R
q
(2)
d
3
R , z ∈ R . Specifically, since every s is just allowed to belong to one and only
one c, we obtain the Gaussian density on the observed data y given z(1) and c
(1)
(1)
+ μ(1)
p(y|c, z(1) ) = N(y; W(1)
c z
c , Ψ c ),
(1)

(1)

where, Wc ∈ Rp×q , Ψ c
parameters.

1.3.1.1

(1)

∈ Rp×p , μc

(1.32)

∈ Rp , z(1) ∈ Rq denote the first layer

Inference

For inference, the posterior distribution is computed in a similar fashion with
Eqs. 1.11 and 1.13
q(z(2) , s|z(1) , c) = q(z(2) |z(1) , c, s)q(s|z(1) , c)
−1

(2)
),
= N(z(2) ; κ (2)
cs , Vcs

(1.33)

where
(2)−1

Vcs

(2)T

(2)−1

= Id + Wcs Ψ cs

−1

T

(2)

Wc s ,

−1

(2)
(2)
(2)
κ (2)
W(2)
(z(1) − W(2)
cs Ψ cs
cs = Vcs
cs μc ).

2 The
3d

superscript represents which layer these variables belongs to
denotes the d-dimension subspace of second layer, where d < q.

(1.34)

1 Introduction to Deep Density Models with Latent Variables

19

Here, The subscript emphasizes the sub-component s which is specific to component
c of the first layer, and Id is a d-dimensional identity matrix. The posterior over the
components can be found as follows
q(s|z(1) , c) ∝ p(z(1) , c|s)p(s),
ŝ =

(1.35)

arg max p(s)q(s|z(1) ).
s

(1.36)

For the second layer, p(z(1) ) and ĉ are treated as input data and initial labels
when the first layer parameters is fixed and the Eq. 1.10 is maximized. According
to the Eq. 1.28, the DMFAs formulation seeks to find a better prior p(z|c) =
(1) (1)
(1)
PMF A (z(1) |ĉ; θ (2)
cs ). Given the new data vectors {z1 ; z2 ; . . . ; zq }, maximizing
the Eq. 1.8 with respect to the second layer parameters is equivalent to maximizing
the density function P (z(1) |ĉ; θ (2)
cs ). The basic likelihood objective function of the
second layer is defined as
P (z(1) |ĉ; θ (2)
cs ) =


s∈c

q

πs
i=1

T

(2)
(2) (2)
{N(z(2)
+ Ψ (2)
s )}.
i |μs , Ws Ws

(1.37)

It is worth to denote that, at the second layer, each mixture model of C components
derived from the first layer can be updated separately, since S second layer
parameters are non-overlapping and just allowed to belong to one and only one
of the first layer component.
θ̂ cs (2) = arg max log P (z(1) ; θ (2)
cs ).

(1.38)

θcs (2)

Despite the good performance in practice, the DMFAs model still has many
drawbacks. Specifically, this model uses different loading matrices for different
components, which may lead to over-fitting in practical applications. Meanwhile, it
also inherits the shortcoming of MFAs, that is, assuming the prior of each potential
factor follows a standard Gaussian distribution, which may limit the flexibility and
accuracy.

1.3.1.2

Collapse Model

While it is true that DMFAs can also be collapsed into a shallow form by integrating
out the latent factors. According to Eq. 1.5, we obtain the collapse model after the
first layer factors z(1) are integrated out:
p(y|z(2) , s) =

z(1)

p(y|c, z(1) )p(z(1) |s, z(2) )p(z(2) |s)dz(1)
T

(2) (2)
(1)
(1) (2) (1)
= N (y; W(1)
+ μ(2)
+ Ψ (1)
c (Ws z
s ) + μ c , Wc Ψ s W c
c ).

(1.39)

20

X. Yang et al.

Then the final shallow form is obtained by further integrating out z(2) :
p(y|s) =


z(2)

(1) (2)

p(y|z(2) , s)dz(2) = N (y; ms , Σ s ),

(1)

(1)

(2)

(2)T

ms = Wc μs + μc , Σ s =Wc (ψ (2)
s +Ws Ws

(1)T

)Wc

(1.40)
(1)

+Ψ c . (1.41)

Finally, the marginal density of the shallowed model on observed data y is then
given by a mixture of Gaussians:
p(y) =

S


p(s)p(y|s) =

s=1

S


πs N(y; ms , Σ s ).

(1.42)

s=1

Conventionally, θ s = {πs , μs , Ws , Ψ s , Wc , μc , Ψ c }S,C
s=1,c=1 represent the parameters of the shallow form of DMFAs.
In this case, the posterior probability of the shallowed MFA which collapses from
a two-layer DMFAs for the sth mixture is given by p(s|y) = πs p(y|s)/p(y). We are
also interested in the posterior distribution of the latent factor zs which is collapsed
to a shallow form
q(z(2) , s|y) = N(zs ; κ s , V−1
s ),
(2)

T

T

(1.43)
−1

(2) (2)
−1 + W(1) Ψ (1) A(1) ,
+ Ψ (2)
V−1
c
c
s )
c
s = (Ws Ws
(2)T

κ s = Ws

μs + V−1
s A
(2)

(1)T

Ψ

(1)−1

(y − μc ).

(1.44)
(1.45)

1.3.2 Deep Mixtures of Factor Analyzers with Common Factor
Loadings
The same scheme was extended to training the DMCFAs model. By illustrating a
mixture of the higher layer in Fig. 1.14, we can clearly see that a mixture model
can better model a non-Gaussian posterior component in the lower layer. The key
improvement is that the same load matrix is used for each mixture component that
uses the MCFAs as an aggregated prior. This would potentially further improve the
performance. Since different loading matrices are used in the DMFAs, the number
of parameters may be unmanageable when the data has a larger feature dimensional
and/or smaller observations. Moreover different load matrices may not be physically
essential or even unnecessary (Yang et al. 2017).
Figure 1.15 presents an illustration of two-layer DMCFA’s graphical model
which is constructed with two global parameters, the factor loading, and noise
covariance matrices in the first layer. As shown in the graphical model, the common
parameters are set in every latent unit at the next level. Based on this setting,
the number of free parameters can be dramatically reduced compared to DMFAs,

1 Introduction to Deep Density Models with Latent Variables

21

Fig. 1.14 The sketch of a mixture of the higher layer of DMCFAs. Left: The aggregated posterior
of a component in the lower layer is not a Gaussian distribution. Right: The higher layer has an
ability to model a better aggregated posterior of the lower layer
Fig. 1.15 Graphical models
of a two-layer DMFA.The
DMCFAs is a deep directed
graphical model utilizing the
multi-layer factor analyzers
which are developed by
adopting a MCFAs model in
each hidden layer

even though MCFAs introduce the mean and variance matrices of latent factors.
Therefore, DMCFAs model could be considerably interested in the tasks with a
large number of clusters or with insufficient instances.
Speaking in a more abstract term, we defined the DMCFAs model in the
mathematical expressions. Firstly, an MCFAs prior is adopted to replace the prior
of latent factors in the first layer.
(1)

p(z|c) = PMCF A (zc ; θ (2)
c ).

(1.46)

Here, new symbols are the same as the definition of DMFAs. Let’s be more concrete,
the DMCFAs model can be written as
p(s) = πs(2) ,

S

(2)
s=1 πs

=1

(1.47)

(2)
p(z(2) |s) = N(z(2) ; ξ (2)
s , Ω s ),

(1.48)

(2)
(2)
p(z(1) , c|s, z(2) ) = N(z(1) ; A(2)
c z , Ψ c ),

(1.49)

p(y|c, z(1) ) = N(y; A(1) z(1) , Ψ (1) ).

(1.50)

22

X. Yang et al.
(2)

(2)

Here, A(1) ∈ Rp×q , Ψ (1) ∈ Rp×p , z(1) ∈ Rq , Ac ∈ Rq×d , Ψ c ∈ Rq×q , z(2) ∈
(2)
d
d×d .4 Conventionally, the joint distribution with the second
Rd , ξ (2)
s ∈ R , Ωs ∈ R
layer latent variables is given by
p(z(1) , z(2) , s) = p(z(1) , c|z(2) , s)p(z(2) |s)p(s),

(1.51)

The same scheme can be extended to train the following layers of MCFAs.
Up to this point, we can roughly calculate the total number of free parameters.
In a two-layer DMCFAs model, the total number of free parameters is in the order
of pq + cqd which is much smaller than that of cpq + sqd in DMFAs, where
the number of dimensions q  p and the number of components s is often in the
order of c2 . In summary, a typical two-layer DMCFAs model usually uses only 1/c
parameters of DMFAs.

1.3.2.1

Inference

For the MAP inference, the formulations of the posterior distribution are a similar
fashion in Sect. 1.2.3.2 with respect to the second layer parameters5
−1

(2)
q(z(2) , s|z(1) , c) = N(z(2) ; κ (2)
),
cs , Vcs

(1.52)

where
(2)−1

Vcs

(2)−1

= Ωs

−1

(2)T

+ Ac
T

(2)−1

Ψc

(2)

Ac ,

−1

(2)
(2)
(2)
κ (2)
A(2)
Ψ (2)
(z(1) − A(2)
c
c
cs = ξ c + Vcs
c ξ c ).

(1.53)

The posterior of the components can be found as follows
q(s|z(1) , c) ∝ p(z(1) , c|s)p(s),
ŝ =

arg max p(s)q(s|z(1) ).
s

(1.54)
(1.55)

In MLE, the likelihood of the mixture model corresponding to the cth component
derived from the first layer is estimated concerning the new observations z(1)
c and
parameters θ (2)
cs

4 In

the second layer, the sub-components corresponding to a component of the first layer share a
common loading and a variance of the independent noise, Ac(2) and Ψ c(2) has the subscript c.
5 The subscript emphasizes the sub-component s which is specific to a component c of the first
layer.

1 Introduction to Deep Density Models with Latent Variables

(2)
P (z(1)
c ; θ cs ) =



πs

s∈c

q


23

T

(2)

(2)
(2) (2) (2)
{N(zi |A(2)
+ Ψ (2)
c ξ c , Ac Ω s Ac
c )}, (1.56)

i=1

where s is just allowed to belong to one and only one c, and q denotes the number
of dimensions of the new observations. Parameters of the mixture model on the new
(1)
observations zc can be updated
(2)
θ̂ cs (2) = arg max log P (z(1)
c ; θ cs ).

(1.57)

θcs (2)

1.3.2.2

Collapse Model

Although a DMCFAs model can be collapsed back into a standard shallow MCFA
by multiplying the factor loading matrices at each layer, the learning of these
two models is entirely different since the lower layer shares the parameters with
the components of the upper layers in the deep model. Therefore, DMCFAs are
more efficient and straightforward than the shallow form, which attributes to the
conditional distribution of the components in the previously hidden layer which is
not modeled using the parameters of the following hidden layers. Moreover, the
over-fitting risk and the computational cost of learning can be significantly reduced
by sharing the factor loadings among the layers.
After the first layer factors z(1) are integrated out, we obtain a multivariate
Gaussian density
T

(2)
(1) (2) (1)
p(y|z(2) , s) = N(y; A(1) (A(2)
+ Ψ (1) ).
c z ), A Ψ c A

(1.58)

Then a standard MCFAs model is obtained by further integrating out z(2) :
p(y|s) =


z(2)

p(y|z(2) , s)p(z(2) |s)dz(2) = N (y; ms , Σ s ),

(2)

ms = A(1) (Ac ξ (2)
s ),

(2)

(2)

(2)T

Σ s =A(1) (Ac Ω s Ac

(2)

(1.59)

T

+Ψ c )A(1) +Ψ (1) . (1.60)

Finally, the marginal density is then given by a mixture of Gaussians
p(y) =

S

s=1

p(s)p(y|s) =

S


πs N(y; ms , Σ s ).

(1.61)

s=1

Conventionally, θ cs = {πs , A, Ψ , Ac , Ψ c , ξ s , Ω s }S,C
s=1,c=1 represents the parameters
of this shallow MCFAs. The posterior distribution of the latent factor zs can also be
collapsed to a shallow form
q(zs , s|y) = N(zs ; κ s , V−1
s ),

(1.62)

24

X. Yang et al.
−1

−1 + A(1) Ψ (1) A(1) ,
V−1
s = (Ac Ω s Ac + Ψ c )
(2)

(1)T

κs = A

(2)

T

(2)

ξ s + V−1
s A

(1)T

Ψ

(1)−1

(y − ms ).

(1.63)
(1.64)

1.3.3 Unsupervised Learning

Log-likelihood (Training set)

We evaluate the performance of the proposed DMFAs and DMCFAs in comparison
with their shallow counterparts on the datasets shown in Sect. 1.2.4. For the density
evaluation, the average log-likelihood is conducted to examine the quality of the
density estimates produced by the deep density models and the standard density
model by collapsing the deep models. When we do empirical analyses, all the results
are all the results are logarithmically computed which makes the results of different
datasets more conveniently compared. The average log-likelihood value on training
data is shown in Fig. 1.16 by multiple trials. The results of the testing data are also
obtained without updating parameters, as shown in Fig. 1.17. we can observe that the
deep models can improve the true log-likelihood of the standard model dramatically.
To assess the model-based clustering, we compute the error rate (the lower, the
better) on 4 real datasets for comparing the performance of deep density models with
the standard models with same scale mixtures. In the experiment, all the methods
are initialized by random grouping, and the numbers of clusters are set according
to the real classes of data. Also, we have done many trials to choose the best result
by dropping attributes into different dimensions. The results shown here are the
best from each model in the most appropriate dimensions. Figures 1.18 and 1.19
show that the results of the models using the common loading outperform other
competitors in both the training set and the testing set. Furthermore, as clearly

-100

ULC-3

Dataset
Coil-4-proc
Leuk72_3k

USPS1-4

-101
-102
-103
-104

DMFAs
DMCFAs
S-MFAs
S-MCFAs

Fig. 1.16 Performance on various real data (on Training Set) in terms of the log-likelihood (the
larger, the better). DMFAs and DMCFAs are all set in two layers. S-MFA and S-MCFA denote
respectively the shallow form by collapsing the deep models

Log-likelihood (Testing set)

1 Introduction to Deep Density Models with Latent Variables

-100

ULC-3

Dataset
Coil-4-proc
Leuk72_3k

25

USPS1-4

-101
-102
-103
-104

DMFAs
DMCFAs
S-MFAs
S-MCFAs

Fig. 1.17 Performance on various real data (on Testing Set) in terms of the log-likelihood (the
larger, the better).DMFAs and DMCFAs are all set in two layers. S-MFAs and S-MCFAs denote
respectively the shallow form by collapsing the deep models

Error Rate (Training set)

0.35
DMFAs
DMCFAs
S-MFAs
S-MCFAs

0.3
0.25
0.2
0.15
0.1
0.05
0

ULC-3

Leuk72_3k
Coil-4-proc
Dataset

USPS1-4

Fig. 1.18 Clustering error rate on training datasets. Both DMFAs and DMCFAs are two layers
architecture. The shallow forms S-MFA and S-MCFA have the same scale mixtures of the deep
models

observed, deep models consistently propose the better performance than their same
scale shallow models.

1.4 Expectation-Maximization Algorithm
In density models who have the incomplete data or hidden variables, expectationmaximization (EM) algorithms make the parameter estimation possible (Do and
Batzoglou 2008). As the general purpose iterative strategy, the EM algorithm
alternates between two steps. Given the current model, the expected step (step E) is
used to predict the probability distribution over completions of missing data. There

26

X. Yang et al.

Error Rate (Testing set)

0.35

DMFAs
DMCFAs
S-MFAs
S-MCFAs

0.3
0.25
0.2
0.15
0.1
0.05
0

ULC-3

Leuk72_3k
Coil-4-proc
Dataset

USPS1-4

Fig. 1.19 Clustering error rate on testing datasets. Both DMFAs and DMCFAs are two layers
architecture. The shallow forms S-MFA and S-MCFA have the same scale mixtures of the deep
models

is usually no need to establish a probability distribution over these completions
explicitly, and it is often necessary to calculate “expected” sufficient statistics
over completions. The maximization step (M-step) is used for re-estimating the
model parameters by utilizing these completions, which can be thought of as
“maximization” of the expected log-likelihood of the data (Do and Batzoglou 2008).
When observing variable y, latent variable z, and parameter θ are given, the
mathematical expression is as follows:
E − step : Q(θ |θ (k) ) = Eq(z|y;θ) [log

p(y, z|θ (k) )dz];

(1.65)

z

M − step :

θ (k+1) = arg max Q(θ|θ (k) ).

(1.66)

θ

Then, we introduce the convergence proof of the latent variable-based EM algorithm. Based on the Jensen inequality, it is easy to prove that the EM algorithm
repeatedly constructs new lower bounds F (q, θ ) where q denotes the posterior of
the latent variable, and then solving the parameters.
L(θ) = log

p(y, z|θ )dz = log
z

≥

q(z|y; θ ) log
z

F (q, θ ) =

q(z|y; θ )
z

p(y, z|θ )
dz
q(z|y; θ )

p(y, z|θ )
dz = F (q, θ ),
q(z|y; θ )

q(z|y; θ ) log p(y, z|θ )dz −
z

= log p(y, z|θ )q(z|y;θ)  + H (q).

(1.67)
(1.68)

q(z|y; θ ) log q(z|y; θ )dz (1.69)
z

(1.70)

1 Introduction to Deep Density Models with Latent Variables

27

Fig. 1.20 Illustration of EM
algorithm. The E step is to
calculate q by fixing θ (k+1) ,
and then raise the lower
bound from F (q; θ (k+1) ) to
F (q; θ (k+2) ), where
F (q; θ (k+2) ) is equal to L(θ)
at θ (k+1) . M step is to
migrated from θ (k+1) to
θ (k+2) when q is fixed, and
then to find the maximum
F (q; θ (k+2) )

In simple terms, the process of EM is to calculate a lower bound function of the
current latent variables distribution according to the current parameters and to obtain
new parameters by optimizing this function, and then continue the loop (as shown
in Fig. 1.20).
In the deep model, there is involved an efficient and simple optimization
algorithm to perform inference and learning, so-called the greedy layer-wise
unsupervised algorithm. Due to this layer-wise algorithm, the EM algorithm can
typically be used to estimate the parameters of each mixture in each layer to find a
local maximum of the log-likelihood. Importantly, a simple objective function can
be fed to the EM algorithm in a layer-wise style to instead of the massive objective
function of a shallow model with same scale mixtures. For instance, the expectation
log-likelihood of the mixture in the first layer is shown as follows
Q(θ|θ (k) ) =

C


q(z, c|y; θ c ) ln p(y, z, c|θ c )dz

c=1 z

= Eq(z,c|y;θ c ) [ln p(y|c, z) + ln p(z|c) + ln πc ].

(1.71)

During M-step, the parameters θ k+1
(at (k + 1)th iteration) are updated by solving
c
the partial differentiation of the expectation log-likelihood equation over each
parameter
∂Eq(z,c|y,θ old ) [L(θ c )]
= 0.
∂θ c

(1.72)

The higher layer has an ability to model a better aggregated posterior of the first
layer, with variational inference, any increase in the bound will improve the true
log-likelihood of the model when the bound is tight. Therefore, training the deep
model is better than training a shallow model.

28

X. Yang et al.

1.5 Conclusion
In summary, this chapter introduces the novel deep density models with latent
variables. We detail the two standard probabilistic models, MFAs and MCFAs,
and the deep models with which they are deduced. Compared with previous deep
density models, the greedy layer-wise algorithm enjoys an easy inference procedure
and a significantly smaller number of free parameters. Experimental results are
compared between deep models and shallow models in both density evaluation
and clustering. Empirical results showed that the deep models could significantly
improve the true log-likelihood of the standard model. Moreover, deep models also
consistently propose the better performance than shallow models on clustering. For
more practical applications, the deep density models also useful for creating a good
prior that can be used for tasks such as image denoising and inpainting or tracking
animate motion.
Acknowledgements The work reported in this paper was partially supported by the following:
National Natural Science Foundation of China (NSFC) under grant no.61473236; Natural Science
Fund for Colleges and Universities in Jiangsu Province under grant no.17KJ D520010; Suzhou
Science and Technology Program under grant no.SY G201712, SZS201613; Jiangsu University
Natural Science Research Programme under grant no.17KJ B520041; and Key Program Special
Fund in XJTLU (KSF − A − 01).

References
Arnold L, Ollivier Y (2012) Layer-wise learning of deep generative models. CoRR, abs/1212.1524
Baek J, McLachlan GJ (2011) Mixtures of common t-factor analyzers for clustering highdimensional microarray data. Bioinformatics 27(9):1269–1276
Baek J, McLachlan GJ, Flack LK (2010) Mixtures of factor analyzers with common factor
loadings: applications to the clustering and visualization of high-dimensional data. IEEE Trans
Pattern Anal Mach Intell 32(7):1298–1309
Bengio Y, Lamblin P, Popovici D, Larochelle H (2007) Greedy layer-wise training of deep
networks. In: Advances in neural information processing systems, pp 153–160
Bengio Y et al (2009) Learning deep architectures for AI. Found Trends Mach Learn 2(1):1–127
Bishop CM (1998) Latent variable models. In: Learning in graphical models. Springer, New York,
pp 371–403
Bishop C (2006) Pattern recognition and machine learning. Springer, New York
Do CB, Batzoglou S (2008) What is the expectation maximization algorithm? Nat Biotech
26(8):897–899
Everitt BS (1984) Factor analysis. Springer Netherlands, Dordrecht, pp 13–31
Everett BS (2013) An introduction to latent variable models. Springer Science & Business Media,
Berlin
Figueiredo M, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern
Anal Mach Intell 24(3):381–396
Fokoué E (2005) Mixtures of factor analyzers: an extension with covariates. J Multivar Anal
95(2):370–384
Galbraith JI, Moustaki I, Bartholomew DJ, Steele F (2002) The analysis and interpretation of
multivariate data for social scientists. Chapman & Hall/CRC Press, Boca Raton

1 Introduction to Deep Density Models with Latent Variables

29

Ghahramani Z (2015) Probabilistic machine learning and artificial intelligence. Nature
521(7553):452
Ghahramani Z, Hinton G (1996) The em algorithm for mixtures of factor analyzers. Technical
Report CRG-TR-96-1, University of Toronto, pp 11–18. http://www.gatsby.ucl.ac.uk/.zoubin/
papers.html
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks.
Science 313(5786):504–507
Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural
Comput 18(7):1527–1554
Mclanchlan GJ, Peel D, Bean RW (2003) Modelling high-dimensional data by mixtures of factor
analyzers. Comput Stat Data Anal 41:379–388
Johnson B (2013) High resolution urban land cover classification using a competitive multi-scale
object-based approach. Remote Sens Lett 4(2):131–140
Johnson B, Xie Z (2013) Classifying a high resolution image of an urban area using super-object
information. ISPRS J Photogrammetry Remote Sens 83:40–49
Law MHC, Figueiredo MAT, Jain AK (2004) Simultaneous feature selection and clustering using
mixture models. IEEE Trans Pattern Anal Mach Intell 26(9):1154–1166
Loehlin JC (1998) Latent variable models: an introduction to factor, path, and structural analysis.
Lawrence Erlbaum Associates Publishers
Ma J, Xu L (2005) Asymptotic convergence properties of the em algorithm with respect to the
overlap in the mixture. Neurocomputing 68:105–129
McLachlan G, Krishnan T (2007) The EM algorithm and extensions, vol 382. John Wiley & Sons,
Hoboken
McLachlan GJ, Peel D (2000) Mixtures of factor analyzers. In: International Conference on
Machine Learning (ICML), pp 599–606
Montanari A, Viroli C (2011) Maximum likelihood estimation of mixtures of factor analyzers.
Comput Stat Data Anal 55(9):2712–2723
Nene SA, Nayar SK, Murase H (1996) Columbia object image library (coil-20). Technical report,
Technical Report CUCS-005-96
Patel AB, Nguyen T, Baraniuk RG (2015) A probabilistic theory of deep learning. arXiv preprint
arXiv:1504.00641
Rippel O, Adams RP (2013) High-dimensional probability estimation with deep density models.
CoRR, abs/1302.5125
Salakhutdinov R, Mnih A, Hinton GE (2007) Restricted boltzmann machines for collaborative
filtering. In: Machine Learning, Proceedings of the Twenty-Fourth International Conference
(ICML), Corvallis, Oregon, USA, 20–24 June 2007, pp 791–798
Smaragdis P, Raj B, Shashanka M (2006) A probabilistic latent variable model for acoustic
modeling. Adv Models Acoust Process NIPS 148:8–1
Tang Y, Salakhutdinov R, Hinton GE (2012) Deep mixtures of factor analysers. In: Proceedings
of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland,
UK, June 26–July 1 2012
Tang Y, Salakhutdinov R, Hinton G (2013) Tensor analyzers. In: International Conference on
Machine Learning, pp 163–171
Tortora C, McNicholas PD, Browne RP (2016) A mixture of generalized hyperbolic factor
analyzers. Adv Data Anal Classif 10(4):423–440
Vermunt JK, Magidson J (2004) Latent class analysis. The sage encyclopedia of social sciences
research methods, pp 549–553
Yang X, Huang K, Goulermas JY, Zhang R (2004) Joint learning of unsupervised dimensionality
reduction and gaussian mixture model. Neural Process Lett 45(3):791–806 (2017)
Yang X, Huang K, Zhang R (2017) Deep mixtures of factor analyzers with common loadings:
a novel deep generative approach to clustering. In: International Conference on Neural
Information Processing. Springer, pp 709–719

Chapter 2

Deep RNN Architecture: Design
and Evaluation
Tonghua Su, Li Sun, Qiu-Feng Wang, and Da-Han Wang

Abstract Sequential data labelling tasks, such as handwriting recognition, face two
critical challenges. One is in great needs of a large-scale well-annotated training
data. The other is how to effectively encode both the current signal segment and
the contextual dependency. Both needs many human efforts. Motivated to relieve
such issues, this chapter presents a systematic investigation on architecture design
strategies for recurrent neural networks in two different perspectives: the pipeline
perspective and micro core perspective. From the pipeline perspective, a deep endto-end recognition architecture is proposed, allowing a fast adaptation to new tasks
with minimal human intervention. Firstly, the deep architecture of RNN gave a high
nonlinear feature representation. Through hierarchical representations, effective and
complex features are expressed in terms of other, simpler ones. Moreover, a hybrid
unit is used to encode the long contextual trajectories, which comprise of a BLSTM
(bidirectional Long Short-Term Memory) layer and a FFS (feed forward subsampling) layer. Secondly, the CTC (Connectionist Temporal Classification) objective
function makes it possible to train the model without annotating the data in character
level. Thirdly, a modified CTC beam search algorithm integrates the linguistic
constraints wisely during decoding in a unified formulation. In such an end-to-end

Part of this chapter is reprinted from:
IEEE Proceedings, Li Sun, “GMU: A Novel RNN Neuron and Its Application to Handwriting
Recognition” 2017 with permission from IEEE
IEEE Proceedings, Li Sun, “Deep LSTM Networks for Online Chinese Handwriting Recognition”,
2016, with permission from IEEE
T. Su () · L. Sun
School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
e-mail: thsu@hit.edu.cn
Q.-F. Wang
Xi’an Jiaotong-Liverpool University, Suzhou, China
D.-H. Wang
School of Computer and Information Engineering, Xiamen University of Technology, Xiamen,
China
© Springer Nature Switzerland AG 2019
K. Huang et al. (eds.), Deep Learning: Fundamentals, Theory and Applications,
Cognitive Computation Trends 2, https://doi.org/10.1007/978-3-030-06073-2_2

31

32

T. Su et al.

style, a thin and compact network with high accuracy can be designed and ready to
industry-scale applications. From the micro core perspective, a new neuron structure
named gated memory unit (GMU) is presented, inspired by both LSTM and GRU.
GMU preserves the constant error carousels (CEC) which is devoted to enhance
a smooth information flow. GMU also lends both the cell structure of LSTM and
the interpolation gates of GRU. Our methods are evaluated on CASIA-OLHWDB
2.x, a publicly available Chinese handwriting database. Comparing with state-ofthe-art methods, it shows that: (1) our pipeline demonstrates over 30% relative error
reductions on test set in terms of both correct rate and accurate rate; (2) better results
can be achieved by our pipeline at certain feasible assumption on competition set;
(3) GMU is of potential choice in handwriting recognition tasks.
Keywords Deep learning architecture · Neuron design · Recurrent neural
network · Handwriting recognition · Large-category labelling

2.1 Introduction
Online Chinese handwriting recognition is a typical sequential data-labelling task.
During writing, the running pen (or more generally writing instrument) tip is
recorded by sensing devices. Each textline can be viewed as a temporal sequence of
tensor, also called trajectory. Online handwriting recognition plays a significant role
for entering Chinese text. Compared with other input methods, it is more natural to
human beings, especially to those Chinese people who are not good at Pinyin.
The pioneer works on online Chinese handwriting recognition were independently studied in 1966 by two groups from MIT and University of Pittsburgh (Liu
1966; Zobrak 1966). Since then, extensive concerns have been received, owing to
the popularity of personal computers and smart devices. Successful applications
have been found in pen-based text input (Wang et al. 2012a), overlapped input
method(Zou et al. 2011; Lv et al. 2013), and notes digitalization and retrieval (Zhang
et al. 2014).
In recent years, combination of recurrent neural network (RNN) and Connectionist Temporal Classification (CTC) becomes a powerful sequential labeling tool
(Graves et al. 2009a). Messina et al. utilize this tool to transcribe offline Chinese
handwriting (Messina and Louradour 2015) and its potentials are proved. More
recently, convolutional LSTM is applied to online Chinese handwriting recognition
(Xie et al. 2016) with pleasing results.
The recurrent neurons are the building blocks of RNN. Their own capabilities and
combinations thereof can form different network architectures, resulting in different
modeling capabilities. Although RNN can approximate any sequence to sequence
mapping in theory, decades of studies suggest that vanilla recurrent neural networks
are difficult to capture the long-term dependencies (Bengio et al. 1994). This chapter
aims to present RNN design solutions from both pipeline perspective and micro core
neuron perspective. In the pipeline level, a novel end-to-end recognition method

2 Deep RNN Architecture: Design and Evaluation

33

is proposed based on deep recurrent neural networks. Instead of rendering the
trajectories as offline mode (Xie et al. 2016; Liu et al. 2015), we explore the possible
capacities of raw pen trajectory using a deep architecture. The architecture mixes
bidirectional LSTM layers and feed forward subsampling layers, which are used to
encode the long contextual history trajectories. We also follow the CTC objective
function, making it possible to train the model without alignment information
between input trajectories and output strings. As suggested in (Liu et al. 2015), we
utilize the explicit linguistic constraints to boost our end-to-end systems in online
mode. To integrate the linguistic constraints wisely, a modified CTC beam search
algorithm is devised for decoding. In the micro core level, we propose a new kind
of recurrent neuron called Gated Memory Unit (GMU) based on the principle of
highway channel. The GMU neuron combines two components. One is used to
keep history context and the other to generate the neuron activation. We retain the
memory cell structure of LSTM to constitute a transmission channel, and lend the
interpolation gate of GRU to regulate the flow of information. At both levels, the
proposed method is evaluated on a publicly available database.

2.2 Related Works
2.2.1 Segmentation-Free Handwriting Recognition
Most approaches for online Chinese handwriting recognition fall into segmentationrecognition integrated strategy (Cheriet et al. 2007). Firstly, the textline is oversegmented into primitive segments. Next, consecutive segments are merged and
assigned a list of candidate classes by a character classifier (Liu et al. 2013), which
forms segmentation-recognition lattice. Last, path searching is executed to identify
an optimal path by wisely integrating character classification scores, geometric
context and linguistic context (Wang et al. 2012b). Great success of such strategy
has been achieved during the past few years.
If there are big well-annotated training data, segmentation-recognition integrated
strategy may be the first choice. Many works are conducted to optimize any
of the key factors, such as character recognition (Zhou et al. 2016), confidence
transformation (Wang and Liu 2013), parameter learning methods (Zhou et al. 2013,
2014). In (Yin et al. 2013), such three modules are termed by Vision Objects Ltd. as
segmentation, recognition, and interpretation experts. They are separately trained
and their weighting to the task is scored through a global discriminant training
process. The complex training phases involves intense expert intervention.
However, construction of such a system is nontrivial. As a precondition, a largescale training data needs to be annotated in character level. The widespread Chinese
handwriting databases, such as HIT-MW (Su et al. 2007) and CASIA-OLHWDB
(Liu et al. 2011), had been taken years to be annotated by hand. Moreover, features
for classification are derived through experts’ handcraft. In addition, different

34

T. Su et al.

components of the system are developed separated, probably requiring different data
for tuning. Definitely, it needs great human efforts.
As an alternation, segmentation-free strategy is promising to solve above issues.
There is no need to explicitly segment textlines into characters. Thus labeling each
character boundaries in training data is unnecessary. Previously, hidden markov
models (HMMs) have been successfully used to recognize Chinese handwriting (Su
et al. 2009) in a segmentation-free way.
Nowadays, RNN especially LSTM became powerful tool to sequential labelling
tasks. Messina et al. utilize this tool to transcribe offline Chinese handwriting
(Messina and Louradour 2015). The preliminary results are comparable to those
ever reported. More recently, convolutional LSTM is applied to online Chinese
handwriting recognition (Xie et al. 2016) with pleasing results. The system includes
four kinds of layers. The input trajectory is first converted into textline images
using a path signature layer. Then convolutional layers are concatenated to learn
features. After that, the features are fed to multiple layers of bidirectional LSTM.
And finally, the CTC beam search is used to transform the output of network to
labels in transcription layer.
Instead of rendering the trajectories as offline mode (Xie et al. 2016; Liu et al.
2015), we explore the possible capacities of original pen trajectory using a deep
architecture. As suggested in (Liu et al. 2015), we utilize the explicit linguistic
constraints to boost our end-to-end systems in online mode.

2.2.2 Variants of RNN Neuron
Since the early 1990s, pursuit of sophisticated recurrent neurons has been a
significant topic. Long Short-term Memory (LSTM) and Gated Recurrent Unit
(GRU) are two kinds of commonly used sophisticated recurrent neurons. Currently
these RNN neurons are widely used in many research fields, including natural
language processing (Cho et al. 2014a), speech recognition (Amodei et al. 2016;
Graves et al. 2013), image recognition (Russakovsky et al. 2015). The state-of-theart character recognition methods (Graves et al. 2009b; Graves and Schmidhuber
2009; Graves 2012a), are increasingly using recurrent neural networks and have
made a lot of progress. Briefly speaking, there are two reasons for their success.
Firstly, they invent special gates which can keep longer contextual dependencies,
such as forget gate (for LSTM) or update gate (for GRU). Secondly, they build
a highway channel to provide shortcuts for the smooth propagation of history
information and back propagation of error information.
LSTM is one of the earliest proposed sophisticated recurrent neuron. The original
LSTM structure (Hochreiter and Schmidhuber 1997) comprises one or more selfconnected memory cell, input gate and the output gate. The structure establishes
constant error carousels (CEC), which facilitates both the storage of history
information to encode a long-term dependency and the gradient back propagation
to prevent gradient decay problem. The input gate is used for controlling the ratio

2 Deep RNN Architecture: Design and Evaluation

35

xt

G ii
f kt

i kt

x t+1

G ii

ćkt

Gf

ć kt+1

Gf

f kt+1
c kt

ct-1
Go

i kt+1

o kt
h kt
t

c kt+1

ct
Go
ht

ct+1

okt+1
hkt+1

t+1

Fig. 2.1 LSTM is unfolded in the time steps

of information to enter the memory cell. The output gate is used to control the
exposure portion of memory cell content as neuron activation. To further enhance
the ability of LSTM, forget gate is added (Gers et al. 2000), which resets operation
if previous history should be discarded. More recent enhancement on LSTM is
peephole connections (Gers et al. 2002). It is helpful to improve the precise timing
and counting ability of internal state.
LSTM neurons can be unfolded in time steps, as shown in Fig. 2.1. Here the cell
is deliberately depicted as a channel flowing from left to right. During the forward
propagation process, the contents of memory cell are stored in the channel. And
during the back-propagation process, the gradient of objective function can flow
back using the channel. Memory content updates using Eq. 2.1:
k
ctk = ftk ct−1
+ ctk

(2.1)

where forget gate’s (Gt) output value fkt determines how much the content of history
(ckt−1 ) preserved. In case fkt is close to 0, it means discarding almost all history
information; otherwise the history information will be partially retained. Even we
can control the input gate (Go) to maximize the usage of history information. The
contents of the memory cell are not fully exposed to the hidden state. Instead, it is
controlled using the value of the output gate, okt .

36

T. Su et al.

xt

r k t+1

r kt

Gr
Gz

1- z kt

x t+1

Gr

g

zk

ħkt

Gz

t

g
z kt+1

ħkt+1

1-z kt+1

ht-1

ht
t

hkt

h t+1
t+1

h kt+1

Fig. 2.2 GRU is unfoleded in the time steps

Another type of sophisticated recurrent neuron is Gated Recurrent Unit (GRU)
(Cho et al. 2014a, b). It is unfolded in time steps in Fig. 2.2. GRU uses a simpler
memory block than LSTM. Each GRU block consists of an input transformation, a
reset gate (Gr) and an update gate (Gz). Reset gate controls short-term contextual
dependencies using a self-connected structure and update gate controls long-term
contextual dependencies using an interpolation between the history information and
the current activation. To adaptively capture the dependencies of different lengths,
just tune those two gates. Moreover, since the unit outputs (hidden states) are selfconnected, it becomes the channel for historical information. Similar to LSTM cell,
the channel ensures a smooth gradient back propagation. But it is different from
LSTM. The new unit output is updated using Eq. 2.2.


hkt = 1 − ztk hkt−1 + ztk kt

(2.2)

In case zkt is close to 1, the hidden state is forced to ignore the previous hidden states;
otherwise, the previous information is retained through the interpolation coefficient.
To achieve maximum usage of history information, one can even allow zkt close to
zero. Different from LSTM, the contents of memory cell are exposed fully without
any reservation.

2 Deep RNN Architecture: Design and Evaluation

37

2.3 Datasets
We use an online Chinese handwriting database, CASIA-OLHWDB (Liu et al.
2011), to train and evaluate our models. This database comprises trajectories of both
isolated characters and lines of text. The part comprising only isolated characters
has three sub-sets, named OLHWDB 1.0∼l.2, and the other also has three parts:
OLHWDB 2.0∼2.2.OLHWDB 2.0∼2.2 are divided into train set and test set by
authors of (Liu et al. 2011). The train set includes 4072 pages from 815 people while
the test set includes 1020 pages from 204 people. Also there is a held-out subset for
ICDAR2013 Chinese handwriting recognition competition (denoted as comp set)
(Yin et al. 2013). Characteristics of those datasets are summarized in Table 2.1. Our
systems are evaluated on test set or comp set. Note that there are 5 characters in test
set which are not coved by train set while around 2.02% characters in comp set are
not coved by train set.
Certain preprocessing is taken before the experiment. Firstly, determine a
character set to cover all samples occurred in CASIA-OLHWDB 2.x. Totally there
are 2765 unique character terms (including 2764 terms and one {blank}). Secondly,
remove the manually annotated points in the sample. In order to retain the true
handwriting behavior, the cutting points added to segment the ligatures are removed.
Thirdly, derive a compact feature representation. The input trajectory is represented
in raw form. At each timestep, it consists of first order difference of x, y coordinates,
along with a state to indicate whether the pen is lifted. As for the start of the
trajectory, first two dimensions are set as zeroes.
We want to construct n-gram models to express the linguistic constraints. A large
amount of textual data are collected from electronic news of People’s Daily(in html
format). We just filter out the html markups and no human effort is intervened while
the data contain some noisy lines. The texts are ranging from 1999 to 2005 including
193,842,123 characters. We estimate our n-gram language models using SRILM
toolkit (Stolcke 2002). All characters outside of 7356 classes are removed. Bigram
and trigram are considered respectively in base-10 logarithm scale. Both of them
are pruned with a threshold of 7e-7.
Table 2.1 Characteristics of
CASIA-OLHWDB 1.0∼l.2
datasets

Items
#pages
#lines
#chars
#classes

Subsets of OLHWDB 2.0∼2.2
train
test
Comp
4072
1020
300
41,710
10,510
3432
1,082,220 269,674 91,576
2650
2631
1375

38

T. Su et al.

2.4 Proposed Deep Neural Network
The proposed system aims to map a fed pen trajectory X (=x1 x2 . . . xT ) to a textual
string l following an end-to-end paradigm and learn an optimal map. This process
is illustrated in Fig. 2.1. Here we let a LSTM model to encode the map. Supposing
such a map is learned. Then we can feed an unseen trajectory textline. The LSTM
network is propagated from input layer to output layer through hidden layers. As
the final step, CTC beam search is performed to output most possible strings. The
performance of each trajectory is measured based on the alignment between l and
the ground-truth z.
How to derive an optimal map is left to the training process of Fig. 2.3. We
generate the needed LSTM models by minimizing the empirical loss function:
L (S) =


(X,z)∈S

L (X, z) ,

(2.3)

with S denoting the training set. Because the problem is a multi-class classification,
we use the L (X, z) = −lnP (z|X) to represent the example loss function which can
be effectively computed by CTC and forward-backward algorithm. Training data in
the CASIA-OLHWDB database (Liu et al. 2011) is used to derive the classification

training

prediction

Training
trajectory

Test
trajectory

Arrange the
trajectory

Arrange the
trajectory

Train LSTM

Forward
propagation

N-gram

CTC Beam Search

string
Fig. 2.3 Flowchart of the system

2 Deep RNN Architecture: Design and Evaluation

39

model. Also we estimate n-gram language models from textual corpus in advance
which will be used to facilitate the search process during decoding.

2.4.1 Architecture
We develop a deep recurrent neural network (RNN) to derive the mapping function
that is specialized for processing a sequence of vectors. Fig. 2.4 shows a typical
instance of the unfolded recurrent neural networks. Besides the input and output
layers, there are hidden layers and subsampling layers. The history information is
encoded through the recurrent hidden layer. Hidden variable vector at time step t in
the n-th hidden layer htn is iteratively computed from both (n-1)-th hidden variable
vector htn−1 (ht0 = xt ) and previous hidden variable vector hnt−1 :


t−1
htn = tanh WIn htn−1 + UH
,
h
n n

(2.4)

where WIn is the weight matrix between (n-1)-th hidden layer and n-th hidden layer
t
and UH
n is the self-connected recurrent weight matrix. Since hn is defined in a
recursive manner, the dependency on input vectors ranging first to t-th time step
may be possible.
To further benefit the geometric context both from left to right and vice versa,
we also consider a bidirectional recurrent neural network (BRNN) (Schuster and

Fig. 2.4 The architecture of the unfolded BRNN (with 5 hidden layers)

40

T. Su et al.

Paliwal 1997) where each recurrent layer is replaced with a forward layer and
backward layer. The forward pass is the same as usual, while the backward processes
data from t = T to 1. The recurrent formulation results in the sharing of parameters
through all time steps (Graves 2012b).
Subsampling layers are worthy to explore considering the large length of
trajectories and the variability in writing styles. Herein we devise feedforwardlike neuron units which allow inputs from a window of consecutive timesteps to
be collapsed. It is a simplified version of (Graves 2012b). The k-th hidden variable
at time step w × t in the n-th subsampling layer is computed as:
w×t
hn(k)
= tanh

w
i=1


w×t+i
,
win(k) · hn−1

(2.5)

where w is the window size, win(k) is the weight vector from (n-1)-th layer to the
k-th unit in the n-th layer.
The network output vector at time t can be computed as:


 
yt = sof tmax at = sof tmax WO htL ,

(2.6)

where WO is the weight matrix from the last hidden layer to output layer.
Note one extra output node is reserved for {null} that will be used in CTC
algorithm.
The standard RNNs suffer from vanishing gradient problem (Goodfellow et al.
2016). In our RNN network, LSTM memory blocks (Hochreiter and Schmidhuber
1997) are used to replace the nodes of RNN in hidden layers. Herein each memory
block includes three gates, one cell, and three peephole connections. Forget gate’s
output value determines how much the content of history preserved. In case it
is close to 0, it means discarding almost all history information; otherwise the
history information will be partially retained. Even we can control the input gate
to maximize the usage of history information. The contents of the memory cell are
not fully exposed to the hidden state. Instead, it is controlled using the value of
the output gate. All of them are nonlinear summation units. In general, the input
activation function is tanh function and the gate activation function is sigmoid
function which can squeeze the gate data between 0 and 1.

2.4.2 Learning
Connectionist Temporal Classification (CTC) is a powerful object function for
sequential data labelling. It allows RNNs to be trained without requiring any prior
alignment between input and target sequences (Graves and Jaitly 2014). Assuming
all labels are draw from an alphabet A, define the extended alphabet A = A∪{null}.
If the output is null, it emits no label at that timestep. A CTC mapping function

2 Deep RNN Architecture: Design and Evaluation

41

is defined to remove repeated labels and then delete the nulls from each output
sequence. Providing that the null label is injected into the label sequence z, we get
a new sequence z’ of length 2|z| + 1.
CTC uses a forward-backward algorithm (Graves 2012b) to sum over all possible
alignments and determine the conditional probability Pr(z|X) of the target sequence
given the input sequence. For a labelling z, the forward variable α(t, u) is defined as
the summed probability of all length t paths that are mapped by onto the length
u/2 prefix of z. On the contrary, the backward variable β(t, u) is defined as the
summed probabilities of all paths starting at t + 1 that complete z when appended
to any path contributing to α(t, u). Thus the conditional probability Pr(z|X) can be
expressed as:
Pr (z|X) =


u=1∼|z |

α (t, u) β (t, u) ,

(2.7)

Substituting Eq. 2.7 into Eq. 2.3, we obtain the empirical loss function. Defining
the sensitivity signal at output layer as δkt  ∂L/∂akt , we can get:
δkt = ykt −

1 
α (t, u) β (t, u) ,
u∈B(z,k)
Pr (z|X)

(2.8)



where B (z, k) = u : zu = k . Based on Eq. 2.8, the sensitivity signals at other
layers can be easily backpropagated through the network.

2.4.3 Decoding
Beam search is employed to integrate the linguistic constraints. Modifications are
made to wisely integrate constraints. To reduce the bias of noisy estimation, the
extension probability Pr(k,y,t) is rescaled, while the pseudocode in Algorithm 1 falls
into the similar framework proposed in (Graves and Jaitly 2014). Define Pr− (y,t),
Pr+ (y,t) and Pr(y,t) as the null, non-null and total probabilities assigned to partial
output transcription y, at time t respectively. The probability Pr(k,y,t) of y by label
k at time t can be expressed as follows:

Pr (k, y, t) = Pr (k, t|X) [γ Pr (k|y)]

Pr− (y, t − 1) , ye = k
,
Pr (y, t − 1) , others

where Pr(k,t|X) is the CTC emission probability of k at t, as defined in (2.3), Pr(k|y)
is the linguistic transition from y to y + k, ye is the final label in y and γ is the
language weight.

42

T. Su et al.

Algorithm 1: Modified CTC Beam Search

2.4.4 Experimental Setup
We set 2765 units to the output layer. One is reserved as null indicator; others cover
all characters occurred in train set, test set and comp set of OLHWDB 2.0∼2.2. We
also select isolated samples from OLHWDB 1.x with no more than 700 samples per
class hoping to alleviate the out-of-vocabulary problem of comp set.
Three different networks are investigated, as provide in Table 2.2. The type
column reflects the number of hidden layers. The setting column describes the size
of each layer. The suffix has specific meaning: I-input layer, S-subsampling layer, LLSTM layer and O-output layer. The second column summarizes the overall number
of network parameters (in million). The three networks are deliberately tuned to be
comparable in number of parameters. Our LSTM layer is very thin compared to
(Xie et al. 2016) where 1024 nodes are assigned.
The system is developed from scratch by our own. The probabilities except
language model are expressed in natural logarithm scale. The learning rate is driven
by AdaDelta algorithm (Zeiler 2012) with a momentum of 0.95. During CTC beam
search decoding, the emission probabilities are pruned with the reciprocal of the
output units as threshold and the beam size is fixed as 64 in our evaluation. Both
CPU version and GPU version are implemented. Using a mini-batch size of 32
textlines, a speedup of more than 200X can be achieved by our GPU version along
with a Tesla K40.

2 Deep RNN Architecture: Design and Evaluation

43

Table 2.2 Three different network setups
Type
3-layer
5-layer
6-layer

#para
1.63 M
1.50 M
1.84 M

Setting
3I-64 L-128S-192 L-2765O
3I-64 L-80S-96 L-128S-160 L-2765O
3I-48 L-64S-96 L-128 L-144S-160 L-2765O

To overcome the data sparsity, the training process is divided into four stages. In
the first stage, isolated samples are used to train the networks with a mini-batch size
of 128 characters. It has cycled for 5 epochs. The next stage is run on all training
data from OLHWDB 2.x with a mini-batch size of 32 textlines and it is repeated
for 10 epochs. The third stage is executed on the same data as the first stage with
10 epochs. At the final stage, 95% of the training data from OLHWDB 2.x is used
to fine-tune the networks no more than 20 epochs. The rest of the training data is
reserved for validation of the best model.
The output of certain recognizer is compared with the reference transcription and
two metrics, the correct rate (CR) and accurate rate (AR), are calculated to evaluate
the results. Supposing the number of substitution errors (Se ), deletion errors (De ),
and insertion errors (Ie ) are known, CR and AR are defined respectively as:


CR = Nt −SNet−De
,
e −Ie
AR = Nt −SeN−D
t

(2.9)

where Nt is the number of total characters in the reference transcription.

2.4.5 Results
The 4th training stage is illustrated in Fig. 2.5 where the y axis illustrates the AR on
validation set. The best model in this phase is selected and used to evaluate on test
set or comp set afterwards.
The rescaling intensity of language models is determined based on their performance on validation set. In Fig. 2.6, the role of character-level trigram is given. We
can see that it is stable when the language weight ranges from 1.0 to 1.8 while the
standard decoding algorithm uses a value around 2.3. Similar observation can be
seen for bigram. In the following experiments, we simply fix it as 1.0.
We investigate the role of different language models as well as network depth
both on test set and comp set, as shown in Table 2.3. Though three networks have
almost same quantity of parameters, the deeper networks generally work better
on both sets. On the other, we can see a steady increase in performance when
more strong linguistic constraints are imposed. Compared between two sets, the
improvement by using language models on comp set is more remarkable.

44

T. Su et al.

AR (%)

96

94

3-layer
5-layer
6-layer

92

90

4

8

12

16

20

Training Epochs
Fig. 2.5 The fourth training stage is evaluated on validation set

AR (%)

98

97

3-layer
5-layer
6-layer

96

0

0.5

0.8

1

1.2

1.5

1.8

2

2.3

LM weight
Fig. 2.6 Investigation of language model (trigram) weight on the validation set

Finally, we compare our results with previous works, as shown in Table 2.4. The
systems in first four rows use a segmentation-recognition integrated strategy. The
fifth row employs a hybrid of CNN and LSTM architecture and a specific feature
extraction layer is presented. Best results on test set are achieved by our system.
Compared with (Xie et al. 2016), 1.15% and 1.71% absolute error reductions
are observed in CR and AR respectively. Compared with the best results from
the segmentation-recognition integrated strategy (Zhou et al. 2014), around 2.3%
absolute error reductions are made in both CR and AR. Our system achieves a bit
lower result than the segmentation-recognition integrated strategy on comp set. It is
severely challenging for the segmentation-free strategy to perform on comp set since

2 Deep RNN Architecture: Design and Evaluation
Table 2.3 Role of language
models on both test set and
comp set (%)

45

No LM
Type
AR
Test set
3-layer 94.81
5-layer 95.08
6-layer 95.30
comp set
3-layer 88.06
5-layer 88.91
6-layer 89.12

CR

bigram
AR
CR

trigram
AR
CR

95.52
95.63
95.82

96.13
96 .34
96.45

96.84
96.92
97.00

96.83
97.04
97.05

97.45
97.58
97.55

89.40
90.00
90.18

91.42
92.05
92.12

92.87
93.20
93.25

93.00
93.37
93.40

94.28
94.44
94.43

Table 2.4 Comparisons with previous results (%)
Methods
Wang2012 (Wang et al. 2012a)
Zhou2013 (Zhou et al. 2013)
Zhou2014 (Zhou et al. 2014)
VO-3 (Yin et al. 2013)
Xie2016 (Xie et al. 2016)
Our method
a Remove

Test set
AR
91.97
93.75
94.69
–
95.34
97.05

CR
92.76
94.34
95.32
–
96.40
97.55

comp set
AR
–
94.06
94.22
94.49
92.88*
93.40
94.65a

CR
–
94.62
94.76
95.03
95.00*
94.43
95.65*

all characters in comp set which are not coved by train set

there is a noticeable out-of-vocabulary problem. If we remove all out-of-vocabulary
characters from comp set, our system achieves 94.65% and 95.65% in AR and CR,
respectively, even outperforming the best system of Vision Objects (Yin et al. 2013).

2.4.6 Error Analysis
We inspect the recognition result of each text line. We arrange the ARs of all
textlines into 11 intervals and consider the number of textlines that fall into the
interval, as shown in Fig. 2.7. The width of interval is 5% if the AR is bigger than
50%. There are 113 text lines in comp set whose ARs are lower than 75% and we
inspect all of them to get some insights.
Generally, errors are caused by three challenging issues, as shown in Fig. 2.8.
First of all, there are cursively written samples, even two or more characters are
joined as ligature. Moreover, severe skew or slant textlines are not easy to deal with
since there are limited training samples to cover such phenomena. In addition, there
are inadequate samples for English letters/words. It may be helpful to further reduce
or alleviate such issues.

46

T. Su et al.
2000
50
40

Frequency

1500

30
20
10

1000

0
70-75 65-70 60-65 55-60 50-55 <50

500

0
95-100 90-95 85-90 80-85 75-80 70-75 65-70 60-65 55-60 50-55 <50

AR Range
Fig. 2.7 Illustration of the ARs using interval histogram

Fig. 2.8 Typical errors and reasons in comp set (errors are marked with[.]). (a) cursive or skew
textline. (b) English words

2 Deep RNN Architecture: Design and Evaluation

47

2.5 Proposed RNN Neuron
2.5.1 Architecture
A new recurrent neuron named Gated Memory Unit (GMU) is proposed, which
combines the memory cell of LSTM and the interpolation gate of GRU. GMU
consists of a memory cell, two gates (memorial gate and operating gate). Memory
cell plays a role of CEC, which is used for transmission of the historical information
and error messages. Memorial gate is mainly responsible for updating the contents
of the memory cell, which controls how much of the history should be retained.
Operating gate is primarily responsible for hidden state generation. It mixes the
history and current input transformation using learned weights. Those two gates can
be run independently, which makes it easy to parallel.
In Fig. 2.9, GMU is unfolded in the time steps. During its forward propagating,
the history information is saved into cell. Since the cell is self-connected, long
dependencies are possibly encoded. Cell also serves as a CEC channel in the
back propagation process, avoiding gradient vanishing problems. Compared with
standard RNN neuron, GMU uses additive component to achieve the evolution from
time t to time t + 1. Standard RNN state always come from the former state and
current information. Unlikely, GMU can keep the previous contents and recursively
integrates fresh information.

xt

x t+1

i kt

i kt
Gm

m kt

c t-1

Gp

1-m kt
ct

p kt

1-m kt+1
m kt+1

Gm

Gp

1-p kt+1

1-p kt
hkt

hkt+1
ht
t

Fig. 2.9 GMU is unfolded in the time steps

i kt+1

i kt+1

t+1

p kt+1

c t+1

48

T. Su et al.

Similar to LSTM, GMU also uses Memory Cell as a highway channel. The
hidden state is derived from the contents of memory cell with proper scaling
operation. Their difference is that the range of information in GMU cell is ensured
to be stable since an interpolation is used.
Similar to GRU, GMU use interpolation structure to merge the history information and current information. Both of them have a simple arithmetic component.
Different to GRU, GMU uses memory cell to store history information, rather than
the activation value itself, better guarantee the independence of memory cell and
hidden state.

2.5.2 Forward Propagation
Memorial gate in GMU uses a logistic sigmoid transformation:
atk =

I


m,x l
wl,k
xt +

H


j

h,m
k
wj,k
ht−1 + wkc,m ∗ ct−1

j =1

l=1

 
mkt = σ atk
Similarly, operating gate also employs a sigmoid:
btk =

I


x,p

wl,k xtl +

H


h,p j

c,p

wj,k ht−1 + wk

j =1

l=1

 
ptk = σ btk
Input transformation can be computed as:
∼k

it =

I

l=1

x,i l
wi,k
xt +

H


j

h,i
wj,k
ht−1

j =1

 k
∼
itk = h i t
The hidden state or activation is given by interpolation:


k
+ mkt itk
hkt = 1 − mkt ct−1

k
∗ ct−1

2 Deep RNN Architecture: Design and Evaluation

49

Memory cell is updated using:


k
ctk = 1 − ptk itk + ptk ct−1

2.5.3 Backward Propagation
As the critical step is to derive the gradients, we list them as follows. Firstly, define
the derivative to hidden state as
h,k
t

def

=

∂L
∂hkt

Also define the derivative memory cell as
c,k
t

def

=

∂L
∂ctk

The gradient of memory cell can be derived:
c,k
t

=

h,k
t




1 − mkt +

c,k k
t+1 pt

p,k

c,p

+ δt+1 wk

m,k c,m
+ δt+1
wk

The gradient of operating gate is:
p,k

δt

=σ

 
btk

c,k
t



k
itk − ct−1

 
atk

c,k
h



k
ct−1
− itk

The gradient of memorial gate:
δtm,k = σ

The gradient of input transformation:
∼

δti ,k


=h

∼k

it



h,k k
t mt

+

c,k
t



1 − ptk

2.5.4 Experimental Setup
Unlikely to subsection 4.4, this section simplifies the experimental process. Firstly,
we don’t consider the comp set. We just set 2750 units to the output layer. Secondly,
we don’t