Deep Learning: Fundamentals, Theory and Applications
Kaizhu Huang, Amir Hussain, QiuFeng Wang, Rui ZhangThe purpose of this edited volume is to provide a comprehensive overview on the fundamentals of deep learning, introduce the widelyused learning architectures and algorithms, present its latest theoretical progress, discuss the most popular deep learning platforms and data sets, and describe how many deep learning methodologies have brought great breakthroughs in various applications of text, image, video, speech and audio processing.
Deep learning (DL) has been widely considered as the next generation of machine learning methodology. DL attracts much attention and also achieves great success in pattern recognition, computer vision, data mining, and knowledge discovery due to its great capability in learning highlevel abstract features from vast amount of data. This new book will not only attempt to provide a general roadmap or guidance to the current deep learning methodologies, but also present the challenges and envision new perspectives which may lead to further breakthroughs in this field.
This book will serve as a useful reference for senior (undergraduate or graduate) students in computer science, statistics, electrical engineering, as well as others interested in studying or exploring the potential of exploiting deep learning algorithms. It will also be of special interest to researchers in the area of AI, pattern recognition, machine learning and related areas, alongside engineers interested in applying deep learning models in existing or new practical applications.
 Open in Browser
 Checking other formats...
 Convert to EPUB
 Convert to FB2
 Convert to MOBI
 Convert to TXT
 Convert to RTF
 Converted file can differ from the original. If possible, download the file in its original format.
 Please login to your account first

Need help? Please read our short guide how to send a book to Kindle.
Please note you need to add our email km@bookmail.org to approved email addresses. Read more.
You may be interested in Powered by Rec2Me


Cognitive Computation Trends 2 Series Editor: Amir Hussain Kaizhu Huang · Amir Hussain QiuFeng Wang Rui Zhang Editors Deep Learning: Fundamentals, Theory and Applications Cognitive Computation Trends Volume 2 Series Editor Amir Hussain School of Computing Edinburgh Napier University Edinburgh, UK Cognitive Computation Trends is an exciting new Book Series covering cuttingedge research, practical applications and future trends covering the whole spectrum of multidisciplinary fields encompassed by the emerging discipline of Cognitive Computation. The Series aims to bridge the existing gap between life sciences, social sciences, engineering, physical and mathematical sciences, and humanities. The broad scope of Cognitive Computation Trends covers basic and applied work involving bioinspired computational, theoretical, experimental and integrative accounts of all aspects of natural and artificial cognitive systems, including: perception, action, attention, learning and memory, decision making, language processing, communication, reasoning, problem solving, and consciousness. More information about this series at http://www.springer.com/series/15648 Kaizhu Huang • Amir Hussain • QiuFeng Wang Rui Zhang Editors Deep Learning: Fundamentals, Theory and Applications 123 Editors Kaizhu Huang Xi’an JiaotongLiverpool University Suzhou, China QiuFeng Wang Xi’an JiaotongLiverpool University Suzhou, China Amir Hussain School of Computing Edinburgh Napier University Edinburgh, UK Rui Zhang Xi’an JiaotongLiverpool University Suzhou, China ISSN 25245341 ISSN 2524535X (electronic) Cognitive Computation Trends ISBN 9783030060725 ISBN 9783030060732 (eBook) https://doi.org/10.1007/9783030060732 Library of Congress Control Number: 2019930405 © Springer Nature Switzerland AG 2019 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitati; on, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG. The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Preface Over the past 10 years, deep learning has attracted a lot of attention, and many exciting results have been achieved in various areas, such as speech recognition, computer vision, handwriting recognition, machine translation, and natural language understanding. Rather surprisingly, the performance of machines has even surpassed humans’ in some specific areas. The fast development of deep learning has already started impacting people’s lives; however, challenges still exist. In particular, the theory of successful deep learning has yet to be clearly explained, and realization of stateoftheart performance with deep learning models requires tremendous amounts of labelled data. Further, optimization of deep learning models can require substantially long times for realworld applications. Hence, much effort is still needed to investigate deep learning theory and apply it in various challenging areas. This book looks at some of the problems involved and describes, in depth, the fundamental theories, some possible solutions, and latest techniques achieved by researchers in the areas of machine learning, computer vision, and natural language processing. The book comprises six chapters, each preceded by an introduction and followed by a comprehensive list of references for further reading and research. The chapters are summarized below: Density models provide a framework to estimate distributions of the data, which is a major task in machine learning. Chapter 1 introduces deep density models with latent variables, which are based on a greedy layerwise unsupervised learning algorithm. Each layer of deep models employs a model that has only one layer of latent variables, such as the Mixtures of Factor Analyzers (MFAs) and the Mixtures of Factor Analyzers with Common Loadings (MCFAs). Recurrent Neural Networks (RNN)based deep learning models have been widely investigated for the sequence pattern recognition, especially the Long Shortterm Memory (LSTM). Chapter 2 introduces a deep LSTM architecture and a Connectionist Temporal Classification (CTC) beam search algorithm and evaluates this design on online handwriting recognition. Following on above deep learningrelated theories, Chapters 3, 4, 5 and 6 introduce recent advances on applications of deep learning methods in several v vi Preface areas. Chapter 3 overviews the stateoftheart performance of deep learningbased Chinese handwriting recognition, including both isolated character recognition and text recognition. Chapters 4 and 5 describe application of deep learning methods in natural language processing (NLP), which is a key research area in artificial intelligence (AI). NLP aims at designing computer algorithms to understand and process natural language in the same way as humans do. Specifically, Chapter 4 focuses on NLP fundamentals, such as word embedding or representation methods via deep learning, and describes two powerful learning models in NLP: Recurrent Neural Networks (RNN) and Convolutional Neural Networks (CNN). Chapter 5 addresses deep learning technologies in a number of benchmark NLP tasks, including entity recognition, supertagging, machine translation, and text summarization. Finally, Chapter 6 introduces Oceanic data analysis with deep learning models, focusing on how CNNs are used for ocean front recognition and LSTMs for sea surface temperature prediction, respectively. In summary, we believe this book will serve as a useful reference for senior (undergraduate or graduate) students in computer science, statistics, electrical engineering, as well as others interested in studying or exploring the potential of exploiting deep learning algorithms. It will also be of special interest to researchers in the area of AI, pattern recognition, machine learning, and related areas, alongside engineers interested in applying deep learning models in existing or new practical applications. In terms of prerequisites, readers are assumed to be familiar with basic machine learning concepts including multivariate calculus, probability and linear algebra, as well as computer programming skills. Suzhou, China Edinburgh, UK Suzhou, China Suzhou, China March 2018 Kaizhu Huang Amir Hussain QiuFeng Wang Rui Zhang Contents 1 Introduction to Deep Density Models with Latent Variables . . . . . . . . . . . Xi Yang, Kaizhu Huang, Rui Zhang, and Amir Hussain 2 Deep RNN Architecture: Design and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . Tonghua Su, Li Sun, QiuFeng Wang, and DaHan Wang 3 Deep Learning Based Handwritten Chinese Character and Text Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XuYao Zhang, YiChao Wu, Fei Yin, and ChengLin Liu 4 Deep Learning and Its Applications to Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Haiqin Yang, Linkai Luo, Lap Pong Chueng, David Ling, and Francis Chin 1 31 57 89 5 Deep Learning for Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . 111 Jiajun Zhang and Chengqing Zong 6 Oceanic Data Analysis with Deep Learning Models . . . . . . . . . . . . . . . . . . . . . 139 Guoqiang Zhong, LiNa Wang, Qin Zhang, Estanislau Lima, Xin Sun, Junyu Dong, Hui Wang, and Biao Shen Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 vii Chapter 1 Introduction to Deep Density Models with Latent Variables Xi Yang, Kaizhu Huang, Rui Zhang, and Amir Hussain Abstract This chapter introduces deep density models with latent variables which are based on a greedy layerwise unsupervised learning algorithm. Each layer of the deep models employs a model that has only one layer of latent variables, such as the Mixtures of Factor Analyzers (MFAs) and the Mixtures of Factor Analyzers with Common Loadings (MCFAs). As the background, MFAs and MCFAs approaches are reviewed. By the comparison between these two approaches, sharing the common loading is more physically meaningful since the common loading is regarded as a kind of feature selection or reduction matrix. Importantly, MCFAs can remarkably reduce the number of free parameters than MFAs. Then the deep models (deep MFAs and deep MCFAs) and their inferences are described, which show that the greedy layerwise algorithm is an efficient way to learn deep density models and the deep architectures can be much more efficient (sometimes exponentially) than shallow architectures. The performance is evaluated between two shallow models, and two deep models separately on both density estimation and clustering. Furthermore, the deep models are also compared with their shallow counterparts. Keywords Deep density model · Mixture of factor analyzers · Common component factor loading · Dimensionality reduction 1.1 Introduction Density models provide a framework for estimating distributions of the data and therefore emerge as one of the central theoretical approaches for designing machines (Rippel and Adams 2013; Ghahramani 2015). One of the essential X. Yang · K. Huang () · R. Zhang Xi’an JiaotongLiverpool University, Suzhou, China email: xi.yang@xjtlu.edu.cn; kaizhu.huang@xjtlu.edu.cn; rui.zhang02@xjtlu.edu.cn A. Hussain School of Computing, Edinburgh Napier University, Edinburgh, UK email: a.hussain@napier.ac.uk © Springer Nature Switzerland AG 2019 K. Huang et al. (eds.), Deep Learning: Fundamentals, Theory and Applications, Cognitive Computation Trends 2, https://doi.org/10.1007/9783030060732_1 1 2 X. Yang et al. probabilistic methods is to adopt latent variables, which reveals data structure and explores features for subsequent discriminative learning. The latent variable models are widely used in machine learning, data mining and statistical analysis. In the recent advances in machine learning, a central task is to estimate the deep architectures for modeling the type of data which has the complex structure such as text, images, and sounds. The deep density models have always been the hot spot for constructing sophisticated density estimates. The existed models, probabilistic graphical models, not only prove that the deep architectures can create a better prior for complex structured data than the shallow density architectures theoretically, but also has practical significance in prediction, reconstruction, clustering, and simulation (Hinton et al. 2006; Hinton and Salakhutdinov 2006). However, they often encounter computational difficulties in practice due to a large number of free parameters and costly inference procedures (Salakhutdinov et al. 2007). To this end, the greedy layerwise learning algorithm is an efficient way to learn deep architectures which consist of many latent variables layers. With this algorithm, the first layer is learned by using a model that has only one layer of latent variables. Especially, the same scheme can be extended to train the following layers at a time. Compared with previous methods, this deep latent variable model has fewer free parameters by sharing the parameters between successive layers, and a simpler inference procedure due to a concise objective function. The deep density models with latent variables are often used to solve unsupervised tasks. In many applications, the parameters of these density models are determined by the maximum likelihood, which typically adopts the expectationmaximization (EM) algorithm (Ma and Xu 2005; Do and Batzoglou 2008; McLachlan and Krishnan 2007). In Sect. 1.4, we shall see that EM is a powerful and elegant method to find the maximum likelihood solutions for models with latent variables. 1.1.1 Density Model with Latent Variables Density estimation is a major task in machine learning. In general, the most commonly used method is Maximum Likelihood Estimate (MLE). In this way, we N 1 can establish a likelihood function L(μ, Σ) = n=1 ln p(yn μ, Σ). However, the derivation of directly calculating the likelihood functions has computational difficulties because of the very high dimensions of Σ. Thus, a set of variables x is defined to govern multiple y, and when the distribution of p(x) is found, p(y) can be determined by the joint distribution over y and x. Typically, the covariates Σ are ruled out. In this setting, x is assumed to affect the manifest variables (observable variables), but it is not directly observable. Thus, x is socalled the latent variable (Loehlin 1998). Importantly, the introduction of latent variables allows the formation of complicated distributions from simpler components. In statistics, 1y n  N observation data, μ  mean, Σ  covariates 1 Introduction to Deep Density Models with Latent Variables 3 Table 1.1 Different types of latent variable models. We fucose on the Latent Profile Analysis and the Latent Class Analysis, where the latent variables are treated as a multinomial distribution Manifest variables Continuous Discrete Latent variables Continuous Factor analysis Latent trait analysis Discrete Latent profile analysis Latent class analysis latent variable models are probabilistic models that represent the existence of a set of latent variables and relate them to manifest variables (Vermunt and Magidson 2004). These models are typically grouped according to whether the nature of the manifest and latent variables are discrete or continuous (see Table 1.1) (Galbraith et al. 2002; Everett 2013). Whilst, they are also widely applied to analyze data, especially in the presence of repeated observations, multilevel data, and panel data (Bishop 1998). In this chapter, we shall see the finite mixture models which based on discrete latent variables (details in Sect. 1.2). This finite mixture model is a convex combination of more than one density functions and comes up with an expression of heterogeneity in a finite number of latent classes, such as the Mixtures of Factor Analyzers (MFAs) (Ghahramani and Hinton 1996) and the Mixtures of Factor Analyzers with Common Loadings (MCFAs) (Baek et al. 2010). 1.1.2 Deep Architectures via Greedy LayerWise Learning Algorithm The latent variable models are usually expressed as a shallow architecture which just has one visible layer and one single hidden layer, such as a directed acyclic graph (see in Fig. 1.1). As shown in Sects. 1.2.4.1 and 1.2.4.2, shallow architectures has achieved good performances in practice. Despite all this, shallow architectures also have a serious problem which is very inefficient in dealing with the complex structural data or large amounts of hidden units (Bengio et al. 2007). In contrast, the deep architectures have multiple hidden layers to help in learning better representations of the complicated hidden units. A directed deep architecture is illustrated in Fig. 1.2. Also, deep architectures reduce the need for handcrafted features which is a very timeconsuming process requiring expert knowledge. Fig. 1.1 A directed acyclic graph represents a shallow density model which has only one hidden layer 4 X. Yang et al. Fig. 1.2 A multilayer directed acyclic graph represents a deep density model which has three hidden layers and one visible layer However, it is difficult to train all layers at once when a multilayered architecture to build generative models of data (Arnold and Ollivier 2012). To this end, the greedy layerwise procedures are developed for training deep generative models by using the same criterion for optimizing each layer starting from the bottom and for transferring the problem upwards to the next layer, socalled the layerwise learning algorithm (Arnold and Ollivier 2012). Thus, the latent variable models can be an ideal criterion for a deep architecture, as well as providing a framework for building more complex probability distributions (Bengio et al. 2009). Specifically, although a deep latent variable model has many layers of latent variables, the learning procedure is just using a model has only one layer of latent variables for optimizing each layer (Tang et al. 2012). Besides, the deep architectures can also be used for a finite mixture model, which means each subpopulation is assumed to come from the manifest variables having more than one distributions. Deep density models also have very important practical significance in many disciplines, especially, attracted considerable attention in unsupervised learning (Patel et al. 2015; Tang et al. 2013; Yang et al. 2017). 1.1.3 Unsupervised Learning In practical applications, the (deep) density models with latent variables are typically used for solving unsupervised tasks, especially in dimensionality reduction, and clustering. The (deep) density model with latent variables is a probabilistic algorithm which based on modeling the density of the observed data and assumes that there are some potential/unobservable processes to manage data generation. Clustering is a method of putting similar data points together, and probabilistic unsupervised learning builds a generation model that describes the item’s clustering. Wherein, each cluster can be represented by a precise component distribution (multiple distributions for a deep model). Hence, each data point is generated from the selection of a component matters (Law et al. 2004; Figueiredo and Jain 2002). 1 Introduction to Deep Density Models with Latent Variables 5 The idea is that all the same types of data points belonging to the same cluster (all from the same distribution) are more or less equivalent, and any differences between them are accidental (Bishop 2006). In dimensionality reduction, the primary role of the latent variables is to allow a complicated distribution over observed variables constructed from simpler conditional distributions. Importantly, the dimensional of latent variables are always lower than the observable variables. Therefore, the higherdimensional observable variables are reduced to the lowdimensional latent variables to represent a model. 1.2 Shallow Architectures of Latent Variable Models This section introduces the density models with latent variables that explicitly shows how inferences of the models correspond to operations in a single layer of a deep architecture. Density models are widely used in data mining, pattern recognition, machine learning, and statistical analysis. As a basic model, Factor analysis (FA) is commonly used to define a appropriate density distribution of data to promote the wellknown mixtures of Gaussians (Everitt 1984). MFAs are improved density estimators that model by allowing each Gaussian in the mixture to be represented in a different lowerdimensional manifold. In particular, MFAs simultaneously perform clustering and dimensionality reduction of the data (Montanari and Viroli 2011; McLachlan and Peel 2000). Exploiting similar ideas of MFAs, MCFAs are proposed as another mixture framework by sharing a common component loading. Importantly, unlike MFAs which specify a multivariate standard normal prior for the latent factors over all the components, MCFAs exploit on each latent unit Gaussian density distributions whose mean and covariance could be learned from data. On the other hand, sharing a common factor loading for all the components reduces the parameters significantly. While, since a common factor loading can be considered as a feature selection or dimensionality reduction matrix, it can be well justified and physically more meaningful (Baek and McLachlan 2011; Tortora et al. 2016). Since MFAs and MCFAs are implemented involving steps that attribute the postulated subpopulations to individual observations, these models are typically used in unsupervised learning or clustering procedures, and achieved impressive performance (Mclanchlan et al. 2003; Yang et al. 2017). 1.2.1 Notation We begin by considering the problem of identifying groups/clusters of data points in a multidimensional space. Suppose we have a dataset y consisting of N observations which have a pdimensional vector {y1 , y2 , . . . , yp } of each feature variable. By introducing latent variables, the manifest distribution p(y) can be signified in terms of a qdimensional vector {z1 , z2 , . . . , zq } of latent variables z, where q is a small 6 X. Yang et al. number than p (Smaragdis et al. 2006). Through this process, the joint distribution P (y, z) is decomposed into the conditional distribution of the feature variables given the latent variables and the product of the marginal distribution of the latent variables p(z) (Bishop 1998). P (y, z) = p(yz)p(z) = p(z) p p(yi z). (1.1) i=1 As shown in Fig. 1.3, the latent variable models can be graphically represented by a directed graph. By considering a mixture of C multinomial distributions, the density of y could be modeled as a finite mixture model P (y; θ ) = C πc p(yc; θ c ), where c = 1, . . . , C. (1.2) c=1 These distributions are referred to as components of this model (or subpopulations of observations) and describe the pvariate density function (Figueiredo and Jain 2002). The mixture distribution can be expressed by a simple graph, as shown in Fig. 1.4. These components must belong to the same parametric family of distributions but with different parameters θ . θ c denotes the observations’ parameters which are associated with the component c. For instance, the mixture components belong to Gaussian distributions, and then each component has different means and variances. Additionally, the parameters are random variables in a Bayesian setting, and prior probability distributions p(yc; θ c ) are placed over the variables (Bishop 1998; Loehlin 1998). π c = p(c) denotes the C mixture weights and satisfies the requirement that probabilities sum to 1. Fig. 1.3 The factorization property of Eq. 1.1 can be expressed in terms of a directed graph, in which the feature variables yi are independent of the latent variables z Fig. 1.4 A simple mixture model is expressed in terms of a Bayesian network 1 Introduction to Deep Density Models with Latent Variables 7 Fig. 1.5 A mixture of latent variables model is graphically expressed by the Bayesian network. The feature variables yi are conditionally independent of the given mixture weights c and latent variables z Then, a mixture of latent variables model can be obtained by combining ingredients from the technical of mixture models with the idea of latent variables (Fokoué 2005). Consequently, the joint distribution P (y, z) is derived as follows and the corresponding Bayesian network is shown in Fig. 1.5. P (y, z) = C p(yp(c), z)p(c)p(zc) = c=1 C c=1 πc p(zc ) p p(yi z). (1.3) i=1 This mixture model is explicitly computed with discrete latent variables. Hence, the integral is replaced by a sum. 1.2.2 Mixtures of Factor Analyzers We first introduce a globally nonlinear latent variable model which is referred to as Mixture of Factor Analyzers (MFAs). MFAs are considered as extending the traditional Factor Analysis by ideas of the analysis of the finite mixture of distributions. To separate the observations independently into c nonoverlapping components, the MFAs approach is modeled as y= C μc + Wc z + c with probability π c (c = 1, . . . , C), (1.4) c=1 where μc is a pdimensional mean vector associated with the component c; Wc is a p × q factor loading matrix of the cth component, where q < p; and πc = p(c) denotes the mixing proportion. Each (unobservable) factor z is distributed independently by a normal distribution with zero mean and a q × q identity covariance matrix, N(0, Iq ). The noise model c is also independent and assumed as a Gaussian distribution with zero mean and a qdimensional diagonal covariance matrix, N (0, Ψ c ). Given the latent variables and the component indicator variables, the conditional distribution of observations p(yc, z) follows a Gaussian distribution which has mean Wc z + μc and variance Ψ c . 8 X. Yang et al. The joint distribution is given by p(yc, z)p(zc)p(c), and the density of the observed data P (y; θ ) is obtained by summing up all mixture components. Then, the MFAs model is obtained by marginalizing over the latent variables p(yc) = z p(yc, z)p(zc)dz = N(y; μc , Wc WTc + Ψ c ), (1.5) C C P (y; θ ) = c=1 πc p(yc, z)p(zc) = c=1 πc N (y; μc , Wc WTc + Ψ c ), (1.6) where θ denotes the model parameter vector. It consists of the mixture weight π c , the means of the component μc , the factor loading Wc and the covariance of component matrices Ψ c (Montanari and Viroli 2011) ⎞C πc ⎟ ⎜ μc ⎟ . θc = ⎜ ⎝ Wc (vec) ⎠ Ψ c (diag) c=1 ⎛ (1.7) Therefore, the MFAs can be considered as a globally nonlinear latent variable model, where C Gaussian factors are fitted on the data. 1.2.2.1 Maximum Likelihood After determining the “bestfitting” distribution (Eq. 1.6), the parameters will be estimated for that distribution. Maximum likelihood estimation (MLE) is a common technique for estimating the parameters of a probability distribution. In other words, the MLE can be known as to maximize the sample likelihood by estimating the values of the parameters. These unknown parameters are contained in a vector θ (as Eq. 1.7). Loosely speaking, the goal of MLE is to maximize a likelihood function of the sample data. Suppose we have a sample of independent and identically distributed (iid) random variables, {y1 ; y2 ; . . . ; yN }, which is described by the MFAs model (Eq. 1.6). The basic likelihood function is defined as P (y; θ ) = C c=1 πc N N(yn ; μc , Wc WTc + Ψ c ). (1.8) i=1 Since maximizing the product is very tedious, logarithms are often used to turn multiplications into summation. It will be equivalent to maximise the loglikelihood since the logarithm is an increasing function (Montanari and Viroli 2011) L(θ ) = log P (y; θ ) = N C c=1 n=1 log{πc N(yn μc , Wc WTc + Ψ c )}, θ̂ = arg max L(θ ). θ (1.9) (1.10) 1 Introduction to Deep Density Models with Latent Variables 9 Here, θ̂ denotes the estimated parameters when the empirical loglikelihood has a maximum value. 1.2.2.2 Maximum A Posteriori Maximum A Posteriori (MAP) is also a method for estimating variables in the probability distributions settings. The MAP is closely related to the MLE, which can be regarded as a regularization of the MLE. However, the MAP is more interested in a posterior distribution, not only the likelihood. For inference, MAP usually comes up with Bayesian setting, and the posterior over the components can be found by recalling Bayes rule q(z, cy) = q(zy, c)q(cy), q(cy) = p(yc)p(c) C h=1 p(yh)p(h) ∝ p(yc)p(c). (1.11) (1.12) More concretely, the posterior over the latent factors is also a multivariate Gaussian density on z given y and c: q(zy, c) = N(z; κ c , V−1 c ), (1.13) where T V−1 c = I + Wc Ψ c Wc , T −1 κ c = V−1 c Wc Ψ c (y − μc ). (1.14) A point estimate can be made by selecting the component c with maximum a posterior probability ĉ = arg max p(c)p(cy). c (1.15) Then, the likelihood in the MLE (Eq. 1.10) is replaced by the posterior. Consequently, MAP estimation will back at MLE equation θ̂ MAP = arg max θ N C c=1 n=1 log{p(yc)p(c)} = θ̂ MLE . (1.16) 10 X. Yang et al. 1.2.3 Mixtures of Factor Analyzers with Common Factor Loadings Suppose the observations y follow a Gaussian distribution N (y; μc , Σ c ). As a special case of MFAs, MCFAs further assumes the additional restrictions: μc = Aξ c ; Σ c = AΩ c AT + Ψ ; Ψc = Ψ; Wc = AKc . (1.17) Thereby, we get a new linear model by rewriting Eq. 1.4 y= C Azc + with probability π c (c = 1, . . . , C). (1.18) c=1 Conventionally, A is the p × q common loading matrix, and zc is refered to as hidden variables or latent factors. Different with the latent factor of MFAs, this common loading matrix can be regarded as a global transformation matrix, and the Gaussian distributions over each latent factor whose mean and covariance could be learned from data. These settings not only reduce the parameters significantly but also estimate the density accurately. Besides, a multivariate standard Gaussian prior of MFAs may limit the flexibility. To use this framework, the MCFAs are defined to a directed generative model, as follows p(c) = πc , C πc = 1; c=1 ∼ N(0, Ψ ); yc, z ∼ N(Azc , Ψ ), (1.19) where Ψ is a p × p diagonal matrix which represents the variances of the independent noise. c denotes the component indicator over the C total components of the mixture, and p(c) = πc is a mixing proportion. The prior of the latent factor z given c follows a Gaussian density distribution: p(zc) = N(zc ; ξ c , Ω c ). Here, ξ c is a qdimension vector and Ω c is a q × q positive definite symmetric matrix. With the above definitions, the density of y given c can be written as a shallow form by integrating out the latent variables z p(yc) = z p(yc, z)p(zc)dz μc = Aξ c , = N (y; μc , Σ c ), Σ c = AΩ c AT + Ψ . (1.20) 1 Introduction to Deep Density Models with Latent Variables 11 Finally, the marginal density of MCFAs model on observed data y is then given by a mixture which is obtained by summing the joint distribution p(yc, z)p(zc)p(c) p(y) = C p(c)p(yc), c=1 P (y; θ c ) = C πc N(y; Aξ c , AΩ c AT + Ψ ). c=1 Apparently, the model of MCFAs is also a multivariate Gaussian with constrained mean and covariance. Here, the parameter vector consists of the global parameters, the common loading matrix Ac , the covariance matrix Ψ , and the local parameters for each mixture component ⎛ ⎞C πc θ c = ⎝ ξ c (vec) ⎠ . Ω c (Sym) c=1 1.2.3.1 (1.21) Maximun Likelihood In Sect. 1.2.2.1, the MLE of MFAs was introduced. We now turn to the MLE of MCFAs with same inference procedure. A sample of iid random variables was described by the MCFAs model (Eq. 1.21). The loglikelihood function is given by L(θ) = log P (y; θ ) = N C log{πc N(yn Aξ c , AΩ c AT + Ψ )}, (1.22) c=1 n=1 When the empirical loglikelihood has a maximum, the maximizing a function is shown as follows θ̂ = arg max L(θ). θ (1.23) A general technique for finding maximum likelihood estimators in latent variable models is the Expectationmaximization(EM) algorithm. In Sect. 1.4, we shall see that the Maximum likelihood learning can straightforward use the EM algorithm. 1.2.3.2 Maximum A Posteriori In Maximum A Posteriori (MAP), the formulation of the posterior can be expressed using the Bayes rule: 12 X. Yang et al. q(cy; θ c ) = πc N(y; Aξ c , AΩ c AT + Ψ ) . C T πh N(y; Aξ h , AΩ h A + Ψ ) (1.24) h=1 The component c can be selected by maximizing a posterior probability ĉ = arg max p(c)q(cy; θ c ). c (1.25) The posterior of the latent factors is also a multivariate Gaussian density on z given the observations y and the component c: q(zy, c) = N(z; κ c , V−1 c ), (1.26) where −1 T −1 A, V−1 c = Ωc + A Ψ T −1 (y − Aξ ). κ c = ξ c + V−1 c c A Ψ (1.27) More concretely, the posterior probability over the components of the mixture can be found by p(cy; θ c ) = pr{ωc = 1y}, where ωc = 1 if y belongs to the cth component, otherwise ωc = 0. 1.2.4 Unsupervised Learning The MFAs model performs the dimensionality reduction of data by making locally linear assumptions. Therefore, the model transforms the points of different clusters into different subspaces, as show in Fig. 1.6. The principle is to maximize the similarity of points from the same cluster. Importantly, the MCFAs model performs the dimensionality reduction of data by making global linear assumptions, which has capability to transform the different clusters into a subspace, as shown in Fig. 1.7. Different with MFAs, MCFAs not only follow the principle of maximizing the similarity but also to maximize the distance between clusters. We conduct extensive experiments on a variety of datasets to evaluate the performance of both algorithms, including artificial data, gray images, and digitized aerial image. The following datasets are used in our empirical experiments. • ULC3: The urban land cover (ULC) data is used to classify a high resolution aerial image which consists of 3 types with 273 training samples, 77 test samples and 147 attributes (Johnson and Xie 2013; Johnson 2013). • Coil4proc: This dataset contains images for 4 objects discarding the background and each object has 72 samples (Nene et al. 1996). The images are down 1 Introduction to Deep Density Models with Latent Variables 13 Fig. 1.6 The sketch of clustering and dimensionality reduction. MFAs drop the data points of different clusters into different subspaces and cluster them at the same time. Different color represents different cluster Fig. 1.7 The sketch of clustering and dimensionality reduction. MCFAs performs the dimensionality reduction and clustering simultaneously.Different color represents different cluster sampled into 32 by 32 pixels and then reshaped to a 1024dimensional vector. There are just 248 samples in the training set and 40 samples in the test set. • Leuk72_3k: This dataset is an artificial dataset including 3 classes which have been drawn from randomly generated Gaussian mixtures. The Leuk72_3k has only 54 training samples and 18 test samples with 39 attributes. • USPS14: This handwriting digit data contains 1 to 4 digits images of size 16 by 16 pixels. Each image is reshaped to a 256dimensional vector. The training set includes 100 samples of each digit and the test set also consists of 100 of each digit. X. Yang et al. Average Loglikelihood (Training set) 14 100 ULC3 Dataset Coil4proc Leuk72_3k USPS14 101 102 103 MFAs 104 MCFAs Fig. 1.8 Performance on various real data (on Training Set) in terms of the loglikelihood (the larger, the better) 1.2.4.1 Empirical Results Finite mixture models are able to present arbitrarily complex probability density functions, which fact makes them an excellent choice for representing complex classconditional (Figueiredo and Jain 2002). Empirically, the average loglikelihood as a criterion for examining the quality of the density estimates after modeling the density of the observed data. The empirical results are demonstrated for both MFAs and MCFAs. The model parameters are estimated over the training data by maximizing the loglikelihood value. By multiple trials, the average loglikelihood value on the training data is shown in Fig. 1.8. In order to intuitively observe the trend of the entire experimental results, the image results have to be treated as logarithmic, which is used to controlled all the results in the same range. On the testing data, the loglikelihood value obtained using only the model has been trained and did not update the model parameters, and the results are shown in Fig. 1.9. From the comparison of both results, the loglikelihood values obtained by the MCFAs model are lower than that of the MFAs on all datasets. Consequently, sharing a common loading can improve the true loglikelihood dramatically. 1.2.4.2 Clustering Then, the clustering error rate is demonstrated of both training and testing datasets and the best results are reported from multiple trials. In the experiments, both methods have been initialized by random assortment. Also, the number of mixes is set to be the same as the number of real categories. Comparing both MFAs and MCFAs results on the training data in Fig. 1.10, the results of these two methods are not significantly different on the Coil4proc dataset, and even get the same result on the Leuk72_3k dataset. However, comparing the results in Fig. 1.11, the MCFA Average Loglikelihood (Testing set) 1 Introduction to Deep Density Models with Latent Variables 100 ULC3 Dataset Coil4proc Leuk72_3k 15 USPS14 101 102 103 MFAs 10 MCFAs 4 Fig. 1.9 Performance on various real data (on Testing Set) in terms of the loglikelihood (the larger, the better) 0.35 Error Rate (Training set) MFAs 0.3 MCFAs 0.25 0.2 0.15 0.1 0.05 0 ULC3 Leuk72_3k Coil4proc Dataset USPS14 Fig. 1.10 Clustering error rate on 4 datasets. The best result is reported from each model on training set model still maintains good performance on the test dataset. On the whole, the results of MCFAs are consistently better than the MFAs. 1.3 Deep Architectures with Latent Variables In the above section, we defined the shallow architectures and also exhibit inferences to optimize a single layer. We now consider how to extend the shallow models by defining the deep density models. Firstly, the deep architectures needs to have many layers of latent variables. Secondly, parameters should be learned by the efficient greedy layerwise algorithms. In this section, two deep density models will be described, Deep Mixtures of Factor Analyzers (DMFAs) who adopts an 16 X. Yang et al. 0.35 Error Rate (Testing set) MFAs 0.3 MCFAs 0.25 0.2 0.15 0.1 0.05 0 ULC3 Leuk72_3k Coil4proc Dataset USPS14 Fig. 1.11 Clustering error rate on 4 datasets. The best result is reported from each model on testing sets MFAs in each hidden layer (McLachlan and Peel 2000; Tang et al. 2012) and Deep Mixtures of Factor Analyzers with Common loadings (DMCFAs) who adopts an MCFAs (Yang et al. 2017). Therefore, both deep models are the directed generative models and use the layerwise training procedure as approximations. The observation vector and the first hidden layer are treated as an MFAs/MCFAs model, and learning parameters are used by this unsupervised method. After fixing the first layer parameters, the priors of next layer MFAs/MCFAs are replaced by sampling the hidden units of the current layer MFA/MCFAs The same scheme can be extended to train the following layers. Compared with the shallow models with same scale mixtures (Collapse Models), the deep models have fewer free parameters and a simpler inference procedure. On one hand, the components of adjacent layers share the parameters; on the other hand, largescale mixtures can cause the objective function of a shallow model to be too complexity. 1.3.1 Deep Mixtures of Factor Analyzers Having formulated the MFAs model, we now show how to construct the MFAs into a deep architecture. In a shallow model, each FA in MFAs has an isotropic Gaussian prior in its factor, as well as a Gaussian posterior over each training sample. However, the posterior is generally nonGaussian when a posterior is aggregated for many training samples. If we replace the prior of each FA with a separate mixed model, this mixture model can learn to model an aggregated posterior rather than model an isotropic Gaussian, and the sketches are shown in Fig. 1.12. Therefore, it can improve a variational lower bound on the log probability of the training data (McLachlan and Peel 2000; Tang et al. 2012). According to this method, Tang et al. (2012) construct a DMFAs model by replacing the FA in the mixture with an MFAs model and even substitute the FA of the next mixtures. The graphical model 1 Introduction to Deep Density Models with Latent Variables 17 Fig. 1.12 The sketch of a mixture of the DMFAs’ higher layer for clustering and dimensionality reduction. Left: The aggregated posterior of a component in the lower layer is not a Gaussian distribution. Right: The higher layer has an ability to model a better aggregated posterior of the lower layer Fig. 1.13 Graphical models of a twolayer DMFA. The DMFAs is a deep directed graphical model utilizing the multilayer factor analyzers which are developed by adopting a MFAs model in each hidden layer of twolayer DMCFAs is visualized in Fig. 1.13. Importantly, the pivotal method is to sample the data regarding the posterior distributions of the current layer and treat it as the training data for the next layer. The observations y and the factors z in the first hidden layer are treated as the MFAs, and the first layer parameters θ (1) are used by this unsupervised method to learn. The aggregated posterior over z factor with a specific component c is given n (1) and ĉ are treated as training by N1 N n=1 p(z , cn = cyn ). For the second layer, z data by sampling posterior distribution (Eqs. 1.13 and 1.15). Then a more powerful MFAs prior replaces the standard multivariate normal prior (2) p(zc) = PMF A (z(1) c ; θ cs ). (1.28) The same scheme can be extended to training the following layers. In the second layer, some new symbols need to be defined: z(1) is a qdimension vector as the data input to the second layer; θ (2) cs emphasizes that a new parameters 18 X. Yang et al. vector in the second layer which is specific to component c of the first layer MFAs2 : The layer factors are denoted as a ddimension vector z(2) ; s is a new subcomponent indicator variable, and the total number of subcomponents is S satisfying S = C c=1 Mc ; mc = 1, . . . , Mc denotes the number of subcomponents corresponding to the cth first layer component; The second layer mixing proportions (2) (2) p(s) = πs are defined as p(cs )p(scs ), where Ss=1 πs = 1 and cs denotes the subcomponents corresponding to the c component. Then, the DMFAs prior is written as follows p(z; c) = p(c)p(mc c)p(zmc ). (1.29) The density of vectors z(1) follows the joint density over z(2) and s: p(z(1) , z(2) , s) = p(z(1) , cz(2) , s)p(z(2) s)p(s), (2) (2) (1.30) (2) p(z(1) , cs, z(2) ) = N(z(1) ; Ws z(2) + μs , Ψ s ), p(z(2) s) = N(0, I). (1.31) (2) (2) (2) q×d , Ψ q×q , μ Here, the new parameter vector θ (2) s ∈R s ∈ cs consists of Ws ∈ R q (2) d 3 R , z ∈ R . Specifically, since every s is just allowed to belong to one and only one c, we obtain the Gaussian density on the observed data y given z(1) and c (1) (1) + μ(1) p(yc, z(1) ) = N(y; W(1) c z c , Ψ c ), (1) (1) where, Wc ∈ Rp×q , Ψ c parameters. 1.3.1.1 (1) ∈ Rp×p , μc (1.32) ∈ Rp , z(1) ∈ Rq denote the first layer Inference For inference, the posterior distribution is computed in a similar fashion with Eqs. 1.11 and 1.13 q(z(2) , sz(1) , c) = q(z(2) z(1) , c, s)q(sz(1) , c) −1 (2) ), = N(z(2) ; κ (2) cs , Vcs (1.33) where (2)−1 Vcs (2)T (2)−1 = Id + Wcs Ψ cs −1 T (2) Wc s , −1 (2) (2) (2) κ (2) W(2) (z(1) − W(2) cs Ψ cs cs = Vcs cs μc ). 2 The 3d superscript represents which layer these variables belongs to denotes the ddimension subspace of second layer, where d < q. (1.34) 1 Introduction to Deep Density Models with Latent Variables 19 Here, The subscript emphasizes the subcomponent s which is specific to component c of the first layer, and Id is a ddimensional identity matrix. The posterior over the components can be found as follows q(sz(1) , c) ∝ p(z(1) , cs)p(s), ŝ = (1.35) arg max p(s)q(sz(1) ). s (1.36) For the second layer, p(z(1) ) and ĉ are treated as input data and initial labels when the first layer parameters is fixed and the Eq. 1.10 is maximized. According to the Eq. 1.28, the DMFAs formulation seeks to find a better prior p(zc) = (1) (1) (1) PMF A (z(1) ĉ; θ (2) cs ). Given the new data vectors {z1 ; z2 ; . . . ; zq }, maximizing the Eq. 1.8 with respect to the second layer parameters is equivalent to maximizing the density function P (z(1) ĉ; θ (2) cs ). The basic likelihood objective function of the second layer is defined as P (z(1) ĉ; θ (2) cs ) = s∈c q πs i=1 T (2) (2) (2) {N(z(2) + Ψ (2) s )}. i μs , Ws Ws (1.37) It is worth to denote that, at the second layer, each mixture model of C components derived from the first layer can be updated separately, since S second layer parameters are nonoverlapping and just allowed to belong to one and only one of the first layer component. θ̂ cs (2) = arg max log P (z(1) ; θ (2) cs ). (1.38) θcs (2) Despite the good performance in practice, the DMFAs model still has many drawbacks. Specifically, this model uses different loading matrices for different components, which may lead to overfitting in practical applications. Meanwhile, it also inherits the shortcoming of MFAs, that is, assuming the prior of each potential factor follows a standard Gaussian distribution, which may limit the flexibility and accuracy. 1.3.1.2 Collapse Model While it is true that DMFAs can also be collapsed into a shallow form by integrating out the latent factors. According to Eq. 1.5, we obtain the collapse model after the first layer factors z(1) are integrated out: p(yz(2) , s) = z(1) p(yc, z(1) )p(z(1) s, z(2) )p(z(2) s)dz(1) T (2) (2) (1) (1) (2) (1) = N (y; W(1) + μ(2) + Ψ (1) c (Ws z s ) + μ c , Wc Ψ s W c c ). (1.39) 20 X. Yang et al. Then the final shallow form is obtained by further integrating out z(2) : p(ys) = z(2) (1) (2) p(yz(2) , s)dz(2) = N (y; ms , Σ s ), (1) (1) (2) (2)T ms = Wc μs + μc , Σ s =Wc (ψ (2) s +Ws Ws (1)T )Wc (1.40) (1) +Ψ c . (1.41) Finally, the marginal density of the shallowed model on observed data y is then given by a mixture of Gaussians: p(y) = S p(s)p(ys) = s=1 S πs N(y; ms , Σ s ). (1.42) s=1 Conventionally, θ s = {πs , μs , Ws , Ψ s , Wc , μc , Ψ c }S,C s=1,c=1 represent the parameters of the shallow form of DMFAs. In this case, the posterior probability of the shallowed MFA which collapses from a twolayer DMFAs for the sth mixture is given by p(sy) = πs p(ys)/p(y). We are also interested in the posterior distribution of the latent factor zs which is collapsed to a shallow form q(z(2) , sy) = N(zs ; κ s , V−1 s ), (2) T T (1.43) −1 (2) (2) −1 + W(1) Ψ (1) A(1) , + Ψ (2) V−1 c c s ) c s = (Ws Ws (2)T κ s = Ws μs + V−1 s A (2) (1)T Ψ (1)−1 (y − μc ). (1.44) (1.45) 1.3.2 Deep Mixtures of Factor Analyzers with Common Factor Loadings The same scheme was extended to training the DMCFAs model. By illustrating a mixture of the higher layer in Fig. 1.14, we can clearly see that a mixture model can better model a nonGaussian posterior component in the lower layer. The key improvement is that the same load matrix is used for each mixture component that uses the MCFAs as an aggregated prior. This would potentially further improve the performance. Since different loading matrices are used in the DMFAs, the number of parameters may be unmanageable when the data has a larger feature dimensional and/or smaller observations. Moreover different load matrices may not be physically essential or even unnecessary (Yang et al. 2017). Figure 1.15 presents an illustration of twolayer DMCFA’s graphical model which is constructed with two global parameters, the factor loading, and noise covariance matrices in the first layer. As shown in the graphical model, the common parameters are set in every latent unit at the next level. Based on this setting, the number of free parameters can be dramatically reduced compared to DMFAs, 1 Introduction to Deep Density Models with Latent Variables 21 Fig. 1.14 The sketch of a mixture of the higher layer of DMCFAs. Left: The aggregated posterior of a component in the lower layer is not a Gaussian distribution. Right: The higher layer has an ability to model a better aggregated posterior of the lower layer Fig. 1.15 Graphical models of a twolayer DMFA.The DMCFAs is a deep directed graphical model utilizing the multilayer factor analyzers which are developed by adopting a MCFAs model in each hidden layer even though MCFAs introduce the mean and variance matrices of latent factors. Therefore, DMCFAs model could be considerably interested in the tasks with a large number of clusters or with insufficient instances. Speaking in a more abstract term, we defined the DMCFAs model in the mathematical expressions. Firstly, an MCFAs prior is adopted to replace the prior of latent factors in the first layer. (1) p(zc) = PMCF A (zc ; θ (2) c ). (1.46) Here, new symbols are the same as the definition of DMFAs. Let’s be more concrete, the DMCFAs model can be written as p(s) = πs(2) , S (2) s=1 πs =1 (1.47) (2) p(z(2) s) = N(z(2) ; ξ (2) s , Ω s ), (1.48) (2) (2) p(z(1) , cs, z(2) ) = N(z(1) ; A(2) c z , Ψ c ), (1.49) p(yc, z(1) ) = N(y; A(1) z(1) , Ψ (1) ). (1.50) 22 X. Yang et al. (2) (2) Here, A(1) ∈ Rp×q , Ψ (1) ∈ Rp×p , z(1) ∈ Rq , Ac ∈ Rq×d , Ψ c ∈ Rq×q , z(2) ∈ (2) d d×d .4 Conventionally, the joint distribution with the second Rd , ξ (2) s ∈ R , Ωs ∈ R layer latent variables is given by p(z(1) , z(2) , s) = p(z(1) , cz(2) , s)p(z(2) s)p(s), (1.51) The same scheme can be extended to train the following layers of MCFAs. Up to this point, we can roughly calculate the total number of free parameters. In a twolayer DMCFAs model, the total number of free parameters is in the order of pq + cqd which is much smaller than that of cpq + sqd in DMFAs, where the number of dimensions q p and the number of components s is often in the order of c2 . In summary, a typical twolayer DMCFAs model usually uses only 1/c parameters of DMFAs. 1.3.2.1 Inference For the MAP inference, the formulations of the posterior distribution are a similar fashion in Sect. 1.2.3.2 with respect to the second layer parameters5 −1 (2) q(z(2) , sz(1) , c) = N(z(2) ; κ (2) ), cs , Vcs (1.52) where (2)−1 Vcs (2)−1 = Ωs −1 (2)T + Ac T (2)−1 Ψc (2) Ac , −1 (2) (2) (2) κ (2) A(2) Ψ (2) (z(1) − A(2) c c cs = ξ c + Vcs c ξ c ). (1.53) The posterior of the components can be found as follows q(sz(1) , c) ∝ p(z(1) , cs)p(s), ŝ = arg max p(s)q(sz(1) ). s (1.54) (1.55) In MLE, the likelihood of the mixture model corresponding to the cth component derived from the first layer is estimated concerning the new observations z(1) c and parameters θ (2) cs 4 In the second layer, the subcomponents corresponding to a component of the first layer share a common loading and a variance of the independent noise, Ac(2) and Ψ c(2) has the subscript c. 5 The subscript emphasizes the subcomponent s which is specific to a component c of the first layer. 1 Introduction to Deep Density Models with Latent Variables (2) P (z(1) c ; θ cs ) = πs s∈c q 23 T (2) (2) (2) (2) (2) {N(zi A(2) + Ψ (2) c ξ c , Ac Ω s Ac c )}, (1.56) i=1 where s is just allowed to belong to one and only one c, and q denotes the number of dimensions of the new observations. Parameters of the mixture model on the new (1) observations zc can be updated (2) θ̂ cs (2) = arg max log P (z(1) c ; θ cs ). (1.57) θcs (2) 1.3.2.2 Collapse Model Although a DMCFAs model can be collapsed back into a standard shallow MCFA by multiplying the factor loading matrices at each layer, the learning of these two models is entirely different since the lower layer shares the parameters with the components of the upper layers in the deep model. Therefore, DMCFAs are more efficient and straightforward than the shallow form, which attributes to the conditional distribution of the components in the previously hidden layer which is not modeled using the parameters of the following hidden layers. Moreover, the overfitting risk and the computational cost of learning can be significantly reduced by sharing the factor loadings among the layers. After the first layer factors z(1) are integrated out, we obtain a multivariate Gaussian density T (2) (1) (2) (1) p(yz(2) , s) = N(y; A(1) (A(2) + Ψ (1) ). c z ), A Ψ c A (1.58) Then a standard MCFAs model is obtained by further integrating out z(2) : p(ys) = z(2) p(yz(2) , s)p(z(2) s)dz(2) = N (y; ms , Σ s ), (2) ms = A(1) (Ac ξ (2) s ), (2) (2) (2)T Σ s =A(1) (Ac Ω s Ac (2) (1.59) T +Ψ c )A(1) +Ψ (1) . (1.60) Finally, the marginal density is then given by a mixture of Gaussians p(y) = S s=1 p(s)p(ys) = S πs N(y; ms , Σ s ). (1.61) s=1 Conventionally, θ cs = {πs , A, Ψ , Ac , Ψ c , ξ s , Ω s }S,C s=1,c=1 represents the parameters of this shallow MCFAs. The posterior distribution of the latent factor zs can also be collapsed to a shallow form q(zs , sy) = N(zs ; κ s , V−1 s ), (1.62) 24 X. Yang et al. −1 −1 + A(1) Ψ (1) A(1) , V−1 s = (Ac Ω s Ac + Ψ c ) (2) (1)T κs = A (2) T (2) ξ s + V−1 s A (1)T Ψ (1)−1 (y − ms ). (1.63) (1.64) 1.3.3 Unsupervised Learning Loglikelihood (Training set) We evaluate the performance of the proposed DMFAs and DMCFAs in comparison with their shallow counterparts on the datasets shown in Sect. 1.2.4. For the density evaluation, the average loglikelihood is conducted to examine the quality of the density estimates produced by the deep density models and the standard density model by collapsing the deep models. When we do empirical analyses, all the results are all the results are logarithmically computed which makes the results of different datasets more conveniently compared. The average loglikelihood value on training data is shown in Fig. 1.16 by multiple trials. The results of the testing data are also obtained without updating parameters, as shown in Fig. 1.17. we can observe that the deep models can improve the true loglikelihood of the standard model dramatically. To assess the modelbased clustering, we compute the error rate (the lower, the better) on 4 real datasets for comparing the performance of deep density models with the standard models with same scale mixtures. In the experiment, all the methods are initialized by random grouping, and the numbers of clusters are set according to the real classes of data. Also, we have done many trials to choose the best result by dropping attributes into different dimensions. The results shown here are the best from each model in the most appropriate dimensions. Figures 1.18 and 1.19 show that the results of the models using the common loading outperform other competitors in both the training set and the testing set. Furthermore, as clearly 100 ULC3 Dataset Coil4proc Leuk72_3k USPS14 101 102 103 104 DMFAs DMCFAs SMFAs SMCFAs Fig. 1.16 Performance on various real data (on Training Set) in terms of the loglikelihood (the larger, the better). DMFAs and DMCFAs are all set in two layers. SMFA and SMCFA denote respectively the shallow form by collapsing the deep models Loglikelihood (Testing set) 1 Introduction to Deep Density Models with Latent Variables 100 ULC3 Dataset Coil4proc Leuk72_3k 25 USPS14 101 102 103 104 DMFAs DMCFAs SMFAs SMCFAs Fig. 1.17 Performance on various real data (on Testing Set) in terms of the loglikelihood (the larger, the better).DMFAs and DMCFAs are all set in two layers. SMFAs and SMCFAs denote respectively the shallow form by collapsing the deep models Error Rate (Training set) 0.35 DMFAs DMCFAs SMFAs SMCFAs 0.3 0.25 0.2 0.15 0.1 0.05 0 ULC3 Leuk72_3k Coil4proc Dataset USPS14 Fig. 1.18 Clustering error rate on training datasets. Both DMFAs and DMCFAs are two layers architecture. The shallow forms SMFA and SMCFA have the same scale mixtures of the deep models observed, deep models consistently propose the better performance than their same scale shallow models. 1.4 ExpectationMaximization Algorithm In density models who have the incomplete data or hidden variables, expectationmaximization (EM) algorithms make the parameter estimation possible (Do and Batzoglou 2008). As the general purpose iterative strategy, the EM algorithm alternates between two steps. Given the current model, the expected step (step E) is used to predict the probability distribution over completions of missing data. There 26 X. Yang et al. Error Rate (Testing set) 0.35 DMFAs DMCFAs SMFAs SMCFAs 0.3 0.25 0.2 0.15 0.1 0.05 0 ULC3 Leuk72_3k Coil4proc Dataset USPS14 Fig. 1.19 Clustering error rate on testing datasets. Both DMFAs and DMCFAs are two layers architecture. The shallow forms SMFA and SMCFA have the same scale mixtures of the deep models is usually no need to establish a probability distribution over these completions explicitly, and it is often necessary to calculate “expected” sufficient statistics over completions. The maximization step (Mstep) is used for reestimating the model parameters by utilizing these completions, which can be thought of as “maximization” of the expected loglikelihood of the data (Do and Batzoglou 2008). When observing variable y, latent variable z, and parameter θ are given, the mathematical expression is as follows: E − step : Q(θ θ (k) ) = Eq(zy;θ) [log p(y, zθ (k) )dz]; (1.65) z M − step : θ (k+1) = arg max Q(θθ (k) ). (1.66) θ Then, we introduce the convergence proof of the latent variablebased EM algorithm. Based on the Jensen inequality, it is easy to prove that the EM algorithm repeatedly constructs new lower bounds F (q, θ ) where q denotes the posterior of the latent variable, and then solving the parameters. L(θ) = log p(y, zθ )dz = log z ≥ q(zy; θ ) log z F (q, θ ) = q(zy; θ ) z p(y, zθ ) dz q(zy; θ ) p(y, zθ ) dz = F (q, θ ), q(zy; θ ) q(zy; θ ) log p(y, zθ )dz − z = log p(y, zθ )q(zy;θ) + H (q). (1.67) (1.68) q(zy; θ ) log q(zy; θ )dz (1.69) z (1.70) 1 Introduction to Deep Density Models with Latent Variables 27 Fig. 1.20 Illustration of EM algorithm. The E step is to calculate q by fixing θ (k+1) , and then raise the lower bound from F (q; θ (k+1) ) to F (q; θ (k+2) ), where F (q; θ (k+2) ) is equal to L(θ) at θ (k+1) . M step is to migrated from θ (k+1) to θ (k+2) when q is fixed, and then to find the maximum F (q; θ (k+2) ) In simple terms, the process of EM is to calculate a lower bound function of the current latent variables distribution according to the current parameters and to obtain new parameters by optimizing this function, and then continue the loop (as shown in Fig. 1.20). In the deep model, there is involved an efficient and simple optimization algorithm to perform inference and learning, socalled the greedy layerwise unsupervised algorithm. Due to this layerwise algorithm, the EM algorithm can typically be used to estimate the parameters of each mixture in each layer to find a local maximum of the loglikelihood. Importantly, a simple objective function can be fed to the EM algorithm in a layerwise style to instead of the massive objective function of a shallow model with same scale mixtures. For instance, the expectation loglikelihood of the mixture in the first layer is shown as follows Q(θθ (k) ) = C q(z, cy; θ c ) ln p(y, z, cθ c )dz c=1 z = Eq(z,cy;θ c ) [ln p(yc, z) + ln p(zc) + ln πc ]. (1.71) During Mstep, the parameters θ k+1 (at (k + 1)th iteration) are updated by solving c the partial differentiation of the expectation loglikelihood equation over each parameter ∂Eq(z,cy,θ old ) [L(θ c )] = 0. ∂θ c (1.72) The higher layer has an ability to model a better aggregated posterior of the first layer, with variational inference, any increase in the bound will improve the true loglikelihood of the model when the bound is tight. Therefore, training the deep model is better than training a shallow model. 28 X. Yang et al. 1.5 Conclusion In summary, this chapter introduces the novel deep density models with latent variables. We detail the two standard probabilistic models, MFAs and MCFAs, and the deep models with which they are deduced. Compared with previous deep density models, the greedy layerwise algorithm enjoys an easy inference procedure and a significantly smaller number of free parameters. Experimental results are compared between deep models and shallow models in both density evaluation and clustering. Empirical results showed that the deep models could significantly improve the true loglikelihood of the standard model. Moreover, deep models also consistently propose the better performance than shallow models on clustering. For more practical applications, the deep density models also useful for creating a good prior that can be used for tasks such as image denoising and inpainting or tracking animate motion. Acknowledgements The work reported in this paper was partially supported by the following: National Natural Science Foundation of China (NSFC) under grant no.61473236; Natural Science Fund for Colleges and Universities in Jiangsu Province under grant no.17KJ D520010; Suzhou Science and Technology Program under grant no.SY G201712, SZS201613; Jiangsu University Natural Science Research Programme under grant no.17KJ B520041; and Key Program Special Fund in XJTLU (KSF − A − 01). References Arnold L, Ollivier Y (2012) Layerwise learning of deep generative models. CoRR, abs/1212.1524 Baek J, McLachlan GJ (2011) Mixtures of common tfactor analyzers for clustering highdimensional microarray data. Bioinformatics 27(9):1269–1276 Baek J, McLachlan GJ, Flack LK (2010) Mixtures of factor analyzers with common factor loadings: applications to the clustering and visualization of highdimensional data. IEEE Trans Pattern Anal Mach Intell 32(7):1298–1309 Bengio Y, Lamblin P, Popovici D, Larochelle H (2007) Greedy layerwise training of deep networks. In: Advances in neural information processing systems, pp 153–160 Bengio Y et al (2009) Learning deep architectures for AI. Found Trends Mach Learn 2(1):1–127 Bishop CM (1998) Latent variable models. In: Learning in graphical models. Springer, New York, pp 371–403 Bishop C (2006) Pattern recognition and machine learning. Springer, New York Do CB, Batzoglou S (2008) What is the expectation maximization algorithm? Nat Biotech 26(8):897–899 Everitt BS (1984) Factor analysis. Springer Netherlands, Dordrecht, pp 13–31 Everett BS (2013) An introduction to latent variable models. Springer Science & Business Media, Berlin Figueiredo M, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24(3):381–396 Fokoué E (2005) Mixtures of factor analyzers: an extension with covariates. J Multivar Anal 95(2):370–384 Galbraith JI, Moustaki I, Bartholomew DJ, Steele F (2002) The analysis and interpretation of multivariate data for social scientists. Chapman & Hall/CRC Press, Boca Raton 1 Introduction to Deep Density Models with Latent Variables 29 Ghahramani Z (2015) Probabilistic machine learning and artificial intelligence. Nature 521(7553):452 Ghahramani Z, Hinton G (1996) The em algorithm for mixtures of factor analyzers. Technical Report CRGTR961, University of Toronto, pp 11–18. http://www.gatsby.ucl.ac.uk/.zoubin/ papers.html Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507 Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554 Mclanchlan GJ, Peel D, Bean RW (2003) Modelling highdimensional data by mixtures of factor analyzers. Comput Stat Data Anal 41:379–388 Johnson B (2013) High resolution urban land cover classification using a competitive multiscale objectbased approach. Remote Sens Lett 4(2):131–140 Johnson B, Xie Z (2013) Classifying a high resolution image of an urban area using superobject information. ISPRS J Photogrammetry Remote Sens 83:40–49 Law MHC, Figueiredo MAT, Jain AK (2004) Simultaneous feature selection and clustering using mixture models. IEEE Trans Pattern Anal Mach Intell 26(9):1154–1166 Loehlin JC (1998) Latent variable models: an introduction to factor, path, and structural analysis. Lawrence Erlbaum Associates Publishers Ma J, Xu L (2005) Asymptotic convergence properties of the em algorithm with respect to the overlap in the mixture. Neurocomputing 68:105–129 McLachlan G, Krishnan T (2007) The EM algorithm and extensions, vol 382. John Wiley & Sons, Hoboken McLachlan GJ, Peel D (2000) Mixtures of factor analyzers. In: International Conference on Machine Learning (ICML), pp 599–606 Montanari A, Viroli C (2011) Maximum likelihood estimation of mixtures of factor analyzers. Comput Stat Data Anal 55(9):2712–2723 Nene SA, Nayar SK, Murase H (1996) Columbia object image library (coil20). Technical report, Technical Report CUCS00596 Patel AB, Nguyen T, Baraniuk RG (2015) A probabilistic theory of deep learning. arXiv preprint arXiv:1504.00641 Rippel O, Adams RP (2013) Highdimensional probability estimation with deep density models. CoRR, abs/1302.5125 Salakhutdinov R, Mnih A, Hinton GE (2007) Restricted boltzmann machines for collaborative filtering. In: Machine Learning, Proceedings of the TwentyFourth International Conference (ICML), Corvallis, Oregon, USA, 20–24 June 2007, pp 791–798 Smaragdis P, Raj B, Shashanka M (2006) A probabilistic latent variable model for acoustic modeling. Adv Models Acoust Process NIPS 148:8–1 Tang Y, Salakhutdinov R, Hinton GE (2012) Deep mixtures of factor analysers. In: Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26–July 1 2012 Tang Y, Salakhutdinov R, Hinton G (2013) Tensor analyzers. In: International Conference on Machine Learning, pp 163–171 Tortora C, McNicholas PD, Browne RP (2016) A mixture of generalized hyperbolic factor analyzers. Adv Data Anal Classif 10(4):423–440 Vermunt JK, Magidson J (2004) Latent class analysis. The sage encyclopedia of social sciences research methods, pp 549–553 Yang X, Huang K, Goulermas JY, Zhang R (2004) Joint learning of unsupervised dimensionality reduction and gaussian mixture model. Neural Process Lett 45(3):791–806 (2017) Yang X, Huang K, Zhang R (2017) Deep mixtures of factor analyzers with common loadings: a novel deep generative approach to clustering. In: International Conference on Neural Information Processing. Springer, pp 709–719 Chapter 2 Deep RNN Architecture: Design and Evaluation Tonghua Su, Li Sun, QiuFeng Wang, and DaHan Wang Abstract Sequential data labelling tasks, such as handwriting recognition, face two critical challenges. One is in great needs of a largescale wellannotated training data. The other is how to effectively encode both the current signal segment and the contextual dependency. Both needs many human efforts. Motivated to relieve such issues, this chapter presents a systematic investigation on architecture design strategies for recurrent neural networks in two different perspectives: the pipeline perspective and micro core perspective. From the pipeline perspective, a deep endtoend recognition architecture is proposed, allowing a fast adaptation to new tasks with minimal human intervention. Firstly, the deep architecture of RNN gave a high nonlinear feature representation. Through hierarchical representations, effective and complex features are expressed in terms of other, simpler ones. Moreover, a hybrid unit is used to encode the long contextual trajectories, which comprise of a BLSTM (bidirectional Long ShortTerm Memory) layer and a FFS (feed forward subsampling) layer. Secondly, the CTC (Connectionist Temporal Classification) objective function makes it possible to train the model without annotating the data in character level. Thirdly, a modified CTC beam search algorithm integrates the linguistic constraints wisely during decoding in a unified formulation. In such an endtoend Part of this chapter is reprinted from: IEEE Proceedings, Li Sun, “GMU: A Novel RNN Neuron and Its Application to Handwriting Recognition” 2017 with permission from IEEE IEEE Proceedings, Li Sun, “Deep LSTM Networks for Online Chinese Handwriting Recognition”, 2016, with permission from IEEE T. Su () · L. Sun School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China email: thsu@hit.edu.cn Q.F. Wang Xi’an JiaotongLiverpool University, Suzhou, China D.H. Wang School of Computer and Information Engineering, Xiamen University of Technology, Xiamen, China © Springer Nature Switzerland AG 2019 K. Huang et al. (eds.), Deep Learning: Fundamentals, Theory and Applications, Cognitive Computation Trends 2, https://doi.org/10.1007/9783030060732_2 31 32 T. Su et al. style, a thin and compact network with high accuracy can be designed and ready to industryscale applications. From the micro core perspective, a new neuron structure named gated memory unit (GMU) is presented, inspired by both LSTM and GRU. GMU preserves the constant error carousels (CEC) which is devoted to enhance a smooth information flow. GMU also lends both the cell structure of LSTM and the interpolation gates of GRU. Our methods are evaluated on CASIAOLHWDB 2.x, a publicly available Chinese handwriting database. Comparing with stateoftheart methods, it shows that: (1) our pipeline demonstrates over 30% relative error reductions on test set in terms of both correct rate and accurate rate; (2) better results can be achieved by our pipeline at certain feasible assumption on competition set; (3) GMU is of potential choice in handwriting recognition tasks. Keywords Deep learning architecture · Neuron design · Recurrent neural network · Handwriting recognition · Largecategory labelling 2.1 Introduction Online Chinese handwriting recognition is a typical sequential datalabelling task. During writing, the running pen (or more generally writing instrument) tip is recorded by sensing devices. Each textline can be viewed as a temporal sequence of tensor, also called trajectory. Online handwriting recognition plays a significant role for entering Chinese text. Compared with other input methods, it is more natural to human beings, especially to those Chinese people who are not good at Pinyin. The pioneer works on online Chinese handwriting recognition were independently studied in 1966 by two groups from MIT and University of Pittsburgh (Liu 1966; Zobrak 1966). Since then, extensive concerns have been received, owing to the popularity of personal computers and smart devices. Successful applications have been found in penbased text input (Wang et al. 2012a), overlapped input method(Zou et al. 2011; Lv et al. 2013), and notes digitalization and retrieval (Zhang et al. 2014). In recent years, combination of recurrent neural network (RNN) and Connectionist Temporal Classification (CTC) becomes a powerful sequential labeling tool (Graves et al. 2009a). Messina et al. utilize this tool to transcribe offline Chinese handwriting (Messina and Louradour 2015) and its potentials are proved. More recently, convolutional LSTM is applied to online Chinese handwriting recognition (Xie et al. 2016) with pleasing results. The recurrent neurons are the building blocks of RNN. Their own capabilities and combinations thereof can form different network architectures, resulting in different modeling capabilities. Although RNN can approximate any sequence to sequence mapping in theory, decades of studies suggest that vanilla recurrent neural networks are difficult to capture the longterm dependencies (Bengio et al. 1994). This chapter aims to present RNN design solutions from both pipeline perspective and micro core neuron perspective. In the pipeline level, a novel endtoend recognition method 2 Deep RNN Architecture: Design and Evaluation 33 is proposed based on deep recurrent neural networks. Instead of rendering the trajectories as offline mode (Xie et al. 2016; Liu et al. 2015), we explore the possible capacities of raw pen trajectory using a deep architecture. The architecture mixes bidirectional LSTM layers and feed forward subsampling layers, which are used to encode the long contextual history trajectories. We also follow the CTC objective function, making it possible to train the model without alignment information between input trajectories and output strings. As suggested in (Liu et al. 2015), we utilize the explicit linguistic constraints to boost our endtoend systems in online mode. To integrate the linguistic constraints wisely, a modified CTC beam search algorithm is devised for decoding. In the micro core level, we propose a new kind of recurrent neuron called Gated Memory Unit (GMU) based on the principle of highway channel. The GMU neuron combines two components. One is used to keep history context and the other to generate the neuron activation. We retain the memory cell structure of LSTM to constitute a transmission channel, and lend the interpolation gate of GRU to regulate the flow of information. At both levels, the proposed method is evaluated on a publicly available database. 2.2 Related Works 2.2.1 SegmentationFree Handwriting Recognition Most approaches for online Chinese handwriting recognition fall into segmentationrecognition integrated strategy (Cheriet et al. 2007). Firstly, the textline is oversegmented into primitive segments. Next, consecutive segments are merged and assigned a list of candidate classes by a character classifier (Liu et al. 2013), which forms segmentationrecognition lattice. Last, path searching is executed to identify an optimal path by wisely integrating character classification scores, geometric context and linguistic context (Wang et al. 2012b). Great success of such strategy has been achieved during the past few years. If there are big wellannotated training data, segmentationrecognition integrated strategy may be the first choice. Many works are conducted to optimize any of the key factors, such as character recognition (Zhou et al. 2016), confidence transformation (Wang and Liu 2013), parameter learning methods (Zhou et al. 2013, 2014). In (Yin et al. 2013), such three modules are termed by Vision Objects Ltd. as segmentation, recognition, and interpretation experts. They are separately trained and their weighting to the task is scored through a global discriminant training process. The complex training phases involves intense expert intervention. However, construction of such a system is nontrivial. As a precondition, a largescale training data needs to be annotated in character level. The widespread Chinese handwriting databases, such as HITMW (Su et al. 2007) and CASIAOLHWDB (Liu et al. 2011), had been taken years to be annotated by hand. Moreover, features for classification are derived through experts’ handcraft. In addition, different 34 T. Su et al. components of the system are developed separated, probably requiring different data for tuning. Definitely, it needs great human efforts. As an alternation, segmentationfree strategy is promising to solve above issues. There is no need to explicitly segment textlines into characters. Thus labeling each character boundaries in training data is unnecessary. Previously, hidden markov models (HMMs) have been successfully used to recognize Chinese handwriting (Su et al. 2009) in a segmentationfree way. Nowadays, RNN especially LSTM became powerful tool to sequential labelling tasks. Messina et al. utilize this tool to transcribe offline Chinese handwriting (Messina and Louradour 2015). The preliminary results are comparable to those ever reported. More recently, convolutional LSTM is applied to online Chinese handwriting recognition (Xie et al. 2016) with pleasing results. The system includes four kinds of layers. The input trajectory is first converted into textline images using a path signature layer. Then convolutional layers are concatenated to learn features. After that, the features are fed to multiple layers of bidirectional LSTM. And finally, the CTC beam search is used to transform the output of network to labels in transcription layer. Instead of rendering the trajectories as offline mode (Xie et al. 2016; Liu et al. 2015), we explore the possible capacities of original pen trajectory using a deep architecture. As suggested in (Liu et al. 2015), we utilize the explicit linguistic constraints to boost our endtoend systems in online mode. 2.2.2 Variants of RNN Neuron Since the early 1990s, pursuit of sophisticated recurrent neurons has been a significant topic. Long Shortterm Memory (LSTM) and Gated Recurrent Unit (GRU) are two kinds of commonly used sophisticated recurrent neurons. Currently these RNN neurons are widely used in many research fields, including natural language processing (Cho et al. 2014a), speech recognition (Amodei et al. 2016; Graves et al. 2013), image recognition (Russakovsky et al. 2015). The stateoftheart character recognition methods (Graves et al. 2009b; Graves and Schmidhuber 2009; Graves 2012a), are increasingly using recurrent neural networks and have made a lot of progress. Briefly speaking, there are two reasons for their success. Firstly, they invent special gates which can keep longer contextual dependencies, such as forget gate (for LSTM) or update gate (for GRU). Secondly, they build a highway channel to provide shortcuts for the smooth propagation of history information and back propagation of error information. LSTM is one of the earliest proposed sophisticated recurrent neuron. The original LSTM structure (Hochreiter and Schmidhuber 1997) comprises one or more selfconnected memory cell, input gate and the output gate. The structure establishes constant error carousels (CEC), which facilitates both the storage of history information to encode a longterm dependency and the gradient back propagation to prevent gradient decay problem. The input gate is used for controlling the ratio 2 Deep RNN Architecture: Design and Evaluation 35 xt G ii f kt i kt x t+1 G ii ćkt Gf ć kt+1 Gf f kt+1 c kt ct1 Go i kt+1 o kt h kt t c kt+1 ct Go ht ct+1 okt+1 hkt+1 t+1 Fig. 2.1 LSTM is unfolded in the time steps of information to enter the memory cell. The output gate is used to control the exposure portion of memory cell content as neuron activation. To further enhance the ability of LSTM, forget gate is added (Gers et al. 2000), which resets operation if previous history should be discarded. More recent enhancement on LSTM is peephole connections (Gers et al. 2002). It is helpful to improve the precise timing and counting ability of internal state. LSTM neurons can be unfolded in time steps, as shown in Fig. 2.1. Here the cell is deliberately depicted as a channel flowing from left to right. During the forward propagation process, the contents of memory cell are stored in the channel. And during the backpropagation process, the gradient of objective function can flow back using the channel. Memory content updates using Eq. 2.1: k ctk = ftk ct−1 + ctk (2.1) where forget gate’s (Gt) output value fkt determines how much the content of history (ckt−1 ) preserved. In case fkt is close to 0, it means discarding almost all history information; otherwise the history information will be partially retained. Even we can control the input gate (Go) to maximize the usage of history information. The contents of the memory cell are not fully exposed to the hidden state. Instead, it is controlled using the value of the output gate, okt . 36 T. Su et al. xt r k t+1 r kt Gr Gz 1 z kt x t+1 Gr g zk ħkt Gz t g z kt+1 ħkt+1 1z kt+1 ht1 ht t hkt h t+1 t+1 h kt+1 Fig. 2.2 GRU is unfoleded in the time steps Another type of sophisticated recurrent neuron is Gated Recurrent Unit (GRU) (Cho et al. 2014a, b). It is unfolded in time steps in Fig. 2.2. GRU uses a simpler memory block than LSTM. Each GRU block consists of an input transformation, a reset gate (Gr) and an update gate (Gz). Reset gate controls shortterm contextual dependencies using a selfconnected structure and update gate controls longterm contextual dependencies using an interpolation between the history information and the current activation. To adaptively capture the dependencies of different lengths, just tune those two gates. Moreover, since the unit outputs (hidden states) are selfconnected, it becomes the channel for historical information. Similar to LSTM cell, the channel ensures a smooth gradient back propagation. But it is different from LSTM. The new unit output is updated using Eq. 2.2. hkt = 1 − ztk hkt−1 + ztk kt (2.2) In case zkt is close to 1, the hidden state is forced to ignore the previous hidden states; otherwise, the previous information is retained through the interpolation coefficient. To achieve maximum usage of history information, one can even allow zkt close to zero. Different from LSTM, the contents of memory cell are exposed fully without any reservation. 2 Deep RNN Architecture: Design and Evaluation 37 2.3 Datasets We use an online Chinese handwriting database, CASIAOLHWDB (Liu et al. 2011), to train and evaluate our models. This database comprises trajectories of both isolated characters and lines of text. The part comprising only isolated characters has three subsets, named OLHWDB 1.0∼l.2, and the other also has three parts: OLHWDB 2.0∼2.2.OLHWDB 2.0∼2.2 are divided into train set and test set by authors of (Liu et al. 2011). The train set includes 4072 pages from 815 people while the test set includes 1020 pages from 204 people. Also there is a heldout subset for ICDAR2013 Chinese handwriting recognition competition (denoted as comp set) (Yin et al. 2013). Characteristics of those datasets are summarized in Table 2.1. Our systems are evaluated on test set or comp set. Note that there are 5 characters in test set which are not coved by train set while around 2.02% characters in comp set are not coved by train set. Certain preprocessing is taken before the experiment. Firstly, determine a character set to cover all samples occurred in CASIAOLHWDB 2.x. Totally there are 2765 unique character terms (including 2764 terms and one {blank}). Secondly, remove the manually annotated points in the sample. In order to retain the true handwriting behavior, the cutting points added to segment the ligatures are removed. Thirdly, derive a compact feature representation. The input trajectory is represented in raw form. At each timestep, it consists of first order difference of x, y coordinates, along with a state to indicate whether the pen is lifted. As for the start of the trajectory, first two dimensions are set as zeroes. We want to construct ngram models to express the linguistic constraints. A large amount of textual data are collected from electronic news of People’s Daily(in html format). We just filter out the html markups and no human effort is intervened while the data contain some noisy lines. The texts are ranging from 1999 to 2005 including 193,842,123 characters. We estimate our ngram language models using SRILM toolkit (Stolcke 2002). All characters outside of 7356 classes are removed. Bigram and trigram are considered respectively in base10 logarithm scale. Both of them are pruned with a threshold of 7e7. Table 2.1 Characteristics of CASIAOLHWDB 1.0∼l.2 datasets Items #pages #lines #chars #classes Subsets of OLHWDB 2.0∼2.2 train test Comp 4072 1020 300 41,710 10,510 3432 1,082,220 269,674 91,576 2650 2631 1375 38 T. Su et al. 2.4 Proposed Deep Neural Network The proposed system aims to map a fed pen trajectory X (=x1 x2 . . . xT ) to a textual string l following an endtoend paradigm and learn an optimal map. This process is illustrated in Fig. 2.1. Here we let a LSTM model to encode the map. Supposing such a map is learned. Then we can feed an unseen trajectory textline. The LSTM network is propagated from input layer to output layer through hidden layers. As the final step, CTC beam search is performed to output most possible strings. The performance of each trajectory is measured based on the alignment between l and the groundtruth z. How to derive an optimal map is left to the training process of Fig. 2.3. We generate the needed LSTM models by minimizing the empirical loss function: L (S) = (X,z)∈S L (X, z) , (2.3) with S denoting the training set. Because the problem is a multiclass classification, we use the L (X, z) = −lnP (zX) to represent the example loss function which can be effectively computed by CTC and forwardbackward algorithm. Training data in the CASIAOLHWDB database (Liu et al. 2011) is used to derive the classification training prediction Training trajectory Test trajectory Arrange the trajectory Arrange the trajectory Train LSTM Forward propagation Ngram CTC Beam Search string Fig. 2.3 Flowchart of the system 2 Deep RNN Architecture: Design and Evaluation 39 model. Also we estimate ngram language models from textual corpus in advance which will be used to facilitate the search process during decoding. 2.4.1 Architecture We develop a deep recurrent neural network (RNN) to derive the mapping function that is specialized for processing a sequence of vectors. Fig. 2.4 shows a typical instance of the unfolded recurrent neural networks. Besides the input and output layers, there are hidden layers and subsampling layers. The history information is encoded through the recurrent hidden layer. Hidden variable vector at time step t in the nth hidden layer htn is iteratively computed from both (n1)th hidden variable vector htn−1 (ht0 = xt ) and previous hidden variable vector hnt−1 : t−1 htn = tanh WIn htn−1 + UH , h n n (2.4) where WIn is the weight matrix between (n1)th hidden layer and nth hidden layer t and UH n is the selfconnected recurrent weight matrix. Since hn is defined in a recursive manner, the dependency on input vectors ranging first to tth time step may be possible. To further benefit the geometric context both from left to right and vice versa, we also consider a bidirectional recurrent neural network (BRNN) (Schuster and Fig. 2.4 The architecture of the unfolded BRNN (with 5 hidden layers) 40 T. Su et al. Paliwal 1997) where each recurrent layer is replaced with a forward layer and backward layer. The forward pass is the same as usual, while the backward processes data from t = T to 1. The recurrent formulation results in the sharing of parameters through all time steps (Graves 2012b). Subsampling layers are worthy to explore considering the large length of trajectories and the variability in writing styles. Herein we devise feedforwardlike neuron units which allow inputs from a window of consecutive timesteps to be collapsed. It is a simplified version of (Graves 2012b). The kth hidden variable at time step w × t in the nth subsampling layer is computed as: w×t hn(k) = tanh w i=1 w×t+i , win(k) · hn−1 (2.5) where w is the window size, win(k) is the weight vector from (n1)th layer to the kth unit in the nth layer. The network output vector at time t can be computed as: yt = sof tmax at = sof tmax WO htL , (2.6) where WO is the weight matrix from the last hidden layer to output layer. Note one extra output node is reserved for {null} that will be used in CTC algorithm. The standard RNNs suffer from vanishing gradient problem (Goodfellow et al. 2016). In our RNN network, LSTM memory blocks (Hochreiter and Schmidhuber 1997) are used to replace the nodes of RNN in hidden layers. Herein each memory block includes three gates, one cell, and three peephole connections. Forget gate’s output value determines how much the content of history preserved. In case it is close to 0, it means discarding almost all history information; otherwise the history information will be partially retained. Even we can control the input gate to maximize the usage of history information. The contents of the memory cell are not fully exposed to the hidden state. Instead, it is controlled using the value of the output gate. All of them are nonlinear summation units. In general, the input activation function is tanh function and the gate activation function is sigmoid function which can squeeze the gate data between 0 and 1. 2.4.2 Learning Connectionist Temporal Classification (CTC) is a powerful object function for sequential data labelling. It allows RNNs to be trained without requiring any prior alignment between input and target sequences (Graves and Jaitly 2014). Assuming all labels are draw from an alphabet A, define the extended alphabet A = A∪{null}. If the output is null, it emits no label at that timestep. A CTC mapping function 2 Deep RNN Architecture: Design and Evaluation 41 is defined to remove repeated labels and then delete the nulls from each output sequence. Providing that the null label is injected into the label sequence z, we get a new sequence z’ of length 2z + 1. CTC uses a forwardbackward algorithm (Graves 2012b) to sum over all possible alignments and determine the conditional probability Pr(zX) of the target sequence given the input sequence. For a labelling z, the forward variable α(t, u) is defined as the summed probability of all length t paths that are mapped by onto the length u/2 prefix of z. On the contrary, the backward variable β(t, u) is defined as the summed probabilities of all paths starting at t + 1 that complete z when appended to any path contributing to α(t, u). Thus the conditional probability Pr(zX) can be expressed as: Pr (zX) = u=1∼z  α (t, u) β (t, u) , (2.7) Substituting Eq. 2.7 into Eq. 2.3, we obtain the empirical loss function. Defining the sensitivity signal at output layer as δkt ∂L/∂akt , we can get: δkt = ykt − 1 α (t, u) β (t, u) , u∈B(z,k) Pr (zX) (2.8) where B (z, k) = u : zu = k . Based on Eq. 2.8, the sensitivity signals at other layers can be easily backpropagated through the network. 2.4.3 Decoding Beam search is employed to integrate the linguistic constraints. Modifications are made to wisely integrate constraints. To reduce the bias of noisy estimation, the extension probability Pr(k,y,t) is rescaled, while the pseudocode in Algorithm 1 falls into the similar framework proposed in (Graves and Jaitly 2014). Define Pr− (y,t), Pr+ (y,t) and Pr(y,t) as the null, nonnull and total probabilities assigned to partial output transcription y, at time t respectively. The probability Pr(k,y,t) of y by label k at time t can be expressed as follows: Pr (k, y, t) = Pr (k, tX) [γ Pr (ky)] Pr− (y, t − 1) , ye = k , Pr (y, t − 1) , others where Pr(k,tX) is the CTC emission probability of k at t, as defined in (2.3), Pr(ky) is the linguistic transition from y to y + k, ye is the final label in y and γ is the language weight. 42 T. Su et al. Algorithm 1: Modified CTC Beam Search 2.4.4 Experimental Setup We set 2765 units to the output layer. One is reserved as null indicator; others cover all characters occurred in train set, test set and comp set of OLHWDB 2.0∼2.2. We also select isolated samples from OLHWDB 1.x with no more than 700 samples per class hoping to alleviate the outofvocabulary problem of comp set. Three different networks are investigated, as provide in Table 2.2. The type column reflects the number of hidden layers. The setting column describes the size of each layer. The suffix has specific meaning: Iinput layer, Ssubsampling layer, LLSTM layer and Ooutput layer. The second column summarizes the overall number of network parameters (in million). The three networks are deliberately tuned to be comparable in number of parameters. Our LSTM layer is very thin compared to (Xie et al. 2016) where 1024 nodes are assigned. The system is developed from scratch by our own. The probabilities except language model are expressed in natural logarithm scale. The learning rate is driven by AdaDelta algorithm (Zeiler 2012) with a momentum of 0.95. During CTC beam search decoding, the emission probabilities are pruned with the reciprocal of the output units as threshold and the beam size is fixed as 64 in our evaluation. Both CPU version and GPU version are implemented. Using a minibatch size of 32 textlines, a speedup of more than 200X can be achieved by our GPU version along with a Tesla K40. 2 Deep RNN Architecture: Design and Evaluation 43 Table 2.2 Three different network setups Type 3layer 5layer 6layer #para 1.63 M 1.50 M 1.84 M Setting 3I64 L128S192 L2765O 3I64 L80S96 L128S160 L2765O 3I48 L64S96 L128 L144S160 L2765O To overcome the data sparsity, the training process is divided into four stages. In the first stage, isolated samples are used to train the networks with a minibatch size of 128 characters. It has cycled for 5 epochs. The next stage is run on all training data from OLHWDB 2.x with a minibatch size of 32 textlines and it is repeated for 10 epochs. The third stage is executed on the same data as the first stage with 10 epochs. At the final stage, 95% of the training data from OLHWDB 2.x is used to finetune the networks no more than 20 epochs. The rest of the training data is reserved for validation of the best model. The output of certain recognizer is compared with the reference transcription and two metrics, the correct rate (CR) and accurate rate (AR), are calculated to evaluate the results. Supposing the number of substitution errors (Se ), deletion errors (De ), and insertion errors (Ie ) are known, CR and AR are defined respectively as: CR = Nt −SNet−De , e −Ie AR = Nt −SeN−D t (2.9) where Nt is the number of total characters in the reference transcription. 2.4.5 Results The 4th training stage is illustrated in Fig. 2.5 where the y axis illustrates the AR on validation set. The best model in this phase is selected and used to evaluate on test set or comp set afterwards. The rescaling intensity of language models is determined based on their performance on validation set. In Fig. 2.6, the role of characterlevel trigram is given. We can see that it is stable when the language weight ranges from 1.0 to 1.8 while the standard decoding algorithm uses a value around 2.3. Similar observation can be seen for bigram. In the following experiments, we simply fix it as 1.0. We investigate the role of different language models as well as network depth both on test set and comp set, as shown in Table 2.3. Though three networks have almost same quantity of parameters, the deeper networks generally work better on both sets. On the other, we can see a steady increase in performance when more strong linguistic constraints are imposed. Compared between two sets, the improvement by using language models on comp set is more remarkable. 44 T. Su et al. AR (%) 96 94 3layer 5layer 6layer 92 90 4 8 12 16 20 Training Epochs Fig. 2.5 The fourth training stage is evaluated on validation set AR (%) 98 97 3layer 5layer 6layer 96 0 0.5 0.8 1 1.2 1.5 1.8 2 2.3 LM weight Fig. 2.6 Investigation of language model (trigram) weight on the validation set Finally, we compare our results with previous works, as shown in Table 2.4. The systems in first four rows use a segmentationrecognition integrated strategy. The fifth row employs a hybrid of CNN and LSTM architecture and a specific feature extraction layer is presented. Best results on test set are achieved by our system. Compared with (Xie et al. 2016), 1.15% and 1.71% absolute error reductions are observed in CR and AR respectively. Compared with the best results from the segmentationrecognition integrated strategy (Zhou et al. 2014), around 2.3% absolute error reductions are made in both CR and AR. Our system achieves a bit lower result than the segmentationrecognition integrated strategy on comp set. It is severely challenging for the segmentationfree strategy to perform on comp set since 2 Deep RNN Architecture: Design and Evaluation Table 2.3 Role of language models on both test set and comp set (%) 45 No LM Type AR Test set 3layer 94.81 5layer 95.08 6layer 95.30 comp set 3layer 88.06 5layer 88.91 6layer 89.12 CR bigram AR CR trigram AR CR 95.52 95.63 95.82 96.13 96 .34 96.45 96.84 96.92 97.00 96.83 97.04 97.05 97.45 97.58 97.55 89.40 90.00 90.18 91.42 92.05 92.12 92.87 93.20 93.25 93.00 93.37 93.40 94.28 94.44 94.43 Table 2.4 Comparisons with previous results (%) Methods Wang2012 (Wang et al. 2012a) Zhou2013 (Zhou et al. 2013) Zhou2014 (Zhou et al. 2014) VO3 (Yin et al. 2013) Xie2016 (Xie et al. 2016) Our method a Remove Test set AR 91.97 93.75 94.69 – 95.34 97.05 CR 92.76 94.34 95.32 – 96.40 97.55 comp set AR – 94.06 94.22 94.49 92.88* 93.40 94.65a CR – 94.62 94.76 95.03 95.00* 94.43 95.65* all characters in comp set which are not coved by train set there is a noticeable outofvocabulary problem. If we remove all outofvocabulary characters from comp set, our system achieves 94.65% and 95.65% in AR and CR, respectively, even outperforming the best system of Vision Objects (Yin et al. 2013). 2.4.6 Error Analysis We inspect the recognition result of each text line. We arrange the ARs of all textlines into 11 intervals and consider the number of textlines that fall into the interval, as shown in Fig. 2.7. The width of interval is 5% if the AR is bigger than 50%. There are 113 text lines in comp set whose ARs are lower than 75% and we inspect all of them to get some insights. Generally, errors are caused by three challenging issues, as shown in Fig. 2.8. First of all, there are cursively written samples, even two or more characters are joined as ligature. Moreover, severe skew or slant textlines are not easy to deal with since there are limited training samples to cover such phenomena. In addition, there are inadequate samples for English letters/words. It may be helpful to further reduce or alleviate such issues. 46 T. Su et al. 2000 50 40 Frequency 1500 30 20 10 1000 0 7075 6570 6065 5560 5055 <50 500 0 95100 9095 8590 8085 7580 7075 6570 6065 5560 5055 <50 AR Range Fig. 2.7 Illustration of the ARs using interval histogram Fig. 2.8 Typical errors and reasons in comp set (errors are marked with[.]). (a) cursive or skew textline. (b) English words 2 Deep RNN Architecture: Design and Evaluation 47 2.5 Proposed RNN Neuron 2.5.1 Architecture A new recurrent neuron named Gated Memory Unit (GMU) is proposed, which combines the memory cell of LSTM and the interpolation gate of GRU. GMU consists of a memory cell, two gates (memorial gate and operating gate). Memory cell plays a role of CEC, which is used for transmission of the historical information and error messages. Memorial gate is mainly responsible for updating the contents of the memory cell, which controls how much of the history should be retained. Operating gate is primarily responsible for hidden state generation. It mixes the history and current input transformation using learned weights. Those two gates can be run independently, which makes it easy to parallel. In Fig. 2.9, GMU is unfolded in the time steps. During its forward propagating, the history information is saved into cell. Since the cell is selfconnected, long dependencies are possibly encoded. Cell also serves as a CEC channel in the back propagation process, avoiding gradient vanishing problems. Compared with standard RNN neuron, GMU uses additive component to achieve the evolution from time t to time t + 1. Standard RNN state always come from the former state and current information. Unlikely, GMU can keep the previous contents and recursively integrates fresh information. xt x t+1 i kt i kt Gm m kt c t1 Gp 1m kt ct p kt 1m kt+1 m kt+1 Gm Gp 1p kt+1 1p kt hkt hkt+1 ht t Fig. 2.9 GMU is unfolded in the time steps i kt+1 i kt+1 t+1 p kt+1 c t+1 48 T. Su et al. Similar to LSTM, GMU also uses Memory Cell as a highway channel. The hidden state is derived from the contents of memory cell with proper scaling operation. Their difference is that the range of information in GMU cell is ensured to be stable since an interpolation is used. Similar to GRU, GMU use interpolation structure to merge the history information and current information. Both of them have a simple arithmetic component. Different to GRU, GMU uses memory cell to store history information, rather than the activation value itself, better guarantee the independence of memory cell and hidden state. 2.5.2 Forward Propagation Memorial gate in GMU uses a logistic sigmoid transformation: atk = I m,x l wl,k xt + H j h,m k wj,k ht−1 + wkc,m ∗ ct−1 j =1 l=1 mkt = σ atk Similarly, operating gate also employs a sigmoid: btk = I x,p wl,k xtl + H h,p j c,p wj,k ht−1 + wk j =1 l=1 ptk = σ btk Input transformation can be computed as: ∼k it = I l=1 x,i l wi,k xt + H j h,i wj,k ht−1 j =1 k ∼ itk = h i t The hidden state or activation is given by interpolation: k + mkt itk hkt = 1 − mkt ct−1 k ∗ ct−1 2 Deep RNN Architecture: Design and Evaluation 49 Memory cell is updated using: k ctk = 1 − ptk itk + ptk ct−1 2.5.3 Backward Propagation As the critical step is to derive the gradients, we list them as follows. Firstly, define the derivative to hidden state as h,k t def = ∂L ∂hkt Also define the derivative memory cell as c,k t def = ∂L ∂ctk The gradient of memory cell can be derived: c,k t = h,k t 1 − mkt + c,k k t+1 pt p,k c,p + δt+1 wk m,k c,m + δt+1 wk The gradient of operating gate is: p,k δt =σ btk c,k t k itk − ct−1 atk c,k h k ct−1 − itk The gradient of memorial gate: δtm,k = σ The gradient of input transformation: ∼ δti ,k =h ∼k it h,k k t mt + c,k t 1 − ptk 2.5.4 Experimental Setup Unlikely to subsection 4.4, this section simplifies the experimental process. Firstly, we don’t consider the comp set. We just set 2750 units to the output layer. Secondly, we don’t