Introduction:
Continual learning stands as a pivotal concept in the realm of machine learning, allowing models to adapt and evolve as new data streams in. In this technical exploration, we delve into the mechanisms of sequential optimization, focusing particularly on its application in the context of Edge computing. By leveraging lightweight techniques, we enable models to continually learn faster without sacrificing memory or computational resources. Through detailed examples encompassing basic optimization, image warping, and neural network classification, we will show you the intricate fabric of sequential optimization in its application to continual learning.
The Challenge of Continual Learning:
Continual learning endeavors to address a fundamental challenge: how to imbue machine learning models with the capacity to accumulate knowledge over time while retaining past learnings. Unlike static learning approaches, which often suffer from rigidity and stagnation, continual learning aims to facilitate adaptability, improvement, and resource efficiency. However, it must manage with the inherent constraints of evolving data distributions, limited computational resources, and the risk of catastrophic forgetting.
Techniques for Continual Learning:
State of the art continual learning involves building re-train strategies that would look at different mixtures of old and new data. Please refer to the vocabulary at the end of this post for key terms and ideas.
To surmount the above challenges, sequential optimization emerges as a cornerstone technique in the arsenal of continual learning. By dynamically updating models with new observations as they arrive, sequential optimization enables continual adaptation without necessitating retraining on the entire dataset. This approach not only conserves memory but also facilitates real-time learning in resource-constrained environments such as Edge devices.
Sequential Optimization:
theta - learning_rate * gradient
Sequential optimization, aka step-by-step estimation, differential adjustment, or Kalman filtering, is a least squares method involving the derivation of expressions for the current estimate in terms of the previous estimate plus a correction term using the rules of matrix partitioning. (Tobey 1930, Tienstra 1956, Schmid & Schmid 1965, Kalman 1960)
Sequential optimization serves for both chopping a large optimization problem into smaller accumulating ones and updating a set of unknown parameters as new observations are available. Similarities to the continual learning paradigm are obvious – in seeking faster deployment of solutions, adaptability to new data, improvement with experience, and lightweight resources.
Let’s start with the formulation of basic least squares optimization before deriving the sequential updating scheme. The figure below starts at the top-left corner by defining the vector of unknowns X holding u values, the vector of observations L holding n values, and the model F holding r equations tying the unknowns and the observations.
When the model F is not linear, it can be linearized using a Taylor series and optimizing in iterations, as shown in the top-right part of the above figure. The model can then be written in linear matrix notations encompassing the design matrices A, and B as well as the vector of misclosure W.
Moving to the lower-right part of the figure, and introducing the square matrix P holding the variance-covariance values for the observations, we then minimize the squares of the observation residuals to get to the final solution to the unknowns X. Note that K here is a helper vector and V is the vector of residuals. If the model F can explicitly separate between unknowns and observations, then the set of solution equations becomes the simpler version shown in the frame in the lower-left corner.
Similarly, carrying the same matrix notations, the sequential formulation is expressed in terms of updating estimates. The solution of the current estimate k (not to be confused with the helper vector K) is given as a correction to the previous estimate k-1, as seen in the frame below.
Note that beyond the vectors K, X, and V, we now need to update the inverse matrix N from cycle to cycle, as this matrix serves as the memory term carrying knowledge from the old data into the updating optimization, balanced by the current observation weight matrix P.
This formulation above is lightweight and does not require heavy resources. Mind that the single inverse operation is for a matrix of size (n x n).
Example 1: Simple Optimization
Let’s test those equations on a trivial example first. Suppose we have a beam of length l that we measure with a tape ten times. Our vector of observations L contains all ten measurements so that it is of size (10x1), as seen in the diagram below.
Our model is trivial in this case F = AX – L = 0 as the single unknown (u = 1) is simply multiplied by the design matrix A which in this case is a vector of ten ones. We expect the least squares solution to come up with exactly the average of all observations as the best estimate for our single unknown, and the error estimate to equal the standard observation error divided by the square root of ten. When applying the sequential formulation, we start with a single observation and let our single unknown, the length of the beam, update with every new observation we introduce.
As the plot in the diagram shows, the sequential solution is improving with every step, converging exactly to the least squares solution of the whole set of observations.
The equations thus function as expected in this simple example. It shows that we get to the same solution by going direct or sequential. It also shows that every new observation flying in is improving the updating solution’s accuracy.
Example 2: Single Camera Reaction
Our next example is a bit more real-life and shows that the same equations work even as the model gets more complex. We are trying to perform a single camera resection, where we need to derive the six exterior orientation parameters of the camera (our unknowns, u=6) from the measurement of the pixel locations of several points in space.
Our model F is given in the above diagram, tying the unknowns (embedded into rotation matrix R and translation vector T) with the observations u, v on the image, and X, Y, and Z in the captured space. This model is not linear and needs to be linearized and thereafter solved in iterations. Again, we apply the direct least squares on all observation points and compare it to growing sequentially by adding the observations of one new point at a time.
For simple testing of the model, we take a planar whiteboard and measure crosses drawn on it, examining our solution by looking at the success of warping the whiteboard plane to the image plane. The short video clip shows how after the first four measured points we can establish an initial solution, and then with every new measured point improve the alignment of the image to our whiteboard.
Again, sequential optimization improves with every additional measurement and converges to the same solution as the direct solution. Important to note that like every step-wise process of an integrative nature, sequential optimization too suffers from an unavoidable drift that may eventually affect stability.
Example 3: Image Classification Network
Our next example looks at a simple neural network aimed at classifying Fashion-MNIST items based on a database of input images of size 28x28.
Let’s assume we are training a simple three-layer network, as seen in the above diagram, consisting of fully-connected layers and a ReLU component. Our initial database has 50,000 tagged items allowing us to train our network. Thereafter 2,500 new samples are available for accuracy improvement of the network. Let’s focus on the last fully connected layer and use the new data to improve its weights and biases.
We can re-train the last layer on all 52,500 data sets directly, and we can sequentially optimize the weights and biases of that last layer based on the new 2,500 items only.
Our unknowns for this 512-to-10 layer are 512x10 weights and 10 biases, so a total of 5130 unknowns. Our observations would be 2500 labels (converted to one-hot vectors), so a total of 25,000 observations.
In the next diagram, we see that the linear model can be written in matrix notation so that the design matrix A is built by duplication of the vector of 512 activations of the previous layer, with an addition of an identity matrix of size (10x10) on its right side to accommodate for the 10 biases that are part of the vector of unknowns.
Our model becomes F=AX-L and we can now apply the same set of equations of the sequential scheme. Results show that the same improvement is achieved directly and sequentially (see below).
We can now take that example one step further, and assume we only start with initial 2,500 items. We can then deploy our network, knowing its accuracy is still sub-optimal. Then, with every new batch of 2,500 more items, we compare the full re-training on all items accumulated up to that point and the gradual improvement achieved with a sequential optimization application.
The below plot shows how the accuracy is even slightly better with the continual path (which we try to explain later), and that only after about 15 such steps of adding new batches of 2,500 samples, drift kicks in, and the gradual solution starts to lose stability.
The sequential updating step is about x100 times faster than the full re-train and requires way less memory.
Conclusion:
Continual learning stands poised as a transformative paradigm in machine learning, offering a pathway toward adaptive, resource-efficient models capable of thriving in dynamic environments. By harnessing the power of sequential optimization techniques, we empower Edge devices to improve alongside evolving data streams, paving the way for a new era of intelligent, responsive systems. As AI tasks increasingly demand adaptability and agility, continual learning emerges as a central pillar in shaping the future of artificial intelligence.
Continual learning vocabulary
i.i.d (Independent and Identically Distributed):
refers to a set of data points that are assumed to be drawn independently from the same probability distribution. Each data point is identically distributed, meaning that they share the same underlying statistical properties, such as when training models on a random sample of data.
Catastrophic Forgetting:
occurs when a model trained on multiple tasks forgets information about previously learned tasks as it learns new ones. This phenomenon can lead to a significant drop in performance on old tasks when adapting to new ones.
Replay:
a technique used to mitigate catastrophic forgetting. Replay involves periodically revisiting old data (samples from previous tasks) during training; thus, the model can retain knowledge about earlier tasks while learning new ones.
Experience replay is a specific form of replay used in reinforcement learning. It involves storing past experiences (state-action-reward-next state tuples) in a replay buffer. During training, the agent samples from this buffer to learn from a mixture of recent and past experiences. Experience replay helps stabilize training and improve sample efficiency.
Online Continual Learning:
specifically focuses on learning from a continuous stream of data or tasks in an online fashion. Models adapt incrementally as new data arrives, without the need for explicit task boundaries or batch processing.
Strategy:
refers to an approach or methodology used to address the challenges posed by learning multiple tasks sequentially. Strategies can include architectural modifications, regularization techniques, memory-based methods, and replay mechanisms.
Resources
Avalanche
Avalanche: an End-to-End Library for Continual Learning | Avalanche - v0.5.0 | Avalanche (continualai.org)
Sequential optimization
Wells DE, Krakiwsky EJ, The Method of Least Squares, Lecture Notes 18, University of New Brunswick, May 1971
Comments