Few Shot Learning

One of the drawbacks of neural networks is the huge amount of training data they require. Although this allows them to achieve state-of-the-art performance in a large range of tasks, often you may be in a situation where not enough data is available for a specific problem.

This is a position that a big online grocery retailer could easily encounter. For example, a potential Ocado App could enable shopping by taking pictures of products. Once a product is recognised it is immediately added to your shopping basket. But Ocado.com’s product range is constantly expanding and changing - 54,000 skus and counting - so getting lots of data quickly for every new product would be challenging.

What would be needed in this case is a method of learning from a handful of stock product images for new items as they are added to the catalog.

There have been advancements in machine learning in generalising from just a few examples per class, known as few shot learning [1][2][3][4]. Although few shot learning methods are capable of generalising very quickly to novel classes, they are normally evaluated by only considering a few classes simultaneously. This can be a drawback, especially when considering that Ocado.com has tens of thousands of products that could need to be classified.

We therefore devised a solution which gave strong results for our experimental dataset.

To define the problem more precisely, we have a dataset comprised of 236 classes from which we can use all the available data for training, referred to as \(D\). Then, we have a set of 50 classes, \(\hat{D}\), of which we will only use 5 images per class to train our neural networks. The rest of the data in those 50 classes is used for evaluation.


Our setup consists of a two-step process.

Firstly, we narrowed down the potential class from either 236 or 50, depending on which set of classes we are using, to just 5 potential correct answers.

The second step involves training a Matching Network to distinguish between the 5 potential classes, referred to as the support set, and select the overall most likely solution.

Diving into both of these stages in more detail:

Support Set Extraction

In this stage we want to reduce the potential pool of classes an image, \(\hat{x}\), could belong to just 5 options. To accomplish this, we first use an Inception-v3 network, \(S(\hat{x}, \theta)\), with pre-trained Imagenet weights as our initial network. We then retrain it on the set of classes \(D\) to perform standard image classification. Once the neural network is trained, every datapoint in \(D\) has its the top 5 predictions recorded. 

For \(\hat{D}\) we select 5 examples per class which represent the data we are allowed to use for training. We then take \(S(\hat{x}, \theta)\) remove its final layer, and freeze the weights \(\theta\).  

We now add a new layer with weights \(W\) which acts on the feature vector given by \(S(\hat{x}, \theta)\). We experiment with two options: firstly, to simply add a linear classification layer. Alternatively, we also experimented with adding a cosine similarity layer as proposed in [2]. The cosine similarity layer’s weights act as prototypes for every class, and classification is made based on the similarity between \(\hat{x}\) and a weight vector in \(W\). 

The weights \(W\) are then optimised based on the 5 training samples in \(\hat{D}\) for image classification.  Ultimately, we found that the linear classification layer in our case performed equivalently to a cosine similarity layer and so we utilised the linear layer in our final solution. Once \(W\) is trained, we obtain the top 5 predictions for every datapoint in \(\hat{D}\).

Matching Network

Once the top 5 class predictions for a given datapoint are identified we use these to form the support set in a Matching Network architecture. Matching Networks were originally proposed in [1] and here we will give a concise overview of their mechanism, however the paper should be referred to for the full details.

Matching Networks make a prediction \(\hat{y}\) based on,

$${\huge \hat{y} = \sum^k_{i=1} a(\hat{x}, x_i)y_i}$$

where \(a\) is acting as a similarity measure between \(\hat{x}\), the test point, and \(x_{i}\) an image drawn from the support set. Finally, \(y_{i}\) is the one hot label vector for \(x_{i}\).

We can define \(a\) more explicitly as

$${\huge a(\hat{x}, x_i) = \frac{e^{c(f(\hat{x}), g(x_i))}}{\sum^k_{j=1} e^{c(f(\hat{x}), g(x_j))}}}$$

in which \(f\) and \(g\) are neural networks and \(c\) is a cosine similarity. Thus, we optimise \(f\) and \(g\) to predict a class for \(\hat{x}\) based on the cosine similarity of its embedding with those of the support set. The overall structure of the solution is shown below in Figure 2.

We can add, what is referred to in [1], as Full Context Embeddings (FCE) which are additional neural networks which change the embeddings of \(x_{i..k}\) and \(\hat{x}\) conditioned on the whole support set. However, in our experiments, these additional networks did not improve performance. Most likely this was because the amount of data we have is small compared to Imagenet / mini-Imagent scale, which is where the FCE benefit was seen in the original paper. 



We evaluated our full solution and compared it to two baselines:

  • Baseline Classifier: Rather than return the top 5 predictions in the yellow block in Figure 2, simply return the top prediction and skip the Matching Network entirely. 

  • Baseline Embeddings: Directly use  \(S(\hat{x},\theta)\) as \(g\) and \(f\).

In the table below, we can see that our full solution of support set extraction followed by a Matching Network outperformed the two baselines by ~7%.



The use of a matching network acting over the top \(k\) predictions of an underlying image classifier provided a notable performance boost. It would be interesting to see if this pre-processing step of extracting out a support set generalises to Imagenet where we can test across a much higher number of classes.

By Giulio Zizzo, PhD Intern, Ocado Technology


[1] Vinyals, Oriol, et al. "Matching networks for one shot learning." Advances in Neural Information Processing Systems. 2016.
[2]  Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A Closer Look at Few-shot Classification. arXiv e-prints, page arXiv:1904.04232, Apr 2019.
[3] Snell, Jake, Kevin Swersky, and Richard Zemel. "Prototypical networks for few-shot learning." Advances in Neural Information Processing Systems. 2017.
[4] Finn, Chelsea, Pieter Abbeel, and Sergey Levine. "Model-agnostic meta-learning for fast adaptation of deep networks." Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017.

george smith