A Bayesian view of data augmentation.

After my lecture on Principled Bayesian Workflow for a group of machine learners back in August, a discussion arose about data augmentation. The comments were about how it made the data more informative. I questioned that as there is only so much information in the data. In the view of the model assumptions, just the likelihood. So simply modifying the data, information should not increase but only possibly decrease (non-invertible modification).

Later, when I actually saw an example of data augmentation and I thought about this more carefully, I changed my mind. I now realise background knowledge is being brought to bear on how the data is being modified. So data augmentation is just a away of being Bayesian by incorporating prior probabilities. Right?

Then thinking some more, it became all trivial as the equations below show.

P(u|x) ~ P(u) * P(x|u)   [Bayes with just the data.]
~  P(u) * P(x|u) * P(ax|u)   [Add the augmented data.]
P(u|x,ax) ~ P(u) * P(x|u) * P(ax|u) [That’s just the posterior given ax.]
P(u|x,ax) ~ P(u) * P(ax|u) * P(x|u) [Change the order of x and ax.]

Now, augmented data is not real data and should not be conditioned on as real. Arguably it is just part of (re)making the prior specification from P(u) into P.au(u) = P(u) * P(ax|u).

So change the notation to P(u|x) ~ P.au(u) * P(x|u).

If you data augment (and you are using likelihood based ML, implicitly starting with P(u) = 1), you are being a Bayesian whether you like it or not.

So I goggled a bit and asked a colleague in ML about the above. They said it makes sense to me when I think about it, but that was not immediately obvious to me. They also said it was not common knowledge – so here it is.

Now better googling gets more stuff such as  Augmentation is also a form of adding prior knowledge to a model; e.g. images are rotated, which you know does not change the class label. and this paper A Kernel Theory of Modern Data Augmentation Dao et al.  where in the introduction they state “Data augmentation can encode prior knowledge about data or task-specific invariances, act as regularizer to make the resulting model more robust, and provide resources to data-hungry deep learning models.” Although the connection to Bayes in either does not seem to be discussed.

Further scholarship likely would lead me to consider deleting this post, but what’s the fun in that?