(1/2) The intuition behind PAC-learning and VC dimension (with little math)

Tamara Alexandra Cucumides
3 min readApr 29, 2021

The math behind machine learning and data science in general might be quite challenging since it requires to know things about optimization, probability theory, statistics, functional analysis and some other topics… For this reason a lot of people decides to skip all the theoretical foundations of data science and go straight to the code (thanks sklearn!) which might work in some cases, but it’s not ideal.

In this post I will talk about the foundations of “learning” and specifically the PAC-learning framework in an intuitive level. After this post I will write about VC-dimension and we will discuss how to decide if something is learnable or not. Also, we will give some theoretical insights about the well known fact: if you want to train a complex model you need a lot of data.

PAC-learning

What is learning? In the usual framework, a learning algorithm is an algorithm that, given some data, learns a function that is able to correctly classify it (for simplicity, lets focus only on the classification task). PAC learning is a framework where we allow probably approximately correct learning, this is, our learning algorithm is probably approximately correct (which doesn’t sound very nice, does it?). Now I will try to explain why PAC learning is something reasonable.

Suppose we have a collection S of n points with their labels and we want to “learn” this data. We will go to our learning algorithm and it will take those n point as input and will return a classifier h that takes as input some point and return a predicted label. How do we know that the classifier h is a good classifier? First thing we can require is that the classifier correctly classifies every point of S… but that isn’t very reasonable is it? How would you feel if someone says that your model isn’t good because it doesn’t have 100% accuracy on the training set?

This is where the PAC concepts arises:

  • the approximately correct part says that we will say a classifier is good enough when it’s somewhat close to being perfect, this is, when the amount of mislabeled data is bounded
  • the probably part is that learning is not a fact, this is, the learning algorithm will output a nice classifier with some high probability, but not every time

Second point is also reasonable… just because that decision tree fitted poorly on that dataset it doesn’t mean that the algorithm sucks, does it?

Now that we have the intuition, let’s jump into the formal definition. First I will introduce some notation: a class C of classifiers will be for example, the class of decision trees of fixed hyper-parameters, or the class of linear regressions of the type y = mx+b, etc. A classifier is any function that assign a label to a point. Also, the error between two classifiers f and h will be the probability that given a point x from a distribution D we have that f(x) h(x)

Definition: PAC — Learning. A class of boolean classifiers C is PAC-learnable if there exists an algorithm L such that for all f in C, for all probability distribution D, for every ε and δ in [0, 0.5], the algorithm L on input epsilon and delta and a random sample picked from D will output a classifier h with a probability of at least 1- δ such that h differs from h at most in ε

Summarizing, a learning algorithm under the PAC framework is an algorithm that given a dataset S and a target classifier f, the algorithm outputs with a high probability a classifier h that differs little from f.

In the next post we will understand how this probability parameters ε and δ and the complexity of a model will affect on the amount of data we need to guarantee PAC learning.

--

--