Calculating Sample Size using VC Dimensions; a Machine Learning way.

How and why do we need to know what is the optimal sample size??

Sreeram A S K
Towards Data Science

--

The data collection process is a tedious task in itself, spanning for eons. In domains like healthcare, this process involves a lot of costs and typically takes about years of time. For example, to procure information about a particular disease, say a 100 data points, would involve screening patients double or triple the desired sample size. Obtaining patient data is not only a time-consuming job but is also something that comes at a very high price. While collecting huge amounts of data when little would do is a very futile task, similarly not being able to meet the data requirements will not accord good results when a Machine learning algorithm is applied to it. Eventually, approximating an optimal sample size thus is a decisive task and something that needs to be looked upon before going on with performing any analysis on the data.

Here I will discuss an interesting technique to approximate the sample size using the concept of VC dimensions, by tweaking it a bit to our ease and comfort. Before proceeding further I would like to throw light upon a statistical method or way of approximating the sample size and then further put across its limitations and thus present you with the necessity of the VC dimension way of doing it.

Statistical Sample Size estimation

To explain the statistical way of estimating the sample size I take reference from the work presented here by Prof. Lisa Sullivan of Boston University. I recommend all the readers to go through the given link before going any further as I explain the methods proposed in it precisely. The proposed method brings forth a way of estimating an optimal sample size that can reduce the error margin E caused by the input features, due to the variance present in the data. In order to determine the optimal sample size, one has to first choose the desired error margin, which typically is the work of a domain expert and varies according to the problem. The formula to calculate the sample size in the above-stated way is given below

This formula generates the sample size ‘n’, required to ensure that the margin of error, E, does not exceed a specified value. To solve for n, we must input “Z,” “σ,” and “E.”

‘σ’ is the sample variance

‘E’ is the desired error rate

‘Z’ is the Z- distribution value for a given level of confidence

From the above formula, it’s important to observe that ’n’, the sample size, depends on the chosen error margin “E”.

The Statistical way of determining the sample size though logically sounds great comes with a very vivid innate limitation.

Statistical sample size estimation does not give a way of choosing a sample size that can reduce the error or increase the accuracy of predictive machine learning models applied on the data.

In fact, the sample size estimated this way does not in any way take into account any machine learning model or algorithm that is applied to it. Thus it makes the statistical method very naive and less helpful.

Is there a way of choosing the sample size that can reduce the error of the machine learning model applied to the data?? Or Is there a way of determining what is the expected error of a machine learning model for a particular sample size. The answers to these questions are tightly knit with the concept of the VC dimension.

The method of VC dimensions gives a way of determining the sample size that which can reduce the error of any machine learning model applied on it.

VC dimensions

The concept of VC dimension has paved a way for determining the Test error based on the machine learning algorithm chosen to be applied to the data and the training error associated with it.

Let me first brief the concept of the Vapnik–Chervonenkis (VC) dimension and how it works.

Note: My explanation of the concept of the VC dimension is very slender as the main focus of this article is on the calculation of the Sample size using it rather than the concept itself.

Vc dimension measure’s the capacity of a statistical classification algorithm in learning a set of functions. Simply put, it measure’s the power of a classification model.

The typical definition of VC dimension of a statistical classifiction algorithm is; the cardinality of the largest set of points that can be shattered by the algorithm.

Shattering

Here I directly borrow the lucid definition of shattering from Wikipedia.

“A classification model ‘f ’ with some parameter vector ‘θ’ is said to shatter a set of data points {X1, X2…, Xn} if, for all assignments of labels to those points, there exists a ‘θ’ such that the model ‘f ’ makes no errors when evaluating that set of data points.”

Shattering

Thus the number of points that an algorithm can shatter defines its VC dimension.

Now moving to the central theme of this article as to how VC- dimensions can come handy in identifying the right sample size, let me introduce the formula which gives a probabilistic upper bound on the amount of test error generated by an algorithm.

Though at first look it may look scary, let me reveal it part by part. The formula here considers the training error of a model, the VC-dimension ‘D’ of the applied algorithm, the size of the sample on which the model is applied ’N’, and ‘η’ which is 0≤η≤1.

Hence, simply put, the above formula takes into consideration a model’s Vc-dimension, the size of the sample, and training error obtained by it and then generates a probabilistic upper bound on the test error that can be obtained given all the above-stated factors. Till here I have not brought out anything novel, apart from explaining the general concept and working of Vc-dimension. Well, actually we have already reached the end without actually knowing it. A little tweak of the above formula gives us our anticipated approximation of the sample size.

We hack the above formula in such a way that we get ’N’ the sample size. Since our aim now is not to get an upper bound on test error but instead a sample size that can reduce the test error, we now choose a test error that which we can entertain based on the problem we are dealing with and sweep across a range of values for ‘N’ till we reach the desired test error. So instead of solving for the test error as we did earlier, we now solve for the sample size.

In general, as we choose a much stricter test error the value of ‘N’ increases. Thus the better the test error we want the bigger our sample size should be. Our resulted sample size approximation thus is the one that can reduce the test error.

This way of approximating is extremely simple and also flexible with the machine learning algorithm that we apply to the data. The approximated sample size varies with from model to model.

Hope this was informative!!

*some of the definitions and figures are sourced from Wikipedia and the internet.

--

--