When the dataset is small, features are your friends

Feature engineering can compensate for the lack of data.

Published in

Towards Data Science

7 min readApr 3, 2023

In the rapidly evolving world of Artificial Intelligence (AI), data has become the lifeblood of countless innovative applications and solutions. Indeed, large datasets are often considered the backbone of robust and accurate AI models. However, what happens when the dataset at hand is relatively small? In this article, we explore the critical role of feature engineering in overcoming the limitations posed by small datasets.

Toy dataset

Our journey starts with the creation of the dataset. In this example, we will perform nice and easy signal classification. The dataset has two classes; sine waves of frequency 1 belong to class 0, and sine waves of frequency 2 belong to class 1. The code for signal generation is presented below. The code generates a sine wave, applies additive gaussian noise, and randomizes phase shift. Due to the addition of noise and phase shift, we obtain diverse signals, and the classification problem becomes non-trivial (albeit still easy with correct feature engineering).

def signal0(samples_per_signal, noise_amplitude):
    x = np.linspace(0, 4.0, samples_per_signal)
    y = np.sin(x * np.pi * 0.5)
    n = np.random.randn(samples_per_signal) * noise_amplitude
    
    s = y + n
    
    shift = np.random.randint(low=0, high=int(samples_per_signal / 2))
    s = np.concatenate([s[shift:], s[:shift]])
    
    return np.asarray(s, dtype=np.float32)

def signal1(samples_per_signal, noise_amplitude):
    x = np.linspace(0, 4.0, samples_per_signal)
    y = np.sin(x * np.pi)
    n = np.random.randn(samples_per_signal) * noise_amplitude
    
    s = y + n
    
    shift = np.random.randint(low=0, high=int(samples_per_signal / 2))
    s = np.concatenate([s[shift:], s[:shift]])
    
    return np.asarray(s, dtype=np.float32)

Deep Learning performance

State-of-The-Art models for signal processing are Convolutional Neural Networks (CNN). So, let’s create one. This particular network contains two one-dimensional convolutional layers and two fully connected ones. The code is listed below.

class Network(nn.Module):
    
    def __init__(self, signal_size):
        
        c = int(signal_size / 10)
        if c < 3:
            c = 3
        
        super().__init__()
        self.cnn = nn.Sequential(
            nn.Conv1d(1, 8, c),
            nn.ReLU(),
            nn.AvgPool1d(2),
            nn.Conv1d(8, 16, c),
            nn.ReLU(),
            nn.AvgPool1d(2),
            nn.ReLU(),
            nn.Flatten()
        )
        
        l = 0
        with torch.no_grad():
            s = torch.randn((1,1,SAMPLES_PER_SIGNAL))
            o = self.cnn(s)
            l = o.shape[1]
        
        self.head = nn.Sequential(
            nn.Linear(l, 2 * l),
            nn.ReLU(),
            nn.Linear(2 * l, 2),
            nn.ReLU(),
            nn.Softmax(dim=1)
        )
        
    def forward(self, x):
        
        x = self.cnn(x)
        x = self.head(x)
        
        return x

CNNs are models that can process the raw signal. However, due to its parameter-heavy architecture, they tend to need a lot of data. However, in the beginning, let’s assume we have enough data to train neural networks. I used signal generation to create a dataset with 200 signals. Each experiment was repeated ten times to reduce the interference of random variables. The code is shown below:

SAMPLES_PER_SIGNAL = 100
SIGNALS_IN_DATASET = 20
NOISE_AMPLITUDE = 0.1
REPEAT_EXPERIMENT = 10

X, Y = [], []

stop = int(SIGNALS_IN_DATASET / 2)
for i in range(SIGNALS_IN_DATASET):

    if i < stop:
        x = signal0(SAMPLES_PER_SIGNAL, NOISE_AMPLITUDE)
        y = 0
    else:
        x = signal1(SAMPLES_PER_SIGNAL, NOISE_AMPLITUDE)
        y = 1

    X.append(x.reshape(1,-1))
    Y.append(y)

X = np.concatenate(X)
Y = np.array(Y, dtype=np.int64)

train_x, test_x, train_y, test_y = train_test_split(X, Y, test_size=0.1)

accs = []
train_accs = []

for i in range(REPEAT_EXPERIMENT):

    net = NeuralNetClassifier(
        lambda: Network(SAMPLES_PER_SIGNAL),
        max_epochs=200,
        criterion=nn.CrossEntropyLoss(),
        lr=0.1,
        callbacks=[
            #('lr_scheduler', LRScheduler(policy=ReduceLROnPlateau, monitor="valid_acc", mode="min", verbose=True)),
            ('lr_scheduler', LRScheduler(policy=CyclicLR, base_lr=0.0001, max_lr=0.01, step_size_up=10)),
        ],
        verbose=False,
        batch_size=128
    )

    net = net.fit(train_x.reshape(train_x.shape[0], 1, SAMPLES_PER_SIGNAL), train_y)
    pred = net.predict(test_x.reshape(test_x.shape[0], 1, SAMPLES_PER_SIGNAL))
    acc = accuracy_score(test_y, pred)
    
    print(f"{i} - {acc}")


    accs.append(acc)
    
    pred_train = net.predict(train_x.reshape(train_x.shape[0], 1, SAMPLES_PER_SIGNAL))
    train_acc = accuracy_score(train_y, pred_train)
    train_accs.append(train_acc)
    
    print(f"Train Acc: {train_acc}, Test Acc: {acc}")
    
accs = np.array(accs)
train_accs = np.array(train_accs)

print(f"Average acc: {accs.mean()}")
print(f"Average train acc: {train_accs.mean()}")
print(f"Average acc where training was successful: {accs[train_accs > 0.6].mean()}")
print(f"Training success rate: {(train_accs > 0.6).mean()}")

CNNs obtained a test accuracy of 99.2%, it was to be expected for the State-of-The-Art model. However, this metric was obtained for these experiment runs, where training was successful. By “successful,” I mean that accuracy on the training dataset exceeded 60%. In this example, CNNs weights initialization is a make-or-break for training, and it sometimes happens, as CNNs are complicated models prone to problems with unfortunate randomized weights initialization. The success rate of training was 70%.

Now, let’s see what happens when the dataset is short. I reduced amount of signals in the dataset to 20. As a result, CNNs obtained 71.4% test accuracy, and the accuracy dropped by 27.8 percentage points. That is not acceptable. Nonetheless, what to do now? The dataset needs to be longer to use State-of-The-Art models. In industrial applications, acquiring more data is either unfeasible or, at the very least, very expensive. Should we drop the project and move on?

No. When the dataset is small, features are your friends.

Feature Engineering

This particular example involves the classification of signals based on their frequency. So, we can apply the good old Fourier Transform. The Fourier Transform decomposes the signal into a series of sine waves parametrized by frequency and amplitude. As a result, we can use Fourier Transform to examine the importance of each frequency in forming the signal. Such data representation should simplify the task enough for the small dataset to suffice. Also, Fourier Transform structures the data so that we can use simpler models like, for example, the Random Forest classifier.

The visualization of signals transformed into spectrums. On the left is the spectrum of the signal from class 0, and on the right is the spectrum of the signal from class 1. These plots have logarithmic scales for better visibility. The models used in this example interpreted signals on a linear scale.

The code for transforming the signal and training Random Forest Classifier is shown below:

X, Y = [], []

stop = int(SIGNALS_IN_DATASET / 2)
for i in range(SIGNALS_IN_DATASET):

    if i < stop:
        x = signal0(SAMPLES_PER_SIGNAL, NOISE_AMPLITUDE)
        y = 0
    else:
        x = signal1(SAMPLES_PER_SIGNAL, NOISE_AMPLITUDE)
        y = 1

    # Transforming signal into spectrum
    x = np.abs(fft(x[:int(SAMPLES_PER_SIGNAL /2 )]))    
    
    X.append(x.reshape(1,-1))
    Y.append(y)

X = np.concatenate(X)
Y = np.array(Y, dtype=np.int64)

train_x, test_x, train_y, test_y = train_test_split(X, Y, test_size=0.1)

accs = []
train_accs = []

for i in range(REPEAT_EXPERIMENT):
    model = RandomForestClassifier()
    model.fit(train_x, train_y)
    
    pred = model.predict(test_x)
    acc = accuracy_score(test_y, pred)
    
    print(f"{i} - {acc}")


    accs.append(acc)
    
    pred_train = model.predict(train_x)
    train_acc = accuracy_score(train_y, pred_train)
    train_accs.append(train_acc)
    
    print(f"Train Acc: {train_acc}, Test Acc: {acc}")
    
accs = np.array(accs)
train_accs = np.array(train_accs)

print(f"Average acc: {accs.mean()}")
print(f"Average train acc: {train_accs.mean()}")
print(f"Average acc where training was successful: {accs[train_accs > 0.6].mean()}")
print(f"Training success rate: {(train_accs > 0.6).mean()}")

The Random Forest classifier achieved 100% test accuracy on 20 and 200 signals-long datasets, and the training success rate is also 100% for each dataset. As a result, we obtained even better results than CNNs with a smaller amount of data required — all thanks to feature engineering.

Risk of overfitting

Although feature engineering is a powerful tool, one must also remember to reduce unnecessary features from the input data. The more features are in input vectors, the higher the chance of overfitting — especially in small datasets. Each unnecessary feature provides the risk of introducing random fluctuations that the machine learning model may consider important patterns. The less data in the dataset, the higher the risk of random fluctuations, creating a correlation that doesn’t exist in the real world.

One of the mechanisms that may help in pruning too large feature collections are search heuristics like the genetic algorithm. The features pruning can be expressed as a task to find the smallest amount of features that facilitate successful training of the machine-learning model. It can be encoded by creating a binary vector of length equal to the size of feature data. The “0” determines that the feature is not present in the dataset, and the “1” indicates that feature is present. Then the fitness function of such a vector is a summation of the machine-learning model’s accuracy achieved on the pruned dataset and the count of zeros in the vector scaled down by sufficient weight.

This is only one of many solutions to remove unnecessary features. However, it is quite powerful.

Conclusion

Although the example presented is relatively simple, it presents typical problems with applying Artificial Intelligence systems in the industry. Currently, Deep Neural Networks can do almost everything we desire on condition of providing enough data. However, the data is usually scarce and expensive. So, industrial applications of Artificial Intelligence usually involve doing extensive features engineering to simplify the problem and, as a result, reduce the amount of data needed to train the model.

Thanks for reading. The code for this example generation is accessible under the link: https://github.com/aimagefrombydgoszcz/Notebooks/blob/main/when_dataset_is_small_features_are_your_friend.ipynb

All images unless otherwise noted are by the author.

When the dataset is small, features are your friends

Feature engineering can compensate for the lack of data.

Written by Krzysztof Pałczyński