Menu

Random Selection
This method is the easiest. All we have to do is randomly split the data into train and validation sets.

Let’s say we want to split 80% of our data to the training set, and the others to validation sets.

The third command will randomly generate 22 numbers whose max value is the number of row of a dataset (32.) We then will apply the “train_index” to a dataset to create training and validation sets.

Next, we train the models on the train_set and see how well they perform on a validation_set.

The advantage of the method is ease of use. It only takes five lines of code to create a training set and a test set. But with simplicity, here comes an issue. Since we randomly split the data, how well do we know if the training set well represent the population in general?

The code above will create four different training sets. Let’s visualize if they are the same.

They are different.