Random sample consensus (RANSAC)

The Random Sample Consesus (RANSAC) algorithm performs linear regression by separating sample data from the input into inliers and outliers. It then creates a model that is estimated only from the inliers and remains unaffected by the data classified as outliers. The algorithm works by the doing following:

Select random samples from the training data and create a linear model for this subset.
Divide datapoints in the entire dataset into inliers and outliers using a user-provided residual threshold and calculate the model score.
Repeat either a maximum number of times or until a certain number of inliers is reached (both parameters provided by the user). If the model of one iteration has a greater number of inliers than that of the previous, it replaces it as the best model.

A final model is then made based on the inliers of the best model obtained by the algorithm.

Experiment Design

Here we test how the RANSAC algorithm performs on the univariate smooth dataset. We use the scikit-learn implementation of RANSAC for the actual algorithm and skopt.forest.minimize in our optimization.

For RANSAC, we optimize the following hyperparameters:

min_samples: minimum number of samples to be taken from the original data. We draw uniformly from $[10, \lfloor \frac{m}{2} \rfloor]$ where $m$ is the dataset size.
residual_threshold: maximum residual for a data sample to be classified as an inlier. We draw from a base 10 log-uniform distribution on the interval $[0.0001,0.1]$ .
max_trials: maximum iterations for random sample selection with default as 100. We draw uniformly from $[50,1000]$ .

We perform the hyperparameter optimization process with 50 calls to an objective function that uses 3-fold cross-validation. We run the optimization routine 25 times on new synthetic datasets growing in size by 2,000 from 4,000 to 24,000.

Results

Here we observe the error for models made on increasing dataset sizes:

Although given enough data a non-linear model will eventually make a better fitting model than RANSAC ever can, because the RANSAC model is linear, it converges to a good model quickly, even with limited data. Here we see an example of the fit from our largest dataset:

Conclusion

During the optimization process, allowing a higher min_samples parameter helped the RANSAC algorithm create more consistent models. RANSAC is a good choice when data is very noisy and limited. As a linear model, it has some limitations when facing non-linear data, but it runs very quickly and handles noise well.

Regressionist

Machine learning with an eye toward finance

Random sample consensus (RANSAC)

Experiment Design

Results

Conclusion