Kenneth Steimel

Student of Computational Linguistics and High Performance Computing

View the Project on GitHub

Scikit-Leaern Random Forest Parallelism inside of GridSearch

Just wanted to give a heads up, with newer versions of scikit-learn (I believe starting with 0.20), the random forest implementation will use 1 job in the default setting as the documentation indicates.

However, the documentation says that this is unless a joblib.parallel_backend context is being used. The problem with this is that if you adjust the n_jobs parameter in GridSearchCV to use multiple processes, the number of jobs you specify will also become the n_jobs parameter for the RandomForestClassifier running inside the gridsearch.

What this means is that running GridSearchCV with n_jobs=16 will actually start 16^2 or 256 processes. This causes a lot of cache thrashing. In essence, because there are so many processes running on each thread, data sitting in cache will be rapidly flushed and then brought back in. Cache thrashing causes a severe hit to performance. Cache hits can be an order of magnitude faster than cache misses and overprovisioning this many processes will cause a majority of data retrievals to be cache misses.

To rectify this, I would explicitly set the n_jobs parameter when creating the RandomForestClassifier and then set the n_jobs parameter for GridSearchCV accordingly so that the two multiply to equal the number of physical cores available on the machine.

Ideally, I would set n_jobs in RandomForestClassifier to 1 and set the n_jobs parameter for GridSearchCV to the number of cores available. This ensures that the work units being assigned to threads are as large as possible. Since joblib, which does the parallel backend for ScikitLearn creates separate python processes as a way to circumvent the GIL, there is considerable overhead to the creation of threads so the best course of action is to use the longest running threads possible.