Student of Computational Linguistics and High Performance Computing
Just wanted to give a heads up, with newer versions of scikit-learn (I believe starting with 0.20), the random forest implementation will use 1 job in the default setting as the documentation indicates.
However, the documentation says that this is unless a joblib.parallel_backend
context is being used. The problem with this is that if you adjust the n_jobs
parameter in GridSearchCV
to use multiple processes, the number of jobs you specify will also become the n_jobs
parameter for the RandomForestClassifier
running inside the gridsearch.
What this means is that running GridSearchCV
with n_jobs=16
will actually start 16^2 or 256 processes. This causes a lot of cache thrashing. In essence, because there are so many processes running on each thread, data sitting in cache will be rapidly flushed and then brought back in. Cache thrashing causes a severe hit to performance. Cache hits can be an order of magnitude faster than cache misses and overprovisioning this many processes will cause a majority of data retrievals to be cache misses.
To rectify this, I would explicitly set the n_jobs
parameter when creating the RandomForestClassifier
and then set the n_jobs
parameter for GridSearchCV
accordingly so that the two multiply to equal the number of physical cores available on the machine.
Ideally, I would set n_jobs
in RandomForestClassifier
to 1 and set the n_jobs
parameter for GridSearchCV
to the number of cores available. This ensures that the work units being assigned to threads are as large as possible. Since joblib, which does the parallel backend for ScikitLearn creates separate python processes as a way to circumvent the GIL, there is considerable overhead to the creation of threads so the best course of action is to use the longest running threads possible.