There are 3 dataset used in the benchmark:
1 - RCV1 (https://scikit-learn.org/0.18/datasets/rcv1.html)
2 - webspam (https://chato.cl/webspam/datasets/uk2007/)
3 - news20 (https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html)
Dataset | samples | classes | features |
---|---|---|---|
RCV1 | 697614 | 2 | 47236 |
webspam | 350000 | 2 | 254 |
news20 | 19928 | 20 | 62061 |