View on GitHub


Data Sets for Machine Learning in Python

Download this project as a .zip file Download this project as a tar.gz file
(v.) round up, herd, or take charge of (e.g. livestock, cats, or just-downloaded data sets!)

skdata is a library of data sets for machine learning experiments, with modules that

  1. download data sets,
  2. load them as directly as possible as Python data structures, and
  3. provide protocols for machine learning tasks via convenient views.

What data sets does it provide? Browse the list of data sets.


Here’s how skdata helps you evaluate an SVM (e.g. scikit-learn’s LinearSVC) as a classifier for the UCI ”Iris” data set:

# Create a suitable view of the Iris data set.
# (For larger data sets, this can trigger a download the first time)
from skdata.iris.view import KfoldClassification
iris_view = KfoldClassification(5)

# Create a learning algorithm based on scikit-learn's LinearSVC
# that will be driven by commands the `iris_view` object.
from sklearn.svm import LinearSVC
from skdata.base import SklearnClassifier
learning_algo = SklearnClassifier(LinearSVC)

# Drive the learning algorithm from the data set view object.
# (An iterator interface is sometimes also be available,
#  so you don't have to give up control flow completely.)

# The learning algorithm keeps track of what it did when under
# control of the iris_view object. This base example is useful for
# internal testing and demonstration. Use a custom learning algorithm
# to track and save the statistics you need.
for loss_report in algo.results['loss']:
    print loss_report['task_name'] + \
        (": err = %0.3f" % (loss_report['err_rate']))

Note that you can also use the skdata.iris.dataset module to get raw un-standardized access to the Iris data set via Python objects. This is an skdata convention: dataset submodules give raw access, and view submodules implement standardized views and protocols.


The recommended installation method is via pypi with either pip install skdata or easy_install skdata (you probably want to use pip if you have it).

If you want to stay up to date with the development tip then use git:

git clone \
&& ( cd skdata python && develop )


Documentation is maintained on the skdata wiki.