How to Import Datasets in Python using the sklearn Module

In this article, we show how to import datasets in Python using the sklearn module.

So many Python modules have built-in datasets.

These datasets can be used to practice with without us having to create our own data.

The sklearn module has several datasets that we can use.

In the example below, we import the diabetes dataset from the sklearn module.

from sklearn import datasets >>> diabetes= datasets.load_diabetes() >>> diabetes.keys() dict_keys(['data', 'target', 'frame', 'DESCR', 'feature_names', 'data_filename', 'target_filename']) >>> print(diabetes['DESCR']) .. _diabetes_dataset: Diabetes dataset ---------------- Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline. **Data Set Characteristics:** :Number of Instances: 442 :Number of Attributes: First 10 columns are numeric predictive values :Target: Column 11 is a quantitative measure of disease progression one year after baseline :Attribute Information: - age age in years - sex - bmi body mass index - bp average blood pressure - s1 tc, T-Cells (a type of white blood cells) - s2 ldl, low-density lipoproteins - s3 hdl, high-density lipoproteins - s4 tch, thyroid stimulating hormone - s5 ltg, lamotrigine - s6 glu, blood sugar level Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1). Source URL: https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html For more information see: Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499. (https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)

So you can see that we have imported data related to diabetes.

The dataset stores the quantitative measure of disease progression one year after baseline. 10 variables are used in relation to this outcome value: age, sex, bmi, blood pressure, T-cells, LDL, HDL, TSH, lamotrigine, and blood sugar level.

The target, column 11, is a quantitative measure of disease progression one year after baseline.

There are 442 instances of data within this dataset.

So this is one example.

Below are more examples of datasets from the sklearn module.

Datasets from sklearn module
load_boston	Load and return the boston house-prices dataset
load_iris	Load and return the iris dataset
load_digits	Load and return the digits dataset
load_linnerud	Load and return the physical exercise linnerud dataset

Below we have code from another dataset.

This is shown below.

from sklearn import datasets houseprices= datasets.load_boston() houseprices.keys() dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename']) print(houseprices['DESCR']) .. _boston_dataset: Boston house prices dataset --------------------------- **Data Set Characteristics:** :Number of Instances: 506 :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target. :Attribute Information (in order): - CRIM per capita crime rate by town - ZN proportion of residential land zoned for lots over 25,000 sq.ft. - INDUS proportion of non-retail business acres per town - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) - NOX nitric oxides concentration (parts per 10 million) - RM average number of rooms per dwelling - AGE proportion of owner-occupied units built prior to 1940 - DIS weighted distances to five Boston employment centres - RAD index of accessibility to radial highways - TAX full-value property-tax rate per $10,000 - PTRATIO pupil-teacher ratio by town - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town - LSTAT % lower status of the population - MEDV Median value of owner-occupied homes in $1000's :Missing Attribute Values: None :Creator: Harrison, D. and Rubinfeld, D.L. This is a copy of UCI ML housing dataset. https://archive.ics.uci.edu/ml/machine-learning-databases/housing/ This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter. The Boston house-price data has been used in many machine learning papers that address regression problems. .. topic:: References - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261. - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

You can see that the load_boston dataset has 506 instances.

There are 13 attributes, variables, to the dataset and the target is the median value of the home prices.

So these are example datasets that you can use when you want to work with data such as testing out different machine learning algorithms to see which is most effective for predicting data.

And this is how to import datasets in Python using the sklearn module.

Related Resources

How to Randomly Select From or Shuffle a List in Python

HTML Comment Box is loading comments...

Learning about Electronics

How to Import Datasets in Python using the sklearn Module