How to Retrieve a Subset of a Pandas DataFrame Object in Python

In this article, we show how to retrieve a subset of a pandas DataFrame object in Python.

A dataframe object is an object composed of a number of pandas series.

A pandas series is a labeled list of data.

A dataframe object is an object made up of a number of series objects.

A dataframe object is most similar to a table. It is composed of rows and columns.

In this article, we will show how to retrieve subsets from a pandas DataFrame object in Python.

A subset is a specific row and column or specific rows and columns of a pandas dataframe object that you want returned.

In other words, you dont want the whole dataframe object returned but only a specific data point or certain specific data points.

Below we go through a few examples in the code.

>>> import pandas as pd >>> from numpy.random import randn >>> dataframe1= pd.DataFrame(randn(4,3),['A','B','C','D',],['X','Y','Z']) >>> dataframe1 X Y Z A -1.229967 -1.807953 -0.023637 B -1.167999 0.830829 0.051181 C -1.404166 0.421598 1.555297 D -0.081069 -1.172590 -0.146508 >>> #returns the rows 'A' and 'B', columns 'Y' and 'Z' >>> dataframe1.loc[['A','B'],['Y','Z']] Y Z A -0.711528 0.469863 B -0.543441 -0.066958 >>> #returns the rows 'A', column 'X' >>> dataframe1.loc[['A'],['X']] X A -1.146696 >>> #returns rows 'B' and 'C', column 'Y' >>> dataframe1.loc[['B','C'], ['Y']] Y B -0.543441 C -0.799195

So let's now go over the code.

So we first have to import the pandas module. We do this with the line, import pandas as pd.

as pd means that we can reference the pandas module with pd instead of writing out the full pandas each time.

We import rand from numpy.random, so that we can populate the DataFrame with random values. In other words, we won't need to manually create the values in the table. The randn function will populate it with random values.

We create a variable, dataframe1, which we set equal to, pd.DataFrame(randn(4,3),['A','B','C','D',],['X','Y','Z'])

This creates a DataFrame object with 4 rows and 3 columns.

The rows are 'A', 'B', 'C', and 'D'.

The columns are 'X', 'Y', and 'Z'.

After we output the dataframe1 object, we get the DataFrame object with all the rows and columns, which you can see above.

We then obtain subsets from the pandas dataframe object.

We obtain subsets of a dataframe object through the loc() function.

The loc() function retrieves the contents of a dataframe according to label-based locations.

So using the loc() function, we reference the labels of rows and columns to retrieve the contents of certain rows and columns.

So, in the first example, we retrieve rows 'A' and 'B' and columns 'Y' and 'Z'.

We do this through the statement, dataframe1.loc[['A','B'],['Y','Z']]

This obtains that subset.

In the next example, we obtain one specific data point. This is the data point located in row A, column X.

We obtain this data point through the statement, dataframe1.loc[['A'],['X']]

In the third example, we obtain the data points of rows B and C, column Y.

This subset is obtained using the following statement, dataframe1.loc[['B','C'], ['Y']]

We can also obtain subsets from a pandas dataframe object in Python using index-based locations with the iloc() function.

Instead of using labels to reference rows and columns, we use index-based locations.

In our dataframe, row A is at an index of 0. row B is at an index of 1. row C is at an index of 2. row D is at an index of 3.

column X is at an index of 0. column Y is at an index of 1. column Z is at an index of 2.

So let's get the same subsets as the code above now only using index-based locations with the iloc() function.

>>> import pandas as pd >>> from numpy.random import randn >>> dataframe1= pd.DataFrame(randn(4,3),['A','B','C','D',],['X','Y','Z']) >>> dataframe1 X Y Z A -1.229967 -1.807953 -0.023637 B -1.167999 0.830829 0.051181 C -1.404166 0.421598 1.555297 D -0.081069 -1.172590 -0.146508 >>> #returns the rows 'A' and 'B', columns 'Y' and 'Z' >>> dataframe1.iloc[[0,1],[1,2]] Y Z A -0.711528 0.469863 B -0.543441 -0.066958 >>> #returns the rows 'A', column 'X' >>> dataframe1.iloc[[0],[0]] X A -1.146696 >>> #returns rows 'B' and 'C', column 'Y' >>> dataframe1.iloc[[1,2], [1]] Y B -0.543441 C -0.799195

So now instead of using labels to reference the rows and columns of a dataframe object, we now use the index locations.

Thus, if a dataframe does not have labels, you can still reference subsets using the index locations of data points.

And even if a dataframe object has labels, you can still reference subsets in an alternative way, using index-based locations.

And this is how we can retrieve subsets of a pandas dataframe object in Python.

Related Resources

How to Randomly Select From or Shuffle a List in Python

HTML Comment Box is loading comments...

Learning about Electronics

How to Retrieve a Subset of a Pandas DataFrame Object in Python