Select subset of .csv data using pandas

      1 Comment on Select subset of .csv data using pandas

Goal

This is short post for people who wants to quickly select data from .csv datafile. Maybe just certain columns, maybe just certain rows. I assume you know how to run python interpreter, how .csv format is defined and how to install pandas library package.

Data

We have .csv datafile data.csv containing following :

Load it

To use pandas, first we need to import the package

One of the basic datastructures in pandas library is dataframe. Dataframe typically represents 2 dimensional data like above.

To load CSV we call method read_csv of dataframe object which returns pandas dataframe object stored in variable df.

View it

Outputs

Process it

Now there is many things you may want to do with your dataframe. In this post, I only want to show very simple scenario that you want to select some of the data and generate new .csv file containing this data subset.

Selecting columns

Select columns by column name – This way you can also change order of the columns in dataframe

Select column by integer indexes – DataFrame object has useful array-like properties .columns and .index

Therefore, we can use integer indexes to select the column names which brings us to the previous case using column names.

Reverse order of columns

Selecting rows

Select rows from i-th (inclusive) to j-th (exclusive) column.

Selecting specific rows using list of row numbers

There are 3 methods – iloc, loc and ix for row selection and it’s important to know which one to use when. You can read about it here[LINK].

Selecting rows and cols at once

If you want to select certain columns and certain rows, you can either combine the operations above, or again use loc and iloc function.

So the first argument in iloc function are selecting rows, so I want 1st, 2nd and 1st row again. Second argumnet is number of cols I wish to select.

Beware that with iloc function you always specify number of row / column. In case you need to specify data by column name and index of row ( identifier of row ), then you need to you loc. Read more here[LINK].

Save your data

Now after you selected the data you are interested in, you can simply save your dataframe back to .csv or many other formats that pandas is supporting.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

One thought on “Select subset of .csv data using pandas

Leave a Reply

Your email address will not be published. Required fields are marked *