Select subset of .csv data using pandas

Goal

This is short post for people who wants to quickly select data from .csv datafile. Maybe just certain columns, maybe just certain rows. I assume you know how to run python interpreter, how .csv format is defined and how to install pandas library package.

Data

We have .csv datafile data.csv containing following :

acq_time,longitude,latitude,p10,p20
	0.01,     -12.00,     -33.00,    5023.26,    1004.65
	0.01,     -13.00,     -28.00,    5023.26,    2511.63
	0.01,     -14.00,     -24.00,    8539.54,    2009.30
	0.02,     -15.00,     -21.00,    2009.30,     502.33
	0.02,     -16.00,     -15.00,    2009.30,     502.33

acq_time,longitude,latitude,p10,p20

0.01, -12.00, -33.00, 5023.26, 1004.65

0.01, -13.00, -28.00, 5023.26, 2511.63

0.01, -14.00, -24.00, 8539.54, 2009.30

0.02, -15.00, -21.00, 2009.30, 502.33

0.02, -16.00, -15.00, 2009.30, 502.33

Load it

To use pandas, first we need to import the package

import pandas as pd

1	import pandas as pd

One of the basic datastructures in pandas library is dataframe. Dataframe typically represents 2 dimensional data like above.

To load CSV we call method read_csv of dataframe object which returns pandas dataframe object stored in variable df.

df = pd.read_csv("my_data.csv")

1	df = pd.read_csv("my_data.csv")

View it

print df

print df

Outputs

   acq_time    longitude     latitude          p10          p20
0   1.0e-02          -12          -33       5023.3       1004.6
1   1.0e-02          -13          -28       5023.3       2511.6
2   1.0e-02          -14          -24       8539.5       2009.3
3   2.0e-02          -15          -21       2009.3        502.3
4   2.0e-02          -16          -15       2009.3        502.3

acq_time longitude latitude p10 p20

0 1.0e-02 -12 -33 5023.3 1004.6

1 1.0e-02 -13 -28 5023.3 2511.6

2 1.0e-02 -14 -24 8539.5 2009.3

3 2.0e-02 -15 -21 2009.3 502.3

4 2.0e-02 -16 -15 2009.3 502.3

Process it

Now there is many things you may want to do with your dataframe. In this post, I only want to show very simple scenario that you want to select some of the data and generate new .csv file containing this data subset.

Selecting columns

Select columns by column name – This way you can also change order of the columns in dataframe

ds = df[['p10','p20','longitude','latitude']]
print ds

1 2	ds = df[['p10','p20','longitude','latitude']] print ds

      p10     p20  longitude  latitude
0  5023.3  1004.6        -12       -33
1  5023.3  2511.6        -13       -28
2  8539.5  2009.3        -14       -24
3  2009.3   502.3        -15       -21
4  2009.3   502.3        -16       -15

p10 p20 longitude latitude

0 5023.3 1004.6 -12 -33

1 5023.3 2511.6 -13 -28

2 8539.5 2009.3 -14 -24

3 2009.3 502.3 -15 -21

4 2009.3 502.3 -16 -15

Select column by integer indexes – DataFrame object has useful array-like properties .columns and .index

df.columns

1	df.columns

Index([u'acq_time', u'longitude', u'latitude', u'p10', u'p20'], dtype='object')

1	Index([u'acq_time', u'longitude', u'latitude', u'p10', u'p20'], dtype='object')

Therefore, we can use integer indexes to select the column names which brings us to the previous case using column names.

ds = d[d.columns[[1,2,1,4]]]

1	ds = d[d.columns[[1,2,1,4]]]

   longitude  latitude  longitude     p20
0        -12       -33        -12  1004.6
1        -13       -28        -13  2511.6
2        -14       -24        -14  2009.3
3        -15       -21        -15   502.3
4        -16       -15        -16   502.3

longitude latitude longitude p20

0 -12 -33 -12 1004.6

1 -13 -28 -13 2511.6

2 -14 -24 -14 2009.3

3 -15 -21 -15 502.3

4 -16 -15 -16 502.3

Reverse order of columns

ds = df[df.columns[range(4,-1,-1)]]
print ds

1 2	ds = df[df.columns[range(4,-1,-1)]] print ds

      p20     p10  latitude  longitude  acq_time
0  1004.6  5023.3       -33        -12   1.0e-02
1  2511.6  5023.3       -28        -13   1.0e-02
2  2009.3  8539.5       -24        -14   1.0e-02
3   502.3  2009.3       -21        -15   2.0e-02
4   502.3  2009.3       -15        -16   2.0e-02

p20 p10 latitude longitude acq_time

0 1004.6 5023.3 -33 -12 1.0e-02

1 2511.6 5023.3 -28 -13 1.0e-02

2 2009.3 8539.5 -24 -14 1.0e-02

3 502.3 2009.3 -21 -15 2.0e-02

4 502.3 2009.3 -15 -16 2.0e-02

Selecting rows

Select rows from i-th (inclusive) to j-th (exclusive) column.

ds = df[2:4]
print ds

1 2	ds = df[2:4] print ds

   acq_time  longitude  latitude     p10     p20
2   1.0e-02        -14       -24  8539.5  2009.3
3   2.0e-02        -15       -21  2009.3   502.3

acq_time longitude latitude p10 p20

2 1.0e-02 -14 -24 8539.5 2009.3

3 2.0e-02 -15 -21 2009.3 502.3

Selecting specific rows using list of row numbers

ds = d.iloc[[2,4,2,2]]
print ds

1 2	ds = d.iloc[[2,4,2,2]] print ds

   acq_time  longitude  latitude     p10     p20
2   1.0e-02        -14       -24  8539.5  2009.3
4   2.0e-02        -16       -15  2009.3   502.3
2   1.0e-02        -14       -24  8539.5  2009.3
2   1.0e-02        -14       -24  8539.5  2009.3

acq_time longitude latitude p10 p20

2 1.0e-02 -14 -24 8539.5 2009.3

4 2.0e-02 -16 -15 2009.3 502.3

2 1.0e-02 -14 -24 8539.5 2009.3

There are 3 methods – iloc, loc and ix for row selection and it’s important to know which one to use when. You can read about it here[LINK].

Selecting rows and cols at once

If you want to select certain columns and certain rows, you can either combine the operations above, or again use loc and iloc function.

ds = d.iloc[[1,2,1],[1,2]]
print ds

1 2	ds = d.iloc[[1,2,1],[1,2]] print ds

   longitude  latitude
1        -13       -28
2        -14       -24
1        -13       -28

longitude latitude

1 -13 -28

2 -14 -24

1 -13 -28

So the first argument in iloc function are selecting rows, so I want 1st, 2nd and 1st row again. Second argumnet is number of cols I wish to select.

Beware that with iloc function you always specify number of row / column. In case you need to specify data by column name and index of row ( identifier of row ), then you need to you loc. Read more here[LINK].

Save your data

Now after you selected the data you are interested in, you can simply save your dataframe back to .csv or many other formats that pandas is supporting.

df = pd.read_csv("my_data.csv")
print df

1 2	df = pd.read_csv("my_data.csv") print df

   acq_time  longitude  latitude     p10     p20
0   1.0e-02        -12       -33  5023.3  1004.6
1   1.0e-02        -13       -28  5023.3  2511.6
2   1.0e-02        -14       -24  8539.5  2009.3
3   2.0e-02        -15       -21  2009.3   502.3
4   2.0e-02        -16       -15  2009.3   502.3

acq_time longitude latitude p10 p20

0 1.0e-02 -12 -33 5023.3 1004.6

1 1.0e-02 -13 -28 5023.3 2511.6

2 1.0e-02 -14 -24 8539.5 2009.3

3 2.0e-02 -15 -21 2009.3 502.3

4 2.0e-02 -16 -15 2009.3 502.3

ds = d.iloc[[1,2,1],[1,2]]
ds.to_csv("output.csv", index=False)

1 2	ds = d.iloc[[1,2,1],[1,2]] ds.to_csv("output.csv", index=False)

> cat output.csv

1	> cat output.csv

longitude,latitude
-13.0,-28.0
-14.0,-24.0
-13.0,-28.0

longitude,latitude

-13.0,-28.0

-14.0,-24.0

-13.0,-28.0

Select subset of .csv data using pandas

Goal

Data

Load it

View it

Process it

Selecting columns

Selecting rows

Selecting rows and cols at once

Save your data

One thought on “Select subset of .csv data using pandas”

Leave a Reply Cancel reply