Goal
This is short post for people who wants to quickly select data from .csv datafile. Maybe just certain columns, maybe just certain rows. I assume you know how to run python interpreter, how .csv format is defined and how to install pandas library package.
Data
We have .csv datafile data.csv containing following :
1 2 3 4 5 6 |
acq_time,longitude,latitude,p10,p20 0.01, -12.00, -33.00, 5023.26, 1004.65 0.01, -13.00, -28.00, 5023.26, 2511.63 0.01, -14.00, -24.00, 8539.54, 2009.30 0.02, -15.00, -21.00, 2009.30, 502.33 0.02, -16.00, -15.00, 2009.30, 502.33 |
Load it
To use pandas, first we need to import the package
1 |
import pandas as pd |
One of the basic datastructures in pandas library is dataframe. Dataframe typically represents 2 dimensional data like above.
To load CSV we call method read_csv of dataframe object which returns pandas dataframe object stored in variable df.
1 |
df = pd.read_csv("my_data.csv") |
View it
1 |
print df |
Outputs
1 2 3 4 5 6 |
acq_time longitude latitude p10 p20 0 1.0e-02 -12 -33 5023.3 1004.6 1 1.0e-02 -13 -28 5023.3 2511.6 2 1.0e-02 -14 -24 8539.5 2009.3 3 2.0e-02 -15 -21 2009.3 502.3 4 2.0e-02 -16 -15 2009.3 502.3 |
Process it
Now there is many things you may want to do with your dataframe. In this post, I only want to show very simple scenario that you want to select some of the data and generate new .csv file containing this data subset.
Selecting columns
Select columns by column name – This way you can also change order of the columns in dataframe
1 2 |
ds = df[['p10','p20','longitude','latitude']] print ds |
1 2 3 4 5 6 |
p10 p20 longitude latitude 0 5023.3 1004.6 -12 -33 1 5023.3 2511.6 -13 -28 2 8539.5 2009.3 -14 -24 3 2009.3 502.3 -15 -21 4 2009.3 502.3 -16 -15 |
Select column by integer indexes – DataFrame object has useful array-like properties .columns and .index
1 |
df.columns |
1 |
Index([u'acq_time', u'longitude', u'latitude', u'p10', u'p20'], dtype='object') |
Therefore, we can use integer indexes to select the column names which brings us to the previous case using column names.
1 |
ds = d[d.columns[[1,2,1,4]]] |
1 2 3 4 5 6 |
longitude latitude longitude p20 0 -12 -33 -12 1004.6 1 -13 -28 -13 2511.6 2 -14 -24 -14 2009.3 3 -15 -21 -15 502.3 4 -16 -15 -16 502.3 |
Reverse order of columns
1 2 |
ds = df[df.columns[range(4,-1,-1)]] print ds |
1 2 3 4 5 6 |
p20 p10 latitude longitude acq_time 0 1004.6 5023.3 -33 -12 1.0e-02 1 2511.6 5023.3 -28 -13 1.0e-02 2 2009.3 8539.5 -24 -14 1.0e-02 3 502.3 2009.3 -21 -15 2.0e-02 4 502.3 2009.3 -15 -16 2.0e-02 |
Selecting rows
Select rows from i-th (inclusive) to j-th (exclusive) column.
1 2 |
ds = df[2:4] print ds |
1 2 3 |
acq_time longitude latitude p10 p20 2 1.0e-02 -14 -24 8539.5 2009.3 3 2.0e-02 -15 -21 2009.3 502.3 |
Selecting specific rows using list of row numbers
1 2 |
ds = d.iloc[[2,4,2,2]] print ds |
1 2 3 4 5 |
acq_time longitude latitude p10 p20 2 1.0e-02 -14 -24 8539.5 2009.3 4 2.0e-02 -16 -15 2009.3 502.3 2 1.0e-02 -14 -24 8539.5 2009.3 2 1.0e-02 -14 -24 8539.5 2009.3 |
There are 3 methods – iloc, loc and ix for row selection and it’s important to know which one to use when. You can read about it here[LINK].
Selecting rows and cols at once
If you want to select certain columns and certain rows, you can either combine the operations above, or again use loc and iloc function.
1 2 |
ds = d.iloc[[1,2,1],[1,2]] print ds |
1 2 3 4 |
longitude latitude 1 -13 -28 2 -14 -24 1 -13 -28 |
So the first argument in iloc function are selecting rows, so I want 1st, 2nd and 1st row again. Second argumnet is number of cols I wish to select.
Beware that with iloc function you always specify number of row / column. In case you need to specify data by column name and index of row ( identifier of row ), then you need to you loc. Read more here[LINK].
Save your data
Now after you selected the data you are interested in, you can simply save your dataframe back to .csv or many other formats that pandas is supporting.
1 2 |
df = pd.read_csv("my_data.csv") print df |
1 2 3 4 5 6 |
acq_time longitude latitude p10 p20 0 1.0e-02 -12 -33 5023.3 1004.6 1 1.0e-02 -13 -28 5023.3 2511.6 2 1.0e-02 -14 -24 8539.5 2009.3 3 2.0e-02 -15 -21 2009.3 502.3 4 2.0e-02 -16 -15 2009.3 502.3 |
1 2 |
ds = d.iloc[[1,2,1],[1,2]] ds.to_csv("output.csv", index=False) |
1 |
> cat output.csv |
1 2 3 4 |
longitude,latitude -13.0,-28.0 -14.0,-24.0 -13.0,-28.0 |
Hi, this is a comment.
To delete a comment, just log in and view the post's comments. There you will have the option to edit or delete them.