Playing With Pandas: DataFrustration

After putting up the post a few weeks ago where I began to analyze the /r/washingtondc community , I received some feedback regarding my methodology. The comment that most resonated with me was “Why are you trying to do your analysis in R if you’ve already got the data in python? Why not just do the analysis and develop your visualizations there?” I had good reasons to use R, mainly: I saw an opportunity to learn plyr, and I don’t really know my way around the numerical programming and plotting tools in python (primarily numpy, scipy, pylab,matplotlib, scikit-learn, and pandas).

As I’ve mentioned previously, I recently started grad school and I’m going to need R specifically in my program, so my need to learn R is much more urgent than my need to develop expertise in python. But frankly, I just enjoy using python more, and I’d rather do everything in one environment if I can so learning the python data analytics libraries is a very attractive prospect to me. Wes McKinney, the author of the pandas library, is publishing a book through O’Reilly called Python for Data Analysis that I’m super interested in. The problem is right before I started my graduate program, I splurged a little and bought some textbooks I had been drooling over but have barely had opportunity to crack into (Drew Conway and John Myles White‘s Machine Learning for Hackers and Christopher M. Bishop’s Pattern Recognition and Machine Learning) so I’m apprehensive to buy more books until I can bite into the books I’ve already bought (while I’m dreaming of a world where I have time, I also really want to pick up Nate Silver’s new book The Signal And The Noise).

Instead of buying a new book (I expect I”ll buy it later) for the moment let’s do this the ol’ fashioned way and hit the documentation. The documentation for pandas looks to be pretty thorough: I recommend starting with the Intro to Data Structures.

Now, it might be because I’m coming into this with very little NumPy experience, but right off the bat the DataFrame class, the heart of pandas, seems awkward to me. To illustrate my problem, let me compare how to take a particular slice in R vs. pandas.

What I want to do is isolate those columns of the dataframe whose first row is below some threshold value:

A <- runif(10)
B <- runif(10)
C <- runif(10)
D <- runif(10)
E <- runif(10)

df <- data.frame(A,B,C,D,E)
sliced_df <- df[ , df[1,]<.5 ]

This is pretty straightforward. The first 5 lines I create labeled vectors of 10 random numbers selected from the uniform distribution (in the range (0,1)). Then I form a dataframe from these vectors. Finally, from the dataframe I select all rows, but only those columns from the where the value of the first row is less than 0.5.

> df
            A           B         C           D          E
1  0.45274205 0.543755858 0.2225730 0.643710467 0.44527644
2  0.55692168 0.687034039 0.8480953 0.494917616 0.98080695
3  0.19127556 0.419290813 0.2744206 0.005422064 0.58559636
4  0.58410947 0.094669003 0.3284746 0.891122109 0.05962251
5  0.94561895 0.022608545 0.9431832 0.951050056 0.38312492
6  0.72316858 0.003073411 0.3336150 0.201627465 0.89433597
7  0.02145899 0.685167549 0.5754166 0.371717998 0.06746820
8  0.47334489 0.143967454 0.4463423 0.959687645 0.64947595
9  0.75215197 0.068791088 0.0343898 0.117595073 0.28861395
10 0.78567118 0.398529395 0.6467450 0.883467028 0.86369047

> sliced_df
            A         C          E
1  0.45274205 0.2225730 0.44527644
2  0.55692168 0.8480953 0.98080695
3  0.19127556 0.2744206 0.58559636
4  0.58410947 0.3284746 0.05962251
5  0.94561895 0.9431832 0.38312492
6  0.72316858 0.3336150 0.89433597
7  0.02145899 0.5754166 0.06746820
8  0.47334489 0.4463423 0.64947595
9  0.75215197 0.0343898 0.28861395
10 0.78567118 0.6467450 0.86369047

Here’s how we do the same thing in python, using the pandas.DataFrame datatype (this might not look so bad, but DataFrame.T is the transpose method): <<Edit: Actually, since finishing this post I figured out a better way which I discuss towards the end o fthis post>>

import pandas as pd
import random

def runif(n):
    res = []
    for i in range(n):
        res.append(random.random())
    return res

A = runif(10)
B = runif(10)
C = runif(10)
D = runif(10)
E = runif(10)

df = pd.DataFrame({'A':A,'B':B,'C':C,'D':D,'E':E})
sliced_df = df.T[ df.T[0]>0.5 ].T

To start with, I have to create a function to build sequences of random numbers. I strongly suspect that some such function exists in NumPy or somewhere, but I don’t know what it is so I made my own (which has the added benefit of making my code align better with the R example). <<EDIT: Since typing this blogpost, I’ve learned that the function I needed was numpy.random.randn.>>

Next, populating the dataframe is already extremely awkward. From the documentation, it seems like the pandas.DataFrame is primarily designed to be generated by dictionaries. What particularly annoys me about this is that, considering the DataFrame is supposed to be a container for the pandas.Series class, I can’t just feed pandas.DataFrame() a list of pandas.Series objects, they have to be in a dictionary. I can create a DataFrame from a single Series like so:

>>> df = pd.DataFrame(A)
>>> df

0
0  0.446686
1  0.696637
2  0.265383
3  0.165647
4  0.861040
5  0.347737
6  0.280177
7  0.980586
8  0.334544
9  0.645143

Note how the column name was changed to “0”. I would have expected that the variable name becomes the column name like in R, but nope, no such luck. The same thing happens if we feed DataFrame() two series, which makes it hard to distinguish them for slicing.

>>> df = pd.DataFrame(A,B)
>>> df

0
0.689893  0.446686
0.373250  0.696637
0.619527  0.265383
0.775759  0.165647
0.819883  0.861040
0.099763  0.347737
0.143239  0.280177
0.303460  0.980586
0.975462  0.334544
0.672082  0.645143

>>> df[0]

0.689893    0.446686
0.373250    0.696637
0.619527    0.265383
0.775759    0.165647
0.819883    0.861040
0.099763    0.347737
0.143239    0.280177
0.303460    0.980586
0.975462    0.334544
0.672082    0.645143
Name: 0

If I try to create a dataframe from three or more series, I just get an error.

>>> df = pd.DataFrame(A,B,C)

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 412, in __init__
copy=copy)
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 506, in _init_ndarray
block = make_block(values.T, columns, columns)
File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 461, in make_block
do_integrity_check=do_integrity_check)
File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 26, in __init__
assert(len(items) == len(values))
AssertionError

Frankly, I’m not sure why two series even worked. Anyway, the DataFrame object wants a dictionary so it knows explicitly how to name each data set. You’d think the variable names would be sufficient, and that’s the intuitive solution we get in R, but no such luck in pandas. For the time being, this is a minor problem but we’ll see what happens when I start working with larger datasets.

What really irks me is the slicing operations are a little funny, and things that should be able to handle booleans can’t. Basically, slicing dataframes is much less intuitive in pandas than in R or octave (matlab) and this concerns me, since this sort of thing is what enables vectorizing operations (i.e. instead of using slower for loops).

The main problem is that the df[x] operator doesn’t take both rows and columns at the same time like in R (e.g. df[rows, columns] ). But weirdly, it can take one or the other separately. In pandas, df[x] takes a couple of different kinds of ‘x’: a column name, a list of column names, indices, or a conditional statement. If I give df[x] a conditional statement or indices (e.g. df[0:3]), they apply to the rows. Did you catch that? If I feed it labels that match column names, I get columns back. If I feed it slice indices I get rows back. What the fuck, pandas. I can’t be the only person who thinks this is spectacularly confusing. A particular slice method (e.g. [ ] ) should give me back either columns or rows, not both.

>>> df['A']

0    0.446686
1    0.696637
2    0.265383
3    0.165647
4    0.861040
5    0.347737
6    0.280177
7    0.980586
8    0.334544
9    0.645143
Name: A

>>> df[['A','B']]

A         B
0  0.446686  0.689893
1  0.696637  0.373250
2  0.265383  0.619527
3  0.165647  0.775759
4  0.861040  0.819883
5  0.347737  0.099763
6  0.280177  0.143239
7  0.980586  0.303460
8  0.334544  0.975462
9  0.645143  0.672082

>>> df[ df['A']>0.5 ]

A         B         C         D         E
1  0.696637  0.373250  0.778385  0.918453  0.580366
4  0.861040  0.819883  0.397820  0.735031  0.110483
7  0.980586  0.303460  0.117398  0.033969  0.731914
9  0.645143  0.672082  0.138268  0.730780  0.969545

>>> df[1:3]

A         B         C         D         E
1  0.696637  0.373250  0.778385  0.918453  0.580366
2  0.265383  0.619527  0.633576  0.553009  0.599043

This means if I want just those columns that meet a particular row condition, I have to first take the transpose of the dataframe to make the rows into columns, apply my conditional statement to the appropriate column (which requires taking the transpose again), and take the transpose one last time to flip the returned data set back into the original orientation. Hence:

sliced_df = df.T[ df.T[0]>0.5 ].T

Having written out this entire rant, I finally figured out the better way of doing this. I’m still annoyed by all the stuff I described above, but here’s how you accomplish this slice without using transpose at all: you use the DataFrame.ix method. This allows for more intuitive R-style indexing of the form df.ix[rows, columns]. It accepts indices and booleans. Do I sound dejected? This took me a while to figure out.

>>> df.ix[:,:]

A         B         C         D         E
0  0.242808  0.829024  0.027734  0.510985  0.430466
1  0.301553  0.834208  0.600806  0.773148  0.119008
2  0.968252  0.098827  0.290203  0.555629  0.652359
3  0.351365  0.391068  0.352370  0.531282  0.478862
4  0.513526  0.138082  0.538826  0.252554  0.486603
5  0.705628  0.362105  0.800225  0.977828  0.454140
6  0.097671  0.613972  0.712334  0.473130  0.886449
7  0.386206  0.520115  0.589156  0.722709  0.293428
8  0.337381  0.102242  0.296870  0.725426  0.475001
9  0.076314  0.894782  0.115159  0.592838  0.402849

>>> df.ix[2:3,'B':'E']

B         C         D         E
2  0.098827  0.290203  0.555629  0.652359
3  0.391068  0.352370  0.531282  0.478862

>>> df.ix[0,:] < 0.5

A     True
B    False
C     True
D    False
E     True
Name: 0

>>> df.ix[ :, df.ix[0,:] < 0.5 ]

A         C         E
0  0.242808  0.027734  0.430466
1  0.301553  0.600806  0.119008
2  0.968252  0.290203  0.652359
3  0.351365  0.352370  0.478862
4  0.513526  0.538826  0.486603
5  0.705628  0.800225  0.454140
6  0.097671  0.712334  0.886449
7  0.386206  0.589156  0.293428
8  0.337381  0.296870  0.475001
9  0.076314  0.115159  0.402849

One thing that is much less confusing about these dataframes is if I want to plot something, all I have to do is:

from pylab import plot, show

plot(df)
show()

And I get a single plot where each line is a separate column of data. I would have had to have used a for loop otherwise, so that at least is nice and intuitive.

These are normal growing pains, as with learning any new tool. If you’ve got a few hours on your hands and don’t feel like mucking through the documentation, here’s a pycon tutorial Wes gave earilier this year that’s pretty interesting (I’ve only made my way through about an hour’s worth myself, but it looks like a thorough introduction). NB: The first 15 minutes or so he’s just setting up IPython, so I recommend you skip ahead a bit.

3 thoughts on “Playing With Pandas: DataFrustration

Leave a comment