Idea: Correlate Reddit Usage With Weather

Clearly, this blog has largely become  a project blog for my analyses of reddit (particular /r/warhingtondc), so I’m going to stop starting posts by going into the background of the project.

Normally, I’ve just been posting results, but I want to get more into the habit of commenting on my whole process, so this is the only “brainstorming” post so far but will probably be the first of many.

Hypothesis

Do people use reddit more when it’s shitty out. Can we show this analytically? Are there some very active users who are likely to stop redditing when it’s really nice out? Can we identify homebodies (people who’s redditing behavior is affected very little by how nice the weather is)?

Potential Methodology

1: Aggregate reddit usage BY DAY for r/dc. Might not have enough data to do something meaningful here with just r/dc, but we can try. Consider determining average usage for each user by day of week and normalizing relative to that mean. Then derive an average based on available users (otherwise we’ll always see more usage later in the dataset since there will be more users). Consider limiting dataset to users who have commented within a certain threshold time frame towards the end of the dataset to ensure we don’t grab any “dead” users.

2: download wunderground data for the appropriate time series. Try to draw predictions based on precipitation, temperature, etc.

One option: work backwards. Identify days during which it rained all day, during the afternoon only, etc (not just any day it rained). See how these buckets affect the analysis.

Getting Pandas

No, I didn’t go to the store and buy a panda. In fact, the national zoo here in DC lost a panda fairly recently 😦

The title of this post is meant to contrast my last posts frustration: I’m starting to get the whole pandas thing.

When I last played with my reddit data, I produced an interesting graphic that showed the reddit activity of DC area redditors. In my analysis, I treated the data as though it were descriptive of the redditors as individuals, as though it were an average of each individual’s usage pattern, but really it was aggregate data. I’m interested in trying to characterize different “kinds” of redditors from their usage patterns, and hope to also build a classifier that can predict a user’s timezone. The obvious next step in my analysis was to get more granular and work with the individual users. As I mentioned in my last post I’m challenging myself to do this analysis in python.

My dataset resides in a SQLite database. I have extensive experience with SQL at work, where I basically live in an oracle database, but I’m lucky in that I can do most of my analysis directly in the database. I’m still getting the hang of how applications I program should interface with the database. So, like any noob, I started out by making a pretty stupid, obvious mistake. Let me show you what my code looked like when I started this analysis:

"""
NB: This code is for sample purposes only and represents
what NOT to do. The better solution is posted directly below.

Many of my posts describe good code in the context of bad code:
if you're going to cut and paste from anything I post here, please
make sure to read the whole post to make sure you aren't using the
bad solution to your problem.
"""
import database as db
from datamodel import *
import sqlalchemy as sa

s = db.Session()

def get_users(min_comments=900):
    usertuples = s.query(Comment.author) \
    .group_by(Comment.author) \
    .having(sa.func.count(Comment.comment_id)>min_comments) \
    .all()
    usernames = []
    for u in usertuples:
        usernames.append(u[0])
    return usernames

def get_comment_activity(username):
    """
    Gets comments from database, develops a "profile" for the user by
    summarizing their % activity for each hour in a 24 horus clock.
    Could make this more robust by summarizing each day separately
    """
    comments = s.query(Comment.created_utc)\
            .filter(Comment.author == username)\
            .all()
    timestamps = []
    for c in comments:
        timestamps.append(c[0])
    return timestamps

def profile_user_hourly_activity(data):
    """
    Returns profile as a total count of activity.
    Should ultimately normalize this by dividing by total user's activity
    to get percentages. Will need to do that with numpy or some other
    numeric library.
    """
    pivot = {}
    for d in data:
        timestamp = time.gmtime(d)
        t = timestamp.tm_hour
        pivot[t] = pivot.get(t,0) +1
    return pivot

users = get_users()
profiles = {}
times = []
for user in users:
    comments = get_comment_activity(user)
    profiles[user] = profile_user_hourly_activity(comments)

I’ve simplified my code slightly for putting it up on this blog, but it used to contain some print statements inside the for loop at the bottom so I could see how long each iteration was taking. On average, it took about 2.2 seconds to process a single user. That means to process all 2651 users in my database, it would take about an hour. HOLY CRAP! Instead of fixing my problem, I decided to trim the dataset down to only those users for whom I had been able to download >900 comments (hence the min_comments parameter to get_users), reducing the dataset to 564 users. I actually ran the code and it took 21 minutes. Absolutely unacceptable. What’s the problem?

I/O IS EXPENSIVE.

This is a problem I’ve made before in the past, and it always leads to significant performance gains if it can be eliminated. Like, code that once took 18 hours finished in about 4 minutes running on the same data set because I minimized the I/O (in that instance it was file read/writes).

Instead of a separate database call for each user, let’s hit the database once and slice the data as necessary from inside python.

import database as db
import pandas as pd

s = db.Session()
data = s.query(Comment.author, Comment.comment_id, Comment.created_utc).all()

author        = [a for a,b,c in data]
comment_id    = [b for a,b,c in data]
timestamps    = [time.gmtime(c) for a,b,c in data]
created_dates = [dt.datetime(t.tm_year, t.tm_mon, t.tm_mday, t.tm_hour, t.tm_min, t.tm_sec) for t in timestamps]

df = pd.DataFrame({'author':author, 'comment_id':comment_id}, index=created_dates)

Much better. Also, putting it in the data frame object makes the rest of the analysis much much easier. Well… easier in terms of how fast the code runs and how little I need to write. I’m still learning pandas so doing it the “hard” way actually produces results faster for me, but I’m not on a deadline of anything so let’s do this the right way.

grouped = df.groupby(['author', lambda x: x.hour]).agg(len)
grid    = grouped.unstack()

Holy shit that was easy.Since that happened so fast you might have missed it, those two lines of code are summarizing my data set by user to give a profile of their sum activity by hour. Did I mention this code takes about a minute to run (compare this with the code at the top of the article that would have taken a full HOUR).

Here’s how to plot this and what we get:

import pylab as pl
pl.plot(grid.T)
pl.show()

NB: Times here are in UTC (EST=UTC-5). Also, although I didn’t go to the trouble of labeling axis, you can tell pretty quickly: the x-axis is hour, the y-axis is count.

You can see that there’s a lot of diversity in these users, but there’s a very clear pattern here which is clearly consistent with the graphic I produced in my first pass at this. Let’s take a look at that pattern in a broad way by taking the average of the elements composing this graph:

pl.plot(grid.mean())
pl.show()

Man…pandas makes everything so…simple! And it integrates crazy nice with matplotlib (imported here as part of pylab).

As I mentioned at the top, I suspect that there are distinct user categories that I can identify from how people use the site, and my immediate hypothesis is that I can find these categories based on the times people use the site (e.g. people who browse predominantly at work, people who browse predominantly after dinner, people who browse predominantly on the weekends, etc.). I was having a ton of trouble installing scikit-learn so I cheated and did some analysis in R. Here’s what PCA of this data looks like (just to add more pretty pictures):

Obviously the clusters just aren’t there yet. Maybe there are two: the horizontal ellipse on the left and the vertical ellipse on the right (maybe), but more likely instead of distinct user types, it looks like there may be a predominant user type (the right blob), and then a fully populated spectrum of other usage styles. I can’t remember if I generated this using the full data set, or just the users for whom I had >900 data points, so I could be wrong. I guess I’m going to need some more features.

Anyway, enough blogging. I’ve got a mid term this week and am taking a rare night off from my books to play around with this data, and instead of coding I’ve been writing up analyses I’ve already done. Time to play!

Playing With Pandas: DataFrustration

After putting up the post a few weeks ago where I began to analyze the /r/washingtondc community , I received some feedback regarding my methodology. The comment that most resonated with me was “Why are you trying to do your analysis in R if you’ve already got the data in python? Why not just do the analysis and develop your visualizations there?” I had good reasons to use R, mainly: I saw an opportunity to learn plyr, and I don’t really know my way around the numerical programming and plotting tools in python (primarily numpy, scipy, pylab,matplotlib, scikit-learn, and pandas).

As I’ve mentioned previously, I recently started grad school and I’m going to need R specifically in my program, so my need to learn R is much more urgent than my need to develop expertise in python. But frankly, I just enjoy using python more, and I’d rather do everything in one environment if I can so learning the python data analytics libraries is a very attractive prospect to me. Wes McKinney, the author of the pandas library, is publishing a book through O’Reilly called Python for Data Analysis that I’m super interested in. The problem is right before I started my graduate program, I splurged a little and bought some textbooks I had been drooling over but have barely had opportunity to crack into (Drew Conway and John Myles White‘s Machine Learning for Hackers and Christopher M. Bishop’s Pattern Recognition and Machine Learning) so I’m apprehensive to buy more books until I can bite into the books I’ve already bought (while I’m dreaming of a world where I have time, I also really want to pick up Nate Silver’s new book The Signal And The Noise).

Instead of buying a new book (I expect I”ll buy it later) for the moment let’s do this the ol’ fashioned way and hit the documentation. The documentation for pandas looks to be pretty thorough: I recommend starting with the Intro to Data Structures.

Now, it might be because I’m coming into this with very little NumPy experience, but right off the bat the DataFrame class, the heart of pandas, seems awkward to me. To illustrate my problem, let me compare how to take a particular slice in R vs. pandas.

What I want to do is isolate those columns of the dataframe whose first row is below some threshold value:

A <- runif(10)
B <- runif(10)
C <- runif(10)
D <- runif(10)
E <- runif(10)

df <- data.frame(A,B,C,D,E)
sliced_df <- df[ , df[1,]<.5 ]

This is pretty straightforward. The first 5 lines I create labeled vectors of 10 random numbers selected from the uniform distribution (in the range (0,1)). Then I form a dataframe from these vectors. Finally, from the dataframe I select all rows, but only those columns from the where the value of the first row is less than 0.5.

> df
            A           B         C           D          E
1  0.45274205 0.543755858 0.2225730 0.643710467 0.44527644
2  0.55692168 0.687034039 0.8480953 0.494917616 0.98080695
3  0.19127556 0.419290813 0.2744206 0.005422064 0.58559636
4  0.58410947 0.094669003 0.3284746 0.891122109 0.05962251
5  0.94561895 0.022608545 0.9431832 0.951050056 0.38312492
6  0.72316858 0.003073411 0.3336150 0.201627465 0.89433597
7  0.02145899 0.685167549 0.5754166 0.371717998 0.06746820
8  0.47334489 0.143967454 0.4463423 0.959687645 0.64947595
9  0.75215197 0.068791088 0.0343898 0.117595073 0.28861395
10 0.78567118 0.398529395 0.6467450 0.883467028 0.86369047

> sliced_df
            A         C          E
1  0.45274205 0.2225730 0.44527644
2  0.55692168 0.8480953 0.98080695
3  0.19127556 0.2744206 0.58559636
4  0.58410947 0.3284746 0.05962251
5  0.94561895 0.9431832 0.38312492
6  0.72316858 0.3336150 0.89433597
7  0.02145899 0.5754166 0.06746820
8  0.47334489 0.4463423 0.64947595
9  0.75215197 0.0343898 0.28861395
10 0.78567118 0.6467450 0.86369047

Here’s how we do the same thing in python, using the pandas.DataFrame datatype (this might not look so bad, but DataFrame.T is the transpose method): <<Edit: Actually, since finishing this post I figured out a better way which I discuss towards the end o fthis post>>

import pandas as pd
import random

def runif(n):
    res = []
    for i in range(n):
        res.append(random.random())
    return res

A = runif(10)
B = runif(10)
C = runif(10)
D = runif(10)
E = runif(10)

df = pd.DataFrame({'A':A,'B':B,'C':C,'D':D,'E':E})
sliced_df = df.T[ df.T[0]>0.5 ].T

To start with, I have to create a function to build sequences of random numbers. I strongly suspect that some such function exists in NumPy or somewhere, but I don’t know what it is so I made my own (which has the added benefit of making my code align better with the R example). <<EDIT: Since typing this blogpost, I’ve learned that the function I needed was numpy.random.randn.>>

Next, populating the dataframe is already extremely awkward. From the documentation, it seems like the pandas.DataFrame is primarily designed to be generated by dictionaries. What particularly annoys me about this is that, considering the DataFrame is supposed to be a container for the pandas.Series class, I can’t just feed pandas.DataFrame() a list of pandas.Series objects, they have to be in a dictionary. I can create a DataFrame from a single Series like so:

>>> df = pd.DataFrame(A)
>>> df

0
0  0.446686
1  0.696637
2  0.265383
3  0.165647
4  0.861040
5  0.347737
6  0.280177
7  0.980586
8  0.334544
9  0.645143

Note how the column name was changed to “0”. I would have expected that the variable name becomes the column name like in R, but nope, no such luck. The same thing happens if we feed DataFrame() two series, which makes it hard to distinguish them for slicing.

>>> df = pd.DataFrame(A,B)
>>> df

0
0.689893  0.446686
0.373250  0.696637
0.619527  0.265383
0.775759  0.165647
0.819883  0.861040
0.099763  0.347737
0.143239  0.280177
0.303460  0.980586
0.975462  0.334544
0.672082  0.645143

>>> df[0]

0.689893    0.446686
0.373250    0.696637
0.619527    0.265383
0.775759    0.165647
0.819883    0.861040
0.099763    0.347737
0.143239    0.280177
0.303460    0.980586
0.975462    0.334544
0.672082    0.645143
Name: 0

If I try to create a dataframe from three or more series, I just get an error.

>>> df = pd.DataFrame(A,B,C)

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 412, in __init__
copy=copy)
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 506, in _init_ndarray
block = make_block(values.T, columns, columns)
File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 461, in make_block
do_integrity_check=do_integrity_check)
File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 26, in __init__
assert(len(items) == len(values))
AssertionError

Frankly, I’m not sure why two series even worked. Anyway, the DataFrame object wants a dictionary so it knows explicitly how to name each data set. You’d think the variable names would be sufficient, and that’s the intuitive solution we get in R, but no such luck in pandas. For the time being, this is a minor problem but we’ll see what happens when I start working with larger datasets.

What really irks me is the slicing operations are a little funny, and things that should be able to handle booleans can’t. Basically, slicing dataframes is much less intuitive in pandas than in R or octave (matlab) and this concerns me, since this sort of thing is what enables vectorizing operations (i.e. instead of using slower for loops).

The main problem is that the df[x] operator doesn’t take both rows and columns at the same time like in R (e.g. df[rows, columns] ). But weirdly, it can take one or the other separately. In pandas, df[x] takes a couple of different kinds of ‘x’: a column name, a list of column names, indices, or a conditional statement. If I give df[x] a conditional statement or indices (e.g. df[0:3]), they apply to the rows. Did you catch that? If I feed it labels that match column names, I get columns back. If I feed it slice indices I get rows back. What the fuck, pandas. I can’t be the only person who thinks this is spectacularly confusing. A particular slice method (e.g. [ ] ) should give me back either columns or rows, not both.

>>> df['A']

0    0.446686
1    0.696637
2    0.265383
3    0.165647
4    0.861040
5    0.347737
6    0.280177
7    0.980586
8    0.334544
9    0.645143
Name: A

>>> df[['A','B']]

A         B
0  0.446686  0.689893
1  0.696637  0.373250
2  0.265383  0.619527
3  0.165647  0.775759
4  0.861040  0.819883
5  0.347737  0.099763
6  0.280177  0.143239
7  0.980586  0.303460
8  0.334544  0.975462
9  0.645143  0.672082

>>> df[ df['A']>0.5 ]

A         B         C         D         E
1  0.696637  0.373250  0.778385  0.918453  0.580366
4  0.861040  0.819883  0.397820  0.735031  0.110483
7  0.980586  0.303460  0.117398  0.033969  0.731914
9  0.645143  0.672082  0.138268  0.730780  0.969545

>>> df[1:3]

A         B         C         D         E
1  0.696637  0.373250  0.778385  0.918453  0.580366
2  0.265383  0.619527  0.633576  0.553009  0.599043

This means if I want just those columns that meet a particular row condition, I have to first take the transpose of the dataframe to make the rows into columns, apply my conditional statement to the appropriate column (which requires taking the transpose again), and take the transpose one last time to flip the returned data set back into the original orientation. Hence:

sliced_df = df.T[ df.T[0]>0.5 ].T

Having written out this entire rant, I finally figured out the better way of doing this. I’m still annoyed by all the stuff I described above, but here’s how you accomplish this slice without using transpose at all: you use the DataFrame.ix method. This allows for more intuitive R-style indexing of the form df.ix[rows, columns]. It accepts indices and booleans. Do I sound dejected? This took me a while to figure out.

>>> df.ix[:,:]

A         B         C         D         E
0  0.242808  0.829024  0.027734  0.510985  0.430466
1  0.301553  0.834208  0.600806  0.773148  0.119008
2  0.968252  0.098827  0.290203  0.555629  0.652359
3  0.351365  0.391068  0.352370  0.531282  0.478862
4  0.513526  0.138082  0.538826  0.252554  0.486603
5  0.705628  0.362105  0.800225  0.977828  0.454140
6  0.097671  0.613972  0.712334  0.473130  0.886449
7  0.386206  0.520115  0.589156  0.722709  0.293428
8  0.337381  0.102242  0.296870  0.725426  0.475001
9  0.076314  0.894782  0.115159  0.592838  0.402849

>>> df.ix[2:3,'B':'E']

B         C         D         E
2  0.098827  0.290203  0.555629  0.652359
3  0.391068  0.352370  0.531282  0.478862

>>> df.ix[0,:] < 0.5

A     True
B    False
C     True
D    False
E     True
Name: 0

>>> df.ix[ :, df.ix[0,:] < 0.5 ]

A         C         E
0  0.242808  0.027734  0.430466
1  0.301553  0.600806  0.119008
2  0.968252  0.290203  0.652359
3  0.351365  0.352370  0.478862
4  0.513526  0.538826  0.486603
5  0.705628  0.800225  0.454140
6  0.097671  0.712334  0.886449
7  0.386206  0.589156  0.293428
8  0.337381  0.296870  0.475001
9  0.076314  0.115159  0.402849

One thing that is much less confusing about these dataframes is if I want to plot something, all I have to do is:

from pylab import plot, show

plot(df)
show()

And I get a single plot where each line is a separate column of data. I would have had to have used a for loop otherwise, so that at least is nice and intuitive.

These are normal growing pains, as with learning any new tool. If you’ve got a few hours on your hands and don’t feel like mucking through the documentation, here’s a pycon tutorial Wes gave earilier this year that’s pretty interesting (I’ve only made my way through about an hour’s worth myself, but it looks like a thorough introduction). NB: The first 15 minutes or so he’s just setting up IPython, so I recommend you skip ahead a bit.

A Beautiful Python Commandline Interface

The most elegant solutions are often to problems you didn’t even realize you had.

Just a few weeks ago I commented on how easy it is to create a good commandline interface with the argparse library. The code I used to demonstrate this looked like this:

parser = argparse.ArgumentParser(description="Scrapes comments for a reddit user. Currently limited to most recent 999 comments (limit imposed by reddit).")
parser.add_argument('-u','--user', type=str, help="Reddit username to grab comments from.", required=True)
parser.add_argument('-l','--limit', type=int, help="Maximum number of comments to download.", default=0)
parser.add_argument('-d','--dbname', type=str, help="Database name for storage.", default='RedditComments.DB')
parser.add_argument('-w','--wait', type=str,help="Wait time between GET requests. Reddit documentation requests a limit of 1 of every 2 seconds not to exceed 30 per min.", default=2)

args = parser.parse_args()
_user   = args.user
_limit  = args.limit
_dbname = args.dbname
_wait   = args.wait

Which, in all fairness, is not bad, and produces the following output (which could be a little prettier):

$ python testargs.py

usage: test_args.py [-h] -u USER [-l LIMIT] [-d DBNAME] [-w WAIT]

Scrapes comments for a reddit user. Currently limited to most recent 999
comments (limit imposed by reddit).

optional arguments:
-h, --help                 show this help message and exit
-u USER, --user USER       Reddit username to grab comments from.
-l LIMIT, --limit LIMIT    Maximum number of comments to download.
-d DBNAME, --dbname DBNAME Database name for storage.
-w WAIT, --wait WAIT       Wait time between GET requests. Reddit documentation
requests a limit of 1 of every 2 seconds not to exceed 30 per min.

Part of what makes this easy to read is that it’s familiar: this isn’t some commandline vocabulary I just invented, it conforms to the POSIX standard.

Yesterday, I saw a PyCon UK presentation by Vladimir Keleshev describing a new library he had written called docopt. The genius of this library is that it speaks POSIX. Instead of feeding a whole list of complex parameters to define a single argument, or worse, a relationship between arguments (such as mutual exclusivity), all you have to do is pass docopt the help text you want it to produce, and it handles the rest.

If your commandline arguments will have complex relationships, or there are several exclusive use cases for your program that need to be defined, this is an incredible elegant and convenient solution.

One-to-one mapping of random table entries (Oracle SQL): generating random records for unit tests

A large amount of the work in my department is associating music repertoire in our system to “claiming rights owners,” usually labels, using some tools I maintain. When I make an update or add a feature, I like to generate dummy repertoire associations for unit testing. Instead of manually finding individual records and targets, I find it useful to do it programmatically. This allows me to scale the test to however many instances I want.

Obviously, the first step is to write a query that finds repertoire in our system that meets the criteria of my test, and another query that finds rights owner that meets the criteria of my test. The question is, how best do I combine the two?

To avoid talking to much about the specifics of my work problem, I’ll generalize this by framing it in a purchases database. Let’s pretend we want to generate random product purchases for tests. We want the random purchases to include customers from various cities and also products connected to several brands. Depending on how our database was populated, just taking the first few entries from the CUSTOMERS table might result in them all being from the same region, and similarly taking the first few entries in the PRODUCTS table could result in them all being from the same brand. Not really a problem, but it would be better to have diverse test cases.

The first step is randomly shuffling our datasets. Here’s how you accomplish this in oracle:

select *
from customers
order by dbms_random.value

If you are using a different RDBMS, dbms_random.value probably doesn’t exist. Pete Freitag wrote a blog post describing the appropriate shuffling methods for different RDBMS’s which you will probably find useful.

Now, we need to map our shuffled data. Why not just put the shuffled datasets side by side? To do this, we join on the rownumbers of our shuffled datasets:

select cust.id customer_id, prod.id product_id from
    (select id, rownum joinval from (
        select id
        from customers
        order by dbms_random.value)) cust,
    (select id, rownum joinval from (
        select id
        from products
        order by dbms_random.value)) prod,
where cust.joinval = prod.joinval

Again, this is oracle style. rownum will probably have a different name or may not even be supported depending on your RDBMS of choice.

This solution works for me because my analogs of the customers and products tables are large and we have many, many customers and products, so I can feel safe that I’ll get a good mix. If you’re not so lucky or want to be really damn sure that no two customers are from the same city (or that no two products are from the same brand), you can, but things will start to get messy fast. Intuitively, we want something like this:

select b.brand_id, min(p.id)
from products p, xref_products_brands b
where p.id = b.prod_id
group by b.brand_id

But this will always give us the same product ids! We don’t actually want the min(p.id) for each brand, we want a random product id for each brand, so this deduplication method probably doesn’t cut it for us unless we have lots of brands in our system (which we very well may). Since shuffling the data suits my needs I don’t really feel the need to delve into this problem further, but be cognizant of your own needs and data if you decide to use these methods.