I before E except after C

A recent post on Patrick Vennebush blog Math Jokes for 4 Mathy Folks asserted that the rule of thumb “I before E except after C” was “total bullshit.” This got me thinking: the “I before E except after C” rule (let’s just call it the IEC rule) was almost certainly developed without any research at all, just based on the subjective experience of educators. It’s not a horrible rule, but certainly we can more intelligently construct better rules of this kind (for lack of a better term, I’ll be refering to these as “trigram spelling rules”). We have the technology, and I have nothing better to do with my night off 🙂

You can find my full methodology and analysis in the following IPython notebook:

http://nbviewer.ipython.org/gist/dmarx/b6a095d2b161eccb18a3

The short version

I used a larger word list than Patrick (233,621 words), but my analysis still corroborated his. I observed the following bigram and trigram frequencies for the IEC rule:

ie: 3950 	cie: 256
ei: 2607 	cei: 156

I thought that perhaps although the IEC rule doesn’t work when we look at the unique words in our vocabulary, perhaps it might hold true if we look at trigram and bigram frequencies across word usage in written text. Here are the frequencies for the IEC rule in the Brown corpus:

ie: 13275 	cie: 1310
ei: 5677 	cei: 485

Nope, still no good.

Instead of the IEC rule, here are some alternatives (taken from my vocabulary analysis, not the word usage analysis). For each rule of the form “A before B except after C” below, the bigram frequency percentage count(AB)/(count(AB)+count(BA)) is at least 1/2, and the laplace smoothed trigram frequency ratio (1+count(cba))/(1+count(cab)) is maximized:

  • P before E except after C
pe: 8052 	cpe: 0
ep: 5053 	cep: 955
  • E before U except after Q
eu: 2620 qeu: 0
ue: 1981 que: 949
  • I before C except after I
ic: 26140 iic: 1
ci: 6561 ici: 1830
  • T before E except after M
te: 27265 mte: 2
et: 11743 met: 2684
  • R before D except after N
rd: 3641 nrd: 0
dr: 2738 ndr: 808

 Update

After posting this article, there was some discussion that the optimal rules should focus on vowel placement and have a higher bigram ratio than the 1/2 threshold I used. Here are two “better” rules that satisfy these condiitons:

  • O before U except after Q
ou 12144    qou 0
uo 671      quo 122
  • I before O except after J
io 15247    jio 0
oi 4040     joi 95
Advertisements

Topological Anomaly Detection

Iris_outlier_graph

(tl;dr: https://github.com/dmarx/Topological-Anomaly-Detection)

Recently I had the pleasure of attending a presentation by Dr. Bill Basener, one of the authors of this paper which describes an outlier analysis technique called Topological Anomaly Detection (TAD). Despite the intimidating name, the algorithm is extremely simple, both to understand and to implement.

The technique is essentially a density based outlier detection algorithm that, instead of calculating local densities, constructs a graph of the data using nearest-neighbors. The algorithm is different from other kNN outlier detection algorithms in that instead of setting ‘k’ as a parameter, you instead set a maximal inter-observation distance (called the graph “resolution” by Gartley and Basener). If the distance between two points is less than the graph resolution, add an edge between those two observations to the graph. Once the full graph is constructed, determine which connected components comprise the “background” of the data by setting some threshold percentage of observations ‘p’: any components with fewer than ‘p’ observations is considered an anomalous component, and all the observations (nodes) in this component are outliers.

As you probably gathered from my description, it’s a pretty simple algorithm to implement. Let’s walk through it step by step.

First, we need to construct a distance matrix. SciPy has a great utility for this called pdist:

from scipy.spatial.distance import pdist

distance_matrix = pdist(X)

One thing that’s nice about this is that it will work for either numpy arrays or pandas dataframes, as long as the observations are on the rows and the features are on the columns. The output of pdist is, by default, a “condensed” distance matrix, which is just the upper triangular matrix collapsed into a flat vector. This is nice because it takes up (just) less than half the memory, but it’s slightly more cumbersome to work with than a straight up distance matrix.

We’re ultimately going to treat this distance matrix as an adjacency matrix, but before we do, we need to replace all entries in the distance matrix (ok, distance vector) that are larger than the graph resolution “r” with zeroes to remove those edges from the graph. I think it’s a little silly to expect the user to come up with a discrete distance to use for the resolution, so we’re going to give the user the option to provide a quantile instead as the ‘rp’ parameter below.

It’s tempting to just modify the distance matrix in place, but we’re going to need it later to score outliers, so we’re going to work with a copy instead.

def trim_adjacency_matrix(adj, r=None, rq=.1):
    if r is None:
        # This is really just a lazy quantile function.
        q = int(np.floor(len(adj)*rq))
        print "q:", q
        r = np.sort(adj)[q]
    print "r:", r
    adj2 = adj.copy()
    adj2[adj>r] = 0 
    return adj2, r

Tada! We’ve got our graph. If you don’t work with graph data a lot you might not see it, but an adjacency matrix is just a way of encoding a graph. But, the adjacency matrix isn’t enough: we want to mine the connected components from the graph. Thankfully, this is trivial with networkx.

Now, networkx expects a square matrix if we’re going to build a graph using an adjacency matrix, but we have a vector. We could convert this to a full matrix by calling scipy.spatial.distance.squareform, but this will take up double the space in memory and it’s possible that a user is working with a large enough dataset that this will be a problem, so let’s work around the condensed distance matrix.

There’s probably a better way to go about this–like a formula that converts (i, j) pairs into the appropriate indexes into this condensed matrix (vector)–but I was lazy when I wrote this up so instead of trying to find the right formula, I instead took some inspiration from a stackoverflow post discussing how to work with condensed distance matrices which led me to iterate through each vector index paired with its corresponding (i, j) index in the square matrix using enumerate(itertools.combinations). This essentially gives me an edgelist, permitting to build up the graph one edge at a time:

from itertools import combinations
import networkx as nx
def construct_graph(edges, n):
    g = nx.Graph()
    for z, ij in enumerate(combinations(range(n),2)):
        d = edges[z]
        if d:
            i,j = ij
            g.add_edge(i,j, weight=d)
    return g

Having constructed this graph structure, we can extract the connected components and score them as “background” or “anomaly” by counting the number of nodes in each component and comparing against the ‘p’ threshold. As a convenience for later, I’m also going to add “class” and “color” attributes to all the nodes in the graph.

def flag_anomalies(g, min_pts_bgnd, node_colors={'anomalies':'r', 'background':'b'}):
    res = {'anomalies':[],'background':[]}
    for c in nx.connected_components(g):
        if len(c) < min_pts_bgnd:
            res['anomalies'].extend(c)
        else:
            res['background'].extend(c)
    for type, array in res.iteritems():
        for node_id in array:
            g.node[node_id]['class'] = type
            g.node[node_id]['color'] = node_colors[type]
    return res, g

Last but not least, let’s score those anomalies. For convenience, I’m going to wrap these scores in a pandas.Series, but this is the only place in our code we’re using pandas so it would actually speed us up a bit not to do it this way (because then we can completely eliminate the pandas import):

import pandas as pd
def calculate_anomaly_scores(classed, adj, n):
    scores = {}
    for a in classed['anomalies']:
        scores[a] = 0
        for z, ij in enumerate(combinations(range(n),2)):
            i,j = ij
            if (i == a or j == a) and (
                i in classed['background'] or
                j in classed['background']):
                d = adj[z]
                if scores[a]:
                    scores[a] = np.min([scores[a], d])
                else:
                    scores[a] = d
    return pd.Series(scores)

Great! Now let’s put all the pieces together to construct out outlier classification tool.

def tad_classify(X, method='euclidean', r=None, rq=.1, p=.1, distances=None):
    if not distances:
        adj = pdist(X, method)
    edges, r = trim_adjacency_matrix(adj, r, rq)
    n = X.shape[0]
    g = construct_graph(edges, n)
    classed, g =  flag_anomalies(g, n*p)
    scores = calculate_anomaly_scores(classed, adj, n)
    return {'classed':classed, 'g':g, 'scores':scores, 'r':r, 'min_pts_bgnd':n*p, 'distances':adj}

Now that we’ve built this thing, let’s try it out on the iris data. I’m going to visualize the result using a pairs plot (a “scatter_matrix” in pandas) which will allow us to see how the outliers relate to the rest of the data across all pairs of dimensions along which we can slice the data.

import matplotlib.pyplot as plt
from pandas.tools.plotting import scatter_matrix
from sklearn import datasets

iris = datasets.load_iris()
df = pd.DataFrame(iris.data)
res = tad_classify(df)

df['anomaly']=0
df.anomaly.ix[res['classed']['anomalies']] = 1
scatter_matrix(df.ix[:,:4], c=df.anomaly, s=(25 + 50*df.anomaly), alpha=.8)
plt.show()

Iris_pairs_plot

The pairs plot it a great way to visualize the data, but if we had more than 4 dimensions this wouldn’t be a viable option. Instead, let’s reduce the dimensionality of the data using PCA just for visualization purposes. While we’re at it, let’s actually visualize how the observations are connected in the graph components to get a better idea of what the algorithm is doing.

from sklearn.decomposition import PCA

g = res['g']
X_pca = PCA().fit_transform(df)
pos = dict((i,(X_pca[i,0], X_pca[i,1])) for i in range(X_pca.shape[0]))
colors = [node[1]['color'] for node in g.nodes(data=True)]
labels = {}
for node in g.nodes():
    if node in res['classed']['anomalies']:
        labels[node] = node
    else:
        labels[node] = ''
nx.draw(g, pos=pos, node_color = colors, labels=labels)
plt.show()

Iris_outlier_graph

If we reduce the graph resolution, we get more observations classed as outliers. Here’s what it looks like with rq=0.05 (the default–above–is 0.10):

Iris_outlier_graph_rq05

And here’s rq=0.03:

Iris_outlier_graph_rq03

In case these images don’t give you the intuition, reducing the graph resolution results in breaking up the graph into more and more components. Changing the ‘p’ parameter has a much less granular effect (as long as most of the observations are in a small number of components): changing ‘p’ will have basically no effect for small increments, but above a threshold we end up classifying large swathes of the graph as outliers.

The images above have slightly different layouts because I ran PCA fresh each time. In retrospect, it would have looked better if I hadn’t done that, but you get the idea. Maybe I’ll fix that later.

Scraping your twitter home timeline with python and mongodb

Background

About a year and a half ago I was hanging out with two colleagues, John and Jane. John and I were discussing various new happenings we’d heard about recently. Jane was very impressed with how current we were and wondered how we did it. I described how I subscribe to several blogs and that suits me fine, but John insisted that we both needed to try Twitter.

I buckled down and finally created a twitter account. I didn’t really know who to follow, so picked a prominent local data scientist and let him “vet” users for me: I skimmed his “following” list and decided to also follow anyone who’s description made them sound reasonably interesting (from a data science stand point). The problem with this method is that I ended up following a bunch of his random friends who don’t actually talk about data science. Right now, there’s just too much happening on me twitter feed to keep up. If I don’t check it every now and then, I’ll quickly amass hundreds of backlogged tweets, so I have strong motivation to “trim the fat” of my following list.

Setup

To get started, I’m going to explain how to scrape your twitter homepage. But first things first, we’re going to need a few things:

Twitter API wrapper

There are several python twitter API wrappers available right now. I did some research back when I first started tinkering with twitter and landed on the Twython package. I don’t remember what led me to it, but I think the main thing is that it has a strong community and so there’s a lot of good documentation and tutorials describing how to use it.

To install Twython, just use pip like you would for most anything else:

pip install twython

No surprises here.

Twitter API authentication

We’re going to need to do two things to get our scraper working with twitter. First, we need to register a new app at http://apps.twitter.com. If your desired app name is taken, just add your username to make it unique. It’s not mentioned anywhere on the page, but you can’t have the ‘@’ symbol in your app name (or at least, it can’t be preceded by the ‘@’ symbol).

Next, register an access token for your account. It only needs to have read-only permissions, and keeping it this way ensures we won’t do any real damage with our experiment.

Finally, store the authentication information in a config file (I called mine “scraper.cfg”) like so:

[credentials]
app_key:XXXXXXXXXXXXXX
app_secret:XXXXXXXXXXXXXX
oath_token:XXXXXXXXXXXXXX
oath_token_secret:XXXXXXXXXXXXXX

MongoDB

Finally, we’re going to need to set up a repository to persist the content we’re scraping. My MO is usually to just use SQLite and to maybe define the data model using SQLAlchemy’s ORM (which is totally overkill but I still do it anyway for some reason). The thing here though is:

1. There’s a lot of information on tweets

2. I’m not entirely sure which information I’m going to find important just yet

3. The datamodel for a particular tweet is very flexible and certain fields may appear on one tweet but not another.

I figured for this project, it would be unnecessarily complicated to do it the old fashioned way and, more importantly, I’d probably be constantly adding new fields to my datamodel as the project developed, rendering my older scrapes less valuable because they’d be missing data. So to capture all the data we might want, we’re going to just drop the tweets in toto in a NoSQL document store. I chose mongo because I’d heard a lot about it and found it suited my needs perfectly and is very easy to use, although querying it uses a paradigm that I’m still getting used to (we’ll get to that later).

Download and install MongoDB from http://docs.mongodb.org/manual/installation/.
I set the data directory to be on a different (larger) disk than my C drive, so I start mongo
like this:

C:\mongodb\bin\mongod --dbpath E:\mongodata\db

We will need to run this command to start a mongo listener before running our scraper. Alternatively, you could just drop a system call in the scraper to startup mongo, but you should check to make sure it’s not running first. I found just spinning up mongo separately to be simple enough for my purposes.

Since we’ve already got a config file started, let’s add our database name and collection (NoSQL analog for a relational table) to the config file, so our full config file will look like:

[credentials]
app_key:XXXXXXXXXXXXXX
app_secret:XXXXXXXXXXXXXX
oath_token:XXXXXXXXXXXXXX
oath_token_secret:XXXXXXXXXXXXXX

[database]
name:twitter_test
collection:home_timeline

Take note: all we have to do to define the collection is give it a name. We don’t need to describe the schema at all (which, as described earlier, is part of the reason I’m using mongo for this project).

Getting Started

So we’re all set up with twython and mongo: time to start talking to twitter.

We start by calling in the relevant configuration values and spinning up a Twython instance:

import ConfigParser
from twython import Twython

config = ConfigParser.ConfigParser()
config.read('scraper.cfg')

# spin up twitter api
APP_KEY    = config.get('credentials','app_key')
APP_SECRET = config.get('credentials','app_secret')
OAUTH_TOKEN        = config.get('credentials','oath_token')
OAUTH_TOKEN_SECRET = config.get('credentials','oath_token_secret')

twitter = Twython(APP_KEY, APP_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
twitter.verify_credentials()

To get the most recent tweets from our timeline, we hit the “/statuses/home_timeline” API endpoint. We can get a maximum of 200 tweets per call to the endpoint, so let’s do that. Also, I’m a little data greedy, so I’m also going to ask for “contributor details:”

params = {'count':200, 'contributor_details':True}
home = twitter.get_home_timeline(**params)

Now, if we want to do persistent scraping of our home feed, obviously we can’t just wrap this call in a while loop: we need to make sure twitter knows what we’ve already seen so we only get the newest tweets. To do this, we will use the “since_id” parameter to set a limit on how far back in the timeline the tweets in our response will go.

Paging and Cursoring

This is going to be a very brief overview of the motivation behind cursoring and how it works. For a more in depth explanation, check the twitter docs here: https://dev.twitter.com/docs/working-with-timelines

Consider a situation in which, since the last call to the timeline, so many new tweets have been written that we can’t get them all in a single call. Twitter has a “paging” option, but if we use this, it’s possible that the tweets on the bottom of one page will overlap with the tweets on the top of the next page (if new tweets are still coming into the timeline). So instead of “paging” we’ll use “cursoring:” in addition to giving twitter a limit for how far back we can go, we’ll also give a limit for the most recent tweet in any particular call. We’ll do this using a “max_id” parameter. The API will still return the tweet with this ID though, so we want to set the max_id value just lower than the last tweet we saw. If you’re in a 64bit environment, you can do this by subtracting ‘1’ from the id.

Putting this all together, here’s what our persistent scraper looks like so far:

latest = None   # most recent id we've seen
while True:
    try:
        newest = None # this is just a flag to let us know if we should update the value of "latest"
        params = {'count':200, 'contributor_details':True, 'since_id':latest}
        home = twitter.get_home_timeline(**params)        
        if home:
            while home:
                store_tweets(home) # I'll define this function in a bit
                
                # Only update "latest" if we're inside the first pass through the inner while loop
                if newest is None:
                    newest = True
                    latest = home[0]['id']
                    
                params['max_id'] = home[-1]['id'] - 1
                home = twitter.get_home_timeline(**params)

Rate limiting

As with pretty much any web API, twitter doesn’t take too kindly to people slamming their servers. You can read more about the rate limits for different API endpoints here. Here’s what concerns us:

  • The rate limiting windows are 15 minutes long. Every 15 minutes, the window resets.
  • We can make 15 calls to the statuses/home_timeline endpoint within a given window.
  • If we exceed this threshold, our GET request to the API will return a 429 (“Too many requests”) code that Twython will feed to us as a twython.TwythonRateLimitError exception
  • Twitter provides an API endpoint to query the rate limiting status of your application at application/rate_limit_status.
  • The application/rate_limit_status endpoint is itself rate limited to 180 requests per window.

If we don’t pass in any parameters, the application/rate_limit_status endpoint will return the rate limit statuses for every single API endpoint which is much more data than we need, so we’ll limit the data we get back by constraining the response to “statuses” endpoints:

status = twitter.get_application_rate_limit_status(resources = ['statuses'])

This returns a JSON response wihch we only want a particular set of values from, so let’s select that bit out:

status = twitter.get_application_rate_limit_status(resources = ['statuses'])
home_status = status['resources']['statuses']['/statuses/home_timeline']        

Finally, we’ll test how many API calls are remaining in the current window, and if we’ve run out set the application to sleep until the window resets, double check that we’re ok, and then resume scraping. I’ve wrapped this procedure in a function to make it simple to perform this test:

def handle_rate_limiting():
    while True:
        status = twitter.get_application_rate_limit_status(resources = ['statuses'])
        home_status = status['resources']['statuses']['/statuses/home_timeline']        
        if home_status['remaining'] == 0:                
            wait = max(home_status['reset'] - time.time(), 0) + 1 # addding 1 second pad
            time.sleep(wait)
        else:
            return

We’re only testing one of the API endpoints we’re hitting though: we’re hitting the application/rate_limit_status endpoint as well, so we should include that in our test just to be safe although realistically, there’s no reason to believe we’ll ever hit the limitation for that endpoint.

def handle_rate_limiting():
    app_status = {'remaining':1} # prepopulating this to make the first 'if' check fail
    while True:
        if app_status['remaining'] &gt; 0:
            status = twitter.get_application_rate_limit_status(resources = ['statuses', 'application'])
            app_status = status['resources']['application']['/application/rate_limit_status']        
            home_status = status['resources']['statuses']['/statuses/home_timeline']        
            if home_status['remaining'] == 0:                
                wait = max(home_status['reset'] - time.time(), 0) + 1 # addding 1 second pad
                time.sleep(wait)
            else:
                return
        else :
            wait = max(app_status['reset'] - time.time(), 0) + 10
            time.sleep(wait)

Now that we have this, we can insert it into the while loop that performs the home timeline scraping function. While we’re at it, we’ll throw in some exception handling just in case this rate limiting function doesn’t work the way it’s supposed to.

while True:
    try:
        newest = None
        params = {'count':200, 'contributor_details':True, 'since_id':latest}
        handle_rate_limiting()
        home = twitter.get_home_timeline(**params)        
        if home:
            while home:
                store_tweets(home)
                
                if newest is None:
                    newest = True
                    latest = home[0]['id']
                    
                params['max_id'] = home[-1]['id'] - 1
                handle_rate_limiting()
                home = twitter.get_home_timeline(**params)
        else:            
            time.sleep(60)
    
    except TwythonRateLimitError, e:
        print "[Exception Raised] Rate limit exceeded"
        reset = int(twitter.get_lastfunction_header('x-rate-limit-reset'))
        wait = max(reset - time.time(), 0) + 10 # addding 10 second pad
        time.sleep(wait)
    except Exception, e:
        print e
        print "Non rate-limit exception encountered. Sleeping for 15 min before retrying"
        time.sleep(60*15)

Storing Tweets in Mongo

First, we need to spin up the database/collection we defined in the config file.

from pymongo import Connection

DBNAME = config.get('database', 'name')
COLLECTION = config.get('database', 'collection')
conn = Connection()
db = conn[DBNAME]
tweets = db[COLLECTION]

I’ve been calling a placeholder function “store_tweets()” above, let’s actually define it:

def store_tweets(tweets_to_save, collection=tweets):
    collection.insert(tweets_to_save)

Told you using mongo was easy! In fact, we could actually just replace every single call to “store_tweets(home)” with “tweets.insert(home)”. It’s really that simple to use mongo.

The reason I wrapped this in a separate function is because I actually want to process the tweets I’m downloading a little bit for my own purposes. A component of my project is going to involve calculating some simple statistics on tweets based on when they were authored, so before storing them I’m going to convert the time stamp on each tweet to a python datetime object. Mongo plays miraculously well with python, so we can actually store that datetime object without serializing it.

import datetime

def store_tweets(tweets_to_save, collection=tweets):
    for tw in tweets_to_save:
        tw['created_at'] = datetime.datetime.strptime(tw['created_at'], '%a %b %d %H:%M:%S +0000 %Y')
    collection.insert(tweets_to_save)

Picking up where we left off

The first time we run this script, it will scrape from the newest tweet back as far in our timeline as it can (approximately 800 tweets back). Then it will monitor new tweets and drop them in the database. But this behavior is completely contingent on the persistence of the “latest” variable. If the script dies for any reason, we’re in trouble: restarting the script will do a complete scrape on our timeline from scratch, going back as far as it can through historical tweets again. To manage this, we can query the “latest” variable from the database instead of just blindly setting it to “None” when we call the script:

latest = None   # most recent id scraped
try:
    last_tweet = tweets.find(limit=1, sort=[('id',-1)])[0] # sort: 1 = ascending, -1 = descending
    if last_tweet:
        latest = last_tweet['id']
except:
    print "Error retrieving tweets. Database probably needs to be populated before it can be queried."

And we’re done! The finished script looks like this:

import ConfigParser
import datetime
from pymongo import Connection
import time
from twython import Twython, TwythonRateLimitError

config = ConfigParser.ConfigParser()
config.read('scraper.cfg')

# spin up twitter api
APP_KEY    = config.get('credentials','app_key')
APP_SECRET = config.get('credentials','app_secret')
OAUTH_TOKEN        = config.get('credentials','oath_token')
OAUTH_TOKEN_SECRET = config.get('credentials','oath_token_secret')

twitter = Twython(APP_KEY, APP_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
twitter.verify_credentials()

# spin up database
DBNAME = config.get('database', 'name')
COLLECTION = config.get('database', 'collection')
conn = Connection()
db = conn[DBNAME]
tweets = db[COLLECTION]

def store_tweets(tweets_to_save, collection=tweets):
    """
    Simple wrapper to facilitate persisting tweets. Right now, the only
    pre-processing accomplished is coercing 'created_at' attribute to datetime.
    """
    for tw in tweets_to_save:
        tw['created_at'] = datetime.datetime.strptime(tw['created_at'], '%a %b %d %H:%M:%S +0000 %Y')
    collection.insert(tweets_to_save)

def handle_rate_limiting():
    app_status = {'remaining':1} # prepopulating this to make the first 'if' check fail
    while True:
        wait = 0
        if app_status['remaining'] &gt; 0:
            status = twitter.get_application_rate_limit_status(resources = ['statuses', 'application'])
            app_status = status['resources']['application']['/application/rate_limit_status']
            home_status = status['resources']['statuses']['/statuses/home_timeline']
            if home_status['remaining'] == 0:
                wait = max(home_status['reset'] - time.time(), 0) + 1 # addding 1 second pad
                time.sleep(wait)
            else:
                return
        else :
            wait = max(app_status['reset'] - time.time(), 0) + 10
            time.sleep(wait)

latest = None   # most recent id scraped
try:
    last_tweet = tweets.find(limit=1, sort=[('id',-1)])[0] # sort: 1 = ascending, -1 = descending
    if last_tweet:
        latest = last_tweet['id']
except:
    print "Error retrieving tweets. Database probably needs to be populated before it can be queried."

no_tweets_sleep = 1
while True:
    try:
        newest = None # this is just a flag to let us know if we should update the "latest" value
        params = {'count':200, 'contributor_details':True, 'since_id':latest}
        handle_rate_limiting()
        home = twitter.get_home_timeline(**params)
        if home:
            while home:
                store_tweets(home)

                # Only update "latest" if we're inside the first pass through the inner while loop
                if newest is None:
                    newest = True
                    latest = home[0]['id']

                params['max_id'] = home[-1]['id'] - 1
                handle_rate_limiting()
                home = twitter.get_home_timeline(**params)
        else:
            time.sleep(60*no_tweets_sleep)

    except TwythonRateLimitError, e:
        reset = int(twitter.get_lastfunction_header('x-rate-limit-reset'))
        wait = max(reset - time.time(), 0) + 10 # addding 10 second pad
        time.sleep(wait)
    except Exception, e:
        print e
        print "Non rate-limit exception encountered. Sleeping for 15 min before retrying"
        time.sleep(60*15)

Installing scientific python libraries in windows

I often have issues installing certain python libraries (such as scipy or opencv) in windows. One solution to this problem is to do it the hard ware and just work your way through the errors as they come, winding your way through a maze of StackOverflow posts and mucking around in the registry. An alternative is to install a scientific python distribution like the Enthought Python Distribution, Continuum Analytics’ Anaconda, or Python(x,y). I’ve played around with Anaconda and Python(x,y) and they’re ok. Anaconda at least is supposed to be a little bit faster than the vanilla python installation, but I haven’t done any speed tests or anything.

Tonight I was wrestling with opencv. Anaconda didn’t have it in its package manager and I kept encountering roadblocks trying to install it in my generic windows installation (CPython). Luckily, I discovered this website which lists unofficial binaries for installing a lot of troublesome libraries in CPython for windows, both x64 and x32. The page is hosted by UCI professor Chris Gholke. If you see this Chris, you’re awesome.

Lifesaver.

Downloading images from saved reddit links

Reddit user aesptux posted to /r/learnprogramming requesting a code review of their script to download images from saved reddit links. This user made the same mistake I originally made and tried to work with the raw reddit JSON data. I hacked together a little script to show them how much more quickly they could accomplish their task using praw. It took almost no time at all, and the meat of the code is only about 20 lines. What a great API.

Here’s my code: https://gist.github.com/4225456

Weekend Project: Word Ladder Solver

<< I plan to add a few graphics, code snippets, and trim the code posted at the bottom, but I just haven’t gotten around to it. I wrote the bulk of this about two weeks ago now and just haven’t gotten around to finishing it. I’ll update this post when I can, but for now, I’m just going to publish since I have no time to clean it up. I’ll remove this message when I’ve made the post all nice and pretty (probably never).>>

The other day I learned about a game Lewis Carol invented where you make “word ladders” or “word bridges.” The idea is you pick two words, and try to get from one to the other by using a chain of valid words where each word in the chain differs from the previous word by a single letter. Here’s an example:

WORD
CORD
CORED
CORDED
CODDED
RODDED
RIDDED
RIDGED
RIDGE
BRIDGE

I immediately thought making a word bridge generator would be a fun little challenge to program, and it turned out to be a slightly more difficult problem than I’d anticipated. Let’s start by formalizing this game and we’ll work our way up to what specifically makes this challenging.

Word Similarity

For our purposes, two words are similar if you can get from one word to another by “changing a letter.” This change can be an insertion (CORD->CORED), a deletion (RIDGED->RIDGE), or a substitution (WORD->CORD). This is a pretty useful similarity metric and is something of a go-to tool for a lot of natural language processing tasks: it’s called “edit distance” or “levenshtein distance” where the “distance” is the minimum number of edits to transform one word into the other, so if edit_distance(w1, w2) = 0, then w1 and w2 are the identical. Note, it’s the MINIMUM number of edits. Edit distance calculation does not need to be edits-via-valid-words like I demonstrated above, so although I was able to transform WORD into BRIDGE with 9 edits, the edit distance between these two words is actually 5:

WORD
WRD — del
BRD — sub
BRID — ins
BRIDG — ins
BRIDGE — ins

The edit distance algorithm is a pretty neat little implementation of dynamic programming, but luckily I don’t need to build it up from scratch: python’s natural language toolkit has it built-in (be careful, it’s case sensitive):

from nltk import edit_distance
edit_distance('WORD','BRIDGE') # 5
edit_distance('WORD','bridge') # 6

GRAPH TRAVERSAL

In the context of this problem, we can think of our word list (i.e. the English dictionary) as an undirected graph where the nodes are words and similar words are connected. Using this formulation of the problem, a word bridge is a path between two nodes. There are several path finding algorithms available to us, but I’m only going to talk about two of the more basic ones: depth-first search (DFS) and breadth-first search (BFS).

DFS essentially goes as far along a particular branch as it can from a start node until it reaches a “terminal node” (can’t go any further) or a “cycle” (found itself back where it had already explored) and then it backtracks, doing the same thing along all the unexplored paths. If you think of a tree climber trying to map out a tree, it would be like them climbing up the trunk until they reached a fork, and then following that branch in the same direction all the way out until it reached a leaf, then working backwards to the last fork it passed, and then going forwards again until it hit a leaf, and so on.

Depth First Search

On the other hand, BFS looks at the graph in terms of layers. BFS first gets all the child nodes of the starting node, so now the furthest nodes are 1 edge from the start. Then it does the same thing for those children, so the furthest nodes in our searched subgraph are 2 edges away, and so on. A good visualization for this projection is a really contagious virus spreading through a population. First, people in immediate contact with patient zero get sick. Then everyone those people are in immediate contact with get sick, and so on. Replace “person” with “node” and “sick” with searched, and that’s BFS.

Breadth First Search

If we use DFS to traverse this graph, will we find the word bridge (assuming one exists, since a path does not exist between all nodes in our graph) eventually, but a) there’s no guarantee it will be the shortest path, and b) how long it takes really depends on the order of nodes in our graph, which is sort of silly. BFS will necessarily find the shortest path for us: given that we’ve searched to a depth of N edges and not found a path, we know there cannot be word bridge of length N from our start word to our target word. Therefore, if we reach our target word on the N+1 iteration of BFS, we know that the shortest path is N+1 edits. It’s possible that there are multiple “shortest paths,” but we’ll stop when we reach the first one we see.

THE BOTTLENECK

Here’s a not-so-obvious question: what part of this problem is going to cause us the biggest problems? Edit distance? The graph search? The real problem here is scale, the sheer size of the word list. Once we have everything set up, traversing the graph will be fairly easy, but building up the network will take some time.

To find a path between two arbitrary words, we need a reasonably robust dictionary. The one I had on hand for this project was 83,667 words. We need to convert this dictionary into a network. Conceptually, we do this with a similarity matrix.  We take all the words in the dictionary and put all of them on each axis. For each cell in the matrix, we populate the value with the edit distance between the two words that correspond with that cell. Since we’re only concerned with entries that had edit distance “1,” we can set every other value in the matrix to zero. This reduces our similarity matrix to an adjacency matrix, describing which nodes in the graph share an edge.

Adjacency Matrix of an undirected graph. (Image via bytehood.com)

Obviously, implementing this in an array structure would be wildly inefficient. Such a matrix would need to be 83,667 x 83,667, or 7,000,166,889 cells, and presumably most of them will be 0’s. If we allocate a byte to each cell in the array, this would take up 7 GB in memory. In actuality, elements of a list are allocated 4 bytes each with additional 36 bytes allocated for the list itself.

>>>import sys
>>>sys.getsizeof([])
36
>>>sys.getsizeof([1])
40
>>>sys.getsizeof([1,2])
44

Therefore, if our array is implemented as a list of lists (there are other options, such as a pandas dataframe or a numpy array, but this is the naive solution), the array would take up:

36 * (83,667+1) + 4 * 83,667^2 = 28,003,679,604 bytes = 28 GB

I certainly don’t have 28 GB of RAM. We’re going to need a “sparse” data structure instead, one where we get to keep the ones but can ignore the zeros. The simple solution here in python is a dictionary (which is basically just a hash table). For each word in our word list, find all similar words. In our python dictionary, the key is the searched word and the collection of matches will be our value (as a list, set, tuple…whatever).

# The adjacency matrix for the graph above converted into a sparser
# dictionary representation
## MATRIX FORM: 6 X 6 = 36 Elements
## SPARSE FORM: 18 elements
{'A':['B','C','D','E']
,'B':['A','D','E']
,'C':['A','F']
,'D':['A','B','F']
,'E':['A','B','F']
,'F':['C','D','E']
}

HEURISTICS

We’re not quite there yet. For each word, we need to look over the entire dictionary. This means our code scales with n^2 (where n is the number of words in our wordlist), which sucks. Even without that knowledge, you can try the code out with this naive search and you’ll see, it will search like 20-100 words a minute. We have 83,667 words of varying size to get through before we’ve processed the graph we need for this problem, so we’ve got to come up with some tricks to make this faster.

One thing we can do is try to limit the number of words we run edit_distance() on. This function is reasonably fast, but it’s not as fast as a letter-by-letter comparison because it needs to build a matrix and do a bunch of calculations. For two words to have edit distance = 1, they can only differ by one letter. So we can compare the two letter prefix of our search word against every word in the dictionary and skip all the words whose prefixes differ by two letters. BAM! Way faster. But on my computer, this code will still take way too long to run. So let’s do the sae thing looking at suffixes. Even faster! I can’t remember, but I think at this point the estimated total run time for my code was 4 hours. Tolerable, but I’m impatient.

The problem is that although we no longer need to look at every letter of every word during a single pass, we still need to look at every word in the dictionary. But we can use the prefix/suffix heurstic we just came up with to significantly reduce the size of the seach space during any given pass though.

DIVIDE AND CONQUER

Bubble sort of an unordered list

The classic divide and conquer algorithm is binary search. Consider a sorted list of numbers and some random number within the range of this list. You want to figure out where this number goes. The naive solution is called bubble sort: start at the top of the list, compare your number against that number, if yours is bigger, move to the next item. Do this until you find your number’s spot in the order. At worst, you have to look through every number in the list. For any given problem, bubble sort is almost always the wrong algorithm.

Binary Search

The idea behind binary search is because we know the list is already ordered, we can throw away half the list each time. Let’s say the list you’re given is 100 numbers. Jump to the middle and compare. Say your number is smaller: we don’t need to consider the top 50 numbers at all now. Now jump to the 25th number. Bigger or smaller? Throw away the other 12. And so on. In a list of 100 numbers, binary search will find the position of your random number in at most 7 steps (log_2(100)). That’s quite an improvement from bubble sort’s 100 steps.

OVERLAPPING SUBPROBLEMS

Binary search isn’t really applicable to our problem, but we can still discount a signficant amount of the search space if we appropriately pre-process our word list. As we discovered above, we can divide the word list into 3 groups relative to our target word: both two-letter affixes match, prefix matches and/or suffix matches, or neither affix matches. There are a limited number of prefixes and suffixes in our word list, so if we’re searching a word that has the same affix we’ve already searched, we’re unnecessarily repeating work. This is what’s known as an overlapping subproblem, and wherever you can identify one you can make your code faster. Oftentimes a lot faster.

We’re going to handle this affix problem by grouping the words by prefix and suffix. In other words, we’re going to build two search indexes: an index on prefix and an index on suffix. I did this using a dictionary where the affix is the key and the matching words are the values.

def build_index(words, n_letters=2, left=True, verbose=True):
    n=n_letters
    res={}
    for w in words:
        if left:
            ix = w[:n]
        else:
            ix = w[(-1)*n:]
        d = res.get(ix, [])
        d.append(w)
        res[ix] = d
    return res

>>> build_index(words)

{'AA':
['AA',
 'AAH',
 'AAHED',
 'AAHING',
 'AAHS',
 'AAL',
 'AALII',
 'AALIIS',
 'AALS',
 'AARDVARK',
 'AARDWOLF',
 'AARGH',
 'AARRGH',
 'AARRGHH',
 'AAS',
 'AASVOGEL']
,'AB':
['AB',
 'ABA',
 'ABACA',
 'ABACAS',
 'ABACI',
 'ABACK',
 'ABACUS',
 'ABACUSES',
...
 'ZYMOSES',
 'ZYMOSIS',
 'ZYMOTIC',
 'ZYMURGY',
 'ZYZZYVA',
 'ZYZZYVAS']
,'ZZ':
['ZZZ']}

This significantly speeds up our code, but if you watch the code run, it seems to speed up and slow down in bursts. This isn’t because your RAM is wonky. This is because the elements in our indexes aren’t evenly distributed. Specifically, two-letter prefixes are pretty evenly distributed, but two letter-suffixes are not.

 To remedy this we’re going to use a three-letter index. An unfortunate result of my index implementation is that words that are smaller than the index size (here two letter words) don’t get considered when looking up words of size greater than or equal to the index. So although “AH” and “AAH” are similar, they’re contained in separate indexes with no overlap, so when I search for words similar to “AA” I’ll get “AT” but not “AAH”, and similarly when I look for words similar to “AAH” I’ll get “AAHS” but not “AA” or “AH.” This isn’t pefect, but I stronglly suspect the effect on the final product is negligible. I could just use a separate index size for the two prefixes, but this was simpler to code and it only just occured to me I could use a two-letter index on the left and a three-letter index on the right. So sue me. I regret nothing. The three letter index is faster anyway.

## Most Common suffixes: two letters
# ES     6810
# ED     6792
# ER     5296
# NG     4222
# RS     3480
# TS     2332
## Most Common suffixes:three letters
# ING    4062
# ERS    2743
# IER    1142
# EST    1035
# TED    1014
# IES     985
#   NB: The top two three-letter suffixes combined comprise about
#   the same amount of the wordlist as the top two most popular
#   two-letter suffixes separately, and the rest of the suffixes
#   are pretty uniformly distributed.

If you look at the X axis of the histograms below for the three letter indices you’ll notice that two letter words are segregated into their own buckets. If we made similar histograms for indices of four or five letters, these buckets would manifest as an unusual density on the far left of the graph.

With our indexes in place, we can now generate our similarity network in a reasonable amount of time: about 90 minutes on my computer.

def build_wordnet(wordlist, max_dist=1, index_size=None):
    """
    Returns a word similarity network (as a dict) where similarity is
    measured as edit_distance(word1,word2) <= max_dist.
    This implementation doesn't properly account for words smaller than
    the index size.

    index_size defaults to max_dist+1, but you should check the
    distribution  of prefixes/affixes in your wordlist to judge.
    In an 80K dictionary, I found a 3 letter index to be suitable,
    although this did result in the isolation of two letter words from
    the major component(s) of the network.
    """
    if index_size is None:
        index_size = max_dist + 1
    if verbose:
         print "Building right index..."
    R_ix = build_index(wordlist, n_letters=index_size, left=False)
    if verbose:
         print "Building left index..."
    L_ix = build_index(wordlist, n_letters=index_size, left=True)

    def check_index(j, affix, index, left=True):
        right = not left
        for k in index[affix]:
            if abs(len(j)-len(k))>max_dist:
                continue
            if right and edit_distance(j[-1*index_size:], k[-1*index_size:]) > max_dist:
                continue
            if left and edit_distance(j[:index_size], k[:index_size]) > max_dist:
                continue
            if j == k:
                continue
            if edit_distance(j,k) <= max_dist:
                similarity_network[j].update([k])

    similarity_network = {}
    n=0
    start = time.time()
    now = 0
    last = 0
    for j in wordlist:
        similarity_network[j]=set()
        prefix = j[:index_size]
        suffix = j[-1*index_size:]
        check_index(j,suffix, R_ix, left=False)
        check_index(j,prefix, L_ix, left=True)

        if verbose:
            n+=1
            now = int(time.time() - start)/60
            if now%5 == 0 and now == last+1:
                print n, int(time.time() - start)/60
            last = now
    return similarity_network

At this point you should really think about saving your result so you don’t have to wait 90 minutes every time you want to play with this tool we’re building (i.e. we should only ever have to run these calculations once). I recommend pickling the dictionary to a file.

import cPickle

datafile = 'similarity_matrix.dat'

f=open(datafile,'wb')
p = cPickle.Pickler(f)
g = build_wordnet(words)
p.dump(g)
f.close()

NETWORKS

Now that we’ve got the structure in place, we’re ready to find some word bridges! When I first started this project, I tweaked a recipe I found online for my BFS function. It gave me results, but it was messy so I’ll spare you my code. Because we’re working with a graph, we can take advantage of special graph libraries. There are several options in python, one of which is igraph which I’ve played with a little in R and it’s fine. After some googling around, I got the impression that the preference for python is generally the networkx library, so I decided to go with that.

All we have left to do is convert our graph into a networkx.Graph object, and call the networkx path search algorithm.

import networkx as nx
net = nx.Graph()
net.add_nodes_from(words)
for k,v in g.iteritems():
    for v_i in v:
        net.add_edge(k,vi)

w1 = raw_input("Start Word: ")
w2 = raw_input("End Word:   ")
nx.shortest_path(g, w1, w2)   # BAM! Don't need to reinvent the motherfucking wheel.

Really, we could have just used this data structure from the start when we were building the network up, but the dictionary worked just fine, the conversion step is still fast, I strongly suspect the native dict pickles to a smaller file than an nx.Graph would (feel free to prove it to yourself), and it was also a better pedagogical tool. Moreover, networkx graphs are dictionary-like, so the idea is still the same.

MAKING IT BETTER: OOP

A feature we’re lacking here is the ability to add missing words. After playing with this a little I’ve found that the word list I started with is short a few words, and doesn’t contain any proper nouns (preventing me from using my own name in word bridges, for example). In the current implementation, to add a single word I would need to rebuild the whole network. I could build in a new function that adds a word to the graph, but I’d want to add this word into indexes too. This is getting complicated. But if I rebuild my tools using OOP principles, this feature will be pretty simple to add and I can reuse most of my existing code (by wrapping several things that currently appear in loops as methods). There are two kinds of entities we’re going to want to represent: indexes, and word networks.


class Index(object):
    '''
    Container class for wordbridge index, which gets stored as a dict
    in the index attribute. affix
    '''
    def __init__(self, index_size=2, left=True):
        self.index_size = index_size
        self.left = left
        self.index = {}
        self._check_if_word_in_index = False
    def in_index(self, word):
        try:
            for w in self.index[self.get_affix(word)]:
                if w==word:
                    return True
        except:
            pass
        return False
    def get_affix(self, word):
        n = self.index_size
        if self.left:
            ix = word[:n]
        else:
            ix = word[(-1)*n:]
        return ix
    def add_word(self, word):
        if self._check_if_word_in_index:
            if in_index(word):
                return
        ix = self.get_affix(word)
        d = self.index.get(ix, [])
        d.append(word)
        self.index[ix] = d
        #self.words.append(word)
    def add_words(self, words):
        if type(words) in [type(()),type([]), type(set())]:
            for w in words:
                self.add_word(w)
        elif type(words) == type(''):
            self.add_word(words)
        else:
            # This should really be an error
            return "Invalid input type"

class Wordbridge(object):
    def __init__(self, words=None, wordslist_filepath=None):
        self.g = nx.Graph()
        self.indexes = []
        self.words = set()
        if type(wordslist_filepath) == type(''):
            self.seed_words_from_file(wordslist_filepath)
        if words:
            self.seed_words(words)
    def graph_from_dict(self, net_dict):
        for k,v in net_dict.iteritems():
            for v_i in v:
                self.g.add_edge(k,v_i)
        self.words.update(net_dict.keys())
    def update_indexes(self, word):
        for ix in self.indexes:
            ix.add_word(word)
    def add_word(self, word):
        w = word.upper()
        try:
            _ = self.g.node(w) # check if node is in graph
        except:
            self.update_indexes(w)
            self.g.add_node(w)
            self.check_indexes(w)
    def add_index(self, index_size, left=True):
        '''
        Creates a new index, stored in Wordnet.indexes list"
        '''
        ix = Index(index_size, left)
        ix.add_words(self.words)
        self.indexes.append(ix)
    def seed_words(self, words):
        self.words.update(words)
        self.g.add_nodes_from(self.words)
    def seed_words_from_file(self,path):
        with open(path) as f:
            word_list = f.read().split()
        self.seed_words(word_list)
    def check_indexes(self, j, max_dist=1):
        '''
        Adds a word to the graph
        '''
        candidates = set()
        n = max_dist+2
        for ix in self.indexes:
            right = not ix.left
            left = ix.left
            affix = ix.get_affix(j)
            candidates.update(ix.index[affix])
        for k in candidates:
            if abs(len(j)-len(k))>max_dist:
                continue
            # this is sort of redundant because of the index design, but probably helps a tick.
            if edit_distance(j[-1*n:], k[-1*n:]) + edit_distance(j[:n], k[:n]) > max_dist:
                continue
            if j == k:
                continue
            if edit_distance(j,k) <= max_dist:
                #similarity_network[j].update([k])
                self.g.add_edge(j,k)
    def build_network(self, verbose=False):
        n=0
        start = time.time()
        now = 0
        last = 0
        for j in wordlist:
            check_indexes(j)
            if verbose:
                n+=1
                now = int(time.time() - start)/60
                if now%5 == 0 and now == last+1:
                    print n, int(time.time() - start)/60
                last = now
    def wordbridge(self, w1, w2):
        try:
            return nx.shortest_path(self.g, w1.upper(), w2.upper())
        except: # NetworkXNoPath error
            return "No path between %s and %s" % (w1, w2)

Playing With Pandas: DataFrustration

After putting up the post a few weeks ago where I began to analyze the /r/washingtondc community , I received some feedback regarding my methodology. The comment that most resonated with me was “Why are you trying to do your analysis in R if you’ve already got the data in python? Why not just do the analysis and develop your visualizations there?” I had good reasons to use R, mainly: I saw an opportunity to learn plyr, and I don’t really know my way around the numerical programming and plotting tools in python (primarily numpy, scipy, pylab,matplotlib, scikit-learn, and pandas).

As I’ve mentioned previously, I recently started grad school and I’m going to need R specifically in my program, so my need to learn R is much more urgent than my need to develop expertise in python. But frankly, I just enjoy using python more, and I’d rather do everything in one environment if I can so learning the python data analytics libraries is a very attractive prospect to me. Wes McKinney, the author of the pandas library, is publishing a book through O’Reilly called Python for Data Analysis that I’m super interested in. The problem is right before I started my graduate program, I splurged a little and bought some textbooks I had been drooling over but have barely had opportunity to crack into (Drew Conway and John Myles White‘s Machine Learning for Hackers and Christopher M. Bishop’s Pattern Recognition and Machine Learning) so I’m apprehensive to buy more books until I can bite into the books I’ve already bought (while I’m dreaming of a world where I have time, I also really want to pick up Nate Silver’s new book The Signal And The Noise).

Instead of buying a new book (I expect I”ll buy it later) for the moment let’s do this the ol’ fashioned way and hit the documentation. The documentation for pandas looks to be pretty thorough: I recommend starting with the Intro to Data Structures.

Now, it might be because I’m coming into this with very little NumPy experience, but right off the bat the DataFrame class, the heart of pandas, seems awkward to me. To illustrate my problem, let me compare how to take a particular slice in R vs. pandas.

What I want to do is isolate those columns of the dataframe whose first row is below some threshold value:

A <- runif(10)
B <- runif(10)
C <- runif(10)
D <- runif(10)
E <- runif(10)

df <- data.frame(A,B,C,D,E)
sliced_df <- df[ , df[1,]<.5 ]

This is pretty straightforward. The first 5 lines I create labeled vectors of 10 random numbers selected from the uniform distribution (in the range (0,1)). Then I form a dataframe from these vectors. Finally, from the dataframe I select all rows, but only those columns from the where the value of the first row is less than 0.5.

> df
            A           B         C           D          E
1  0.45274205 0.543755858 0.2225730 0.643710467 0.44527644
2  0.55692168 0.687034039 0.8480953 0.494917616 0.98080695
3  0.19127556 0.419290813 0.2744206 0.005422064 0.58559636
4  0.58410947 0.094669003 0.3284746 0.891122109 0.05962251
5  0.94561895 0.022608545 0.9431832 0.951050056 0.38312492
6  0.72316858 0.003073411 0.3336150 0.201627465 0.89433597
7  0.02145899 0.685167549 0.5754166 0.371717998 0.06746820
8  0.47334489 0.143967454 0.4463423 0.959687645 0.64947595
9  0.75215197 0.068791088 0.0343898 0.117595073 0.28861395
10 0.78567118 0.398529395 0.6467450 0.883467028 0.86369047

> sliced_df
            A         C          E
1  0.45274205 0.2225730 0.44527644
2  0.55692168 0.8480953 0.98080695
3  0.19127556 0.2744206 0.58559636
4  0.58410947 0.3284746 0.05962251
5  0.94561895 0.9431832 0.38312492
6  0.72316858 0.3336150 0.89433597
7  0.02145899 0.5754166 0.06746820
8  0.47334489 0.4463423 0.64947595
9  0.75215197 0.0343898 0.28861395
10 0.78567118 0.6467450 0.86369047

Here’s how we do the same thing in python, using the pandas.DataFrame datatype (this might not look so bad, but DataFrame.T is the transpose method): <<Edit: Actually, since finishing this post I figured out a better way which I discuss towards the end o fthis post>>

import pandas as pd
import random

def runif(n):
    res = []
    for i in range(n):
        res.append(random.random())
    return res

A = runif(10)
B = runif(10)
C = runif(10)
D = runif(10)
E = runif(10)

df = pd.DataFrame({'A':A,'B':B,'C':C,'D':D,'E':E})
sliced_df = df.T[ df.T[0]>0.5 ].T

To start with, I have to create a function to build sequences of random numbers. I strongly suspect that some such function exists in NumPy or somewhere, but I don’t know what it is so I made my own (which has the added benefit of making my code align better with the R example). <<EDIT: Since typing this blogpost, I’ve learned that the function I needed was numpy.random.randn.>>

Next, populating the dataframe is already extremely awkward. From the documentation, it seems like the pandas.DataFrame is primarily designed to be generated by dictionaries. What particularly annoys me about this is that, considering the DataFrame is supposed to be a container for the pandas.Series class, I can’t just feed pandas.DataFrame() a list of pandas.Series objects, they have to be in a dictionary. I can create a DataFrame from a single Series like so:

>>> df = pd.DataFrame(A)
>>> df

0
0  0.446686
1  0.696637
2  0.265383
3  0.165647
4  0.861040
5  0.347737
6  0.280177
7  0.980586
8  0.334544
9  0.645143

Note how the column name was changed to “0”. I would have expected that the variable name becomes the column name like in R, but nope, no such luck. The same thing happens if we feed DataFrame() two series, which makes it hard to distinguish them for slicing.

>>> df = pd.DataFrame(A,B)
>>> df

0
0.689893  0.446686
0.373250  0.696637
0.619527  0.265383
0.775759  0.165647
0.819883  0.861040
0.099763  0.347737
0.143239  0.280177
0.303460  0.980586
0.975462  0.334544
0.672082  0.645143

>>> df[0]

0.689893    0.446686
0.373250    0.696637
0.619527    0.265383
0.775759    0.165647
0.819883    0.861040
0.099763    0.347737
0.143239    0.280177
0.303460    0.980586
0.975462    0.334544
0.672082    0.645143
Name: 0

If I try to create a dataframe from three or more series, I just get an error.

>>> df = pd.DataFrame(A,B,C)

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 412, in __init__
copy=copy)
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 506, in _init_ndarray
block = make_block(values.T, columns, columns)
File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 461, in make_block
do_integrity_check=do_integrity_check)
File "C:\Python27\lib\site-packages\pandas\core\internals.py", line 26, in __init__
assert(len(items) == len(values))
AssertionError

Frankly, I’m not sure why two series even worked. Anyway, the DataFrame object wants a dictionary so it knows explicitly how to name each data set. You’d think the variable names would be sufficient, and that’s the intuitive solution we get in R, but no such luck in pandas. For the time being, this is a minor problem but we’ll see what happens when I start working with larger datasets.

What really irks me is the slicing operations are a little funny, and things that should be able to handle booleans can’t. Basically, slicing dataframes is much less intuitive in pandas than in R or octave (matlab) and this concerns me, since this sort of thing is what enables vectorizing operations (i.e. instead of using slower for loops).

The main problem is that the df[x] operator doesn’t take both rows and columns at the same time like in R (e.g. df[rows, columns] ). But weirdly, it can take one or the other separately. In pandas, df[x] takes a couple of different kinds of ‘x’: a column name, a list of column names, indices, or a conditional statement. If I give df[x] a conditional statement or indices (e.g. df[0:3]), they apply to the rows. Did you catch that? If I feed it labels that match column names, I get columns back. If I feed it slice indices I get rows back. What the fuck, pandas. I can’t be the only person who thinks this is spectacularly confusing. A particular slice method (e.g. [ ] ) should give me back either columns or rows, not both.

>>> df['A']

0    0.446686
1    0.696637
2    0.265383
3    0.165647
4    0.861040
5    0.347737
6    0.280177
7    0.980586
8    0.334544
9    0.645143
Name: A

>>> df[['A','B']]

A         B
0  0.446686  0.689893
1  0.696637  0.373250
2  0.265383  0.619527
3  0.165647  0.775759
4  0.861040  0.819883
5  0.347737  0.099763
6  0.280177  0.143239
7  0.980586  0.303460
8  0.334544  0.975462
9  0.645143  0.672082

>>> df[ df['A']>0.5 ]

A         B         C         D         E
1  0.696637  0.373250  0.778385  0.918453  0.580366
4  0.861040  0.819883  0.397820  0.735031  0.110483
7  0.980586  0.303460  0.117398  0.033969  0.731914
9  0.645143  0.672082  0.138268  0.730780  0.969545

>>> df[1:3]

A         B         C         D         E
1  0.696637  0.373250  0.778385  0.918453  0.580366
2  0.265383  0.619527  0.633576  0.553009  0.599043

This means if I want just those columns that meet a particular row condition, I have to first take the transpose of the dataframe to make the rows into columns, apply my conditional statement to the appropriate column (which requires taking the transpose again), and take the transpose one last time to flip the returned data set back into the original orientation. Hence:

sliced_df = df.T[ df.T[0]>0.5 ].T

Having written out this entire rant, I finally figured out the better way of doing this. I’m still annoyed by all the stuff I described above, but here’s how you accomplish this slice without using transpose at all: you use the DataFrame.ix method. This allows for more intuitive R-style indexing of the form df.ix[rows, columns]. It accepts indices and booleans. Do I sound dejected? This took me a while to figure out.

>>> df.ix[:,:]

A         B         C         D         E
0  0.242808  0.829024  0.027734  0.510985  0.430466
1  0.301553  0.834208  0.600806  0.773148  0.119008
2  0.968252  0.098827  0.290203  0.555629  0.652359
3  0.351365  0.391068  0.352370  0.531282  0.478862
4  0.513526  0.138082  0.538826  0.252554  0.486603
5  0.705628  0.362105  0.800225  0.977828  0.454140
6  0.097671  0.613972  0.712334  0.473130  0.886449
7  0.386206  0.520115  0.589156  0.722709  0.293428
8  0.337381  0.102242  0.296870  0.725426  0.475001
9  0.076314  0.894782  0.115159  0.592838  0.402849

>>> df.ix[2:3,'B':'E']

B         C         D         E
2  0.098827  0.290203  0.555629  0.652359
3  0.391068  0.352370  0.531282  0.478862

>>> df.ix[0,:] < 0.5

A     True
B    False
C     True
D    False
E     True
Name: 0

>>> df.ix[ :, df.ix[0,:] < 0.5 ]

A         C         E
0  0.242808  0.027734  0.430466
1  0.301553  0.600806  0.119008
2  0.968252  0.290203  0.652359
3  0.351365  0.352370  0.478862
4  0.513526  0.538826  0.486603
5  0.705628  0.800225  0.454140
6  0.097671  0.712334  0.886449
7  0.386206  0.589156  0.293428
8  0.337381  0.296870  0.475001
9  0.076314  0.115159  0.402849

One thing that is much less confusing about these dataframes is if I want to plot something, all I have to do is:

from pylab import plot, show

plot(df)
show()

And I get a single plot where each line is a separate column of data. I would have had to have used a for loop otherwise, so that at least is nice and intuitive.

These are normal growing pains, as with learning any new tool. If you’ve got a few hours on your hands and don’t feel like mucking through the documentation, here’s a pycon tutorial Wes gave earilier this year that’s pretty interesting (I’ve only made my way through about an hour’s worth myself, but it looks like a thorough introduction). NB: The first 15 minutes or so he’s just setting up IPython, so I recommend you skip ahead a bit.