I before E except after C

A recent post on Patrick Vennebush blog Math Jokes for 4 Mathy Folks asserted that the rule of thumb “I before E except after C” was “total bullshit.” This got me thinking: the “I before E except after C” rule (let’s just call it the IEC rule) was almost certainly developed without any research at all, just based on the subjective experience of educators. It’s not a horrible rule, but certainly we can more intelligently construct better rules of this kind (for lack of a better term, I’ll be refering to these as “trigram spelling rules”). We have the technology, and I have nothing better to do with my night off ūüôā

You can find my full methodology and analysis in the following IPython notebook:

http://nbviewer.ipython.org/gist/dmarx/b6a095d2b161eccb18a3

The short version

I used a larger word list than Patrick (233,621 words), but my analysis still corroborated his. I observed the following bigram and trigram frequencies for the IEC rule:

ie: 3950 	cie: 256
ei: 2607 	cei: 156

I thought that perhaps although the IEC rule doesn’t work when we look at the unique words in our vocabulary, perhaps it might hold true if we look at trigram and bigram frequencies across word usage in written text. Here are the frequencies for the IEC rule in the Brown corpus:

ie: 13275 	cie: 1310
ei: 5677 	cei: 485

Nope, still no good.

Instead of the IEC rule, here are some alternatives (taken from my¬†vocabulary analysis, not the word usage analysis). For each rule of the form “A before B except after C” below, the bigram frequency percentage count(AB)/(count(AB)+count(BA)) is at least 1/2, and the laplace smoothed trigram frequency ratio (1+count(cba))/(1+count(cab)) is maximized:

  • P before E except after C
pe: 8052 	cpe: 0
ep: 5053 	cep: 955
  • E before U except after Q
eu: 2620 qeu: 0
ue: 1981 que: 949
  • I before C except after I
ic: 26140 iic: 1
ci: 6561 ici: 1830
  • T before E except after M
te: 27265 mte: 2
et: 11743 met: 2684
  • R before D except after N
rd: 3641 nrd: 0
dr: 2738 ndr: 808

 Update

After posting this article, there was some discussion that the optimal rules should focus on vowel placement and have a higher bigram ratio than the 1/2 threshold I used. Here are two “better” rules that satisfy these condiitons:

  • O before U except after Q
ou 12144    qou 0
uo 671      quo 122
  • I before O except after J
io 15247    jio 0
oi 4040     joi 95
Advertisements

Scraping your twitter home timeline with python and mongodb

Background

About a year and a half ago I was hanging out with two colleagues, John and Jane. John and I were discussing various new happenings we’d heard about recently. Jane was very impressed with how current we were and wondered how we did it. I described how I subscribe to several blogs and that suits me fine, but John insisted that we both needed to try Twitter.

I buckled down and finally created a twitter account. I didn’t really know who to follow, so picked a prominent local data scientist and let him “vet” users for me: I skimmed his “following” list and decided to also follow anyone who’s description made them sound reasonably interesting (from a data science stand point). The problem with this method is that I ended up following a bunch of his random friends who don’t actually talk about data science. Right now, there’s just too much happening on me twitter feed to keep up. If I don’t check it every now and then, I’ll quickly amass hundreds of backlogged tweets, so I have strong motivation to “trim the fat” of my following list.

Setup

To get started, I’m going to explain how to scrape your twitter homepage. But first things first, we’re going to need a few things:

Twitter API wrapper

There are several python twitter API wrappers available right now. I did some research back when I first started tinkering with twitter and landed on the Twython package. I don’t remember what led me to it, but I think the main thing is that it has a strong community and so there’s a lot of good documentation and tutorials describing how to use it.

To install Twython, just use pip like you would for most anything else:

pip install twython

No surprises here.

Twitter API authentication

We’re going to need to do two things to get our scraper working with twitter. First, we need to register a new app at http://apps.twitter.com. If your desired app name is taken, just add your username to make it unique. It’s not mentioned anywhere on the page, but you can’t have the ‘@’ symbol in your app name (or at least, it can’t be preceded by the ‘@’ symbol).

Next, register an access token for your account. It only needs to have read-only permissions, and keeping it this way ensures we won’t do any real damage with our experiment.

Finally, store the authentication information in a config file (I called mine “scraper.cfg”) like so:

[credentials]
app_key:XXXXXXXXXXXXXX
app_secret:XXXXXXXXXXXXXX
oath_token:XXXXXXXXXXXXXX
oath_token_secret:XXXXXXXXXXXXXX

MongoDB

Finally, we’re going to need to set up a repository to persist the content we’re scraping. My MO is usually to just use SQLite and to maybe define the data model using SQLAlchemy’s ORM (which is totally overkill but I still do it anyway for some reason). The thing here though is:

1. There’s a lot of information on tweets

2. I’m not entirely sure which information I’m going to find important just yet

3. The datamodel for a particular tweet is very flexible and certain fields may appear on one tweet but not another.

I figured for this project, it would be unnecessarily complicated to do it the old fashioned way and, more importantly, I’d probably be constantly adding new fields to my datamodel as the project developed, rendering my older scrapes less valuable because they’d be missing data. So to capture all the data we might want, we’re going to just drop the tweets in toto in a NoSQL document store. I chose mongo because I’d heard a lot about it and found it suited my needs perfectly and is very easy to use, although querying it uses a paradigm that I’m still getting used to (we’ll get to that later).

Download and install MongoDB from http://docs.mongodb.org/manual/installation/.
I set the data directory to be on a different (larger) disk than my C drive, so I start mongo
like this:

C:\mongodb\bin\mongod --dbpath E:\mongodata\db

We will need to run this command to start a mongo listener before running our scraper. Alternatively, you could just drop a system call in the scraper to startup mongo, but you should check to make sure it’s not running first. I found just spinning up mongo separately to be simple enough for my purposes.

Since we’ve already got a config file started, let’s add our database name and collection (NoSQL analog for a relational table) to the config file, so our full config file will look like:

[credentials]
app_key:XXXXXXXXXXXXXX
app_secret:XXXXXXXXXXXXXX
oath_token:XXXXXXXXXXXXXX
oath_token_secret:XXXXXXXXXXXXXX

[database]
name:twitter_test
collection:home_timeline

Take note: all we have to do to define the collection is give it a name. We don’t need to describe the schema at all (which, as described earlier, is part of the reason I’m using mongo for this project).

Getting Started

So we’re all set up with twython and mongo: time to start talking to twitter.

We start by calling in the relevant configuration values and spinning up a Twython instance:

import ConfigParser
from twython import Twython

config = ConfigParser.ConfigParser()
config.read('scraper.cfg')

# spin up twitter api
APP_KEY    = config.get('credentials','app_key')
APP_SECRET = config.get('credentials','app_secret')
OAUTH_TOKEN        = config.get('credentials','oath_token')
OAUTH_TOKEN_SECRET = config.get('credentials','oath_token_secret')

twitter = Twython(APP_KEY, APP_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
twitter.verify_credentials()

To get the most recent tweets from our timeline, we hit the “/statuses/home_timeline” API endpoint. We can get a maximum of 200 tweets per call to the endpoint, so let’s do that. Also, I’m a little data greedy, so I’m also going to ask for “contributor details:”

params = {'count':200, 'contributor_details':True}
home = twitter.get_home_timeline(**params)

Now, if we want to do persistent scraping of our home feed, obviously we can’t just wrap this call in a while loop: we need to make sure twitter knows what we’ve already seen so we only get the newest tweets. To do this, we will use the “since_id” parameter to set a limit on how far back in the timeline the tweets in our response will go.

Paging and Cursoring

This is going to be a very brief overview of the motivation behind cursoring and how it works. For a more in depth explanation, check the twitter docs here: https://dev.twitter.com/docs/working-with-timelines

Consider a situation in which, since the last call to the timeline, so many new tweets have been written that we can’t get them all in a single call. Twitter has a “paging” option, but if we use this, it’s possible that the tweets on the bottom of one page will overlap with the tweets on the top of the next page (if new tweets are still coming into the timeline). So instead of “paging” we’ll use “cursoring:” in addition to giving twitter a limit for how far back we can go, we’ll also give a limit for the most recent tweet in any particular call. We’ll do this using a “max_id” parameter. The API will still return the tweet with this ID though, so we want to set the max_id value just lower than the last tweet we saw. If you’re in a 64bit environment, you can do this by subtracting ‘1’ from the id.

Putting this all together, here’s what our persistent scraper looks like so far:

latest = None   # most recent id we've seen
while True:
    try:
        newest = None # this is just a flag to let us know if we should update the value of "latest"
        params = {'count':200, 'contributor_details':True, 'since_id':latest}
        home = twitter.get_home_timeline(**params)        
        if home:
            while home:
                store_tweets(home) # I'll define this function in a bit
                
                # Only update "latest" if we're inside the first pass through the inner while loop
                if newest is None:
                    newest = True
                    latest = home[0]['id']
                    
                params['max_id'] = home[-1]['id'] - 1
                home = twitter.get_home_timeline(**params)

Rate limiting

As with pretty much any web API, twitter doesn’t take too kindly to people slamming their servers. You can read more about the¬†rate limits for different API endpoints here. Here’s what concerns us:

  • The rate limiting windows are 15 minutes long. Every 15 minutes, the window resets.
  • We can make 15 calls to the statuses/home_timeline endpoint within a given window.
  • If we exceed this threshold, our GET request to the API will return a 429 (“Too many requests”) code that Twython will feed to us as a twython.TwythonRateLimitError exception
  • Twitter provides an API endpoint to query the rate limiting status of your application at application/rate_limit_status.
  • The application/rate_limit_status endpoint is itself rate limited to 180 requests per window.

If we don’t pass in any parameters, the application/rate_limit_status endpoint will return the rate limit statuses for every single API endpoint which is much more data than we need, so we’ll limit the data we get back by constraining the response to “statuses” endpoints:

status = twitter.get_application_rate_limit_status(resources = ['statuses'])

This returns a JSON response wihch we only want a particular set of values from, so let’s select that bit out:

status = twitter.get_application_rate_limit_status(resources = ['statuses'])
home_status = status['resources']['statuses']['/statuses/home_timeline']        

Finally, we’ll test how many API calls are remaining in the current window, and if we’ve run out set the application to sleep until the window resets, double check that we’re ok, and then resume scraping. I’ve wrapped this procedure in a function to make it simple to perform this test:

def handle_rate_limiting():
    while True:
        status = twitter.get_application_rate_limit_status(resources = ['statuses'])
        home_status = status['resources']['statuses']['/statuses/home_timeline']        
        if home_status['remaining'] == 0:                
            wait = max(home_status['reset'] - time.time(), 0) + 1 # addding 1 second pad
            time.sleep(wait)
        else:
            return

We’re only testing one of the API endpoints we’re hitting though: we’re hitting the application/rate_limit_status endpoint as well, so we should include that in our test just to be safe although realistically, there’s no reason to believe we’ll ever hit the limitation for that endpoint.

def handle_rate_limiting():
    app_status = {'remaining':1} # prepopulating this to make the first 'if' check fail
    while True:
        if app_status['remaining'] > 0:
            status = twitter.get_application_rate_limit_status(resources = ['statuses', 'application'])
            app_status = status['resources']['application']['/application/rate_limit_status']        
            home_status = status['resources']['statuses']['/statuses/home_timeline']        
            if home_status['remaining'] == 0:                
                wait = max(home_status['reset'] - time.time(), 0) + 1 # addding 1 second pad
                time.sleep(wait)
            else:
                return
        else :
            wait = max(app_status['reset'] - time.time(), 0) + 10
            time.sleep(wait)

Now that we have this, we can insert it into the while loop that performs the home timeline scraping function. While we’re at it, we’ll throw in some exception handling just in case this rate limiting function doesn’t work the way it’s supposed to.

while True:
    try:
        newest = None
        params = {'count':200, 'contributor_details':True, 'since_id':latest}
        handle_rate_limiting()
        home = twitter.get_home_timeline(**params)        
        if home:
            while home:
                store_tweets(home)
                
                if newest is None:
                    newest = True
                    latest = home[0]['id']
                    
                params['max_id'] = home[-1]['id'] - 1
                handle_rate_limiting()
                home = twitter.get_home_timeline(**params)
        else:            
            time.sleep(60)
    
    except TwythonRateLimitError, e:
        print "[Exception Raised] Rate limit exceeded"
        reset = int(twitter.get_lastfunction_header('x-rate-limit-reset'))
        wait = max(reset - time.time(), 0) + 10 # addding 10 second pad
        time.sleep(wait)
    except Exception, e:
        print e
        print "Non rate-limit exception encountered. Sleeping for 15 min before retrying"
        time.sleep(60*15)

Storing Tweets in Mongo

First, we need to spin up the database/collection we defined in the config file.

from pymongo import Connection

DBNAME = config.get('database', 'name')
COLLECTION = config.get('database', 'collection')
conn = Connection()
db = conn[DBNAME]
tweets = db[COLLECTION]

I’ve been calling a placeholder function “store_tweets()” above, let’s actually define it:

def store_tweets(tweets_to_save, collection=tweets):
    collection.insert(tweets_to_save)

Told you using mongo was easy! In fact, we could actually just replace every single call to “store_tweets(home)” with “tweets.insert(home)”. It’s really that simple to use mongo.

The reason I wrapped this in a separate function is because I actually want to process the tweets I’m downloading a little bit for my own purposes. A component of my project is going to involve calculating some simple statistics on tweets based on when they were authored, so before storing them I’m going to convert the time stamp on each tweet to a python datetime object. Mongo plays miraculously well with python, so we can actually store that datetime object without serializing it.

import datetime

def store_tweets(tweets_to_save, collection=tweets):
    for tw in tweets_to_save:
        tw['created_at'] = datetime.datetime.strptime(tw['created_at'], '%a %b %d %H:%M:%S +0000 %Y')
    collection.insert(tweets_to_save)

Picking up where we left off

The first time we run this script, it will scrape from the newest tweet back as far in our timeline as it can (approximately 800 tweets back). Then it will monitor new tweets and drop them in the database. But this behavior is completely contingent on the persistence of the “latest” variable. If the script dies for any reason, we’re in trouble: restarting the script will do a complete scrape on our timeline from scratch, going back as far as it can through historical tweets again. To manage this, we can query the “latest” variable from the database instead of just blindly setting it to “None” when we call the script:

latest = None   # most recent id scraped
try:
    last_tweet = tweets.find(limit=1, sort=[('id',-1)])[0] # sort: 1 = ascending, -1 = descending
    if last_tweet:
        latest = last_tweet['id']
except:
    print "Error retrieving tweets. Database probably needs to be populated before it can be queried."

And we’re done! The finished script looks like this:

import ConfigParser
import datetime
from pymongo import Connection
import time
from twython import Twython, TwythonRateLimitError

config = ConfigParser.ConfigParser()
config.read('scraper.cfg')

# spin up twitter api
APP_KEY    = config.get('credentials','app_key')
APP_SECRET = config.get('credentials','app_secret')
OAUTH_TOKEN        = config.get('credentials','oath_token')
OAUTH_TOKEN_SECRET = config.get('credentials','oath_token_secret')

twitter = Twython(APP_KEY, APP_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
twitter.verify_credentials()

# spin up database
DBNAME = config.get('database', 'name')
COLLECTION = config.get('database', 'collection')
conn = Connection()
db = conn[DBNAME]
tweets = db[COLLECTION]

def store_tweets(tweets_to_save, collection=tweets):
    """
    Simple wrapper to facilitate persisting tweets. Right now, the only
    pre-processing accomplished is coercing 'created_at' attribute to datetime.
    """
    for tw in tweets_to_save:
        tw['created_at'] = datetime.datetime.strptime(tw['created_at'], '%a %b %d %H:%M:%S +0000 %Y')
    collection.insert(tweets_to_save)

def handle_rate_limiting():
    app_status = {'remaining':1} # prepopulating this to make the first 'if' check fail
    while True:
        wait = 0
        if app_status['remaining'] > 0:
            status = twitter.get_application_rate_limit_status(resources = ['statuses', 'application'])
            app_status = status['resources']['application']['/application/rate_limit_status']
            home_status = status['resources']['statuses']['/statuses/home_timeline']
            if home_status['remaining'] == 0:
                wait = max(home_status['reset'] - time.time(), 0) + 1 # addding 1 second pad
                time.sleep(wait)
            else:
                return
        else :
            wait = max(app_status['reset'] - time.time(), 0) + 10
            time.sleep(wait)

latest = None   # most recent id scraped
try:
    last_tweet = tweets.find(limit=1, sort=[('id',-1)])[0] # sort: 1 = ascending, -1 = descending
    if last_tweet:
        latest = last_tweet['id']
except:
    print "Error retrieving tweets. Database probably needs to be populated before it can be queried."

no_tweets_sleep = 1
while True:
    try:
        newest = None # this is just a flag to let us know if we should update the "latest" value
        params = {'count':200, 'contributor_details':True, 'since_id':latest}
        handle_rate_limiting()
        home = twitter.get_home_timeline(**params)
        if home:
            while home:
                store_tweets(home)

                # Only update "latest" if we're inside the first pass through the inner while loop
                if newest is None:
                    newest = True
                    latest = home[0]['id']

                params['max_id'] = home[-1]['id'] - 1
                handle_rate_limiting()
                home = twitter.get_home_timeline(**params)
        else:
            time.sleep(60*no_tweets_sleep)

    except TwythonRateLimitError, e:
        reset = int(twitter.get_lastfunction_header('x-rate-limit-reset'))
        wait = max(reset - time.time(), 0) + 10 # addding 10 second pad
        time.sleep(wait)
    except Exception, e:
        print e
        print "Non rate-limit exception encountered. Sleeping for 15 min before retrying"
        time.sleep(60*15)

Installing scientific python libraries in windows

I often have issues installing certain python libraries (such as scipy or opencv) in windows. One solution to this problem is to do it the hard ware and just work your way through the errors as they come, winding your way through a maze of StackOverflow posts and mucking around in the registry. An alternative is to install a scientific python distribution like the Enthought Python Distribution, Continuum Analytics’ Anaconda, or Python(x,y). I’ve played around with Anaconda and Python(x,y) and they’re ok. Anaconda at least is supposed to be a little bit faster than the vanilla python installation, but I haven’t done any speed tests or anything.

Tonight I was wrestling with opencv. Anaconda didn’t have it in its package manager and I kept encountering roadblocks trying to install it in my generic windows installation (CPython). Luckily, I discovered this website which lists unofficial binaries for installing a lot of troublesome libraries in CPython for windows, both x64 and x32. The page is hosted by UCI professor Chris Gholke. If you see this Chris, you’re awesome.

Lifesaver.

What’s Really Going On In Principal Component Analysis (PCA)?

PCA seems to have gained sufficient popularity that many “I know just enough statistics to be really dangerous” types are using it. I have no real problem with this: it’s a great technique that has a lot of applications, but on a fairly regular basis I stumble across questions about PCA on CrossValidated that would only be asked by someone who really doesn’t understand at a fundamental level what PCA is doing. Often the situation is appropriate for PCA to be applied (usually dimensionality reduction), but questions are asked about how to use or interpret the output the indicate the person really doesn’t know what’s going on.

This article provides the best explanation of PCA I’ve come across, illustrating the linear algebra in a very intuitive geometrical context. I’ll provide a condensed version of the geometric explanation below without getting into the Linear Algebra

Consider a sample of 50 points generated from y=x + noise. The first principal component will lie along the line y=x and the second component will lie along the line y=-x, as shown below.

enter image description here

The aspect ratio messes it up a little, but take my word for it that the components are orthogonal. Applying PCA will rotate our data so the components become the x and y axes:

enter image description here

The data before the transformation are circles, the data after are crosses. In this particular example, the data wasn’t rotated so much as it was flipped across the line y=-2x, but we could have just as easily inverted the y-axis to make this truly a rotation without loss of generality as described here.

The bulk of the variance, i.e. the information in the data, is spread along the first principal component (which is represented by the x-axis after we have transformed the data). There’s a little variance along the second component (now the y-axis), but we can drop this component entirely without significant loss of information. So to collapse this from two dimensions into 1, we let the projection of the data onto the first principal component completely describe our data.

enter image description here

We can partially recover our original data by rotating (ok, projecting) it back onto the original axes.

enter image description here

The dark blue points are the “recovered” data, whereas the empty points are the original data. As you can see, we have lost some of the information from the original data, specifically the variance in the direction of the second principal component. But for many purposes, this compressed description (using the projection along the first principal component) may suit our needs.

Here’s the code I used to generate this example in case you want to replicate it yourself. If you reduce the variance of the noise component on the second line, the amount of data lost by the PCA transformation will decrease as well because the data will converge onto the first principal component:

set.seed(123)
y2 = x + rnorm(n,0,.2)
mydata = cbind(x,y2)
m2 = colMeans(mydata)

p2 = prcomp(mydata, center=F, scale=F)
reduced2= cbind(p2$x[,1], rep(0, nrow(p2$x)))
recovered = reduced2 %*% p2$rotation

plot(mydata, xlim=c(-1.5,1.5), ylim=c(-1.5,1.5), main='Data with principal component vectors')
arrows(x0=m2[1], y0=m2[2]
       ,x1=m2[1]+abs(p2$rotation[1,1])
           ,y1=m2[2]+abs(p2$rotation[2,1])
       , col='red')
arrows(x0=m2[1], y0=m2[2]
       ,x1=m2[1]+p2$rotation[1,2]
           ,y1=m2[2]+p2$rotation[2,2]
       , col='blue')

plot(mydata, xlim=c(-1.5,1.5), ylim=c(-1.5,1.5), main='Data after PCA transformation')
points(p2$x, col='black', pch=3)
    arrows(x0=m2[1], y0=m2[2]
           ,x1=m2[1]+abs(p2$rotation[1,1])
       ,y1=m2[2]+abs(p2$rotation[2,1])
           , col='red')
    arrows(x0=m2[1], y0=m2[2]
           ,x1=m2[1]+p2$rotation[1,2]
       ,y1=m2[2]+p2$rotation[2,2]
           , col='blue')
    arrows(x0=mean(p2$x[,1])
      ,y0=0
      ,x1=mean(p2$x[,1])
          ,y1=1
          ,col='blue'
           )
    arrows(x0=mean(p2$x[,1])
       ,y0=0
       ,x1=-1.5
       ,y1=0
       ,col='red'
)
lines(x=c(-1,1), y=c(2,-2), lty=2)

plot(p2$x, xlim=c(-1.5,1.5), ylim=c(-1.5,1.5), main='PCA dimensionality reduction')
    points(reduced2, pch=20, col="blue")
    for(i in 1:n){
      lines(rbind(reduced2[i,], p2$x[i,]), col='blue')
}

plot(mydata, xlim=c(-1.5,1.5), ylim=c(-1.5,1.5), main='Lossy data recovery after PCA transformation')
arrows(x0=m2[1], y0=m2[2]
       ,x1=m2[1]+abs(p2$rotation[1,1])
           ,y1=m2[2]+abs(p2$rotation[2,1])
       , col='red')
arrows(x0=m2[1], y0=m2[2]
       ,x1=m2[1]+p2$rotation[1,2]
           ,y1=m2[2]+p2$rotation[2,2]
       , col='blue')
for(i in 1:n){
  lines(rbind(recovered[i,], mydata[i,]), col='blue')
}
points(recovered, col='blue', pch=20)

The bulk of this post was scavenged from a response I provided on CrossValidated.

Programming A Simple Reddit Bot (in Python)

Almost a year ago I built a built a reddit bot called VideoLinkBot¬†that has been running on an old half-broken laptop in my room ever since. VLB was a great exercise in agile development that I started to write a post about awhile ago but abandoned. The gist of it is that the bot only satisfied a small portion of my vision when it first went live, but I didn’t want to delay “bringing my product to market.” This was a good decision because due mainly to time constraints, it took quite some time for me to incorporate most of the features I originally planned for the bot, many of which were deprioritized in my development path in favor of things my “end users” (redditors) specifically requested.

Courtesy of the praw library, building a simple bot is really very easy. As an exercise, I wrote a clone of the now disabled “linkfixerbot” to illustrate a simple bot template to another redditor, since VLB was fairly complex. Instead of discussing VLB at length (I may do so at a later date) I’ll describe a general strategy for programming a reddit bot via the LinkFixerBot example.

LinkFixerClone

(Here’s the complete bot code)

On reddit, it’s not uncommon for a comment to include internal links to either a subreddit or a user’s overview page. To simplify building these links because they’re so common, reddit markdown interprets the syntax “/u/username” as a shortlink to a user’s history page, and “/r/subredditname” as a shortlink to a subreddit. Sometimes users either don’t realize that this markdown feature exists (less an issue now than when the feature was first rolled out) or they simply neglect to add the first forward slash and instead of proper markdown, they write something like “r/subredditname.” LinkFixerBot was a bot that scanned reddit for these mistakes and responded with repaired corrections.

We can break down the bot into three activities:

1. A component that scans the reddit comments feed.

2. A component that determines if a particular comment is a candidate for bot action.

3. A component that performs the bot’s signature activity in the form of a comment response.

There are many types of reddit bots, but I think what I’ve outlined above describes the most common. Some other kinds of bots may be moderator bots, or bots that mirror the contents of a specific subreddit.

The LinkFixerBot code should largely speak for itself: the while loop scans the reddit comment stream for comments, passing each comment to the “check_condition” function that determines if the bot should do anything. If the bot’s services are required, the results of pre-processing the comment accomplished by the check_condition function are passed into the bot_action function, which causes the bot to respond with a comment. Rinse and repeat.

Some items worth note:

  • It’s a good idea to maintain a short cache of comments the bot has already looked at to prevent duplicated work. LinkFixerClone uses a fixed-length deque so when new comments come in, they push the old comments out. The deque length is set to twice the number of comments the bot can call down from reddit at a time.
  • You’re going to need to include exception handling. Some special cases to consider (that are probably less of a problem for most bots than for VLB):
    • A comment or submission of interest has been deleted
    • A comment or submission is sufficiently old that it has been archived
    • Reddit is down (consider adding a progressively increasing “back-off” time until your next attempt).
    • Your bot’s username doesn’t have enough karma to post comments as frequently as it wants to.

This last case can actually be remedied, at least to some extent, by just using the account and accruing karma.

The reddit API rules discuss the appropriate request rate, but this is actually not a concern for us because praw handles that.

The Best of The Best on CrossValidated

CrossValidated is the StackExchange Q&A site focused on statistics. I mainly participate on CV answering questions (http://stats.stackexchange.com/users/8451/david-marx). One somewhat alternative use of SE sites that I like is to “stalk” interesting users. I started this practice on StackOverflow (the very popular programming QA site), where I’d occasionally stumble across very high reputation users answering topics with tags I was interested in, so I’d check their answer history to see what other interesting things they’d posted. This was interesting, but because users on SO have such varied skills, you never know what you’re going to get: maybe I visited a users page because they were answering a Python question, but it turns out most of their activity is talking about some other technology I’m not interested in.

On the other hand, because the scope of CV is so much narrower, an analogous investigation on CV almost always produces very interesting results (i.e. in general everyone on CV’s background and interests intersect with my own at least a little). SE has a very interesting tool available called the data explorer, a T-SQL interface to their database. As I happen to be a database jockey for my day job, I decided to make my CV user investigation more efficient by leveraging this tool.

Here’s a query I built that returns a list of CV users who have authored answers that scored >=25 points. The approach I chose satisfied me for several reasons:

  1. I’m not just interested in reputation. Repuatation is largely a function of activity on the site: if you answer a lot of question but never achieve a very high score, the shear volume of your answers will give you high reputation. Similarly, asking a lot of questions will boost your reputation, and I’m specifically interestd in people with good answers.
  2. I’m not just interested in the highest scoring responses on the site. I want reading material: theoretically, I would like to read the highest scoring answers on the site, but I’m lazy:: I’d like to be able to visit a users profile and have several interesting answers to read, not just one.

In particular, to satisfy the second criterion, my query sorts users in a fashion that is not entirely obvious when glancing at the results: instead of sorting by the user’s repuation or their answer reputation or their max answer score or the number of high scoring answers, I used a combiend index: the results are sorted by the highest answer score they achieved multiplied by the number of answers above the score threshhold (24). I like this approach for the following reasons:

  • People with lots of good answers should bubble up to the top.
  • As the number of good answers goes down, the max score that user has achieved becomes increasingly relevant.
  • Users with just one good answer should sink to the bottom unless their answer was really good.
SELECT MAX(score) max_score, count(p.id) good_answers,
(select count(*) from posts p2
where p2.posttypeid=2 and p2.owneruserid = u.id) total_answers,
u.reputation,
--u.displayname,
'site://users/' + CAST(u.Id AS nvarchar)
+ '?tab=answers&sort=votes|'
+ u.displayname  cv_user,
u.age, u.location, u.websiteurl,
lastaccessdate
FROM users u, posts p
WHERE u.id = p.owneruserid
AND p.posttypeid = 2  -- answers only
AND p.score >= 25
AND p.parentid <> 1337 -- ignore jokes
AND p.parentid <> 726  -- ignore quotes
GROUP BY reputation, u.displayname,
u.age, u.location, u.websiteurl, u.lastaccessdate, u.id
ORDER BY max(score) * count(p.id) desc
 

Winter Break Project: Building Minesweeper from Scratch

I don’t take a ton of vacation from my work, but luckily I don’t really need to over the winter: my employer gives us christmas through new years off, which comes out to around 10 days vacation. I had planned to do some travelling, but instead tagged along with my girlfriend to do some family stuff for new years. Since I was in town for most of the break, I spent most of my time just chilling (which, frankly, was awesome) and coding.

I’ve been programming for a few years now, both for fun and for work, but my programs usually are either scripts or backend (database) programs. I never got around to learning GUI programming, so I decided to use this opportunity to build a GUI program. I had recommended some minesweeper related challenges to a friend who’s learning programming, so i had minesweeper on the brain. I ended up coding the main game out of boredom one day, and decided to slap a GUI on top of it.¬†

Whenever I’ve seen GUI tutorials online, for simplicity sake they combine the main program code with the GUI code. I guess this makes sense for a quick tutorial, but my understanding is that this is pretty weak GUI programming. Generally, you want to separate the “model”, “controller” and “view” components.

For my application, my model was comprised of a board class which is a container for tile class objects I also defined. The controller was a “game” class which included functions like “click” and “flip_flag.” I wanted the game component (the model and controller) to operate just fine in isolation from the GUI, so my Game class also has a play() method for playing minesweeper in the terminal. Having built the working game components, I could now slap a GUI on top in a separate “view” component.

The most common GUI toolkits for python are wx, qt and tk. Tk is built into python as Tkinter, has a repuation for being simple to use, has plenty of documentation, and has the added benefit of taking advantage of the native OS’s aesthetics, so I decided to go with Tkinter.

Now, although there are lots of Tkinter tutorials, there doesn’t seem to be a ton of consistency in how people use Tkinter. In general, the main GUI is defined as a class called “App.” But sometimes the App class inherits from the Tk class, and sometimes it inherits from the Frame class. I started using an example that had my App inheriting from Frame, and this worked fine. I later wanted to implement a feature where the buttons would be hidden when the game was paused, and I found that this required a lot of changes if I kept it inheriting from the Frame class. I changed my app to inherit from the Tk class and it was a lot easier to add that feature. I think that inheritance makes more sense intuitively as well.

Anyway, as an exercise, I strongly recommend programming minesweeper. Next time I need to learn a new language, I plan to use programming minesweeper as a comprehensive exercise. To accomplish the task takes using several primitive types and functions, defining classes, using threads (for the timer), and of course the GUI.