A Flood Of Data

Hurricane Isaac Wind Map

Hurricane Isaac makes landfall in Louisiana on 8/29/2012

Some of the most amazing and elegant visualizations to appear this year have been Particle Flow visualizations. The two stars in this field are NASA’s Perpetual Ocean project, and the US Wind Map by Martin Wattenberg and Fernanda Viégas of Google’s Data Visualization Group.

The Perpetual Ocean project is stunningly beautiful: like Van Gogh paintings in motion. Unfortunately, the Perpetual Ocean is a static project, it represents over two years worth of data, but just the period from June 2005 – December 2007. On the other hand, the Wind Map amazingly is based on live weather data updated hourly from the National Digital Forecast Database. On any particular day the map is cool, but today Hurricane Isaac hit ground in Louisiana, and the hurricane is evident on the wind map. It’s just really, really cool to play with.

Zoom in on the map by double clicking to see just how small and calm the eye of the storm is. Some noteworthy features:

  • Between Lafayette and Baton Rouge, the south blowing wind seems to be unusually accelerated. That’s because it’s flowing south unimpeded over the Mississippi River (in Natchez) and onto the delta flats of the Sherburne Wildlife Management Area. Wind generally blows faster over water than on land because it’s unimpeded by trees, hill, buildings, etc.
  • At the eye of the storm the wind is at or below 1.0 mph. Using Google Maps to approximate scale, about 5 miles away the wind speed is already at 30 mph, and just 20 miles away we’re already at the peak windspeed of 50 mph.
  • The East-West blowing winds hitting Baton Rouge are about 35 mph immediately prior to the city, then 30 mph immediately after. The city is built along the mississippi, so its building are organized perpendicular to the wind and seem to be acting as a physical wall, slowing the wind down. You can actually see the wind in the map parting to go around the city.

Very interesting graphic. Hope everyone’s safe down there.


Supervised Learning

I’m a mostly self-taught programmer/developer. I work as a data analyst and try to leverage my work for opportunities to learn new tools and tricks, but for the most part my learning has come by way of pet projects. I thought about starting this blog for some time before I actually started posting. Originally, I wanted to use it to log useful code snippets or elegant solutions I developed (primarily to problems at work). I generally don’t find I have the time to document that sort of thing while I’m working so although I’m only a few posts deep, I feel like this blog is going to be more of a project log.

Anyway, I didn’t start typing to produce a manifesto for my blog. I’m here to brag. Tonight was orientation for my grad program! That’s right, after years of committing most of my free time to teaching myself shit, I have finally re-entered academia. I am now a graduate candidate in the Georgetown University Mathematics and Statistics Program.

At the moment, I’m sufficiently interested in my work to to the  part time thing, which means one class this semester to get my feet wet, then two classes every other semester for a total of ten classes for the program. My first class is Probability with the book A Course in Probability. I think the program has a primarily frequentist slant, but I hope to supplement this with a Bayesian Statistics class next semester and I suspect at least one class from the comp sci department, either Artificial Intelligence or Machine Learning.

As a 27 year old I don’t think I’m old from the program, but now that I think about it I don’t remember seeing any faces in the room tonight that looked older than me, so maybe I am going to be the old man of the bunch.

In any event, I’m excited and it’s going to be fun. I’ve got my text book, but I’m also simultaneously enrolled in SEVEN coursera courses (I’m mainly signed up for access to the content, I suspect I won’t have the time to finish any of the within the due date. Maybe one. I’ll finish them on my own time, anyway), so I think I’ll use my free time for other learning fun while I can. During my mini freak-out about whether or not I’d learn some Bayesian analysis in this program I picked up the textbook Pattern Recognition and Machine Learning (yes, I read textbooks for fun. I’m that guy), so I’ll try to dig into that some more tonight.
Oh yeah: go Hoya’s.

Weekend project: Reddit Comment Scraper in Python

(The methodology described below works, but is not as easy as the preferred alternative method using the praw library. If you’re here because you want to scape reddit, please reference this post which describes a simpler, faster method)

This weekend, I built a webscraper that downloads comments from reddit. It was a good exercise and I’m quite pleased with my work, but I hesitate to refer to my tool as a “scraper” because reddit makes it easy to access JSON output which is very, very easy to work with, and when I think of webscraping I think of more involved screen capture and parsing such as with a package like Beautiful Soup. Maybe it would be more appropriate to call my tool a “miner,” but it doesn’t do any analysis, just grabs data. Anyway, enough semantics.

Here’s the code on GIthub. Please bear in mind, this is still in development, but it works well enough for now. Although I haven’t added a formal license file in the repo, all project code is licensed CC BY-SA.

Generalization of the steps in my development process:

  1. Determine how to open and download web pages
  2. Determine how to get JSON out of reddit
  3. Determine how to generate appropriate URLs for pagination
  4. Determine how to extract pertinent data from the downloaded pages
  5. Determine appropriate halt points for the download process
  6. Add command line parameters
  7. Implement a data storage solution

1. Downloading Webpages

This was the easy part. There are a lot of fairly generic solutions to this in python. mechanize is a nice library for web browsing with python, but it’s a little more robust than what I need. urllib2 is closer to the metal, but still very simple and lightweight. urllib is even simpler, but the reddit API documentation specifies that they want to see a header. I’m not really using the API, but I decided to include a header in my request anyway to play nice with those reddit kids.

from urllib2 import Request, urlopen

_URL = 'http://www.reddit.com/'
_headers = {'User-agent':'myscript'}

request = Request(_URL, _headers)
response = urlopen(request)
data = response.read()

Honesty time: I don’t know a ton about the HTTP protocol. I know there are POST requests and GET requests. A GET is a simple “Hey, can I see that webpage?” whereas a POST is more like “I want to interact with that webpage” (like completing a form.” The above represents a GET request which is all I needed for my tool; if you want to use urllib2 to make a POST, just add a ‘data’ parameterin the call to urlopen or when you’re generating the ‘Request’ object. I wish I understood better what was going on here, but my understanding is that urlopen() creates a connection, and the read() method actually pulls data across the connection. urlopen() will throw an error if there’s an issue with the request or forming the connection, but the read() method is where I think you’re most likely to see an error.

Reddit doesn’t like people to make requests to frequently, so I when I wrapped the above code in a function I included a default 2 second delay before issuing the GET request. This request is wrapped in a ‘try’ block that increases the delay if an exception is reached. I think I’ve seen instances where the exception wasn’t seen until hitting my code’s JSON parser, but you get the idea. I can modify later as needed. Here’s my function:

def get_comments(URL,head,delay=2):
    '''Pretty generic call to urllib2.'''
    sleep(delay) # ensure we don't GET too frequently or the API will block us
    request = Request(URL, headers=head)
        response = urlopen(request)
        data = response.read()
        response = urlopen(request)
        data = response.read()
    return data


Doug Hellman has a great tutorial on using JSON with python, so I won’t go into too much depth here. I will say this much though: reddit is very developer friendly and makes their website very easy to scrape. By adding “.json” or “.xml” to the end of any reddit URL (at least to the best of my knowledge), you get structured data in your format of choice. I’ve had some experience with XML and I’m not a huge fan. This project was my first time doing anything with JSON and I knew nothing about JSON going into this project, but even so I found JSON to be extremely intuitive. Basically each reddit “thing” (object, like a comment or link) gets converted into what are effectively nested python dictionaries. Awesome. Here’s a sample of the JSON we get from the reddit comments page. This contains one single comment, but the output is from a set of 25 comments, hence the “before” and “after” tags.

{"kind": "Listing"
{"modhash": "", "children":
{"kind": "t1", "data":
{"subreddit_id": "t5_2qh1e"
, "link_title": "This is How Hot it is in Iraq"
, "banned_by": null
, "link_id": "t3_ymy8w"
, "likes": null
, "replies": null
, "id": "c5x77kx"
, "author": "shaggorama"
, "parent_id": "t1_c5x54ci"
, "approved_by": null
, "body": "I don't know where this soldier is, but t[he high in Baghdad today was 106^oF](http://www.weather.com/weather/today/Baghdad+Iraq+IZXX0008)\n\nEDIT: My bad, this video was posted 9/8/2009. The high in Baghdad on that date was actually [105^oF](http://www.wunderground.com/history/airport/KQTZ/2009/9/8/DailyHistory.html?req_city=NA&req_state=NA&req_statename=NA)"
, "edited": 1345668492.0
, "author_flair_css_class": null
, "downs": 0
, "body_html": "<div class=\"md\"><p>I don't know where this soldier is, but t<a href=\"http://www.weather.com/weather/today/Baghdad+Iraq+IZXX0008\">he high in Baghdad today was 106<sup>oF</sup></a></p>\n\n<p>EDIT: My bad, this video was posted 9/8/2009. The high in Baghdad on that date was actually <a href=\"http://www.wunderground.com/history/airport/KQTZ/2009/9/8/DailyHistory.html?req_city=NA&req_state=NA&req_statename=NA\">105<sup>oF</sup></a></p>\n</div>"
, "subreddit": "videos"
, "name": "t1_c5x77kx"
, "created": 1345667922.0
, "author_flair_text": null
, "created_utc": 1345667922.0
, "num_reports": null, "ups": 3}
, "after": "t1_c5wcp43"
, "before": "t1_c5x77kx"

3.  URL Hacking

Originally I had expected this foray to familiarize me with the reddit API. After digging through it a bit though I got the impression that the API is really for app developers and not webscrapers. I just assumed there would be some API call I could use to get comments for a user, but my research into the reddit API documentation revealed no such functionality. This meant I would have to pull the data down directly from the website. The first 25 comments are easy enough to grab:


This is converted to easy to navigate JSON by adding “.JSON” to the end of the URL as such:


The URL for the next 25 comments is a little more complicated:


The number after “count” is the number of comments we’ve already seen, so we’ll have to keep track. Th “after” parameter needs the reddit “thing id” for the last comment we saw. I had expected to just be able to tack “.json” onto the above URL, but when I did that reddit interpreted that as part of the thing id and just sent me back to the top 25 comments in HTML since it didn’t understand my request. After experimenting for a few minutes, I figured it out and here’s what the JSON request needs to look like:


4. Parsing JSON

Again, I wish I understood what I was doing here better and direct you to Doug Hellman’s JSON tutorial. I may not understand exactly what I’m doing, but I’ve got my functional code so I’m happy.

The usage of the python JSON library usually seems to start with a call to either json.loads() or json.dumps() to encode/decode JSON. I don’t remember where I found this, but apparently the solution I needed was actually the decode() method of the JSONDecoder() class. Apparently you don’t even need to generate an instance of this class to use it, so here’s what I ended up with:

ecoded = json.JSONDecoder().decode(json_data)
comments = [x['data'] for x in decoded['data']['children']]

For convenience, I wrapped my parser in a function that returns the ‘comments’ list as well as a list of comment ids.

5. Halting the Download

At this point I basically had a working prototype. I built my tool to output some information to the screen as the download progressed, and I noticed that after 999 comments, reddit stopped giving me data. I had built my tool to halt when it started seeing comments it had already downloaded (since that’s what the HTML website does) but apparently the JSON output just gives you null data. I decided to also add an optional download limit to my scraper tool, so I wrote in three conditions to stop downloading: receipt of data we had already downloaded, receipt of null data, or reaching a defined limit on downloads.

The above observation might seems sufficiently trivial that I shouldn’t have mentioned it, but I think it’s important to think about these kind of stop points and test and keep an eye on them in live code when you’re scraping. Otherwise you might just eat up unnecessary bandwidth and spam your target website with requests.

6. Command Line Parameters

For this I used the argparse library. Unlike urllib2 and json, this bit wasn’t a learning experience for me. I’ve used this library in the past and frankly, I fucking love it. It’s easy to use, generates a very easy ‘help’ page… The scripts I’ve used it in before were all smaller and for fairly narrow use at work. For this project I used the “if __name__ == ‘__main__’: “ test to simplify debugging, and I originally dropped my command line arguments right under my library importation calls. The result was that my code would error out because I hadn’t provided parameters that the code saw as required. Lesson learned: command line argument handling goes inside the “main” block, not up with function definitions. Here’s the code, it’s pretty self explanatory:

if __name__ == '__main__':
    ### Commandline argument handling ###

    parser = argparse.ArgumentParser(description="Scrapes comments for a reddit user. Currently limited to most recent 999 comments (limit imposed by reddit).")
    parser.add_argument('-u','--user', type=str, help="Reddit username to grab comments from.", required=True)
    parser.add_argument('-l','--limit', type=int, help="Maximum number of comments to download.", default=0)
    parser.add_argument('-d','--dbname', type=str, help="Database name for storage.", default='RedditComments.DB')
    parser.add_argument('-w','--wait', type=str,help="Wait time between GET requests. Reddit documentation requests a limit of 1 of every 2 seconds not to exceed 30 per min.", default=2)

    args = parser.parse_args()
    _user   = args.user
    _limit  = args.limit
    _dbname = args.dbname
    _wait   = args.wait

    comments = scrape(_user, _limit, _wait)

7. Saving The Results

I had built a successful comment scraper! The one problem was my work was evaporating into the ether. The comments would download, and be lost as soon as the program exited. I needed to save them somehow. One very simple solution would be to pickle the comments. This would make it easy to get the data back into python, but I won’t be doing my analyses in the python environment. I know my way around excel, R and octave, but I haven’t gotten around to learning numpy/scipy/scikit-learn yet. I may make that part of this project in which case I could have my tool automatically spit out some graphs after the download, which would be pretty neat.

My very, very strong preference over pickling would be to store the data in a database. This is really the obvious solution, but unfortunately, although I have been neck deep in SQL for the past two years for work, I don’t really have any experience working with databases from inside python. I’ve used python to play with CSVs, so for the time being I’m dumping the data into a CSV using the csv DictWriter() class. This unfortunately wasn’t as straightforward as it should have been because csv writes to ASCII and my decoded JSON was in unicode. I won’t break it down for you, but here’s how I dealt with the unicode (I totally found this solution somewhere and modified it to suit my needs, but I can’t remember where I found it. Probably StackOverflow):

writer.writerow(dict((k, v.encode(‘utf-8’) if isinstance(v, unicode) else v) for k, v in comment.iteritems()))

I’m not a big fan of this solution for a myriad of reasons. My next step will be implementing database storage. Right now I’m using this as an excuse to learn the sqlalchemy library, but considering how familiar SQL is to me I may go lower level and just do it with the much lighterwieght, albeit RDBMS implementation specific sqlite3.

Tinkering with OCR

A year ago (wow..was it really that long ago?) I took the initial offering of Andrew Ng’s now very famous Coursera (then ml-class.org) machine learning class. One of the class projects was to build a neural network to classify handwritten digits. Kaggle recently created a new “tutorial” contest attacking the same problem, and from the look of it, using the same data.

Well, sort of. I’m fairly certain the Coursera dataset was derived from the same MNIST dataset used in the Kaggle competition. The dataset available from MNIST has 70,000 28×28 images and is apparently just a subset of a larger dataset available from NIST. The kaggle contest provides a training set of 42,000 28×28 images. The ML class homework assignment — ex4 — provided a training set of 5000 20×20 images. I had a nice vectorized solution in octave I was fairly proud of for the ML class, and I seem to recall requiring some time for that NN to train on my dinky, 4 year old laptop.

I unfortunately don’t find much excuse to use machine learning at work, so I’ve been tinkering a bit in my spare time (hence this blog). Although I attacked pretty much this exact problem last year, I felt the Coursera course was largely focused teaching the theory behind the neural network algorithm and also did a fair amount of hand-holding. I’ve been teaching myself R and decided to give this project a shot. Also, this dataset is massive relative to other data I’ve hacked at for extracurricular KDD (don’t get me wrong, my dataset at work is plenty big, just not really applicable to this sort of fun), so it would be nice ot push my hardware a little. I’m starting a Masters in Stats program at Georgetown this fall, and I’m probably going to buy/build a nice number cruncher: might as well set some benchmarks to see how much of an improvement I get for my investment.

OK, enough background. If I’m going to stand a chance working with the provided data, I’m going to need to do some serious dimensionality reduction. The kaggle dataset is 42000×785. I could reduce the number of rows, but the more data I have to learn on the better. Really, if I wanted to be a dick, I could just pull down the whole MNIST dataset and use it for KNN prediction, which should in theory give me 100% accuracy since I know the test set is from that data. But that’s no fun at all. One option for reducing the number of rows would be bagging, but I’ve never tried that before. If I use a randomForest classifier (which I intend to, just to see how it performs against their benchmark) my understanding is that this should already be happening.

Let’s not even worry about the rows for now. Let’s reduce those columns. The first column is labels, the next 784 represent the unrolled 28×28 image matrix. There are a few different methods I’m considering:

  • Lossless data compression
  • Lossy data correction (via clustering)
  • Principal Component Analysis dimensionality reduction
  • Outright elimination of invaluable columns

Let’s discuss each of these in sequence.

Lossless Data Compression

Honesty time: I don’t know shit about image processing, so I’m figuring this out as I go. The “lossless” compression I came up with isn’t really lossless, just mostly. The first step I took was applying contrast to each image. The defining features of a digit aren’t the border shading, so let’s drop that.

The values in our image data are 0-255 (grayscale). After some analysis, I decided to set everything below 120 to 0, and everything above 120 to 1. Tada! Binary data. Much easier to work with, and really just lovely for classification algorithms.

Now for the compression. Sometime ago I learned about this trick JPEG uses called the Discrete Cosine Transform. This converts a block of pixels into one of a limited number of base cells (64 possibilities for each color channel), which allows the JPEG algorithm to represent an image in terms of these base cells. This compression algorithm is reminiscent of another assignment from the ML class where we used clustering to compress an image. My favorite use of this kind of compression is this guy who wanted to emulate the display limitations of old school NES consoles.

Anyway, on to my solution. Having converted our matrix from values ranging 0-255 to just discrete 0 and 1, I realized if I treated the image as composed of 4×4 cells, I could treat each cell as 4-bit binary and converting these to decimal I effectively retain all the information of the original information while reducing the dimensionality of the actual image matrix from 28×28 to 14×14. This is actually a quadratic compression if you think about it: I go from 784 bits to 196. Not bad at all!

Lossy data correction

As I confessed above, my “lossless” algorithm was only lossless after already performing some processing on the image. After performing the above compression, I decided to view my results and was somewhat surprised to find that I had compressed the image in a human readable fashion, even though I had intended to encode detailed position information into each bit. Tempted to apply my “high contrast” algorithm to the compressed output, I realized it would probably be better easier to apply contrast and compression at the same time.

I haven’t tried this yet, but I have three ideas for performing this kind of compression on the high contrast 28×28 data:

  1. represent each 2×2 block that contains a 1 as 1, the rest as 0
  2. represent each 2×2 cell as its sum
  3. represent each 2×2 cell as its average

Although I would get different actual values, options 2 and 3 would actually give me the same output. Here are the following possible outcomes of each algorithm: (2)-{0,1, 2, 3, 4}; (3)-{0,0.25, 0.5, 0.75, 1}. Taking the average really just scales the sums down by 4, so it would really give me the same output but just take a little longer. Forget I even suggested it.

The benefit of taking the sum is that I encode the number of 1’s that appeared in the cell. I could just encode the position of those ones as I suggested above, but I believe that will take significantly longer than otherwise (converting from binary in R is messy). The main benefit of this would basically be a time increase: I suspect that the range in each variable won’t have much if ay impact on learning, but the number of variables will have a significant impact on time.

Principal Component Analysis

PCA is the traditional go-to for dimensionality reduction. PCA is really good when you don’t know what’s going to be useful. I understand a fair amount about the structure of this data and am actually not particularly inclined to use PCA to slim it down. I haven’t tried it yet, so maybe I’ll have a change of heart. Actually, I was thinking of using PCA to develop some additional features. I’ve dedicated a lot of attention to reducing the size of my dataset to make it viable for my hardware, but the fact is unless I learn a few additional features, I shouldn’t expect to perform better than the available baselines. In fact, considering those were calculated on the full dataset, I should really expect to do worse.

My thought was the following: the traditional PCA approach would be to treat my data as though it were in 784 degree space, but it’s not. It is so, so very not. My data is really two dimensional. It’s a scatterplot represented by a matrix. Running PCA might give me some funky results, but I’m really just interested in the first principal component, which tells me about the general “direciton” of my data. If I collapse my dataset down into one dimension along this component (an eigenvector), my data will be…hmm, can’t remember, but my suspicion is that the one dimensional representation should be no longer than the diagonal from one corner of my image to another. For my uncompressed image (28×28), that’s 40 bits, for my compressed image (14×14), 20 bits. Add two bits to each to represent the eigenvetor.

A couple of people have taken averages over each label in the dataset. This is to me a very intuitive visualization to attempt with this data (there’s a data visualizaiton “contest” running in parallel with the main contest). Corey Chivers at R-Bloggers points out that some of the blur in the images is due to different people drawing them at different angles. My first thoughts about preprocessing, after dimensionality reduction, were to normalize the configuration of the images as much as possible. I had originally thought of rotating each image relative to its first principal component, but why go through the extra matrix multiplication? If I use PCA to collapse the image, I’ve effectively normalized the direction of the images (at least, relative to their label) and potentially learned some new features. I suspect that certain digits are very distinctive in this collapsed state, in particular 8.

I might try learning a classifier just on PCA reduced data, but I’m thinking I will probably at least calculate the direction of the eigenvector as a learned, aggregate parameter and add that to my other models. Not sure how useful it will be without the collapsed data though.

Outright elimination of invaluable columns

Some fields in the dataset carry no information at all. Zero. Zilch. Nada. I made this realization almost immediately, but there was also some discussion in the contest forum about it as well. In particular, user AlKhwarizmi recommended using the nearZeroVariance function in the caret package. This is probably the correct way to handle these columns, but I like my way, which was probably faster and might even have eliminated more columns.

Should I have at least calculated variance? Probably. Instead though, after applying contrast to the images, I just summed over each column with colSums. That’s it. I took the output, threw it into a few exploratory plots and decided that columns that had a value less than 1200 could be eliminated. I think I came up with this figure by plotting the eCDF of my data, but it actually makes a lot of sense, since these columns are only populated in 5% of the dataset (1200/24000). Using this threshhold trimmed my data from 784 variables to 392. Not bad, right?

Here’s the thing: elimination of these columns should probably be the last step in my preprocessing since I am guaranteed to lose a lot of positional information this way. Maybe it doesn’t actually make a different, but my intuition is the later I do this, the better.

Current Status

I wrote the “high contrast”, “binary->decimal lossless compression” and “elimination of useless columns” code already, and my intention is to run the processing in that order: contrast, compression, columns. I started to tun the compression code yesterday and…well, it never finished. I got impatient after about 4 hours and killed the process. My feeling is that if it takes that long, I’m probably not gaining any benefit from the preprocessing. I think the best way to do this will be to perform my preprocessing in python and then train the model in R on the processed data.

A big part of this is that I want to learn R. For the Coursera class, we used Octave (essentially Matlab).