A year ago (wow..was it really that long ago?) I took the initial offering of Andrew Ng’s now very famous Coursera (then ml-class.org) machine learning class. One of the class projects was to build a neural network to classify handwritten digits. Kaggle recently created a new “tutorial” contest attacking the same problem, and from the look of it, using the same data.

Well, sort of. I’m fairly certain the Coursera dataset was derived from the same MNIST dataset used in the Kaggle competition. The dataset available from MNIST has 70,000 28×28 images and is apparently just a subset of a larger dataset available from NIST. The kaggle contest provides a training set of 42,000 28×28 images. The ML class homework assignment — ex4 — provided a training set of 5000 20×20 images. I had a nice vectorized solution in octave I was fairly proud of for the ML class, and I seem to recall requiring some time for that NN to train on my dinky, 4 year old laptop.

I unfortunately don’t find much excuse to use machine learning at work, so I’ve been tinkering a bit in my spare time (hence this blog). Although I attacked pretty much this exact problem last year, I felt the Coursera course was largely focused teaching the theory behind the neural network algorithm and also did a fair amount of hand-holding. I’ve been teaching myself R and decided to give this project a shot. Also, this dataset is massive relative to other data I’ve hacked at for extracurricular KDD (don’t get me wrong, my dataset at work is plenty big, just not really applicable to this sort of fun), so it would be nice ot push my hardware a little. I’m starting a Masters in Stats program at Georgetown this fall, and I’m probably going to buy/build a nice number cruncher: might as well set some benchmarks to see how much of an improvement I get for my investment.

OK, enough background. If I’m going to stand a chance working with the provided data, I’m going to need to do some serious dimensionality reduction. The kaggle dataset is 42000×785. I could reduce the number of rows, but the more data I have to learn on the better. Really, if I wanted to be a dick, I could just pull down the whole MNIST dataset and use it for KNN prediction, which should in theory give me 100% accuracy since I know the test set is from that data. But that’s no fun at all. One option for reducing the number of rows would be bagging, but I’ve never tried that before. If I use a randomForest classifier (which I intend to, just to see how it performs against their benchmark) my understanding is that this should already be happening.

Let’s not even worry about the rows for now. Let’s reduce those columns. The first column is labels, the next 784 represent the unrolled 28×28 image matrix. There are a few different methods I’m considering:

- Lossless data compression
- Lossy data correction (via clustering)
- Principal Component Analysis dimensionality reduction
- Outright elimination of invaluable columns

Let’s discuss each of these in sequence.

# Lossless Data Compression

Honesty time: I don’t know shit about image processing, so I’m figuring this out as I go. The “lossless” compression I came up with isn’t *really* lossless, just mostly. The first step I took was applying contrast to each image. The defining features of a digit aren’t the border shading, so let’s drop that.

The values in our image data are 0-255 (grayscale). After some analysis, I decided to set everything below 120 to 0, and everything above 120 to 1. Tada! Binary data. Much easier to work with, and really just lovely for classification algorithms.

Now for the compression. Sometime ago I learned about this trick JPEG uses called the Discrete Cosine Transform. This converts a block of pixels into one of a limited number of base cells (64 possibilities for each color channel), which allows the JPEG algorithm to represent an image in terms of these base cells. This compression algorithm is reminiscent of another assignment from the ML class where we used clustering to compress an image. My favorite use of this kind of compression is this guy who wanted to emulate the display limitations of old school NES consoles.

Anyway, on to my solution. Having converted our matrix from values ranging 0-255 to just discrete 0 and 1, I realized if I treated the image as composed of 4×4 cells, I could treat each cell as 4-bit binary and converting these to decimal I effectively retain all the information of the original information while reducing the dimensionality of the actual image matrix from 28×28 to 14×14. This is actually a quadratic compression if you think about it: I go from 784 bits to 196. Not bad at all!

# Lossy data correction

As I confessed above, my “lossless” algorithm was only lossless after already performing some processing on the image. After performing the above compression, I decided to view my results and was somewhat surprised to find that I had compressed the image in a human readable fashion, even though I had intended to encode detailed position information into each bit. Tempted to apply my “high contrast” algorithm to the compressed output, I realized it would probably be better easier to apply contrast and compression at the same time.

I haven’t tried this yet, but I have three ideas for performing this kind of compression on the high contrast 28×28 data:

- represent each 2×2 block that contains a 1 as 1, the rest as 0
- represent each 2×2 cell as its sum
- represent each 2×2 cell as its average

Although I would get different actual values, options 2 and 3 would actually give me the same output. Here are the following possible outcomes of each algorithm: (2)-{0,1, 2, 3, 4}; (3)-{0,0.25, 0.5, 0.75, 1}. Taking the average really just scales the sums down by 4, so it would really give me the same output but just take a little longer. Forget I even suggested it.

The benefit of taking the sum is that I encode the number of 1’s that appeared in the cell. I could just encode the position of those ones as I suggested above, but I believe that will take significantly longer than otherwise (converting from binary in R is messy). The main benefit of this would basically be a time increase: I suspect that the range in each variable won’t have much if ay impact on learning, but the number of variables will have a significant impact on time.

# Principal Component Analysis

PCA is the traditional go-to for dimensionality reduction. PCA is really good when you don’t know what’s going to be useful. I understand a fair amount about the structure of this data and am actually not particularly inclined to use PCA to slim it down. I haven’t tried it yet, so maybe I’ll have a change of heart. Actually, I was thinking of using PCA to develop some additional features. I’ve dedicated a lot of attention to reducing the size of my dataset to make it viable for my hardware, but the fact is unless I learn a few additional features, I shouldn’t expect to perform better than the available baselines. In fact, considering those were calculated on the full dataset, I should really expect to do worse.

My thought was the following: the traditional PCA approach would be to treat my data as though it were in 784 degree space, but it’s not. It is so, so very not. My data is really two dimensional. It’s a scatterplot represented by a matrix. Running PCA might give me some funky results, but I’m really just interested in the first principal component, which tells me about the general “direciton” of my data. If I collapse my dataset down into one dimension along this component (an eigenvector), my data will be…hmm, can’t remember, but my suspicion is that the one dimensional representation should be no longer than the diagonal from one corner of my image to another. For my uncompressed image (28×28), that’s 40 bits, for my compressed image (14×14), 20 bits. Add two bits to each to represent the eigenvetor.

A couple of people have taken averages over each label in the dataset. This is to me a very intuitive visualization to attempt with this data (there’s a data visualizaiton “contest” running in parallel with the main contest). Corey Chivers at R-Bloggers points out that some of the blur in the images is due to different people drawing them at different angles. My first thoughts about preprocessing, after dimensionality reduction, were to normalize the configuration of the images as much as possible. I had originally thought of rotating each image relative to its first principal component, but why go through the extra matrix multiplication? If I use PCA to collapse the image, I’ve effectively normalized the direction of the images (at least, relative to their label) and potentially learned some new features. I suspect that certain digits are very distinctive in this collapsed state, in particular 8.

I might try learning a classifier just on PCA reduced data, but I’m thinking I will probably at least calculate the direction of the eigenvector as a learned, aggregate parameter and add that to my other models. Not sure how useful it will be without the collapsed data though.

# Outright elimination of invaluable columns

Some fields in the dataset carry no information at all. Zero. Zilch. Nada. I made this realization almost immediately, but there was also some discussion in the contest forum about it as well. In particular, user AlKhwarizmi recommended using the nearZeroVariance function in the caret package. This is probably the correct way to handle these columns, but I like my way, which was probably faster and might even have eliminated more columns.

Should I have at least calculated variance? Probably. Instead though, after applying contrast to the images, I just summed over each column with colSums. That’s it. I took the output, threw it into a few exploratory plots and decided that columns that had a value less than 1200 could be eliminated. I think I came up with this figure by plotting the eCDF of my data, but it actually makes a lot of sense, since these columns are only populated in 5% of the dataset (1200/24000). Using this threshhold trimmed my data from 784 variables to 392. Not bad, right?

Here’s the thing: elimination of these columns should probably be the last step in my preprocessing since I am guaranteed to lose a lot of positional information this way. Maybe it doesn’t actually make a different, but my intuition is the later I do this, the better.

# Current Status

I wrote the “high contrast”, “binary->decimal lossless compression” and “elimination of useless columns” code already, and my intention is to run the processing in that order: contrast, compression, columns. I started to tun the compression code yesterday and…well, it never finished. I got impatient after about 4 hours and killed the process. My feeling is that if it takes that long, I’m probably not gaining any benefit from the preprocessing. I think the best way to do this will be to perform my preprocessing in python and then train the model in R on the processed data.

A big part of this is that I want to learn R. For the Coursera class, we used Octave (essentially Matlab).