Installing scientific python libraries in windows

I often have issues installing certain python libraries (such as scipy or opencv) in windows. One solution to this problem is to do it the hard ware and just work your way through the errors as they come, winding your way through a maze of StackOverflow posts and mucking around in the registry. An alternative is to install a scientific python distribution like the Enthought Python Distribution, Continuum Analytics’ Anaconda, or Python(x,y). I’ve played around with Anaconda and Python(x,y) and they’re ok. Anaconda at least is supposed to be a little bit faster than the vanilla python installation, but I haven’t done any speed tests or anything.

Tonight I was wrestling with opencv. Anaconda didn’t have it in its package manager and I kept encountering roadblocks trying to install it in my generic windows installation (CPython). Luckily, I discovered this website which lists unofficial binaries for installing a lot of troublesome libraries in CPython for windows, both x64 and x32. The page is hosted by UCI professor Chris Gholke. If you see this Chris, you’re awesome.

Lifesaver.

What’s Really Going On In Principal Component Analysis (PCA)?

PCA seems to have gained sufficient popularity that many “I know just enough statistics to be really dangerous” types are using it. I have no real problem with this: it’s a great technique that has a lot of applications, but on a fairly regular basis I stumble across questions about PCA on CrossValidated that would only be asked by someone who really doesn’t understand at a fundamental level what PCA is doing. Often the situation is appropriate for PCA to be applied (usually dimensionality reduction), but questions are asked about how to use or interpret the output the indicate the person really doesn’t know what’s going on.

This article provides the best explanation of PCA I’ve come across, illustrating the linear algebra in a very intuitive geometrical context. I’ll provide a condensed version of the geometric explanation below without getting into the Linear Algebra

Consider a sample of 50 points generated from y=x + noise. The first principal component will lie along the line y=x and the second component will lie along the line y=-x, as shown below.

enter image description here

The aspect ratio messes it up a little, but take my word for it that the components are orthogonal. Applying PCA will rotate our data so the components become the x and y axes:

enter image description here

The data before the transformation are circles, the data after are crosses. In this particular example, the data wasn’t rotated so much as it was flipped across the line y=-2x, but we could have just as easily inverted the y-axis to make this truly a rotation without loss of generality as described here.

The bulk of the variance, i.e. the information in the data, is spread along the first principal component (which is represented by the x-axis after we have transformed the data). There’s a little variance along the second component (now the y-axis), but we can drop this component entirely without significant loss of information. So to collapse this from two dimensions into 1, we let the projection of the data onto the first principal component completely describe our data.

enter image description here

We can partially recover our original data by rotating (ok, projecting) it back onto the original axes.

enter image description here

The dark blue points are the “recovered” data, whereas the empty points are the original data. As you can see, we have lost some of the information from the original data, specifically the variance in the direction of the second principal component. But for many purposes, this compressed description (using the projection along the first principal component) may suit our needs.

Here’s the code I used to generate this example in case you want to replicate it yourself. If you reduce the variance of the noise component on the second line, the amount of data lost by the PCA transformation will decrease as well because the data will converge onto the first principal component:

set.seed(123)
y2 = x + rnorm(n,0,.2)
mydata = cbind(x,y2)
m2 = colMeans(mydata)

p2 = prcomp(mydata, center=F, scale=F)
reduced2= cbind(p2$x[,1], rep(0, nrow(p2$x)))
recovered = reduced2 %*% p2$rotation

plot(mydata, xlim=c(-1.5,1.5), ylim=c(-1.5,1.5), main='Data with principal component vectors')
arrows(x0=m2[1], y0=m2[2]
       ,x1=m2[1]+abs(p2$rotation[1,1])
           ,y1=m2[2]+abs(p2$rotation[2,1])
       , col='red')
arrows(x0=m2[1], y0=m2[2]
       ,x1=m2[1]+p2$rotation[1,2]
           ,y1=m2[2]+p2$rotation[2,2]
       , col='blue')

plot(mydata, xlim=c(-1.5,1.5), ylim=c(-1.5,1.5), main='Data after PCA transformation')
points(p2$x, col='black', pch=3)
    arrows(x0=m2[1], y0=m2[2]
           ,x1=m2[1]+abs(p2$rotation[1,1])
       ,y1=m2[2]+abs(p2$rotation[2,1])
           , col='red')
    arrows(x0=m2[1], y0=m2[2]
           ,x1=m2[1]+p2$rotation[1,2]
       ,y1=m2[2]+p2$rotation[2,2]
           , col='blue')
    arrows(x0=mean(p2$x[,1])
      ,y0=0
      ,x1=mean(p2$x[,1])
          ,y1=1
          ,col='blue'
           )
    arrows(x0=mean(p2$x[,1])
       ,y0=0
       ,x1=-1.5
       ,y1=0
       ,col='red'
)
lines(x=c(-1,1), y=c(2,-2), lty=2)

plot(p2$x, xlim=c(-1.5,1.5), ylim=c(-1.5,1.5), main='PCA dimensionality reduction')
    points(reduced2, pch=20, col="blue")
    for(i in 1:n){
      lines(rbind(reduced2[i,], p2$x[i,]), col='blue')
}

plot(mydata, xlim=c(-1.5,1.5), ylim=c(-1.5,1.5), main='Lossy data recovery after PCA transformation')
arrows(x0=m2[1], y0=m2[2]
       ,x1=m2[1]+abs(p2$rotation[1,1])
           ,y1=m2[2]+abs(p2$rotation[2,1])
       , col='red')
arrows(x0=m2[1], y0=m2[2]
       ,x1=m2[1]+p2$rotation[1,2]
           ,y1=m2[2]+p2$rotation[2,2]
       , col='blue')
for(i in 1:n){
  lines(rbind(recovered[i,], mydata[i,]), col='blue')
}
points(recovered, col='blue', pch=20)

The bulk of this post was scavenged from a response I provided on CrossValidated.

Programming A Simple Reddit Bot (in Python)

Almost a year ago I built a built a reddit bot called VideoLinkBot that has been running on an old half-broken laptop in my room ever since. VLB was a great exercise in agile development that I started to write a post about awhile ago but abandoned. The gist of it is that the bot only satisfied a small portion of my vision when it first went live, but I didn’t want to delay “bringing my product to market.” This was a good decision because due mainly to time constraints, it took quite some time for me to incorporate most of the features I originally planned for the bot, many of which were deprioritized in my development path in favor of things my “end users” (redditors) specifically requested.

Courtesy of the praw library, building a simple bot is really very easy. As an exercise, I wrote a clone of the now disabled “linkfixerbot” to illustrate a simple bot template to another redditor, since VLB was fairly complex. Instead of discussing VLB at length (I may do so at a later date) I’ll describe a general strategy for programming a reddit bot via the LinkFixerBot example.

LinkFixerClone

(Here’s the complete bot code)

On reddit, it’s not uncommon for a comment to include internal links to either a subreddit or a user’s overview page. To simplify building these links because they’re so common, reddit markdown interprets the syntax “/u/username” as a shortlink to a user’s history page, and “/r/subredditname” as a shortlink to a subreddit. Sometimes users either don’t realize that this markdown feature exists (less an issue now than when the feature was first rolled out) or they simply neglect to add the first forward slash and instead of proper markdown, they write something like “r/subredditname.” LinkFixerBot was a bot that scanned reddit for these mistakes and responded with repaired corrections.

We can break down the bot into three activities:

1. A component that scans the reddit comments feed.

2. A component that determines if a particular comment is a candidate for bot action.

3. A component that performs the bot’s signature activity in the form of a comment response.

There are many types of reddit bots, but I think what I’ve outlined above describes the most common. Some other kinds of bots may be moderator bots, or bots that mirror the contents of a specific subreddit.

The LinkFixerBot code should largely speak for itself: the while loop scans the reddit comment stream for comments, passing each comment to the “check_condition” function that determines if the bot should do anything. If the bot’s services are required, the results of pre-processing the comment accomplished by the check_condition function are passed into the bot_action function, which causes the bot to respond with a comment. Rinse and repeat.

Some items worth note:

  • It’s a good idea to maintain a short cache of comments the bot has already looked at to prevent duplicated work. LinkFixerClone uses a fixed-length deque so when new comments come in, they push the old comments out. The deque length is set to twice the number of comments the bot can call down from reddit at a time.
  • You’re going to need to include exception handling. Some special cases to consider (that are probably less of a problem for most bots than for VLB):
    • A comment or submission of interest has been deleted
    • A comment or submission is sufficiently old that it has been archived
    • Reddit is down (consider adding a progressively increasing “back-off” time until your next attempt).
    • Your bot’s username doesn’t have enough karma to post comments as frequently as it wants to.

This last case can actually be remedied, at least to some extent, by just using the account and accruing karma.

The reddit API rules discuss the appropriate request rate, but this is actually not a concern for us because praw handles that.

The Best of The Best on CrossValidated

CrossValidated is the StackExchange Q&A site focused on statistics. I mainly participate on CV answering questions (http://stats.stackexchange.com/users/8451/david-marx). One somewhat alternative use of SE sites that I like is to “stalk” interesting users. I started this practice on StackOverflow (the very popular programming QA site), where I’d occasionally stumble across very high reputation users answering topics with tags I was interested in, so I’d check their answer history to see what other interesting things they’d posted. This was interesting, but because users on SO have such varied skills, you never know what you’re going to get: maybe I visited a users page because they were answering a Python question, but it turns out most of their activity is talking about some other technology I’m not interested in.

On the other hand, because the scope of CV is so much narrower, an analogous investigation on CV almost always produces very interesting results (i.e. in general everyone on CV’s background and interests intersect with my own at least a little). SE has a very interesting tool available called the data explorer, a T-SQL interface to their database. As I happen to be a database jockey for my day job, I decided to make my CV user investigation more efficient by leveraging this tool.

Here’s a query I built that returns a list of CV users who have authored answers that scored >=25 points. The approach I chose satisfied me for several reasons:

  1. I’m not just interested in reputation. Repuatation is largely a function of activity on the site: if you answer a lot of question but never achieve a very high score, the shear volume of your answers will give you high reputation. Similarly, asking a lot of questions will boost your reputation, and I’m specifically interestd in people with good answers.
  2. I’m not just interested in the highest scoring responses on the site. I want reading material: theoretically, I would like to read the highest scoring answers on the site, but I’m lazy:: I’d like to be able to visit a users profile and have several interesting answers to read, not just one.

In particular, to satisfy the second criterion, my query sorts users in a fashion that is not entirely obvious when glancing at the results: instead of sorting by the user’s repuation or their answer reputation or their max answer score or the number of high scoring answers, I used a combiend index: the results are sorted by the highest answer score they achieved multiplied by the number of answers above the score threshhold (24). I like this approach for the following reasons:

  • People with lots of good answers should bubble up to the top.
  • As the number of good answers goes down, the max score that user has achieved becomes increasingly relevant.
  • Users with just one good answer should sink to the bottom unless their answer was really good.
SELECT MAX(score) max_score, count(p.id) good_answers,
(select count(*) from posts p2
where p2.posttypeid=2 and p2.owneruserid = u.id) total_answers,
u.reputation,
--u.displayname,
'site://users/' + CAST(u.Id AS nvarchar)
+ '?tab=answers&sort=votes|'
+ u.displayname  cv_user,
u.age, u.location, u.websiteurl,
lastaccessdate
FROM users u, posts p
WHERE u.id = p.owneruserid
AND p.posttypeid = 2  -- answers only
AND p.score >= 25
AND p.parentid <> 1337 -- ignore jokes
AND p.parentid <> 726  -- ignore quotes
GROUP BY reputation, u.displayname,
u.age, u.location, u.websiteurl, u.lastaccessdate, u.id
ORDER BY max(score) * count(p.id) desc
 

Winter Break Project: Building Minesweeper from Scratch

I don’t take a ton of vacation from my work, but luckily I don’t really need to over the winter: my employer gives us christmas through new years off, which comes out to around 10 days vacation. I had planned to do some travelling, but instead tagged along with my girlfriend to do some family stuff for new years. Since I was in town for most of the break, I spent most of my time just chilling (which, frankly, was awesome) and coding.

I’ve been programming for a few years now, both for fun and for work, but my programs usually are either scripts or backend (database) programs. I never got around to learning GUI programming, so I decided to use this opportunity to build a GUI program. I had recommended some minesweeper related challenges to a friend who’s learning programming, so i had minesweeper on the brain. I ended up coding the main game out of boredom one day, and decided to slap a GUI on top of it. 

Whenever I’ve seen GUI tutorials online, for simplicity sake they combine the main program code with the GUI code. I guess this makes sense for a quick tutorial, but my understanding is that this is pretty weak GUI programming. Generally, you want to separate the “model”, “controller” and “view” components.

For my application, my model was comprised of a board class which is a container for tile class objects I also defined. The controller was a “game” class which included functions like “click” and “flip_flag.” I wanted the game component (the model and controller) to operate just fine in isolation from the GUI, so my Game class also has a play() method for playing minesweeper in the terminal. Having built the working game components, I could now slap a GUI on top in a separate “view” component.

The most common GUI toolkits for python are wx, qt and tk. Tk is built into python as Tkinter, has a repuation for being simple to use, has plenty of documentation, and has the added benefit of taking advantage of the native OS’s aesthetics, so I decided to go with Tkinter.

Now, although there are lots of Tkinter tutorials, there doesn’t seem to be a ton of consistency in how people use Tkinter. In general, the main GUI is defined as a class called “App.” But sometimes the App class inherits from the Tk class, and sometimes it inherits from the Frame class. I started using an example that had my App inheriting from Frame, and this worked fine. I later wanted to implement a feature where the buttons would be hidden when the game was paused, and I found that this required a lot of changes if I kept it inheriting from the Frame class. I changed my app to inherit from the Tk class and it was a lot easier to add that feature. I think that inheritance makes more sense intuitively as well.

Anyway, as an exercise, I strongly recommend programming minesweeper. Next time I need to learn a new language, I plan to use programming minesweeper as a comprehensive exercise. To accomplish the task takes using several primitive types and functions, defining classes, using threads (for the timer), and of course the GUI.

 

 

Downloading images from saved reddit links

Reddit user aesptux posted to /r/learnprogramming requesting a code review of their script to download images from saved reddit links. This user made the same mistake I originally made and tried to work with the raw reddit JSON data. I hacked together a little script to show them how much more quickly they could accomplish their task using praw. It took almost no time at all, and the meat of the code is only about 20 lines. What a great API.

Here’s my code: https://gist.github.com/4225456

Idea: graphing wikipedia

I’m almost certain something like this has been done before. Anyway, here’s the idea:

  1. Download the wikipedia database dump.
  2. Ingest article texts into a database
  3. Scrape wikipedia links out of the first paragraph of each article.
  4. Create a directed graph of articles where two articles share an edge if they are linked as described in (3). Treat article categories as node attributes.
  5. Investigate community structure of wikipedia articles, particularly which categories cluster together
  6. Extra challenge: Try to find articles that won’t “get you to philosophy”

There are currently over 4M articles in the english wikipedia, so for this to be feasible I will probably need to invent some criterion for including articles in the project, probably minimum length, minimum age, or minimum edits. Alternatively, I might just focus on certain categories/subcategories.