Reddit user aesptux posted to /r/learnprogramming requesting a code review of their script to download images from saved reddit links. This user made the same mistake I originally made and tried to work with the raw reddit JSON data. I hacked together a little script to show them how much more quickly they could accomplish their task using praw. It took almost no time at all, and the meat of the code is only about 20 lines. What a great API.
Here’s my code: https://gist.github.com/4225456
I’m almost certain something like this has been done before. Anyway, here’s the idea:
- Download the wikipedia database dump.
- Ingest article texts into a database
- Scrape wikipedia links out of the first paragraph of each article.
- Create a directed graph of articles where two articles share an edge if they are linked as described in (3). Treat article categories as node attributes.
- Investigate community structure of wikipedia articles, particularly which categories cluster together
- Extra challenge: Try to find articles that won’t “get you to philosophy”
There are currently over 4M articles in the english wikipedia, so for this to be feasible I will probably need to invent some criterion for including articles in the project, probably minimum length, minimum age, or minimum edits. Alternatively, I might just focus on certain categories/subcategories.