Idea: graphing wikipedia

I’m almost certain something like this has been done before. Anyway, here’s the idea:

  1. Download the wikipedia database dump.
  2. Ingest article texts into a database
  3. Scrape wikipedia links out of the first paragraph of each article.
  4. Create a directed graph of articles where two articles share an edge if they are linked as described in (3). Treat article categories as node attributes.
  5. Investigate community structure of wikipedia articles, particularly which categories cluster together
  6. Extra challenge: Try to find articles that won’t “get you to philosophy”

There are currently over 4M articles in the english wikipedia, so for this to be feasible I will probably need to invent some criterion for including articles in the project, probably minimum length, minimum age, or minimum edits. Alternatively, I might just focus on certain categories/subcategories.

Advertisements

2 thoughts on “Idea: graphing wikipedia

    • No, I never got around to it. Thanks for the link, that article was interesting! I’m glad the person who tackled this had better distributed computing resources than me 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s