Weekend project: Reddit Comment Scraper in Python

(The methodology described below works, but is not as easy as the preferred alternative method using the praw library. If you’re here because you want to scape reddit, please reference this post which describes a simpler, faster method)

This weekend, I built a webscraper that downloads comments from reddit. It was a good exercise and I’m quite pleased with my work, but I hesitate to refer to my tool as a “scraper” because reddit makes it easy to access JSON output which is very, very easy to work with, and when I think of webscraping I think of more involved screen capture and parsing such as with a package like Beautiful Soup. Maybe it would be more appropriate to call my tool a “miner,” but it doesn’t do any analysis, just grabs data. Anyway, enough semantics.

Here’s the code on GIthub. Please bear in mind, this is still in development, but it works well enough for now. Although I haven’t added a formal license file in the repo, all project code is licensed CC BY-SA.

Generalization of the steps in my development process:

  1. Determine how to open and download web pages
  2. Determine how to get JSON out of reddit
  3. Determine how to generate appropriate URLs for pagination
  4. Determine how to extract pertinent data from the downloaded pages
  5. Determine appropriate halt points for the download process
  6. Add command line parameters
  7. Implement a data storage solution

1. Downloading Webpages

This was the easy part. There are a lot of fairly generic solutions to this in python. mechanize is a nice library for web browsing with python, but it’s a little more robust than what I need. urllib2 is closer to the metal, but still very simple and lightweight. urllib is even simpler, but the reddit API documentation specifies that they want to see a header. I’m not really using the API, but I decided to include a header in my request anyway to play nice with those reddit kids.

from urllib2 import Request, urlopen

_URL = 'http://www.reddit.com/'
_headers = {'User-agent':'myscript'}

request = Request(_URL, _headers)
response = urlopen(request)
data = response.read()

Honesty time: I don’t know a ton about the HTTP protocol. I know there are POST requests and GET requests. A GET is a simple “Hey, can I see that webpage?” whereas a POST is more like “I want to interact with that webpage” (like completing a form.” The above represents a GET request which is all I needed for my tool; if you want to use urllib2 to make a POST, just add a ‘data’ parameterin the call to urlopen or when you’re generating the ‘Request’ object. I wish I understood better what was going on here, but my understanding is that urlopen() creates a connection, and the read() method actually pulls data across the connection. urlopen() will throw an error if there’s an issue with the request or forming the connection, but the read() method is where I think you’re most likely to see an error.

Reddit doesn’t like people to make requests to frequently, so I when I wrapped the above code in a function I included a default 2 second delay before issuing the GET request. This request is wrapped in a ‘try’ block that increases the delay if an exception is reached. I think I’ve seen instances where the exception wasn’t seen until hitting my code’s JSON parser, but you get the idea. I can modify later as needed. Here’s my function:

def get_comments(URL,head,delay=2):
    '''Pretty generic call to urllib2.'''
    sleep(delay) # ensure we don't GET too frequently or the API will block us
    request = Request(URL, headers=head)
    try:
        response = urlopen(request)
        data = response.read()
    except:
        sleep(delay+5)
        response = urlopen(request)
        data = response.read()
    return data

2. JSON

Doug Hellman has a great tutorial on using JSON with python, so I won’t go into too much depth here. I will say this much though: reddit is very developer friendly and makes their website very easy to scrape. By adding “.json” or “.xml” to the end of any reddit URL (at least to the best of my knowledge), you get structured data in your format of choice. I’ve had some experience with XML and I’m not a huge fan. This project was my first time doing anything with JSON and I knew nothing about JSON going into this project, but even so I found JSON to be extremely intuitive. Basically each reddit “thing” (object, like a comment or link) gets converted into what are effectively nested python dictionaries. Awesome. Here’s a sample of the JSON we get from the reddit comments page. This contains one single comment, but the output is from a set of 25 comments, hence the “before” and “after” tags.

{"kind": "Listing"
,"data":
{"modhash": "", "children":
[
{"kind": "t1", "data":
{"subreddit_id": "t5_2qh1e"
, "link_title": "This is How Hot it is in Iraq"
, "banned_by": null
, "link_id": "t3_ymy8w"
, "likes": null
, "replies": null
, "id": "c5x77kx"
, "author": "shaggorama"
, "parent_id": "t1_c5x54ci"
, "approved_by": null
, "body": "I don't know where this soldier is, but t[he high in Baghdad today was 106^oF](http://www.weather.com/weather/today/Baghdad+Iraq+IZXX0008)\n\nEDIT: My bad, this video was posted 9/8/2009. The high in Baghdad on that date was actually [105^oF](http://www.wunderground.com/history/airport/KQTZ/2009/9/8/DailyHistory.html?req_city=NA&req_state=NA&req_statename=NA)"
, "edited": 1345668492.0
, "author_flair_css_class": null
, "downs": 0
, "body_html": "<div class=\"md\"><p>I don't know where this soldier is, but t<a href=\"http://www.weather.com/weather/today/Baghdad+Iraq+IZXX0008\">he high in Baghdad today was 106<sup>oF</sup></a></p>\n\n<p>EDIT: My bad, this video was posted 9/8/2009. The high in Baghdad on that date was actually <a href=\"http://www.wunderground.com/history/airport/KQTZ/2009/9/8/DailyHistory.html?req_city=NA&req_state=NA&req_statename=NA\">105<sup>oF</sup></a></p>\n</div>"
, "subreddit": "videos"
, "name": "t1_c5x77kx"
, "created": 1345667922.0
, "author_flair_text": null
, "created_utc": 1345667922.0
, "num_reports": null, "ups": 3}
}]
, "after": "t1_c5wcp43"
, "before": "t1_c5x77kx"
}
}

3.  URL Hacking

Originally I had expected this foray to familiarize me with the reddit API. After digging through it a bit though I got the impression that the API is really for app developers and not webscrapers. I just assumed there would be some API call I could use to get comments for a user, but my research into the reddit API documentation revealed no such functionality. This meant I would have to pull the data down directly from the website. The first 25 comments are easy enough to grab:

http://www.reddit.com/user/UserName/comments/

This is converted to easy to navigate JSON by adding “.JSON” to the end of the URL as such:

http://www.reddit.com/user/UserName/comments/.json

The URL for the next 25 comments is a little more complicated:

http://www.reddit.com/user/UserName/comments/?count=25&after=XX_XXXXXXX

The number after “count” is the number of comments we’ve already seen, so we’ll have to keep track. Th “after” parameter needs the reddit “thing id” for the last comment we saw. I had expected to just be able to tack “.json” onto the above URL, but when I did that reddit interpreted that as part of the thing id and just sent me back to the top 25 comments in HTML since it didn’t understand my request. After experimenting for a few minutes, I figured it out and here’s what the JSON request needs to look like:

http://www.reddit.com/user/UserName/comments/.json?count=25&after=XX_XXXXXXX

4. Parsing JSON

Again, I wish I understood what I was doing here better and direct you to Doug Hellman’s JSON tutorial. I may not understand exactly what I’m doing, but I’ve got my functional code so I’m happy.

The usage of the python JSON library usually seems to start with a call to either json.loads() or json.dumps() to encode/decode JSON. I don’t remember where I found this, but apparently the solution I needed was actually the decode() method of the JSONDecoder() class. Apparently you don’t even need to generate an instance of this class to use it, so here’s what I ended up with:

ecoded = json.JSONDecoder().decode(json_data)
comments = [x['data'] for x in decoded['data']['children']]

For convenience, I wrapped my parser in a function that returns the ‘comments’ list as well as a list of comment ids.

5. Halting the Download

At this point I basically had a working prototype. I built my tool to output some information to the screen as the download progressed, and I noticed that after 999 comments, reddit stopped giving me data. I had built my tool to halt when it started seeing comments it had already downloaded (since that’s what the HTML website does) but apparently the JSON output just gives you null data. I decided to also add an optional download limit to my scraper tool, so I wrote in three conditions to stop downloading: receipt of data we had already downloaded, receipt of null data, or reaching a defined limit on downloads.

The above observation might seems sufficiently trivial that I shouldn’t have mentioned it, but I think it’s important to think about these kind of stop points and test and keep an eye on them in live code when you’re scraping. Otherwise you might just eat up unnecessary bandwidth and spam your target website with requests.

6. Command Line Parameters

For this I used the argparse library. Unlike urllib2 and json, this bit wasn’t a learning experience for me. I’ve used this library in the past and frankly, I fucking love it. It’s easy to use, generates a very easy ‘help’ page… The scripts I’ve used it in before were all smaller and for fairly narrow use at work. For this project I used the “if __name__ == ‘__main__’: “ test to simplify debugging, and I originally dropped my command line arguments right under my library importation calls. The result was that my code would error out because I hadn’t provided parameters that the code saw as required. Lesson learned: command line argument handling goes inside the “main” block, not up with function definitions. Here’s the code, it’s pretty self explanatory:

if __name__ == '__main__':
    ### Commandline argument handling ###

    parser = argparse.ArgumentParser(description="Scrapes comments for a reddit user. Currently limited to most recent 999 comments (limit imposed by reddit).")
    parser.add_argument('-u','--user', type=str, help="Reddit username to grab comments from.", required=True)
    parser.add_argument('-l','--limit', type=int, help="Maximum number of comments to download.", default=0)
    parser.add_argument('-d','--dbname', type=str, help="Database name for storage.", default='RedditComments.DB')
    parser.add_argument('-w','--wait', type=str,help="Wait time between GET requests. Reddit documentation requests a limit of 1 of every 2 seconds not to exceed 30 per min.", default=2)

    args = parser.parse_args()
    _user   = args.user
    _limit  = args.limit
    _dbname = args.dbname
    _wait   = args.wait

    comments = scrape(_user, _limit, _wait)

7. Saving The Results

I had built a successful comment scraper! The one problem was my work was evaporating into the ether. The comments would download, and be lost as soon as the program exited. I needed to save them somehow. One very simple solution would be to pickle the comments. This would make it easy to get the data back into python, but I won’t be doing my analyses in the python environment. I know my way around excel, R and octave, but I haven’t gotten around to learning numpy/scipy/scikit-learn yet. I may make that part of this project in which case I could have my tool automatically spit out some graphs after the download, which would be pretty neat.

My very, very strong preference over pickling would be to store the data in a database. This is really the obvious solution, but unfortunately, although I have been neck deep in SQL for the past two years for work, I don’t really have any experience working with databases from inside python. I’ve used python to play with CSVs, so for the time being I’m dumping the data into a CSV using the csv DictWriter() class. This unfortunately wasn’t as straightforward as it should have been because csv writes to ASCII and my decoded JSON was in unicode. I won’t break it down for you, but here’s how I dealt with the unicode (I totally found this solution somewhere and modified it to suit my needs, but I can’t remember where I found it. Probably StackOverflow):

writer.writerow(dict((k, v.encode(‘utf-8’) if isinstance(v, unicode) else v) for k, v in comment.iteritems()))
[/sourceode]

I’m not a big fan of this solution for a myriad of reasons. My next step will be implementing database storage. Right now I’m using this as an excuse to learn the sqlalchemy library, but considering how familiar SQL is to me I may go lower level and just do it with the much lighterwieght, albeit RDBMS implementation specific sqlite3.