Scraping your twitter home timeline with python and mongodb

Background

About a year and a half ago I was hanging out with two colleagues, John and Jane. John and I were discussing various new happenings we’d heard about recently. Jane was very impressed with how current we were and wondered how we did it. I described how I subscribe to several blogs and that suits me fine, but John insisted that we both needed to try Twitter.

I buckled down and finally created a twitter account. I didn’t really know who to follow, so picked a prominent local data scientist and let him “vet” users for me: I skimmed his “following” list and decided to also follow anyone who’s description made them sound reasonably interesting (from a data science stand point). The problem with this method is that I ended up following a bunch of his random friends who don’t actually talk about data science. Right now, there’s just too much happening on me twitter feed to keep up. If I don’t check it every now and then, I’ll quickly amass hundreds of backlogged tweets, so I have strong motivation to “trim the fat” of my following list.

Setup

To get started, I’m going to explain how to scrape your twitter homepage. But first things first, we’re going to need a few things:

Twitter API wrapper

There are several python twitter API wrappers available right now. I did some research back when I first started tinkering with twitter and landed on the Twython package. I don’t remember what led me to it, but I think the main thing is that it has a strong community and so there’s a lot of good documentation and tutorials describing how to use it.

To install Twython, just use pip like you would for most anything else:

pip install twython

No surprises here.

Twitter API authentication

We’re going to need to do two things to get our scraper working with twitter. First, we need to register a new app at http://apps.twitter.com. If your desired app name is taken, just add your username to make it unique. It’s not mentioned anywhere on the page, but you can’t have the ‘@’ symbol in your app name (or at least, it can’t be preceded by the ‘@’ symbol).

Next, register an access token for your account. It only needs to have read-only permissions, and keeping it this way ensures we won’t do any real damage with our experiment.

Finally, store the authentication information in a config file (I called mine “scraper.cfg”) like so:

[credentials]
app_key:XXXXXXXXXXXXXX
app_secret:XXXXXXXXXXXXXX
oath_token:XXXXXXXXXXXXXX
oath_token_secret:XXXXXXXXXXXXXX

MongoDB

Finally, we’re going to need to set up a repository to persist the content we’re scraping. My MO is usually to just use SQLite and to maybe define the data model using SQLAlchemy’s ORM (which is totally overkill but I still do it anyway for some reason). The thing here though is:

1. There’s a lot of information on tweets

2. I’m not entirely sure which information I’m going to find important just yet

3. The datamodel for a particular tweet is very flexible and certain fields may appear on one tweet but not another.

I figured for this project, it would be unnecessarily complicated to do it the old fashioned way and, more importantly, I’d probably be constantly adding new fields to my datamodel as the project developed, rendering my older scrapes less valuable because they’d be missing data. So to capture all the data we might want, we’re going to just drop the tweets in toto in a NoSQL document store. I chose mongo because I’d heard a lot about it and found it suited my needs perfectly and is very easy to use, although querying it uses a paradigm that I’m still getting used to (we’ll get to that later).

Download and install MongoDB from http://docs.mongodb.org/manual/installation/.
I set the data directory to be on a different (larger) disk than my C drive, so I start mongo
like this:

C:\mongodb\bin\mongod --dbpath E:\mongodata\db

We will need to run this command to start a mongo listener before running our scraper. Alternatively, you could just drop a system call in the scraper to startup mongo, but you should check to make sure it’s not running first. I found just spinning up mongo separately to be simple enough for my purposes.

Since we’ve already got a config file started, let’s add our database name and collection (NoSQL analog for a relational table) to the config file, so our full config file will look like:

[credentials]
app_key:XXXXXXXXXXXXXX
app_secret:XXXXXXXXXXXXXX
oath_token:XXXXXXXXXXXXXX
oath_token_secret:XXXXXXXXXXXXXX

[database]
name:twitter_test
collection:home_timeline

Take note: all we have to do to define the collection is give it a name. We don’t need to describe the schema at all (which, as described earlier, is part of the reason I’m using mongo for this project).

Getting Started

So we’re all set up with twython and mongo: time to start talking to twitter.

We start by calling in the relevant configuration values and spinning up a Twython instance:

import ConfigParser
from twython import Twython

config = ConfigParser.ConfigParser()
config.read('scraper.cfg')

# spin up twitter api
APP_KEY    = config.get('credentials','app_key')
APP_SECRET = config.get('credentials','app_secret')
OAUTH_TOKEN        = config.get('credentials','oath_token')
OAUTH_TOKEN_SECRET = config.get('credentials','oath_token_secret')

twitter = Twython(APP_KEY, APP_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
twitter.verify_credentials()

To get the most recent tweets from our timeline, we hit the “/statuses/home_timeline” API endpoint. We can get a maximum of 200 tweets per call to the endpoint, so let’s do that. Also, I’m a little data greedy, so I’m also going to ask for “contributor details:”

params = {'count':200, 'contributor_details':True}
home = twitter.get_home_timeline(**params)

Now, if we want to do persistent scraping of our home feed, obviously we can’t just wrap this call in a while loop: we need to make sure twitter knows what we’ve already seen so we only get the newest tweets. To do this, we will use the “since_id” parameter to set a limit on how far back in the timeline the tweets in our response will go.

Paging and Cursoring

This is going to be a very brief overview of the motivation behind cursoring and how it works. For a more in depth explanation, check the twitter docs here: https://dev.twitter.com/docs/working-with-timelines

Consider a situation in which, since the last call to the timeline, so many new tweets have been written that we can’t get them all in a single call. Twitter has a “paging” option, but if we use this, it’s possible that the tweets on the bottom of one page will overlap with the tweets on the top of the next page (if new tweets are still coming into the timeline). So instead of “paging” we’ll use “cursoring:” in addition to giving twitter a limit for how far back we can go, we’ll also give a limit for the most recent tweet in any particular call. We’ll do this using a “max_id” parameter. The API will still return the tweet with this ID though, so we want to set the max_id value just lower than the last tweet we saw. If you’re in a 64bit environment, you can do this by subtracting ‘1’ from the id.

Putting this all together, here’s what our persistent scraper looks like so far:

latest = None   # most recent id we've seen
while True:
    try:
        newest = None # this is just a flag to let us know if we should update the value of "latest"
        params = {'count':200, 'contributor_details':True, 'since_id':latest}
        home = twitter.get_home_timeline(**params)        
        if home:
            while home:
                store_tweets(home) # I'll define this function in a bit
                
                # Only update "latest" if we're inside the first pass through the inner while loop
                if newest is None:
                    newest = True
                    latest = home[0]['id']
                    
                params['max_id'] = home[-1]['id'] - 1
                home = twitter.get_home_timeline(**params)

Rate limiting

As with pretty much any web API, twitter doesn’t take too kindly to people slamming their servers. You can read more about the rate limits for different API endpoints here. Here’s what concerns us:

  • The rate limiting windows are 15 minutes long. Every 15 minutes, the window resets.
  • We can make 15 calls to the statuses/home_timeline endpoint within a given window.
  • If we exceed this threshold, our GET request to the API will return a 429 (“Too many requests”) code that Twython will feed to us as a twython.TwythonRateLimitError exception
  • Twitter provides an API endpoint to query the rate limiting status of your application at application/rate_limit_status.
  • The application/rate_limit_status endpoint is itself rate limited to 180 requests per window.

If we don’t pass in any parameters, the application/rate_limit_status endpoint will return the rate limit statuses for every single API endpoint which is much more data than we need, so we’ll limit the data we get back by constraining the response to “statuses” endpoints:

status = twitter.get_application_rate_limit_status(resources = ['statuses'])

This returns a JSON response wihch we only want a particular set of values from, so let’s select that bit out:

status = twitter.get_application_rate_limit_status(resources = ['statuses'])
home_status = status['resources']['statuses']['/statuses/home_timeline']        

Finally, we’ll test how many API calls are remaining in the current window, and if we’ve run out set the application to sleep until the window resets, double check that we’re ok, and then resume scraping. I’ve wrapped this procedure in a function to make it simple to perform this test:

def handle_rate_limiting():
    while True:
        status = twitter.get_application_rate_limit_status(resources = ['statuses'])
        home_status = status['resources']['statuses']['/statuses/home_timeline']        
        if home_status['remaining'] == 0:                
            wait = max(home_status['reset'] - time.time(), 0) + 1 # addding 1 second pad
            time.sleep(wait)
        else:
            return

We’re only testing one of the API endpoints we’re hitting though: we’re hitting the application/rate_limit_status endpoint as well, so we should include that in our test just to be safe although realistically, there’s no reason to believe we’ll ever hit the limitation for that endpoint.

def handle_rate_limiting():
    app_status = {'remaining':1} # prepopulating this to make the first 'if' check fail
    while True:
        if app_status['remaining'] > 0:
            status = twitter.get_application_rate_limit_status(resources = ['statuses', 'application'])
            app_status = status['resources']['application']['/application/rate_limit_status']        
            home_status = status['resources']['statuses']['/statuses/home_timeline']        
            if home_status['remaining'] == 0:                
                wait = max(home_status['reset'] - time.time(), 0) + 1 # addding 1 second pad
                time.sleep(wait)
            else:
                return
        else :
            wait = max(app_status['reset'] - time.time(), 0) + 10
            time.sleep(wait)

Now that we have this, we can insert it into the while loop that performs the home timeline scraping function. While we’re at it, we’ll throw in some exception handling just in case this rate limiting function doesn’t work the way it’s supposed to.

while True:
    try:
        newest = None
        params = {'count':200, 'contributor_details':True, 'since_id':latest}
        handle_rate_limiting()
        home = twitter.get_home_timeline(**params)        
        if home:
            while home:
                store_tweets(home)
                
                if newest is None:
                    newest = True
                    latest = home[0]['id']
                    
                params['max_id'] = home[-1]['id'] - 1
                handle_rate_limiting()
                home = twitter.get_home_timeline(**params)
        else:            
            time.sleep(60)
    
    except TwythonRateLimitError, e:
        print "[Exception Raised] Rate limit exceeded"
        reset = int(twitter.get_lastfunction_header('x-rate-limit-reset'))
        wait = max(reset - time.time(), 0) + 10 # addding 10 second pad
        time.sleep(wait)
    except Exception, e:
        print e
        print "Non rate-limit exception encountered. Sleeping for 15 min before retrying"
        time.sleep(60*15)

Storing Tweets in Mongo

First, we need to spin up the database/collection we defined in the config file.

from pymongo import Connection

DBNAME = config.get('database', 'name')
COLLECTION = config.get('database', 'collection')
conn = Connection()
db = conn[DBNAME]
tweets = db[COLLECTION]

I’ve been calling a placeholder function “store_tweets()” above, let’s actually define it:

def store_tweets(tweets_to_save, collection=tweets):
    collection.insert(tweets_to_save)

Told you using mongo was easy! In fact, we could actually just replace every single call to “store_tweets(home)” with “tweets.insert(home)”. It’s really that simple to use mongo.

The reason I wrapped this in a separate function is because I actually want to process the tweets I’m downloading a little bit for my own purposes. A component of my project is going to involve calculating some simple statistics on tweets based on when they were authored, so before storing them I’m going to convert the time stamp on each tweet to a python datetime object. Mongo plays miraculously well with python, so we can actually store that datetime object without serializing it.

import datetime

def store_tweets(tweets_to_save, collection=tweets):
    for tw in tweets_to_save:
        tw['created_at'] = datetime.datetime.strptime(tw['created_at'], '%a %b %d %H:%M:%S +0000 %Y')
    collection.insert(tweets_to_save)

Picking up where we left off

The first time we run this script, it will scrape from the newest tweet back as far in our timeline as it can (approximately 800 tweets back). Then it will monitor new tweets and drop them in the database. But this behavior is completely contingent on the persistence of the “latest” variable. If the script dies for any reason, we’re in trouble: restarting the script will do a complete scrape on our timeline from scratch, going back as far as it can through historical tweets again. To manage this, we can query the “latest” variable from the database instead of just blindly setting it to “None” when we call the script:

latest = None   # most recent id scraped
try:
    last_tweet = tweets.find(limit=1, sort=[('id',-1)])[0] # sort: 1 = ascending, -1 = descending
    if last_tweet:
        latest = last_tweet['id']
except:
    print "Error retrieving tweets. Database probably needs to be populated before it can be queried."

And we’re done! The finished script looks like this:

import ConfigParser
import datetime
from pymongo import Connection
import time
from twython import Twython, TwythonRateLimitError

config = ConfigParser.ConfigParser()
config.read('scraper.cfg')

# spin up twitter api
APP_KEY    = config.get('credentials','app_key')
APP_SECRET = config.get('credentials','app_secret')
OAUTH_TOKEN        = config.get('credentials','oath_token')
OAUTH_TOKEN_SECRET = config.get('credentials','oath_token_secret')

twitter = Twython(APP_KEY, APP_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
twitter.verify_credentials()

# spin up database
DBNAME = config.get('database', 'name')
COLLECTION = config.get('database', 'collection')
conn = Connection()
db = conn[DBNAME]
tweets = db[COLLECTION]

def store_tweets(tweets_to_save, collection=tweets):
    """
    Simple wrapper to facilitate persisting tweets. Right now, the only
    pre-processing accomplished is coercing 'created_at' attribute to datetime.
    """
    for tw in tweets_to_save:
        tw['created_at'] = datetime.datetime.strptime(tw['created_at'], '%a %b %d %H:%M:%S +0000 %Y')
    collection.insert(tweets_to_save)

def handle_rate_limiting():
    app_status = {'remaining':1} # prepopulating this to make the first 'if' check fail
    while True:
        wait = 0
        if app_status['remaining'] > 0:
            status = twitter.get_application_rate_limit_status(resources = ['statuses', 'application'])
            app_status = status['resources']['application']['/application/rate_limit_status']
            home_status = status['resources']['statuses']['/statuses/home_timeline']
            if home_status['remaining'] == 0:
                wait = max(home_status['reset'] - time.time(), 0) + 1 # addding 1 second pad
                time.sleep(wait)
            else:
                return
        else :
            wait = max(app_status['reset'] - time.time(), 0) + 10
            time.sleep(wait)

latest = None   # most recent id scraped
try:
    last_tweet = tweets.find(limit=1, sort=[('id',-1)])[0] # sort: 1 = ascending, -1 = descending
    if last_tweet:
        latest = last_tweet['id']
except:
    print "Error retrieving tweets. Database probably needs to be populated before it can be queried."

no_tweets_sleep = 1
while True:
    try:
        newest = None # this is just a flag to let us know if we should update the "latest" value
        params = {'count':200, 'contributor_details':True, 'since_id':latest}
        handle_rate_limiting()
        home = twitter.get_home_timeline(**params)
        if home:
            while home:
                store_tweets(home)

                # Only update "latest" if we're inside the first pass through the inner while loop
                if newest is None:
                    newest = True
                    latest = home[0]['id']

                params['max_id'] = home[-1]['id'] - 1
                handle_rate_limiting()
                home = twitter.get_home_timeline(**params)
        else:
            time.sleep(60*no_tweets_sleep)

    except TwythonRateLimitError, e:
        reset = int(twitter.get_lastfunction_header('x-rate-limit-reset'))
        wait = max(reset - time.time(), 0) + 10 # addding 10 second pad
        time.sleep(wait)
    except Exception, e:
        print e
        print "Non rate-limit exception encountered. Sleeping for 15 min before retrying"
        time.sleep(60*15)
Advertisements

Weekend project: Reddit Comment Scraper in Python

(The methodology described below works, but is not as easy as the preferred alternative method using the praw library. If you’re here because you want to scape reddit, please reference this post which describes a simpler, faster method)

This weekend, I built a webscraper that downloads comments from reddit. It was a good exercise and I’m quite pleased with my work, but I hesitate to refer to my tool as a “scraper” because reddit makes it easy to access JSON output which is very, very easy to work with, and when I think of webscraping I think of more involved screen capture and parsing such as with a package like Beautiful Soup. Maybe it would be more appropriate to call my tool a “miner,” but it doesn’t do any analysis, just grabs data. Anyway, enough semantics.

Here’s the code on GIthub. Please bear in mind, this is still in development, but it works well enough for now. Although I haven’t added a formal license file in the repo, all project code is licensed CC BY-SA.

Generalization of the steps in my development process:

  1. Determine how to open and download web pages
  2. Determine how to get JSON out of reddit
  3. Determine how to generate appropriate URLs for pagination
  4. Determine how to extract pertinent data from the downloaded pages
  5. Determine appropriate halt points for the download process
  6. Add command line parameters
  7. Implement a data storage solution

1. Downloading Webpages

This was the easy part. There are a lot of fairly generic solutions to this in python. mechanize is a nice library for web browsing with python, but it’s a little more robust than what I need. urllib2 is closer to the metal, but still very simple and lightweight. urllib is even simpler, but the reddit API documentation specifies that they want to see a header. I’m not really using the API, but I decided to include a header in my request anyway to play nice with those reddit kids.

from urllib2 import Request, urlopen

_URL = 'http://www.reddit.com/'
_headers = {'User-agent':'myscript'}

request = Request(_URL, _headers)
response = urlopen(request)
data = response.read()

Honesty time: I don’t know a ton about the HTTP protocol. I know there are POST requests and GET requests. A GET is a simple “Hey, can I see that webpage?” whereas a POST is more like “I want to interact with that webpage” (like completing a form.” The above represents a GET request which is all I needed for my tool; if you want to use urllib2 to make a POST, just add a ‘data’ parameterin the call to urlopen or when you’re generating the ‘Request’ object. I wish I understood better what was going on here, but my understanding is that urlopen() creates a connection, and the read() method actually pulls data across the connection. urlopen() will throw an error if there’s an issue with the request or forming the connection, but the read() method is where I think you’re most likely to see an error.

Reddit doesn’t like people to make requests to frequently, so I when I wrapped the above code in a function I included a default 2 second delay before issuing the GET request. This request is wrapped in a ‘try’ block that increases the delay if an exception is reached. I think I’ve seen instances where the exception wasn’t seen until hitting my code’s JSON parser, but you get the idea. I can modify later as needed. Here’s my function:

def get_comments(URL,head,delay=2):
    '''Pretty generic call to urllib2.'''
    sleep(delay) # ensure we don't GET too frequently or the API will block us
    request = Request(URL, headers=head)
    try:
        response = urlopen(request)
        data = response.read()
    except:
        sleep(delay+5)
        response = urlopen(request)
        data = response.read()
    return data

2. JSON

Doug Hellman has a great tutorial on using JSON with python, so I won’t go into too much depth here. I will say this much though: reddit is very developer friendly and makes their website very easy to scrape. By adding “.json” or “.xml” to the end of any reddit URL (at least to the best of my knowledge), you get structured data in your format of choice. I’ve had some experience with XML and I’m not a huge fan. This project was my first time doing anything with JSON and I knew nothing about JSON going into this project, but even so I found JSON to be extremely intuitive. Basically each reddit “thing” (object, like a comment or link) gets converted into what are effectively nested python dictionaries. Awesome. Here’s a sample of the JSON we get from the reddit comments page. This contains one single comment, but the output is from a set of 25 comments, hence the “before” and “after” tags.

{"kind": "Listing"
,"data":
{"modhash": "", "children":
[
{"kind": "t1", "data":
{"subreddit_id": "t5_2qh1e"
, "link_title": "This is How Hot it is in Iraq"
, "banned_by": null
, "link_id": "t3_ymy8w"
, "likes": null
, "replies": null
, "id": "c5x77kx"
, "author": "shaggorama"
, "parent_id": "t1_c5x54ci"
, "approved_by": null
, "body": "I don't know where this soldier is, but t[he high in Baghdad today was 106^oF](http://www.weather.com/weather/today/Baghdad+Iraq+IZXX0008)\n\nEDIT: My bad, this video was posted 9/8/2009. The high in Baghdad on that date was actually [105^oF](http://www.wunderground.com/history/airport/KQTZ/2009/9/8/DailyHistory.html?req_city=NA&req_state=NA&req_statename=NA)"
, "edited": 1345668492.0
, "author_flair_css_class": null
, "downs": 0
, "body_html": "<div class=\"md\"><p>I don't know where this soldier is, but t<a href=\"http://www.weather.com/weather/today/Baghdad+Iraq+IZXX0008\">he high in Baghdad today was 106<sup>oF</sup></a></p>\n\n<p>EDIT: My bad, this video was posted 9/8/2009. The high in Baghdad on that date was actually <a href=\"http://www.wunderground.com/history/airport/KQTZ/2009/9/8/DailyHistory.html?req_city=NA&req_state=NA&req_statename=NA\">105<sup>oF</sup></a></p>\n</div>"
, "subreddit": "videos"
, "name": "t1_c5x77kx"
, "created": 1345667922.0
, "author_flair_text": null
, "created_utc": 1345667922.0
, "num_reports": null, "ups": 3}
}]
, "after": "t1_c5wcp43"
, "before": "t1_c5x77kx"
}
}

3.  URL Hacking

Originally I had expected this foray to familiarize me with the reddit API. After digging through it a bit though I got the impression that the API is really for app developers and not webscrapers. I just assumed there would be some API call I could use to get comments for a user, but my research into the reddit API documentation revealed no such functionality. This meant I would have to pull the data down directly from the website. The first 25 comments are easy enough to grab:

http://www.reddit.com/user/UserName/comments/

This is converted to easy to navigate JSON by adding “.JSON” to the end of the URL as such:

http://www.reddit.com/user/UserName/comments/.json

The URL for the next 25 comments is a little more complicated:

http://www.reddit.com/user/UserName/comments/?count=25&after=XX_XXXXXXX

The number after “count” is the number of comments we’ve already seen, so we’ll have to keep track. Th “after” parameter needs the reddit “thing id” for the last comment we saw. I had expected to just be able to tack “.json” onto the above URL, but when I did that reddit interpreted that as part of the thing id and just sent me back to the top 25 comments in HTML since it didn’t understand my request. After experimenting for a few minutes, I figured it out and here’s what the JSON request needs to look like:

http://www.reddit.com/user/UserName/comments/.json?count=25&after=XX_XXXXXXX

4. Parsing JSON

Again, I wish I understood what I was doing here better and direct you to Doug Hellman’s JSON tutorial. I may not understand exactly what I’m doing, but I’ve got my functional code so I’m happy.

The usage of the python JSON library usually seems to start with a call to either json.loads() or json.dumps() to encode/decode JSON. I don’t remember where I found this, but apparently the solution I needed was actually the decode() method of the JSONDecoder() class. Apparently you don’t even need to generate an instance of this class to use it, so here’s what I ended up with:

ecoded = json.JSONDecoder().decode(json_data)
comments = [x['data'] for x in decoded['data']['children']]

For convenience, I wrapped my parser in a function that returns the ‘comments’ list as well as a list of comment ids.

5. Halting the Download

At this point I basically had a working prototype. I built my tool to output some information to the screen as the download progressed, and I noticed that after 999 comments, reddit stopped giving me data. I had built my tool to halt when it started seeing comments it had already downloaded (since that’s what the HTML website does) but apparently the JSON output just gives you null data. I decided to also add an optional download limit to my scraper tool, so I wrote in three conditions to stop downloading: receipt of data we had already downloaded, receipt of null data, or reaching a defined limit on downloads.

The above observation might seems sufficiently trivial that I shouldn’t have mentioned it, but I think it’s important to think about these kind of stop points and test and keep an eye on them in live code when you’re scraping. Otherwise you might just eat up unnecessary bandwidth and spam your target website with requests.

6. Command Line Parameters

For this I used the argparse library. Unlike urllib2 and json, this bit wasn’t a learning experience for me. I’ve used this library in the past and frankly, I fucking love it. It’s easy to use, generates a very easy ‘help’ page… The scripts I’ve used it in before were all smaller and for fairly narrow use at work. For this project I used the “if __name__ == ‘__main__’: “ test to simplify debugging, and I originally dropped my command line arguments right under my library importation calls. The result was that my code would error out because I hadn’t provided parameters that the code saw as required. Lesson learned: command line argument handling goes inside the “main” block, not up with function definitions. Here’s the code, it’s pretty self explanatory:

if __name__ == '__main__':
    ### Commandline argument handling ###

    parser = argparse.ArgumentParser(description="Scrapes comments for a reddit user. Currently limited to most recent 999 comments (limit imposed by reddit).")
    parser.add_argument('-u','--user', type=str, help="Reddit username to grab comments from.", required=True)
    parser.add_argument('-l','--limit', type=int, help="Maximum number of comments to download.", default=0)
    parser.add_argument('-d','--dbname', type=str, help="Database name for storage.", default='RedditComments.DB')
    parser.add_argument('-w','--wait', type=str,help="Wait time between GET requests. Reddit documentation requests a limit of 1 of every 2 seconds not to exceed 30 per min.", default=2)

    args = parser.parse_args()
    _user   = args.user
    _limit  = args.limit
    _dbname = args.dbname
    _wait   = args.wait

    comments = scrape(_user, _limit, _wait)

7. Saving The Results

I had built a successful comment scraper! The one problem was my work was evaporating into the ether. The comments would download, and be lost as soon as the program exited. I needed to save them somehow. One very simple solution would be to pickle the comments. This would make it easy to get the data back into python, but I won’t be doing my analyses in the python environment. I know my way around excel, R and octave, but I haven’t gotten around to learning numpy/scipy/scikit-learn yet. I may make that part of this project in which case I could have my tool automatically spit out some graphs after the download, which would be pretty neat.

My very, very strong preference over pickling would be to store the data in a database. This is really the obvious solution, but unfortunately, although I have been neck deep in SQL for the past two years for work, I don’t really have any experience working with databases from inside python. I’ve used python to play with CSVs, so for the time being I’m dumping the data into a CSV using the csv DictWriter() class. This unfortunately wasn’t as straightforward as it should have been because csv writes to ASCII and my decoded JSON was in unicode. I won’t break it down for you, but here’s how I dealt with the unicode (I totally found this solution somewhere and modified it to suit my needs, but I can’t remember where I found it. Probably StackOverflow):

writer.writerow(dict((k, v.encode(‘utf-8’) if isinstance(v, unicode) else v) for k, v in comment.iteritems()))
[/sourceode]

I’m not a big fan of this solution for a myriad of reasons. My next step will be implementing database storage. Right now I’m using this as an excuse to learn the sqlalchemy library, but considering how familiar SQL is to me I may go lower level and just do it with the much lighterwieght, albeit RDBMS implementation specific sqlite3.