Who is /r/WashingtonDC – Part 1: Daily activity usage

(The code associated with this project can be found at https://github.com/dmarx/Scrape-Subreddit-Users)

A few weeks ago I started playing with the Reddit API. My initial goal was to characterize my own usage activity and well…. I’d rather not publish the evidence of my internet addiction. After building my comment scraper, I learned that the python library praw (Python Reddit API Wrapper), which I had previously thought wasn’t useful for my project, actually would have made the scrape trivial. Like, 3 lines of code trivial. It’s a great library and makes scraping Reddit a breeze. Really incredible tool.

My next goal was to use PRAW to try to characterize redditors. How do people use reddit differently? Are there different user-types we can meaningfully classify? I decided a good starting point would be to analyze user activity and see if I could characterize users based on their frequency and preferred posting hours. To simplify my project, I wanted to build a dataset of users who I could say with a high confidence were all in the same time-zone, and preferably in the same geographic community. The obvious solution: scrape a geographic subreddit!

Thus was born the “Who is /r/WashingtonDC” project (let’s call the subreddit r/DC for short). I chose DC because I live here so characterizing the community is of personal interest to me. I plan to scrape a few other US cities, which I think will produce interesting results by contrasting one locations “profile” with another.

Methodology

At the time I started my project, the subreddit had 11,332 subscribers. Of course, I had no idea who these subscribers were although these users were ultimately the members of the dataset I wanted to build. To resolve this, I scraped the last 1000 posts from the subreddit, and mined usernames from the comments of these posts. I probably grabbed a few users who were commenting in r/DC because they planned on visiting, but I’m not too concerned because ultimately I was able to grab 2606 unique usernames. I don’t have my code handy, but this was so easy I can fake it for you:

import praw

target_subreddit = 'washingtonDC'

r = praw.Reddit(user_agent='')
sub = r.get_subreddit(target_subreddit)
post_generator = sub.get_top(limit=None)

user = {}
for post in post_generator():
    comments = post.all_comments_flat()
    for comment in comments:
        # If comment or user account have been deleted, the comment
        # has no 'author' property, so we need to be ready for that.
        try:
            user[comment.author.name] = None
        except:
            pass

Bam. That’s it! Way simple and straightforward, definitely more so than what I built for that weekend project. The Reddit API documentation requests that you don’t hit the website more frequently than once every two seconds: you don’t see anything in my code regarding this because praw handles it for you with an elegant use of python decorators.

Scraping 1000 posts for comments took about 33 minutes which is exactly what we should expect: 2 seconds per post * 1000 posts = 2000 seconds, 2000/60 = 33.33 minutes. Scraping the individual users comment histories will necessarily take longer. If we assume the worst case scenario that every user has a full comment history (i.e. 1000 comments available through the API), it should take about two months to scrape everything. Yikes! Luckily for us, internet activity of this kind is subject to a power-log distibution that generally follows an 80-20 or 90-10 rule. The average r/DC user has 394 comments in their comment history, which brings us down to about a weeks worth of scraping. In reality, it took me closer to…I wanna say 5 days of scraping, a lot of which was time wasted by not properly handling errors from the API (for instance, if a username I scraped in the first phase was deleted by the time I got around to grabbing their comments.

Again, scraping user comment histories is trivially easy with praw. The following is rough code that effectively replaces the entire project I posted earlier. You should store the usernames form phase 1 in a database and similarly save the comments as you get them. That way you can easily pick up where you left off if something breaks without losing all of your work. The following code doesn’t include persisting the data, but you’ll get the gist of the process.

comments = []
for u in users:
    user = r.get_redditor(u)
    comments_gen = user.get_comments(limit=None)
    for comment in comments_gen:
        comments.append(comment)

So easy. Love it. Each “comment” is a praw Comment object whose attributes we would otherwise have had to have parsed out of a JSON response. Again, I recommend storing these as you receive them, maybe committing after scraping all of a user’s comments or as you go, or at regular time intervals.

Results

In total, I scraped 1,022,887 comments across 2591 users associated with the r/DC subreddit. Next step will be scraping users posts which I strongly suspect will show a similar behavior to their commenting activity. What does their commenting activity look like? I thought you’d never ask!

Each data point is the sum of all activity during the time period form X to X:59, so really this graph is a histogram of commenting activity binned by hour/day.

This graph tells an interesting story, most of which is pretty expected, but I still think it’s very intersting.The vast bulk of activity occurs between 9am and 4pm, Monday to Friday. This suggests to me that either the community is dominated by college kids who check reddit in between classes (or during class?) or the vast bulk of the community is browsing during work (more likely). Considering the rapid decrease in activity from 3-5pm, maybe it’s neither of these: maybe it’s actually high schoolers who dominate the subreddit, which would explain the rapid drop off from 3 to 4. The local minimum at 6pm is clearly from people commuting.

Let’s assume most redditors are working adults. This means that DC redditors basically browse reddit all day at work, starting as soon as they get there (except on Mondays), and then make a few comments after dinner. Because this graph represents the sum activity for all users, the dip on Monday’s is probably due to Monday holidays, but I haven’t made any attempt to account for that so let’s just say we’re sluggish on Mondays in DC.

Redditors like to browse after they eat. The global maximum for all weekdays (except Friday) is at 1-1:59 pm. I personally have a late lunch, but from looking around my office I suspect most people actually eat around 12-12:30 ish. This is corroborated by the second local maximum at 9-9:59pm, which would be just after dinner. of course, this is sort of an unfair generalization, because we can also expect that a significant contribution to this second local maximum is from people who are just lazily browsing reddit while watching TV before bed.

But let’s run with the assumption that spikes follow meals for fun. Keeping this in mind, it appears likely that people in DC like Sunday brunch. Ceteris perebus, we’d expect people to have lunch at the same time on Saturday and Sunday, but there’s a very clear spike in the Sunday graph at 11-11:59m which could be people hitting Reddit after their 10am bottomless mimosas. Saturday’s bump is at 12-12:59, but this bump is in line with a relative plateau of activity that lasts basically all day. Of course, if people are out at brunch, then we should expect them to continue to be with their friends after brunch, but let’s not let facts interfere with my more fun brunch narrative here.

Speaking of Saturday, about as many people stay in Friday night as stay in Saturday night. This isn’t surprising, I just think it’s funny that you can see it in the graph. Monday through Thursday, the dip at 18-18:59 is followed by a steady increase in acitivty until the after-dinner bump. Friday on the other hand just keeps going down until 19-19:59, where it picks back up again. What I love about this is the way that the Friday night acitivty almost perfectly matches the Saturday night activity. This could indicate that people in DC generally go out either Friday or Saturday but not both, or perhaps that a large contingent of DC redditors just don’t really go out on the weekend and prefer to spend their Friday/Saturday nights at home on reddit.

We don’t go out Sunday night. Or at least, we go out Sunday night about as much as we go out Thursday night. This is unsurprising, but I point it out because I like how the Sunday night activity matches the weeknight activity. This goes nicely with how Friday and Saturday nights aligned. So basically, Sunday is a “school night.”

Finally, from 4am-6am, DC sleeps. In fact, activity from 12am-4:59am is very constant without regard to day of the week, so unfortunately it’s clear: we just don’t party that late in DC.This is generally explained as a consequence of the metro schedule, but if that were the case, we would expect to see some difference between the weekend and weeknight activity after midnight: Sun-Thu metro stops at midnight, and Fri-Sat metro stops at 3 am. But it’s clear that regardless of day of the week, reddit activity from midnight to 5am follows the same trend.

After I’m satisfied that I’ve gotten all I can from the DC data, I’ll probably scrape NYC next, and I strongly suspect their late-night activity will look different, especially on weekends.

Next Steps

I hope to further characterize the subreddit by looking at what other subreddits the users of r/DC share in common. I’m still determining how best to do this. First I tried counting the times one member of r/DC happened to talk to another member, and I counted these for each subreddit. I think this was too narrow, so I then counted the number of times users appeared in the same link as each other. The question is, do I count links? Do I count unique r/DC users per link? The really hard part that I’m trying to figure out is how to normalize this data. Considering the size of these different subreddits, I feel like I need to be scaling the “counts” I come up with somehow. I’m thinking something similar to TF.IDF that will cause the more esoteric subreddits (i.e. more characteristic of a user’s interests) to rise to the top.

I’d also like to dig through the comments I’ve collected. I bet I could do some named-entity recognition to determine some popular locations in the city. What would be really interesting would be if I could pull out autobiographical statements of fact and aggregate those somehow, but I’m not sure that’s within the scope of the nltk default tools and it’s definitely over my head to develop in my free time.

Stay tuned for more analysis of this subreddit and other subreddits coming soon (NYC, I’m coming for you!)

About these ads

3 thoughts on “Who is /r/WashingtonDC – Part 1: Daily activity usage

  1. Pingback: Playing With Pandas: DataFrustration | Unsupervised Learning

  2. Pingback: Weekend project: Reddit Comment Scraper in Python | Unsupervised Learning

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s