Topological Anomaly Detection



Recently I had the pleasure of attending a presentation by Dr. Bill Basener, one of the authors of this paper which describes an outlier analysis technique called Topological Anomaly Detection (TAD). Despite the intimidating name, the algorithm is extremely simple, both to understand and to implement.

The technique is essentially a density based outlier detection algorithm that, instead of calculating local densities, constructs a graph of the data using nearest-neighbors. The algorithm is different from other kNN outlier detection algorithms in that instead of setting ‘k’ as a parameter, you instead set a maximal inter-observation distance (called the graph “resolution” by Gartley and Basener). If the distance between two points is less than the graph resolution, add an edge between those two observations to the graph. Once the full graph is constructed, determine which connected components comprise the “background” of the data by setting some threshold percentage of observations ‘p’: any components with fewer than ‘p’ observations is considered an anomalous component, and all the observations (nodes) in this component are outliers.

As you probably gathered from my description, it’s a pretty simple algorithm to implement. Let’s walk through it step by step.

First, we need to construct a distance matrix. SciPy has a great utility for this called pdist:

from scipy.spatial.distance import pdist

distance_matrix = pdist(X)

One thing that’s nice about this is that it will work for either numpy arrays or pandas dataframes, as long as the observations are on the rows and the features are on the columns. The output of pdist is, by default, a “condensed” distance matrix, which is just the upper triangular matrix collapsed into a flat vector. This is nice because it takes up (just) less than half the memory, but it’s slightly more cumbersome to work with than a straight up distance matrix.

We’re ultimately going to treat this distance matrix as an adjacency matrix, but before we do, we need to replace all entries in the distance matrix (ok, distance vector) that are larger than the graph resolution “r” with zeroes to remove those edges from the graph. I think it’s a little silly to expect the user to come up with a discrete distance to use for the resolution, so we’re going to give the user the option to provide a quantile instead as the ‘rp’ parameter below.

It’s tempting to just modify the distance matrix in place, but we’re going to need it later to score outliers, so we’re going to work with a copy instead.

def trim_adjacency_matrix(adj, r=None, rq=.1):
    if r is None:
        # This is really just a lazy quantile function.
        q = int(np.floor(len(adj)*rq))
        print "q:", q
        r = np.sort(adj)[q]
    print "r:", r
    adj2 = adj.copy()
    adj2[adj>r] = 0 
    return adj2, r

Tada! We’ve got our graph. If you don’t work with graph data a lot you might not see it, but an adjacency matrix is just a way of encoding a graph. But, the adjacency matrix isn’t enough: we want to mine the connected components from the graph. Thankfully, this is trivial with networkx.

Now, networkx expects a square matrix if we’re going to build a graph using an adjacency matrix, but we have a vector. We could convert this to a full matrix by calling scipy.spatial.distance.squareform, but this will take up double the space in memory and it’s possible that a user is working with a large enough dataset that this will be a problem, so let’s work around the condensed distance matrix.

There’s probably a better way to go about this–like a formula that converts (i, j) pairs into the appropriate indexes into this condensed matrix (vector)–but I was lazy when I wrote this up so instead of trying to find the right formula, I instead took some inspiration from a stackoverflow post discussing how to work with condensed distance matrices which led me to iterate through each vector index paired with its corresponding (i, j) index in the square matrix using enumerate(itertools.combinations). This essentially gives me an edgelist, permitting to build up the graph one edge at a time:

from itertools import combinations
import networkx as nx
def construct_graph(edges, n):
    g = nx.Graph()
    for z, ij in enumerate(combinations(range(n),2)):
        d = edges[z]
        if d:
            i,j = ij
            g.add_edge(i,j, weight=d)
    return g

Having constructed this graph structure, we can extract the connected components and score them as “background” or “anomaly” by counting the number of nodes in each component and comparing against the ‘p’ threshold. As a convenience for later, I’m also going to add “class” and “color” attributes to all the nodes in the graph.

def flag_anomalies(g, min_pts_bgnd, node_colors={'anomalies':'r', 'background':'b'}):
    res = {'anomalies':[],'background':[]}
    for c in nx.connected_components(g):
        if len(c) < min_pts_bgnd:
    for type, array in res.iteritems():
        for node_id in array:
            g.node[node_id]['class'] = type
            g.node[node_id]['color'] = node_colors[type]
    return res, g

Last but not least, let’s score those anomalies. For convenience, I’m going to wrap these scores in a pandas.Series, but this is the only place in our code we’re using pandas so it would actually speed us up a bit not to do it this way (because then we can completely eliminate the pandas import):

import pandas as pd
def calculate_anomaly_scores(classed, adj, n):
    scores = {}
    for a in classed['anomalies']:
        scores[a] = 0
        for z, ij in enumerate(combinations(range(n),2)):
            i,j = ij
            if (i == a or j == a) and (
                i in classed['background'] or
                j in classed['background']):
                d = adj[z]
                if scores[a]:
                    scores[a] = np.min([scores[a], d])
                    scores[a] = d
    return pd.Series(scores)

Great! Now let’s put all the pieces together to construct out outlier classification tool.

def tad_classify(X, method='euclidean', r=None, rq=.1, p=.1, distances=None):
    if not distances:
        adj = pdist(X, method)
    edges, r = trim_adjacency_matrix(adj, r, rq)
    n = X.shape[0]
    g = construct_graph(edges, n)
    classed, g =  flag_anomalies(g, n*p)
    scores = calculate_anomaly_scores(classed, adj, n)
    return {'classed':classed, 'g':g, 'scores':scores, 'r':r, 'min_pts_bgnd':n*p, 'distances':adj}

Now that we’ve built this thing, let’s try it out on the iris data. I’m going to visualize the result using a pairs plot (a “scatter_matrix” in pandas) which will allow us to see how the outliers relate to the rest of the data across all pairs of dimensions along which we can slice the data.

import matplotlib.pyplot as plt
from import scatter_matrix
from sklearn import datasets

iris = datasets.load_iris()
df = pd.DataFrame(
res = tad_classify(df)

df.anomaly.ix[res['classed']['anomalies']] = 1
scatter_matrix(df.ix[:,:4], c=df.anomaly, s=(25 + 50*df.anomaly), alpha=.8)


The pairs plot it a great way to visualize the data, but if we had more than 4 dimensions this wouldn’t be a viable option. Instead, let’s reduce the dimensionality of the data using PCA just for visualization purposes. While we’re at it, let’s actually visualize how the observations are connected in the graph components to get a better idea of what the algorithm is doing.

from sklearn.decomposition import PCA

g = res['g']
X_pca = PCA().fit_transform(df)
pos = dict((i,(X_pca[i,0], X_pca[i,1])) for i in range(X_pca.shape[0]))
colors = [node[1]['color'] for node in g.nodes(data=True)]
labels = {}
for node in g.nodes():
    if node in res['classed']['anomalies']:
        labels[node] = node
        labels[node] = ''
nx.draw(g, pos=pos, node_color = colors, labels=labels)


If we reduce the graph resolution, we get more observations classed as outliers. Here’s what it looks like with rq=0.05 (the default–above–is 0.10):


And here’s rq=0.03:


In case these images don’t give you the intuition, reducing the graph resolution results in breaking up the graph into more and more components. Changing the ‘p’ parameter has a much less granular effect (as long as most of the observations are in a small number of components): changing ‘p’ will have basically no effect for small increments, but above a threshold we end up classifying large swathes of the graph as outliers.

The images above have slightly different layouts because I ran PCA fresh each time. In retrospect, it would have looked better if I hadn’t done that, but you get the idea. Maybe I’ll fix that later.


Playing With Pandas: DataFrustration

After putting up the post a few weeks ago where I began to analyze the /r/washingtondc community , I received some feedback regarding my methodology. The comment that most resonated with me was “Why are you trying to do your analysis in R if you’ve already got the data in python? Why not just do the analysis and develop your visualizations there?” I had good reasons to use R, mainly: I saw an opportunity to learn plyr, and I don’t really know my way around the numerical programming and plotting tools in python (primarily numpy, scipy, pylab,matplotlib, scikit-learn, and pandas).

As I’ve mentioned previously, I recently started grad school and I’m going to need R specifically in my program, so my need to learn R is much more urgent than my need to develop expertise in python. But frankly, I just enjoy using python more, and I’d rather do everything in one environment if I can so learning the python data analytics libraries is a very attractive prospect to me. Wes McKinney, the author of the pandas library, is publishing a book through O’Reilly called Python for Data Analysis that I’m super interested in. The problem is right before I started my graduate program, I splurged a little and bought some textbooks I had been drooling over but have barely had opportunity to crack into (Drew Conway and John Myles White‘s Machine Learning for Hackers and Christopher M. Bishop’s Pattern Recognition and Machine Learning) so I’m apprehensive to buy more books until I can bite into the books I’ve already bought (while I’m dreaming of a world where I have time, I also really want to pick up Nate Silver’s new book The Signal And The Noise).

Instead of buying a new book (I expect I”ll buy it later) for the moment let’s do this the ol’ fashioned way and hit the documentation. The documentation for pandas looks to be pretty thorough: I recommend starting with the Intro to Data Structures.

Now, it might be because I’m coming into this with very little NumPy experience, but right off the bat the DataFrame class, the heart of pandas, seems awkward to me. To illustrate my problem, let me compare how to take a particular slice in R vs. pandas.

What I want to do is isolate those columns of the dataframe whose first row is below some threshold value:

A <- runif(10)
B <- runif(10)
C <- runif(10)
D <- runif(10)
E <- runif(10)

df <- data.frame(A,B,C,D,E)
sliced_df <- df[ , df[1,]<.5 ]

This is pretty straightforward. The first 5 lines I create labeled vectors of 10 random numbers selected from the uniform distribution (in the range (0,1)). Then I form a dataframe from these vectors. Finally, from the dataframe I select all rows, but only those columns from the where the value of the first row is less than 0.5.

> df
            A           B         C           D          E
1  0.45274205 0.543755858 0.2225730 0.643710467 0.44527644
2  0.55692168 0.687034039 0.8480953 0.494917616 0.98080695
3  0.19127556 0.419290813 0.2744206 0.005422064 0.58559636
4  0.58410947 0.094669003 0.3284746 0.891122109 0.05962251
5  0.94561895 0.022608545 0.9431832 0.951050056 0.38312492
6  0.72316858 0.003073411 0.3336150 0.201627465 0.89433597
7  0.02145899 0.685167549 0.5754166 0.371717998 0.06746820
8  0.47334489 0.143967454 0.4463423 0.959687645 0.64947595
9  0.75215197 0.068791088 0.0343898 0.117595073 0.28861395
10 0.78567118 0.398529395 0.6467450 0.883467028 0.86369047

> sliced_df
            A         C          E
1  0.45274205 0.2225730 0.44527644
2  0.55692168 0.8480953 0.98080695
3  0.19127556 0.2744206 0.58559636
4  0.58410947 0.3284746 0.05962251
5  0.94561895 0.9431832 0.38312492
6  0.72316858 0.3336150 0.89433597
7  0.02145899 0.5754166 0.06746820
8  0.47334489 0.4463423 0.64947595
9  0.75215197 0.0343898 0.28861395
10 0.78567118 0.6467450 0.86369047

Here’s how we do the same thing in python, using the pandas.DataFrame datatype (this might not look so bad, but DataFrame.T is the transpose method): <<Edit: Actually, since finishing this post I figured out a better way which I discuss towards the end o fthis post>>

import pandas as pd
import random

def runif(n):
    res = []
    for i in range(n):
    return res

A = runif(10)
B = runif(10)
C = runif(10)
D = runif(10)
E = runif(10)

df = pd.DataFrame({'A':A,'B':B,'C':C,'D':D,'E':E})
sliced_df = df.T[ df.T[0]>0.5 ].T

To start with, I have to create a function to build sequences of random numbers. I strongly suspect that some such function exists in NumPy or somewhere, but I don’t know what it is so I made my own (which has the added benefit of making my code align better with the R example). <<EDIT: Since typing this blogpost, I’ve learned that the function I needed was numpy.random.randn.>>

Next, populating the dataframe is already extremely awkward. From the documentation, it seems like the pandas.DataFrame is primarily designed to be generated by dictionaries. What particularly annoys me about this is that, considering the DataFrame is supposed to be a container for the pandas.Series class, I can’t just feed pandas.DataFrame() a list of pandas.Series objects, they have to be in a dictionary. I can create a DataFrame from a single Series like so:

>>> df = pd.DataFrame(A)
>>> df

0  0.446686
1  0.696637
2  0.265383
3  0.165647
4  0.861040
5  0.347737
6  0.280177
7  0.980586
8  0.334544
9  0.645143

Note how the column name was changed to “0”. I would have expected that the variable name becomes the column name like in R, but nope, no such luck. The same thing happens if we feed DataFrame() two series, which makes it hard to distinguish them for slicing.

>>> df = pd.DataFrame(A,B)
>>> df

0.689893  0.446686
0.373250  0.696637
0.619527  0.265383
0.775759  0.165647
0.819883  0.861040
0.099763  0.347737
0.143239  0.280177
0.303460  0.980586
0.975462  0.334544
0.672082  0.645143

>>> df[0]

0.689893    0.446686
0.373250    0.696637
0.619527    0.265383
0.775759    0.165647
0.819883    0.861040
0.099763    0.347737
0.143239    0.280177
0.303460    0.980586
0.975462    0.334544
0.672082    0.645143
Name: 0

If I try to create a dataframe from three or more series, I just get an error.

>>> df = pd.DataFrame(A,B,C)

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\pandas\core\", line 412, in __init__
File "C:\Python27\lib\site-packages\pandas\core\", line 506, in _init_ndarray
block = make_block(values.T, columns, columns)
File "C:\Python27\lib\site-packages\pandas\core\", line 461, in make_block
File "C:\Python27\lib\site-packages\pandas\core\", line 26, in __init__
assert(len(items) == len(values))

Frankly, I’m not sure why two series even worked. Anyway, the DataFrame object wants a dictionary so it knows explicitly how to name each data set. You’d think the variable names would be sufficient, and that’s the intuitive solution we get in R, but no such luck in pandas. For the time being, this is a minor problem but we’ll see what happens when I start working with larger datasets.

What really irks me is the slicing operations are a little funny, and things that should be able to handle booleans can’t. Basically, slicing dataframes is much less intuitive in pandas than in R or octave (matlab) and this concerns me, since this sort of thing is what enables vectorizing operations (i.e. instead of using slower for loops).

The main problem is that the df[x] operator doesn’t take both rows and columns at the same time like in R (e.g. df[rows, columns] ). But weirdly, it can take one or the other separately. In pandas, df[x] takes a couple of different kinds of ‘x’: a column name, a list of column names, indices, or a conditional statement. If I give df[x] a conditional statement or indices (e.g. df[0:3]), they apply to the rows. Did you catch that? If I feed it labels that match column names, I get columns back. If I feed it slice indices I get rows back. What the fuck, pandas. I can’t be the only person who thinks this is spectacularly confusing. A particular slice method (e.g. [ ] ) should give me back either columns or rows, not both.

>>> df['A']

0    0.446686
1    0.696637
2    0.265383
3    0.165647
4    0.861040
5    0.347737
6    0.280177
7    0.980586
8    0.334544
9    0.645143
Name: A

>>> df[['A','B']]

A         B
0  0.446686  0.689893
1  0.696637  0.373250
2  0.265383  0.619527
3  0.165647  0.775759
4  0.861040  0.819883
5  0.347737  0.099763
6  0.280177  0.143239
7  0.980586  0.303460
8  0.334544  0.975462
9  0.645143  0.672082

>>> df[ df['A']>0.5 ]

A         B         C         D         E
1  0.696637  0.373250  0.778385  0.918453  0.580366
4  0.861040  0.819883  0.397820  0.735031  0.110483
7  0.980586  0.303460  0.117398  0.033969  0.731914
9  0.645143  0.672082  0.138268  0.730780  0.969545

>>> df[1:3]

A         B         C         D         E
1  0.696637  0.373250  0.778385  0.918453  0.580366
2  0.265383  0.619527  0.633576  0.553009  0.599043

This means if I want just those columns that meet a particular row condition, I have to first take the transpose of the dataframe to make the rows into columns, apply my conditional statement to the appropriate column (which requires taking the transpose again), and take the transpose one last time to flip the returned data set back into the original orientation. Hence:

sliced_df = df.T[ df.T[0]>0.5 ].T

Having written out this entire rant, I finally figured out the better way of doing this. I’m still annoyed by all the stuff I described above, but here’s how you accomplish this slice without using transpose at all: you use the DataFrame.ix method. This allows for more intuitive R-style indexing of the form df.ix[rows, columns]. It accepts indices and booleans. Do I sound dejected? This took me a while to figure out.

>>> df.ix[:,:]

A         B         C         D         E
0  0.242808  0.829024  0.027734  0.510985  0.430466
1  0.301553  0.834208  0.600806  0.773148  0.119008
2  0.968252  0.098827  0.290203  0.555629  0.652359
3  0.351365  0.391068  0.352370  0.531282  0.478862
4  0.513526  0.138082  0.538826  0.252554  0.486603
5  0.705628  0.362105  0.800225  0.977828  0.454140
6  0.097671  0.613972  0.712334  0.473130  0.886449
7  0.386206  0.520115  0.589156  0.722709  0.293428
8  0.337381  0.102242  0.296870  0.725426  0.475001
9  0.076314  0.894782  0.115159  0.592838  0.402849

>>> df.ix[2:3,'B':'E']

B         C         D         E
2  0.098827  0.290203  0.555629  0.652359
3  0.391068  0.352370  0.531282  0.478862

>>> df.ix[0,:] < 0.5

A     True
B    False
C     True
D    False
E     True
Name: 0

>>> df.ix[ :, df.ix[0,:] < 0.5 ]

A         C         E
0  0.242808  0.027734  0.430466
1  0.301553  0.600806  0.119008
2  0.968252  0.290203  0.652359
3  0.351365  0.352370  0.478862
4  0.513526  0.538826  0.486603
5  0.705628  0.800225  0.454140
6  0.097671  0.712334  0.886449
7  0.386206  0.589156  0.293428
8  0.337381  0.296870  0.475001
9  0.076314  0.115159  0.402849

One thing that is much less confusing about these dataframes is if I want to plot something, all I have to do is:

from pylab import plot, show


And I get a single plot where each line is a separate column of data. I would have had to have used a for loop otherwise, so that at least is nice and intuitive.

These are normal growing pains, as with learning any new tool. If you’ve got a few hours on your hands and don’t feel like mucking through the documentation, here’s a pycon tutorial Wes gave earilier this year that’s pretty interesting (I’ve only made my way through about an hour’s worth myself, but it looks like a thorough introduction). NB: The first 15 minutes or so he’s just setting up IPython, so I recommend you skip ahead a bit.