From an idea to implementation
You can find all of the code for this project on Github.
I had a hypothesis:
Social media signals can be used to give a rough picture of foot traffic around places of business.
This leads us to a pretty obvious proposed implementation: “Query for business data from Google, social media data from Instagram/Twitter. Correlate social media data with business data if the social post originated within some radius from a business.” Surprising nobody, we’re going to have to do something a little more complicated to get the data we want.
Google, Twitter, and Instagram APIs allow for fairly large search radii, but return few results. For instance, querying Google for ‘nearby businesses’ in a 3KM radius would only return 20 results (due to result limiting). If we reduce the radius to something much smaller, like 50 meters, we end up with a more believable output (say ~12 results). Most APIs have some sort of hard result limit which tends to be problematic when trying to reason about data at scale. We get better resolution on queries with smaller radii (since they are unlikely to exceed the result limit), so our implementation will probably need to incorporate this somehow.
So let’s revise our proposal: We will need to figure out some way to make a BUNCH of queries with smaller radii and stitch them together to build the “full result”. We probably want to minimize the number of queries in order to get our final results faster/with less overhead.
Visualize the Earth. Pick any point on its surface, and draw a big circle with that point at the center. We can envision the area we want to analyze bounded by this (probably large) circle. Draw a much smaller circle at the same center-point (this represents the results for 1 query). Layer those small circles in some configuration to where they never overlap but still cover (most of) the larger-circle area. This is starting to look like a shape packing problem. Our goal is to figure out how to pack a bunch of small circles into a large circle while optimizing for as few small circles as possible.
We should also realize that we’re ACTUALLY packing circles into a circle on the surface of a sphere, so surface curvature will need to be handled somehow.
Let’s restate what we know:
Social media signals can be used to give a rough picture of foot traffic around places of business.
- We know that we need to generate the circle packing for the search-area in 3D.
- We’ll then have the coordinates of each ‘next request’, so we’ll need to iterate over that collection and get data from all of the services we need to query.
- Finally, we’ll need to correlate the data and do something interesting with it.
So our ‘big problem’ now became 3 distinct, stateless components: Generate circle packings -> Make requests -> Transform data for the application.
A quick google search tells us that finding the absolute minimum of n circles packed into a larger circle is an unsolved problem for the general case. Because this is a currently unsolved math problem, it’ll probably be easier to try to optimize for a specific case rather than attempt a general mathematical solution.
We need to deal with the surface curvature thing — which ultimately translates to: “How do I take the surface of the earth and make it 2D?” Well, maps are 2D representations of the earth so that’s probably what we’re looking for. An unimaginative google search of “mapping earth in 2d” returns a link on map projection, where we can learn all about using mercator projection to correlate latitude/longitude points with x/y coordinates in two dimensions.
This now changes our visualization - and I think it’s important to keep our conceptual understanding of the problem domain up to date. Instead of drawing circles directly onto the earth, we’re actually “unwrapping” the surface of the planet and laying it out flat. We then draw circles upon that, and lay it back onto the sphere once we’re done. We’re still technically only doing 2D circle packing but we’re dealing with the 3D nature of the problem in a clever way.
At this point, I sort of had an idea for how I was going to deal with the circle packing — a carrom board came to mind.
It turns out that hexagonal packing is the trivially optimal solution to circle-in-circle packing, so go figure. As an aside - someone of non-indian descent could come to this conclusion organically by reading the circle packing wiki entry.
4. Draw it out
I grabbed a piece of paper and a pen, and started drawing circles in the same hexagonal configuration in an effort to learn something about the pattern and how I went about drawing it.
Intuitively, by definition of hex-packing, each circle drawn is surrounded by 6 other circles. Each circle’s center-point is exactly 2*r (where r is the radius of the circle) from each of it’s neighbors’ center points.
I realized that when I drew the pattern by hand, I tended to draw it by layers: starting with one circle at the center, drawing 6 circles around that, 12 circles around those, 18 circles around those, and so on. Each layer has 6 more circles than the layer preceding it, so we can express the number of circles in the pattern as:
Where “layers” is the number of layers needed to fill the pattern optimally.
However, this raises the question: “How do you know how many layers are needed for an optimal fill?” Well, let’s define it simply: we know that we have reached an ‘optimal fill’ if adding another layer to the pattern means having to draw (parts of) circles outside of the bounded area. By this definition, the number of layers is just the number of circles from the center to the edge (or slightly less) of the outer circle.
I drew a circle around the pattern I had drawn previously, so that the edge of the outer circle met the outermost edges of the last layer I drew. I labeled that radius r1, and had drawn 3 layers from the center.
The first r2 corresponds with the center circle, but we only care about the number of layers - so by eliminating the first term we know that:
To be safe, we’ll have to round down (to guarantee that we never have circles exceeding our outer boundaries).
So now we know how to calculate the minimum number of layers (and therefore circles) necessary to complete a hexagonal packing, our next step is figuring out how to actually plot it.
“The wall” is that unexpected complexity that crops up at the worst moment. In this case: I know how many circles I will need to fill a given area and I know how to draw the packing manually, but I don’t actually know how to go about generating the positions of each circle with precision.
Scaling the wall:
I ended up walking away from the project for a few days in order to clear my head and approach it from a different angle. During my ‘hiatus’, I randomly met a buddy for lunch and we chatted a bit about the latest in our lives. I ended up mentioning this circle packing problem to him, and we talked about it for the next three hours (I have geeky friends). Over shawarma wraps, we considered two potential approaches: “The dumb way” - plotting 6 circles around the center circle, 6 circles around each of those circles, 6 around each of those, etc.; “The (less?) dumb way” - determining the distance between each circle and the center based on some pattern and plotting accordingly. The goal was to use the ‘minimum number of circles’ equation to turn this into an efficient iterative solution.
We figured we’d take a stab at the second method - so we began drawing circles and triangles. We were using trigonometry to figure out the relationship between the center point and circles on varying layers. After brute forcing relationships layer by layer for a while, we came to realize that distances (with respect to the center point) vary in a familiar pattern.
We realized that you could take the hexagonal circle packing pattern and divide it into 6 ‘triangles’. Each ‘triangle’ resembles a ‘pascals triangle’ structure, where each number represents a distance from that circle to the center point.
Imagine that ‘1’ represents the value 2r2. We know that down the edges of the triangle, the distance from any of those circles to the center will be 2r2some constant (pertaining to the layer the circle lives in). However, the distance from the circles in the middle (like the ‘2’ in the middle of the second layer) to the center circle are various values other than 2r2. Each ‘new’ number introduced in the pascal’s triangle configuration (i.e: anything that isn’t 1) is a new distance that needs to be accounted for - which means that for the general solution - there is an indefinite number of distances that need to be calculated before the circles can even be plotted. Suddenly “the (less?) dumb way” is out.
However, I realized something that seemed way too obvious to not have occurred to us already: The distance from any one circle to its immediate neighbors is 2*r2.
Therefore if you know where your ‘current’ circle is, you can draw its next two closest neighbors by calculating 2*r2 away from the circle at 0° and 60°.
We’ll have duplicate coordinates with this approach since neighbors overlap, but de-duping is cheap and easy. We keep generating layer after layer and ultimately we end up with a triangular configuration of circles - which we can then rotate 6 times about the center point.
We now theoretically know how to generate our circle packing — and gaining a new perspective and some fresh input helped us in “scaling the wall”.
After a considerable amount of trial/error, I finally got a working implementation of hexagonal circle packing in 2D. In this case, the algorithm was able to generate the coordinates for 3000 circles packing an area of radius 3KM in 74.43838 msecs.
When attempting to map the x/y coordinate circles generated by the algorithm onto latitude/longitude projections, I ran into the issue that ‘mercator projection’ is not just one equation - it really depends on whether you assume the Earth is a Sphere or a Spheroid. There are tons of mercator projection libraries out there (including one for Clojure) but I wasn’t able to find one that was consistent with the Spherical Earth model and had reliable reverse operations. I ended up building a simple (but reliable) mercator projection library which lives in the file “mercator.clj” in the project, and is used to map lots of lat/long coordinates to x/y, and then back to lat/long.
I used the throttler library to make sure the application was playing nice with service APIs, and that all of my requests were below the rate limits for the APIs I was using.
After a lot of tweaking, the final implementation worked. I queried google for every restaurant within 3KM of the Able office, ranked by the number of social media posts within 5 meters of their location.
The results were pretty compelling - lots of people like to take pictures of fancy looking food, so there were a significant number of small restaurants that bubbled up to the top which otherwise would’ve flown under the radar. Down the road, we could even build out some Natural Language Processing functionalities to draw conclusions from the social media content itself (i.e: What are people saying about [business]? Who is saying it?)
The circle packing implementation itself is general enough to be used anywhere that requires the canvasing/sampling of large geographical areas (census projections, regional market analysis, resource management, etc.) - lead generation was only an example for this specific implementation. It would be awesome to see someone use it to do large scale analysis on the data of entire countries (technically possible with <10 lines of additional code given the current state of the project, but it would take forever unless you were querying a non-rate-limited endpoint).
Having a process for approaching and distilling problems has been an invaluable part of my workflow. I’ve found that by spending most of my time constructively reflecting on the implementation before writing a single line of code, I understand the problem domain better and can therefore reason about the underlying implementation more naturally.
I wanted to approach this blog post in a way that was accessible to everyone, and really spoke to the journey rather than just the end product.