Sunday, February 8, 2009

Duplicates

One of the most frequent support requests we handle here involve helping kind people around the world reporting that there are duplicate entries in our database; many including detailed reports of which entry is wrong and including comprehensive, correct data for us to use in its stead.

I wanted to write a bit about what a duplicate is, where it comes from, and some new strategies we're trying to solve the problem.

So, what's a duplicate? The case we're talking about is where a single physical station is recorded more than once in our database, with a very slightly different address. For example, one of our users might enter 12001 S Douglas Blvd, Guthrie, OK, whilst another might enter it as S Douglas Blvd & E Charter Oak Rd, OK. If you click both of those links, you'll see that they're for almost the same location -- Google's geocoder places them only 1 metre apart -- but in practice, any human would recognise both those addresses as being for the same place.

How does it happen? A lot of these duplicate entries are due to a couple of poor design decisions we made at our end. One of the issues we have is that there's an expectation that stations will appear on the map as soon as they've been entered. Unfortunately, for purely technical reasons, this is not the case: it can take up to 15 minutes for a station submission to be approved and placed on the map. When this happens, a lot of people think that something's gone wrong, so they'll try adding the station again, under some variation of the address. Other times it's where we've sourced a listing of stations from somewhere (a handful of chains have complete listings available on their websites), and find that these listings have stations listed that have already been added by our users under slightly different formulations of the address. The other source of confusion here is that GasBag 1.x will actually "hide" some stations, but I'll write more about that in a moment.

The problem with all this is that since we're a map-driven application, we need to find a way to cram all these stations onto a map. This is often hard enough when we only have one entry per station (it's common to have stations clustered around an intersection, for example), but the problem is just compounded when we have three or four listings for each of those stations. As I mentioned above, when faced with this situation, GasBag 1 will actually just stop putting stations on the map once a certain density of stations has been reached.

The big question is: what are we going to do about it?! Well, there's two main strategies we're planning to use to sort this problem out. The first is that starting with GasBag 2 (which is making its way through the approval process now), we will no longer omit any stations from the display. If its in our database, past a fairly modest zoom level, it'll be on your screen. To avoid having so many bubbles that you can't actually see the map, we've introduced an innovative new "bubble stacks" concept. The idea is that when we have an area with lots of stations, those stations will be represented by a bubble icon resembling a set of bubbles that have been stacked, one on top of the other. When you tap on that bubble, GasBag will zoom in and unstack the stack, revealing each of the stations it represents. We hope that this interface will solve a lot of problems, but we're hoping that it will at least resolve some of the confusion people are having with GasBag 1.

The second thing we're doing is to run a batch job each night to go through our database looking for duplicate stations, and we're just going to have this program pick one, scrap the other, and get on with life. Our reasoning is that if two stations are so close together that their location on the map is indistinguishable, then it probably doesn't matter that much which of them we choose to display. This script will ensure that you'll never see two stations of the same brand within a quarter-mile of each other (that's 400 metres for those of us Down Under). Initial testing of this script has shown very encouraging results. This one will be rolled out later today to our live servers, so if you've been frustrated by duplicates in the past, keep an eye out for improvements over the next few days.

We're pretty excited about this, because we know it's been a big problem since day 1, and its always satisfying to cross one of those babies off your list. So thanks for putting up with us while we work on this; we hope it will have been worth the wait.

James -- Founder and CTO.

4 comments:

RAmeeti said...

While you are probably correct in that multiple entries of very close proximity of identical brands will probably not be significant, it is interesting that in my area, we have 2 Arco stations diagonal from each other on the same corner. But yes, I do think they most always have the same price.

James said...

Hi RAmeeti,

you're quite right, there are indeed a handful of cases where there are two stations with the same brand in very close proximity. In Australia, where I'm from, I've found this often happens on expressways, where there's no easy way to get to the other side of the road.

But we agree with your surmise: that in almost all cases, those stations will have the same price. So whilst it's not a perfect solution, we hope it will have minimal negative impact, and we also think it will help clean up the map so that people can make more sense of it, more of the time.

-- James.

Rob said...

I just found your blog after using GasBag for a few months, and this actually answers my biggest question -- what happened to my neighborhood gas station?

I'm in a similar situation to RAmeeti, in that there are two Conoco stations on diagonally-opposite corners of an intersection. They're independent franchises though, and frequently have different prices. At the moment, the oft-dropped station is in with a slightly-inaccurate address, but could you investigate lowering your distance parameter from 400m to maybe 100m?

pdxjim said...

Thanks for the great work. Here's a different address problem:

I show two Costco stations in Portland, Oregon with addresses 4849 NE 138th Ave. They show states of OR and CA and I'm pretty sure there are no California stations in Portland.

What's a good standard channel to let you folks know about this type of duplicate?

Cheers