As big data analytics takes to the roads, the need for accurate distance calculations has blossomed too. However, traditional approaches to this problem face considerable challenges with accuracy and scalability. Now a group of researchers out of the University of Maryland at College Park think they found have an accurate and scalable solution with their Distance Oracle.
Roads may seem like a pedestrian topic for big data analytics. After all, roads are just strips of concrete or asphalt that we use to drive from one place to another. What’s so complicated about that?
But roads are surprisingly complex when considered in aggregate. When you consider that this country has more than one million named roads, constituting more than 4 million miles of paved byways, you begin to understand the massive scale and complexity of our road network.
Anybody who traverses the roads for business or pleasure has a need to calculate the distances they’ll travel. Whether it’s a Blue Apron driver delivering a meal in a big city or a family headed to the beach for Memorial Day, the number of miles on the route between points A and B is a fundamental number that will impact many aspects of the journey.
The simplest way to calculate distance is to put a ruler down against a paper map. This Euclidian distance will be useful when we all have self-flying cars, but until then its usefulness will be limited mainly to crows. It may surprise you, however, to know that many online services, including Google Maps, often utilize the Euclidian distance when a user submits a common query like “Where are the 10 closest Chinese restaurants?”
The more useful number is the distance we’ll travel across the existing roads. A graph database can be a useful tool for calculating this distance. If you transposed the roads into a graph database, and considered every intersection of two roads (sometimes three) to be a vertex (or node), one could then use a shortest distance algorithm, such as Dijkstra’s algorithm, to find the shortest distance between two vertices.
While this approach works and is accurate, it’s computationally expensive and therefore not scalable. If you wanted to calculate the distances between 1,000 different places (or vertices in graph speak), then you’d have 1,000 times 1,000, or potentially up to 1 million, paths to check. And the amount of space you’d need to compute that would be N cubed, or a billion.
“This is for a small problem,” says University of Maryland Computer Science Professor Hanan Samet. “If you look at a road network of the United States, it has about 24 million vertices. These numbers get way beyond anything you can conceive of.”
Professor Samet and two of his students at the University of Maryland’s Institute for Advanced Computer Studies, Jagan Sankaranarayanan and Shangfu Peng, worked on this problem and came up with what they believe is a groundbreaking solution that could find application in any number of real-world big data analytic problems companies are facing today.
The solution boils down to precomputing distances between two points in an intelligent manner, storing the results in an SQL table, and then figuring out how to serve the distance values very quickly.
Read the entire article at Datanami.