Map-Matching for Speed Prediction | by João Paulo Figueira | Jan, 2024

How fast will you drive?

Photo by Julian Hochgesang on Unsplash

When planning future vehicle routes, you trust the digital map providers for accurate speed predictions. You do this when picking your phone up to prepare for a car trip or, in a professional setting, when planning routes for your vehicle fleet. The forecasted speeds are an essential component of the trip’s cost, especially as they are one of the fundamental drivers of energy consumption in electric and combustion vehicles.

The digital mapping service providers collect information from live traffic and historical records to estimate how fast you can drive along a specific road at any given time. With this data and intelligent algorithms, they estimate how quickly an average vehicle will travel through the planned route. Some services will accept each vehicle’s features to tune the route path and the expected trip times.

But what about specific vehicles and drivers like yours? Do these predictions apply? Your cars and drivers might have particular requirements or habits that do not fit the standardized forecasts. Can we do better than the digital map service providers? We have a chance if you keep a good record of your historical telematics data.

In this article, we will improve the quality of speed predictions from digital map providers by leveraging the historical speed profiles from a telematics database. This database contains records of past trips that we use to modulate the standard speed inferences from a digital map provider.

Central to this is map-matching, the process by which we “snap” the observed GPS locations to the underlying digital map. This correcting step allows us to bring the GPS measurements in line with the map’s road network representation, thus making all location sources comparable.

The Road Network

A road network is a mathematical concept that supports most digital mapping applications. Usually implemented as a directed multi-graph, each node represents a known geospatial location, usually some noteworthy landmark such as an intersection or a defining point on a road bend, and the connecting directed edges represent straight-line paths along the road. Figure 1 below illustrates the concept.

Figure 1 — The diagram above shows the simplified representation of a road segment as represented by a digital map. Each node, represented by a red dot, corresponds to a known geospatial location defined by latitude, longitude, and altitude coordinates. Each connecting line is an edge, and digital maps use these to represent the roads (the travel direction is omitted here). When we ask a digital map provider for directions, we get a long sequence of these nodes and edges and an estimate of the time it will take to travel such roads. (Image source: Author)

When we request a route from a digital map provider, we get a sequence of road network nodes and their connecting edges. The service also provides the estimated travel times and corresponding speeds between all pairs of nodes (in some cases, the speed estimates cover a range of nodes). We get the total trip duration by adding all the partial times together.

If we get better estimates for these times, we will also have better speed estimates and a better route prediction overall. The source for these better estimates is your historical telematics data. But knowing the historic speeds is just a part of the process. Before we can use these speeds, we must make sure that we can project them onto the digital map, and for this, we use map-matching.


Map-matching projects sequences of GPS coordinates sampled from a moving object’s path to an existing road graph. The matching process uses a Hidden Markov Model to map the sampled locations to the most likely graph edge sequence. As a result, this process produces both the edge projections and the implicit node sequence. You can read a more detailed explanation in my previous article on map matching:

After reading the above article, you will understand that the Valhalla map-matching algorithm projects the sampled GPS locations into road network edges, not to the nodes. The service may also return the matched poly-line defined in terms of the road network nodes. So, we can get both edge projections and the implicit node sequence.

On the other hand, when retrieving a route plan from the same provider, we also get a sequence of road network nodes. By matching these nodes to the previously map-matched ones, we can overlay the known telematics information to the newly generated route and thus improve the time and speed estimates with actual data.

Before using the map-matched locations to infer actual speeds, we must project them to the nodes and adjust the known travel times, as illustrated in Figure 2 below.

Figure 2 — The diagram above illustrates the challenge of mapping the implicit speeds of sampled GPS locations, displayed as green dots, to the speeds between the map nodes, represented as red dots. The orange diamonds represent the map-matched sampled GPS locations. In this problem, we know the average speeds between the green dots and want to infer the average speeds between the red dots. (Image source: Author)

As a prerequisite, we must correctly sequence both sets of locations, the nodes, and the map matches. This process is depicted in Figure 2 above, where the map matches, represented by the orange diamonds, are sequenced along with the road network nodes, represented as red dots. The travel sequence is obviously from left to right.

We assume the time differences between the GPS locations are the same as the map-matched ones. This assumption, illustrated by Figure 3 below, is essential because there is no way to infer what effect in time the map matching has. This assumption simplifies the calculation while keeping a good approximation.

Figure 3 — We assume that the time differences between consecutive samples are the same as the corresponding map-matched ones. (Image source: Author)

Now that we know the time differences between consecutive orange diamonds, our challenge is to use this information to infer the time differences between consecutive red dots (nodes). Figure 4 below shows the relationship between the two sequences of time differences.

Figure 4 — The diagram above helps us understand how we interpolate the time differences between road network nodes (red dots) using the time differences from the sequence of map-matched locations (orange diamonds). (Image source: Author)

We can safely assume that the average speeds between consecutive orange diamonds are constant. This assumption is essential for what comes next. But before we proceed, let’s define some terminology. We will use some simplifications due to Medium’s typesetting limitations.

We need to deal with two fundamental quantities: distances and timespans. Using Figure 4 above as a reference, we define the distance between the orange diamond one and red dot one as d(n1, m1). Here, the letter “m” stands for “map-matched,” and the letter “n” stands for node. Similarly, the corresponding timespan is t(n1, m1), and the average speed is v(n1, m1).

Let us focus on the first two nodes and see how we can derive the average speed (and the corresponding timespan) using the known timespans from orange diamonds one to four. The average speed of travel between the first two map-matched locations is thus:

Because the average speed is constant, we can now compute the first timespan.

The second timespan is just t(m2, m3). For the final period, we can repeat the process above. The total time is thus:

We must repeat this process, adapting it to the sequence of nodes and map matches to calculate the projected trip times between all nodes.

Now that we have seen how to project measured speeds onto a digital map let’s see where to get the data.

Telematics Database

This article uses a telematics database to infer unknown road segment average speeds. All the geospatial data in the database is already map-matched to the underlying digital map. This characteristic helps us match future service-provided routes to the known or projected road segment speeds using the abovementioned process.

Here, we will use a tried-and-true open-source telematics database I have been exploring lately and presented in a previously published article, the Extended Vehicle Energy Dataset (EVED), licensed under Apache 2.0.

We develop the solution in two steps: data preparation and prediction. In the data preparation step, we traverse all known trips in the telematics database and project the measured trip times to the corresponding road network edges. These computed edge traversal times are then stored in another database using maximum resolution H3 indices for faster searches during exploration. At the end of the process, we have collected traversal time distributions for the known edges, information that will allow us to estimate travel speeds in the prediction phase.

The prediction phase requires a source route expressed as a sequence of road network nodes, such as what we get from the Valhalla route planner. We query each pair of consecutive nodes’ corresponding traversal time distribution (if any) from the speed database and use its mean (or median) to estimate the local average speed. By adding all edge estimates, we get the intended result, the expected total trip time.

Data Preparation

To prepare the data and generate the reference time distribution database, we must iterate through all the trips in the source data. Fortunately, the source database makes this easy by readily identifying all the trips (see the article above).

Let us look at the code that prepares the edge traversal times.

Figure 5— The code above shows the main loop for the data preparation script. (Image source: Author)

The code in Figure 5 above shows the main loop of the data preparation code. We use the previously created EVED database and save the output data to a new speed database. Each record is a time traversal sample for a single road network edge. For the same edge, a set of these samples makes up for a statistical time distribution, for which we calculate the average, median, and other statistics.

The call on line 5 retrieves a list of all the known trips in the source database as triplets containing the trajectory identifier (the table sequential identifier), the vehicle identifier, and the trip identifier. We need these last two items to retrieve the trip’s signals, as shown in line 10.

Lines 10 to 16 contain the code that retrieves the trip’s trajectory as a sequence of latitude, longitude, and timestamps. These locations do not necessarily correspond to road network nodes; they will mostly be projections into the edges (the orange diamonds in Figure 2).

Now, we can ask the Valhalla map-matching engine to take these points and return a poly-line with the corresponding road network node sequence, as shown in lines 18 to 25. These are the nodes that we store in the database, along with the projected traversal times, which we derive in the final lines of the code.

The traversal time projection from the map-matched locations to the node locations occurs in two steps. First, line 27 creates a “compound trajectory” object that merges the map-matched locations and the corresponding nodes in the trip sequence. The object stores each map-matched segment separately for later joining. Figure 6 below shows the object constructor (source file).

Figure 6 — The compound trajectory constructor receives the map-matched trajectory and the map nodes as separate arrays of latitudes and longitudes. The code merges both trajectories into a list of segments and later converts these to a trajectory list. (Image source: Author)

The compound trajectory constructor starts by merging the sequence of map-matched points to the corresponding road network nodes. Referring to the symbols in Figure 2 above, the code combines the orange diamond sequence with the red dot sequence so they keep the trip order. In the first step, listed in Figure 7 below, we create a list of sequences of orange diamond pairs with any red dots in between.

Figure 7 — The above function merges the map-matched trajectory points with the corresponding road network node sequence. The function tries to assign sequential nodes, if any, between each pair of trajectory points. A list collects all merged sequences for a final consolidation step. (Image source: Author)

Once merged, we convert the trajectory segments to node-based trajectories, removing the map-matched endpoints and recomputing the traversal times. Figure 8 below shows the function that computes the equivalent between-node traversal times.

Figure 8 — The conversion from a segment to a trajectory requires calculating the equivalent times between road network nodes. (Image source: Author)

Using the symbology of Figure 2, the code above uses the traversal times between two orange diamonds and calculates the times for all sub-segment traversals, namely between node-delimited ones. This way, we can later reconstruct all between-node traversal times through simple addition.

The final conversion step occurs on line 28 of Figure 5 when we convert the compound trajectory to a simple trajectory using the functions listed in Figure 9 below.

Figure 9 — The final reconstruction step concatenates the projected trajectory segments. Note how arrays with more than two elements are clipped. The clipped elements correspond to the map-matched positions and are thus removed. Arrays with only two components have map-matched positions that coincide with road network nodes. (Image source: Author)

The final step of the code in Figure 5 (lines 30–32) is to save the computed edge traversal times to the database for posterior use.

Data Quality

How good is the data that we just prepared? Does the EVED allow for good speed predictions? Unfortunately, this database was not designed for this purpose so that we will see some issues.

The first issue is the number of single-edge records in the final database, in this case, a little over two million. The total number of rows is over 5.6 million, so the unusable single-edge records represent a sizable proportion of the database. Almost half the rows are from edges with ten or fewer records.

The second issue is trips with very low frequencies. When querying an ad-hoc trip, we may fall into areas of very low density, where edge time records are scarce or nonexistent. In such conditions, the prediction code tries to compensate for the data loss using a simple heuristic: assume the same average speed as in the last edge. For larger road sections, and as we see below, we may even copy the data from the Valhalla route predictor.

The bottom line is that some of these predictions will be bad. A better use case for this algorithm would be to use a telematics database from fleets that frequently travel through the same routes. It would be even better to get more data for the same routes.


To explore this time-prediction enhancement algorithm, we will use two different scripts: one interactive Streamlit application that allows you to freely use the map and an analytics script that tries to assess the quality of the predicted times by comparing them to known trip times in a LOOCV type of approach.

Interactive Map

You run the interactive application by executing the following command line at the project root:

streamlit run

The interactive application allows you to specify endpoints of a route for Valhalla to predict. Figure 10 below shows what the user interface looks like.

Source link

Be the first to comment

Leave a Reply

Your email address will not be published.