Filling Data Gaps In OpenStreetMap
Welcome to my third post in a series exploring transit-oriented data quality: the extent to which digital information on stations and their surrounding areas represents reality. Prior posts discussed ways to ensure complete and accurate facility infrastructure and location information. This post delves into the data on the land uses surrounding stations. It provides an overview of OpenStreetMap (OSM), identifies transit station areas that are “under-mapped”, and presents a case study on adding buildings near a station to the OpenStreetMap dataset. Along the way, I’ll explain how to extract land use data from the OSM database and discuss how my results can promote Transit-Oriented Development.
OpenStreetMap
Transit-Oriented Discoveries leans heavily on OpenStreetMap (OSM), a collaborative mapping platform that allows people to contribute and edit geographic data, creating a detailed and up-to-date map of the world. Unlike proprietary mapping services, OSM’s data is freely available for anyone to use, making it a powerful tool for a variety of applications. OSM provides insights into the built environment, land use patterns, and infrastructure networks,
OSM’s crowd-sourcing approach creates data quality strengths and challenges. On the one hand, the open platform means OSM is highly adaptable, constantly updated, and inclusive of local knowledge that might be overlooked by commercial tools. On the other hand, since contributions come from volunteers with varying levels of expertise, there can be inconsistencies in data quality, particularly in less populated or less frequently mapped areas. Urban areas with active mapping communities tend to have more accurate and detailed data than exurban areas and small towns. When I’ve described my project to OSM veterans, they’ve warned me to be on the lookout for transit station areas where OSM data is missing or incomplete.
Mind The Gap
This advice raises questions: What does it mean for an OSM map of an area to be “complete”? How does one know if an area around a station is “incomplete?” How many details are enough for a map to be considered an accurate representation of the natural or built environment? How do we know if these details represent the current environment as opposed to what existed a few weeks, months, or years ago?
I am a novice OSM user and, at this point, am not prepared to tackle these questions head on and in full. Instead, I will start with one component of missing OSM data which are neighborhoods near transit stations where buildings have not been added to the map.
The image below from my prior post shows the location of the Long Island Rail Road St. James station with a circle representing a 1/2 mile radius. Notice the area in the top right corner of the image east of Moriches Road includes residential streets but no houses or other buildings. This is not because a residential subdivision is under construction (in fact, OSM has a separate tag for construction sites) but because none of OSM’s volunteer mappers have populated in this area.
The number of buildings near transit stations along with other features such as square footage and height will be important inputs to future Transit-Oriented Discoveries density analysis. How many other station areas have missing information, and how much land near the stations are under-mapped?
Buildings Near BART Stations
As part of my work to answer these questions for all transit systems in my data set, I recently reviewed the 50 stations in the Bay Area Rapid Transit (BART) system and visually identified 12 station areas (24% of the system) with similar gaps to the LIRR example: areas with a street grid but no buildings included. Consistent with the feedback I received from the OSM community, these stations are primarily in the more outlying and less densely populated areas of the rail network as opposed to in downtown Oakland or San Francisco. The BART system map below highlights stations with under-mapped areas.
Let’s take a closer look at one of these stations, the Lafayette Station located in Lafayette, CA on the BART yellow line serving the eastern side of the Bay Area. The station is located in the median of State Route 24. Per the image below, the station parking lots have been mapped (with purple images signifying a solar powered canopy on the lots) as has have some of the buildings south of Route 24. Other features such as a stream, forest, and recreational areas (shown in green) are also on included the map.
What’s missing? Most likely the houses and other buildings that line Orchard Road, Oak Hill road, and the other residential areas north of Route 24. (OSM mappers have colored this area gray to signify a residential area exists). It’s also very likely that houses or other buildings are missing from the areas to the north and south of the creek near Brook Street.
Let’s use the Lafayette station as a case study for updating OpenStreetMap to add buildings to the area surrounding the station.
Establishing a Baseline
Ideally, there would be a way to estimate the number of missing buildings in the transit-shed. There probably is a way to do this with a model that incorporates the street network geometry, the number of existing buildings, and data from similarly situated transit station areas. However, this will be a project for another time. For now, I’ll develop a script to identify the number of buildings that are currently mapped and the total amount of land these buildings occupy, then re-run the query after I’ve added buildings to OSM. As in prior posts, I’ll be using Python code for this analysis.
I started by importing various libraries. The “requests”
library retrieves the building data from the OpenStreetMap API. The “shapely”
and “geopandas” libraries
are used to convert the raw data into geometric shapes (polygons) and perform spatial operations such as calculating the area. “Geopandas”
also allows the spatial data to be handled in a structured way, making it easier to perform operations like area calculation and coordinate system transformation. I also provided the specific latitude and longitude of the Lafayette Station provided by the National Transit Atlas, and specified a radius around the station in miles and meters.
I developed a function to query the Overpass API to retrieve data about buildings within 1/2 mile of the station. The first part of the function provides the URL for the Overpass API's interpreter endpoint, which processes and queries written in Overpass Query Language (Overpass QL). The interpreter endpoint acts as an interface between my script and the OpenStreetMap (OSM) database, allowing me to send queries and retrieve data from OSM.
The next part of the code constructs the query. It searches for all OSM elements tagged as "building" within my 1/2 mile radius. It outputs the results and then recursively fetches the points (nodes) that define the building outlines. For example, a simple rectangular building might have four points, one at each corner of the building, that form the building’s skeleton. The “out skel qt;
” line of code outputs the building skeleton data in a quick and terse format. Finally, the script sends the request to the API and stores the response which is processed into Java Script Object Notation (JSON) format which is converted into a Python dictionary that you can easily work with my code.
The next portion of the code takes the data from the Overpass API and calculates the total area occupied by the buildings. It also counts the number of buildings found in the data. It starts by defining “elements” as the information queried from the API and then filters only the elements that are "ways" (which represent the outlines of buildings). It also creates a dictionary mapping building point IDs to their longitude and latitude coordinates
The code loops through each way
and extracts the coordinates of the points for each way. It ensures the first and last coordinates are the same, confirming it's a closed shape (a valid building outline). If it is a closed polygon, the code creates a polygon
object from the coordinates and adds it to the polygons
list. Each polygon = one building.
The code converts the list of polygons into a GeoDataFrame (a geospatial data structure), with the initial coordinate reference system (CRS). It calculates the total area of all polygons in the GeoDataFrame, in square meters and converts to square feet.
The last portion of the code executes the query to the Overpass API, retrieving data about buildings within a 1/2 mile radius of the Lafayette Station. It processes this data to determine the number of buildings and calculate the total area occupied by these buildings. Finally, it prints out the results, showing how many buildings were found and how much land they occupy in square feet.
You can find the complete code on my Github Page. When I ran this code in early August prior to updating the area, it returned 143 buildings for a total of 951,267 square feet within 1/2 mile of the Lafayette Station.
Inside the Matrix
After I identified a baseline of development near the Lafayette station, I started adding additional buildings using the OpenStreetMap editor. Updating OpenStreetMap does not require any computer programming or data analytics expertise, OSM provides a mapping tutorial, and adding information to the map is relatively intuitive. Patience and attention to detail may be the most important qualities of a good mapper.
The image below depicts the Lafayette station on the front end of OSM and in edit mode. The edit mode provides a more detailed view of street segments, building outlines, and various markers. These visuals represent OSM’s basic building blocks: lines, points, and areas. Lines are a series of connected points that are used to depict roads, railways, rivers, paths, and other linear elements. Points (or nodes) mark specific locations on the map. They represent discrete features like individual landmarks or points of interest (POIs). Areas are closed polygons that represent a defined space such as buildings, land use zones, and administrative boundaries. Areas are created by connecting a series of points (nodes) to form a closed loop, with the first and last points being the same. (This is why a portion of my code checked to make sure that the points being extracted from OSM formed a closed polygon).
OSM areas will play an important role in the Transit-Oriented Discoveries database because they provide useful information on land uses around transit stations. For example, the image below of the strip mall to the south of Route 29 identifies a retail area shown by the orange border surrounding the buildings.
OSM offers information on the origin and history of data, including any changes that have occurred over time. For example, the screenshots below from the public-facing version of OSM identify the building in the retail area tagged as a Panda Express restaurant. The top left hand corner provides details on when the building was tagged and a comment by the person who made the change: “adding and fixing ATP info to fast food and cafes.” (“ATP” refers to the All the Places dataset, which is a large-scale compilation of location data scraped from various public directories of businesses. This data includes detailed information on Points of Interest such as grocery stores, supermarkets, and other commercial entities). OSM also provides a link to the mapper, whammo as shown in the screen shot below.
Side note: the orange chicken at Panda Express is my favorite takeout guilty pleasure.
Building Out the Map
After watching the OSM tutorial, I started adding buildings to the map. The process involves zooming in, selecting a building that was not yet shaded, and tracing a perimeter around it with my mouse and keyboard. Per the screen shots below, each point (node) represents a corner of the building and the lines between the points form the outline. I tried to be true to the shape of the building but sometimes drew a simple rectangle around buildings with very detailed rooflines.
Once I outlined a building I associated it with an area type (usually “house”, “building”, or “apartment”) and did a bulk upload to OSM after I’d identified a handful of buildings. It was a long and tedious process and easy to get sloppy. OSM provides some helpful built-in edit checks such as making sure multiple buildings don’t overlap with one another.
Ultimately, I identified and added almost all of the buildings within the 1/2 mile radius of the Lafayette station, and re-ran my query to tabulate the revised building count and area occupied by buildings. I was able to significantly increase the number of buildings and land occupied by buildings. I also added two construction sites shown in dark green south of CA-24.
Additional data presents a more complete picture of the station area’s land uses and urban form. The area’s south side is a mixture of commercial buildings, single family houses, and apartments that are clustered relatively close to one another, while the north side consists of larger detached single family house on a suburban street grid. Additional housing on the south side of CA-24 would make more sense given the existing amenities somewhat more walkable street grid (and likely opposition from neighbors to the north side to new apartments or other higher density housing). BART is planning additional development on the Lafayette station parking lots and these projects as well as similar ones may help address the region’s affordable housing crisis. (Last month the median home price in Lafayette was $2 million and the median monthly rent for all property types was $3,900).
Overall, the under-mapped areas are a relatively small proportion of the stations that I’ve reviewed to date. Of the 2,200 station areas I’ve visually inspected, 300 (or 14%) appear to have gaps similar to the Lafayette station. I will not be updating all of these areas myself since I learned that completing a single area is a time consuming process. That said, OSM has plugins that can speed up the process of drawing buildings. This could also be a good volunteer project for high school students.
Once I’ve documented under-mapped station areas, I’d like to delve further into possible causes for the missing information. Are there simply more OSM mappers in some areas than others or do OSM volunteers more interested in some areas than others or consider some places more important or worthy of mapping? What are the demographics of the under-mapped communities compared to well-mapped areas and compared to the OSM mapper community? I’d be happy to have further conversations on these and other lines of inquiry. Let’s take BART to the Lafayette station and hash it out over orange chicken at the Panda Express.