“Is it One or is it Many?” Reasoning with Transit Station Data (Part 1)
In my last update, I described TOD data quality as “the extent to which digital information on stations and their surrounding areas represent reality.” I examined satellite images of houses, shops, and other structures around the Lafayette BART station and noticed that some of these buildings (the reality) were not included in the OpenStreetMap database (the digital information). As I used OSM tools to add buildings to the database, the digital information got closer to reality.
This week’s post describes a new data quality challenge: how to tally and organize stations when some stations could either be counted individually or grouped together. I provide examples from larger and smaller transit systems along with a decision-making framework. This problem lacks a technical solution, but it does require some consistent reasoning and a willingness to make choices. In data science, as in life, there can be multiple ways to understand and categorize our reality.
Times Square
What constitutes a "station"? Is a station defined by its physical infrastructure, its functional role, its data attributes, or its perception by the public?
Consider the Times Square Station, located at the intersection of Broadway and 42nd Street in Manhattan. I imagine most people consider Times Square to be a singular place (it’s not called Times Squares after all).
But there’s more beneath the surface. The Times Square station is one of 36 New York City Transit (NYCT) subway station complexes, which are subway stations that are physically connected. Station complexes allow passengers to transfer between different subway lines without exiting the system. Times Square passengers can access the 1, 2, 3, 7, N, Q, R, and W trains, as well as the A, C, and E lines and the Port Authority Bus Terminal.
Station complexes are rooted in the system’s history and evolution. The New York City subway initially began as three separate systems: the Interborough Rapid Transit (IRT, Lines 1, 2, 3 and 7), the Brooklyn-Manhattan Transit (BMT, Lines N, Q, R and W), and the Independent Subway System (IND, lines A, C, and E). Each of these systems was built with their own lines and stations, often operating independently of each other. These independent stations were stitched together over the course of the 20th Century in order to reduce street congestion caused by passengers needing to exit and re-enter different stations. The Times Square station complex includes four stations. (Although travelers can also connect to the A, C, and E lines, this station is listed separately as the 42nd St-Port Authority Bus Terminal Station).
In addition, the Times Square General Transit Feed Specification (GTFS) Stops data, one of the basic building blocks of Transit-Oriented Discoveries, includes eight different rows. I’m not sure why different identification numbers are assigned to rows with the same longitudes, latitudes, and route_ids. Perhaps the data seeks to identify express and local platforms or sides of platforms that serve trains running in opposite directions.
Should Transit-Oriented Discoveries list Times Square as a single record, consistent with how most people perceive it? Should it include four records, one for each station complex? Five records, to include the 42nd St-Port Authority station (because it is connected with the Times Square stations underground)? or eight records, one for each GTFS entry?
It’s tempting to aggregate all of the Times Square stations into a single entity. Because each of the stations in the complex are located within a block or two of one another, any station area demographic or land use analysis will result in double counting (or triple or quadruple counting) and could skew the dataset towards features more common to Times Square than elsewhere, features such as tall buildings, population density, and theaters. But how to go about this? Would I choose one of the station complexes to represent the station and, if so, which one? Would I create a bespoke Times Square station based on the center point of the lon/lat coordinates? Neither option seems appealing.
Likewise, I could encorporate the GTFS Stops data, duplicates and all, because it’s an official agency record. But including duplicate coordinates exacerbates the double-counting problem and Transit-Oriented Discoveries cares more about surrounding areas, then the precise location of platforms within a station.
Ultimately, I included four different entries for the Times Square station, one for each entity in the station complex, and treated the other NYCT station complexes the same way. When in doubt, I defer to how the transit agency presents its information to the public. NYCT lists the stations in its complexes separately. Portions of the NYCT subway map attempt to reconcile the part-to-whole relationship through black lines that connect stations and, in some cases, with circles surrounding station complexes, including Fulton Street and Canal Street.
Grand Central Station
With it’s imposing great hall and gilded interior, Grand Central is a station architype. If people are asked to name a bird, many would mention a robin and if people are asked to think of a train station, I suspect something like Grand Central would come to mind and it would be a single entity, not a collection of stations.
However, as is the case with Times Square, Grand Central is a subway station complex that includes three subway lines (the IRT Lexington Avenue Line, the IRT Flushing Line, and the 42nd Street Shuttle). It also serves multiple branches of the Long Island Rail Road and Metro North Commuter rail systems. Transit-Oriented Discoveries subdivides Grand Central Station into five separate records: one for each NYCT station complex and one for each commuter rail system.
The Hop Streetcar
Reasoning about station data can be complicated in large metropolises and smaller cities alike. Consider the Milwaukee, WI streetcar, also known as the “Hop”, a 2.1 mile route that connects downtown to the city’s Lower East Side and Historic Third Ward neighborhoods. Streetcar systems often include station platforms located on opposite sides of a street in order to serve vehicles running in different directions. These platforms can be located opposite from one another, similar to two platforms in a subway station that are separated by a rail right-of-way. The image below shows the Hop route map and National Transit Database (NTD) data identifying two separate stations (GTFS data includes two entries as well).
As was the case with Times Square, it’s tempting to consolidate the two streetcar station platforms into a single station in order to avoid double-counting information around the station areas. And since we ruled out counting each platform previously, shouldn’t we do the same here? But there is still no way to resolve problems around which platform to choose or how to create a new entity. Also, the Hop website lists these stations as “Ogden at Astor Eastbound” and “Ogden at Astor Westbound.” Since we’re treating agency station descriptions as the “ground truth”, the Transit-Oriented Discoveries database contains two records.
The Health Line Bus Rapid Transit
Bus Rapid Transit (BRT) is designed to provide faster and more reliable service than a conventional bus route by including features such as dedicated bus lanes, off-board fare collection, level boarding, and bus priority at intersections. BRT stations can be configured similarly to streetcar stations, with platforms located on opposite sides of a street.
In Cleveland, Ohio the Healthline BRT system, operated by the Greater Cleveland Rapid Transit Authority (GCRTA), travels 6.8 miles down Euclid Avenue providing service from downtown through East Cleveland. Some of the Healthline stops are median stations are located within the Euclid Avenue busway that utilize left- or right-side boarding, as buses have doors on both sides. Others are curb stations are more traditional bus stops where buses open their doors to the right curb of the street (see examples below).
When it comes to tallying the number of HealthLine stations, sources conflict. GTFS data includes two distinct records for the curb stops and one record for the median stop for a total of 59 stops, similar to what appears on the HealthLine’s Wikipedia page. I could not find an official count of BRT stops on the GCRTA website, aside from the map of 39 stations presented here.
Ultimately, I included only the eastbound curb stops in the Transit-Oriented Discoveries dataset for a total of 39 stations. Since I had to choose between two different agency sources, I went with the source that would likely be more familiar to the public—the route and station map.
A high-quality data set builds trust between the data owner and its users. Transit-Oriented Discoveries is intended for people familiar with the transit systems and built environment in their communities and want to create more sustainable development around stations. Information that is consistent with people’s prior knowledge and expectations, while accounting for and clearly explaining nuances (such as New York City Station complexes, and some streetcar and BRT platforms) will inspire greater confidence and be a more usable and useful public resource.