On Using Data to Make More Data
Welcome to my final 2024 blog post on developing the Transit-Oriented Developments dataset. Each prior post has identified a data analytics concept and discussed how I’ve applied it to my work of creating a dataset of transit station areas in the United States. My prior posts have addressed data accuracy, completeness, tabulation methods, and obsolescence. Read them, and you’ll notice that for much of the summer and fall I’ve been in a defensive crouch-presuming that problems exist in my data and working through ways to address them. As the year draws to a close, I’ve begun to go on offense: generating new data instead of mending existing information.
If you work with data, you’ve probably heard the term “feature derivation” or “data enrichment.” Maybe you have even derived features or enriched some data yourself! Feature derivation transforms existing data into more meaningful representations, while data enrichment incorporates additional information from external sources. Both techniques can create new knowledge and insights.
Steadying the Stool
As discussed here, my national transit station dataset stands on a three-legged stool: General Transit Feed Specification GTFS data provides station locations and associated transit routes (our first leg), the National Transit Database NTD Facility Inventory supplies station age, size and configurations (our second leg), and when those sources wobble, websites like Wikipedia and transit agency pages step in as our third leg to steady the structure.
Approximately 87% of the 4,729 stations in my dataset contained data in both the GTFS and NTD. That left 630 stations with GTFS location, mode, and route data but without NTD data on the station type, year built, or square footage. Of these, 223 stations are associated with fixed guideway systems that are not required to report to the NTD because they do not receive Federal funds. The other 407 stations were not included in the facility inventory along with the other stations submitted by NTD reporters. Some of these stations are missing because they opened after the most recent reporting deadline. Other gaps remain a mystery. Why did San Francisco Muni exclude 153 light rail stations from it’s report? Why did Lane Transit District in Eugene Oregon not report any of it’s 45 bus rapid transit stations? If anyone can shed light on the answers to these and others of life’s most important questions, please give me a call.
In the meantime, I worked to fill gaps in the facility type and year built through some old fashioned detective work, looking up station configurations and the year that a service opened (a proxy for the year the station was built) on line. I gave up on trying to find the square footage of the missing stations as I don’t know that there is any source other than the NTD for this information.
The chart below summarizes my national transit station dataset features:
The World’s Address
Information about the mode a station serves, how big it is, and whether it is underground, at-grade or elevated are useful starting points for further analysis but a station’s accurate longitude and latitude coordinates are foundational. Like the yeast in a slurry, they can can be combined with other ingredients to produce new information, such as a station’s address. Having an address helps situate a station in our built environment (most of us are much for familiar with street names and intersections then with coordinate pairs) and can include additional useful features. This information can be derived by using reverse geocoding.
To undertake reverse geocoding, I used the Google Maps Client API, a tool provided by Google that allows developers to access Google Maps services directly from their applications or code. Google Maps searches its massive location database, which includes streets, businesses, landmarks, and administrative regions and identifies the closest matching location to the provided lon/lat coordinates.
My first steps involved preparing my dataset for reverse geocoding. I loaded the pandas library which is used for handling tabular data, along with a library that connects to Google Map services and one that pauses the program briefly to avoid sending too many requests at once.
I then used my API key to initialize a connection to Google Maps, allowing the program to make API requests. My code next creates a column labeled “full_address” used to store the addresses generated by my reverse geocoding.
The next section of my code addresses data formatting issues that could prevent reverse geocoding. This includes a function to ensure that the latitude and longitude values that will be used for the coding are numeric (i.e. a float data type), code to report which rows have invalid data, which makes it easier to spot them, and code that converts latitude and longitude data, to numeric values, if necessary. The screen shot below shows the code.
Once I completed my data pre-processing, I ran scripts to generate street address data from each station’s longitude and latitude coordinate pair. The function shown below sends the latitude and longitude coordinates from my data to the Google Maps API. It also ensures that I am only populating addresses for rows with valid lon/lat coordinates. The code loops through each row of my dataset and stores the reverse Geocoded address in the “full_address” column. I also included code to print out the address in real time (or “address not found”).
The code produce a steady stream of addresses populated for each row of my data set. The first twenty addresses (which happen to coincide with the Alameda/Oakland Ferry and the AC Transit Tempo Bus Rapid Transit Line) are shown below.
My reverse geocoding generated address information for all 4,728 stations in my dataset. However, Google Maps isn’t perfect and many transit stations which aare not situated on city blocks like homes or businesses are hard to associate with a street address. For example, the image below shows three AC Transit Tempo BRT stops with a red dot for the station lon/lat coordinates and the reverse geocoded address underneith In the image on the left, Google Maps assigned an exact street address. in the center image, it assigned a pair of cross streets. Google Maps just assigned a street name for the station on the right. (If you are planning on sending a holiday card to the Durant Avenue station, it may not get there in time)
Oh County, My County!
For reasons that I’ll explain later, I am also interested in the counties in which the transit stations are located, so I developed a script to extract county information from the station’s street address generated previously. After calling the Google API again, my code includes a function to read the information in my “full address” column and check to see if Google Maps has data for the “administrative area level 2” (which is Google Maps speak for “county”) for that address. The code then loops through each row of data, reads the address from the full_address
column, calls the get_county()
function to get the county name and writes the result back into the County
column of the dataset. The complete code block is below. You can find my notebook with this code here.
Why Cities and Counties Matter
Generating data on cities and counties is worthwhile because these jurisdictions play a crucial role in supporting or impeding TOD around stations. While transit lines are the arteries of our transportation system, cities and counties are the vital organs that regulate development around them.
Cities typically govern land use and zoning rules, height requirements, transportation policy, and infrastructure investment decisions within their boundaries. Counties do the same for unincorporated areas and may also take the lead on regional policies such as growth management plans that can impact where development takes place, if it takes place at all.
It is not uncommon for a transit system to serve multiple cities. For example, both the Sacramento Regional Transit District (SacRT) and the Los Angeles County Metropolitan Transit District (LA Metro) provide service within a single County (Sacramento County and Los Angeles County respectively). However, the Sacramento Regional Transit District’s system includes 53 stations across 6 cities and the LA Metro system includes 202 stations located in 32 cities. (A future post will explore how different jurisdictions impact neighborhoods around stations operated by the same agency).
Putting it All Together
Ultimately my feature derivation and data enrichment work expanded my national transit station dataset to include additional features as shown below:
And this is just the beginning. I plan to use feature derivation and data enrichment tools repeatedly in 2025 as I pair my station data with OpenStreetMap’s rich source of information on land uses, transportation features, and points of interest and as I join station coordinates with Census geographies and their demographic data. I’m also planning to roll out data visualization and lookup tools and access to detailed data. All of this in the service of helping people create better places to live and work near transit stations. I hope you are excited! I know I am.