Let’s Dig Old Data

Source: Axel Antas-Bergkvist, Unsplash

Fact: We would all like to make better, faster, data-driven decisions. However, with limited resources and time, that can be a tall order.

So–let’s keep things simple. Let’s address how you can use data to analyze socioeconomic and demographic trends in a community over the long term, and, specifically, how mySidewalk can help.

Real-Time Data vs. Historical Data

If you’re plugged in to any of the chatter going on in the civic and gov tech world, you’ve probably heard a lot of discussions around the “Internet of Things” and sensor-generated data–information that is being transmitted and analyzed in real time to produce insights to discover patterns quickly.

However, I’d like to make a case for the exact opposite of real-time data — historical data.

I’m not trying to present an argument to discount the value of real-time sensor data in certain applications. I do believe that microchips are going to drive the next big “megatrend” and will substantially impact the shape of our cities. At the same time, I think there’s a lot to be learned about the communities we live in, using both a macro and a micro lens, from analyzing historical data and trying to decipher the underlying demographic trends they may reflect or appear to be caused by.

An Ingredient Your Analysis Might Be Missing

Temporal data analysis involves very powerful techniques that can reveal a lot more insight than just examining the most recent observations.

Spatio-temporal data analysis is a technique that involves producing insights from both the spatial and temporal components of your data.

It is special, valuable, and can help dig deep into trends. It will help you understand how your community has changed over time, and even forecast what the next few years or decades might look like.

Yes, that’s right–you could use spatio-temporal data analysis to predict the future.

But, how do I use it?

One of the easiest ways to allow for your existing data to have a temporal component is to ensure you’re keeping meta data about its currency — the time period of measurement and validity. This will enable you to apply temporal data analysis techniques later to detect and study trends in your data over a period of time.

If your data already has temporal information, another simple method for gaining useful insight is to look at the change over time for those variables. This will highlight trends among your variables. The temporal component also makes it possible to join those datasets with other datasets from the same time period, allowing data analysts to perform correlation and perhaps even causal analyses on those datasets. This will infer even more valuable insights about how a change in one variable may be correlated with, or perhaps even the cause of, change in another variable.

Yes, I agree–Seeing trends evolve over time is a pretty big deal.

It’s even more important to include temporal information when collecting data containing observations from experiments that are not repeatable under the same circumstances. Note that I use the term “experiment” purely in the statistical sense — merely a geek’s term for a random or nondeterministic phenomenon. An example of such observations might be income levels of members in your community. While recent censuses have included median earnings by gender in community data, it’s difficult to study how the income discrepancy by gender has evolved over a longer period of time since we lack data from earlier censuses.

Running into Roadblocks

While most computer and sensor-generated datasets often come with very precise timestamps, a lot of human-generated data can be lacking temporal information. For example, we’ve observed that registered voter data from most states does not contain much in terms of temporal meta data. Most states don’t document when the observations were measured, or even report how current a dataset of their registered voters is — making it impossible to analyze if certain events or treatments to a particular community can be correlated with the growth or decline of the number or demographics of the registered voter population in that area.

Stop Digging– Here is An Applicable Solution

At mySidewalk, our goal is to make these spatio-temporal analyses on data achievable without needing to do the math or crack open your favorite statistics book. It’s even okay if you don’t have a favorite statistics text.

In fact, mySidewalk already has some datasets that help you see change over time of a particular variable for a particular area.

As you can see below in the mySidewalk screenshot, it appears from a quick visual scan that the number of renter occupied housing units in the Kansas City MSA decreased from 2000 to 2010.

Above: Change in renter occupied housing units in the Kansas City MSA. Source: mySidewalk platform, U.S. Census Bureau, 2010–2014 American Community Survey (ACS) 5-Year Estimates

Let’s dig a bit deeper, shall we?

From this screenshot (see below), I’ve filtered the map to only show block groups where the number of renter occupied housing units grew.

Above: Areas where the number of renter occupied housing units have grown. Source: mySidewalk platform, U.S. Census Bureau, 2010–2014 American Community Survey (ACS) 5-Year Estimates

Below is a screenshot of the same map extent, but showing block groups where the number of renter occupied housing units shrank.

Above: Areas where the number of renter occupied housing units shrank. Source: mySidewalk platform, U.S. Census Bureau, 2010–2014 American Community Survey (ACS) 5-Year Estimates

Obviously, this “temporal analysis” was possible because we knew the temporal component of both sets of observations, and were able to look at a change over time. Unfortunately, because of the lack of temporal information in voter registration data, we’re not able (at this time) to determine how growth or shrinkage in the number of renter occupied housing units correlates with the number of registered voters in that geography.

Fear not–we’ve got quite a few things in the pipeline being released in the next couple of months at mySidewalk that we’re really excited about. The development of these features means that soon you’ll be able to perform complex temporal analyses using lots of historical data in just a few clicks.

Interested in seeing how your community (or project area) has changed over the past decade? Request a free, interactive report here.

Correlation ≠ Causation

This is more of a disclaimer the statistician in me wants to make sure I’m not implying — the existence of a correlation between two variables should not be considered to be the evidence of a cause and effect relationship between the two variables. You should however check out one of my favorite websites on spurious correlations: http://tylervigen.com/spurious-correlations

About the Author: Riddhiman Das is a proud data scientist, Chief Data Officer at mySidewalk, and NOT a cat person. With a background in Computer Science, his general interests lie in Machine Learning, Data Science, Applied Statistics, Computer Vision/Image Processing, high-frequency/low-latency Distributed Systems, functional programming.