Correlate… ALL THE THINGS II: NYC


This is the second post in a 4 part series examining interesting relationships that can be found when you correlate every characteristic/attribute of fine grain (Census Tract) geographies in the mySidewalk database for a metro and then analyze the interesting ones. Also in the series: definitions, part I, part III
Scope
If this is your first taste of “Correlate… ALL THE THINGS”, I recommend taking a look at this handy/hilarious/handsomely written reference and definitions post.
In this post, we’ll examine the New York City / Five Boroughs Area . The “boroughs” make up the incorporated elements of New York City. The largest city in America, NYC has roughly 8.5 million people residing within. As with Part I, Wikipedia can provide the nitty gritty. Once again, for the purposes of this publication we’re using the five boroughs of this city as a mask to intersect census tracts (click the county and census tracts links for the precise geographies as geojson).
“But I live in Yonkers/Mt. Vernon/Long Island–it’s NYC’s 6th borough!”
*Hehe* Sure it is, buddy…
Correlating ALL the Things
Reiterating a point from last time: correlating “the everything” to “the everything” avoids having to carefully categorize the data and perform feature selection. Because, in the famous words of Sweet Brown, “Ain’t nobody got time for that.”


This being part II, we have the benefit of experience on our side. Instead of having our eyes overwhelmed by the blues and reds of the spreadsheet when we look at correlation to fundamental demographic units like “Total Population” and “Total Households”, we’ve normalized wherever possible and the strong correlations are better (and fewer) for it.
Total population, for example, has been normalized by total area and therefore represents population density. I’ve normalized items to their most specific universe wherever possible, so that when we are examining unemployment, for example, it is normalized by the labor force statistics in that tract and the natural correlation unemployment has with population, age, and factors relating to choosing not to pursue employment fall out, leaving only the most specific correlations to attract our attention. Looking at the definitions sheet will allow you to see what every dataset was normalized by. The conditional formatting is that -1 is brightest red, 0 is white, and 1 is brightest blue. Not only does this show patriotism, it allows us to quickly spot strong positive or negative correlations. On that note…


What’s the first thing you notice about that screenshot? If it’s that the bright blue and red cells have been kept to a dull roar, congratulations! Not only did you pay attention to part I, but you have good color and tone recognition.
You’re probably also likable and successful.
The next thing you’ll notice is that I have blanked the lower left triangle of the spreadsheet. It reduces the amount of noise at the cost of not being able to see every correlation on whatever row/column you’re on (i.e. you may have to follow the row or follow the column to find a specific correlation, dependent on your position in the spreadsheet). I welcome comments on this change because I myself am a little torn…
Changes aside, here are the full results:
This spreadsheet and all data within is provided under a Creative Commons Attribution-NonCommercial Share-alike 4.0…docs.google.com
Some Interesting Highlights
Disclaimer: this analysis should be considered an “entertaining and informational exploration of the intersectional topics of geography, statistics, and demography”. No warranty is implied, no formal QA process checked my results, and no ideology is meant to be antagonized.
Highlight #1: Large swaths of little to no correlation
Discussed previously, normalizing all possible datasets by their most specific available universe had a fantastic filtering effect on the correlations. Population alone contributed in broad fashion to the moderate correlation seen between many datasets in CATT: Part I (as predicted by Parr’s 2nd law of geostatistics).
“2- Mo’ people, mo’ problems (of every kind).” — Brian Parr; Parr’s Laws of Geostatistics








The lesson (for geostatistical and demographic analyses): default to normalizing everything you can as specifically as you can. This is why we allow you to normalize datasets quickly, easily, and sensibly plus (sneak preview of upcoming features) will soon suggest a most specific normalization you can apply with the click of a button (or is it a link? a slider? knob? our design team loves change…).
Further, last time the data was sorted alphabetically by the name of the dataset; this time I opted to group the data roughly by what basic underlying factor it described (population totals, voting, foreign population, language, education, housing, etc.). This resulted in the removal of the small, random pillars of red and blue and allowed the trends to group naturally.
Highlight #2: On Density
NYC presents a fantastic chance to examine density. People, households, housing units, structures — you name it, Gotham’s got it, and it’s packed in tight, thank you muchly.
- Income inequality has a mild correlation to density in NYC. You could handwave this away by saying, “denser census tracts have larger samples, larger variability, more income inequality”, to which I’d say, “That’s really fair and sounds like every article critique I phoned in during my (required) sociology and psychology classes”. On the other hand, I think there are probably some uniquely New York (say that 3 times fast) variables at play here. The most appealing possibility to look at further is the incentives for developers to create and maintain so-called “mixed-income housing” in which a large portion of the building is rented and market rate and a share of “set-asides” are offered at a substantial discount to low and mid income New Yorkers. Mixed-income housing comes with controversy but many experts agree there are benefits.
- Quite by necessity, NYC’s density correlates positively with high capacity structures (5 to 50 housing units, peaking at 20 to 49 housing units) and negatively with low capacity structures (4 or fewer housing units). Single unit detached housing correlates negatively with density and most planners and developers targeting urban cores have little interest in this kind of housing based on economic (both organic and artificial) and building code lines of reasoning. Still, there is something to be said for the fact that NYC is one of the few lucky cities in America where you need merely build taller buildings with more unit capacity and additional population density will materialize. Similarly, housing costs correlate ever so slightly with density, as demand goes up, so does price. Many population analysts attempt to push an agenda supporting building height limits and other byzantine building codes by showing that as development in an area increases, the rents actually increase, too; they are attempting 3 small deceptions by simple omission herein: that new developments cost more because they are new and the developer is looking to profit, that the variability in income and living costs afforded by more density will provide some lower cost housing (even if the average housing cost continues to increase), and a simple truth every social network that has experienced hyper-growth knows well — people want to be where lots of other people are.
- Vehicle ownership of any form, especially extravagant fleets of 3 or more per household, is exceedingly rare in dense areas. Not surprisingly, walking and public transit dominate where density exists (many planners speak about density and public transit using causal language, “a density [X] is needed to support transit mode [Y]” — for good reason).
- Dense areas are significantly more often rented than owned, this is most likely because the values of property in dense areas are too high to support density in the first place (paradoxically so…) and must be made available in an affordably structured manner (pay as you go).
- Housing unit/household density correlates better with degree holders in Literature & Language, Liberal Arts, Visual Performing Arts, and Communication. Coming from an engineering school myself, I have no explanation to offer for this.
- Employment and participation in the labor force tilt just a bit higher in denser areas. Favored industries in dense areas seem to be service, professional, or entertainment in nature. Transportation, construction, and agricultural industry are rarer in dense conditions (a mercy to livestock and residents’ noses, I wager).
- Single women are relatively more abundant in dense areas of NYC than single men or married people of either gender.
- Dense populations are slightly more likely to be uninsured.
- “Millennials” are particularly fond of dense tracts while other generations seem more shy. “Baby Boomers” and “Matures” are actually less likely to live in a tract the denser its population.
- Population density is a moderate predictor of the likelihood the primary language spoken in a household in NYC is Espanõl. Alegría!
- Educational achievement correlation skews towards the extremes (Bachelor’s/Graduate degree or Less than High School Graduate) when compared with density (but not to an extreme magnitude).
That last point makes an excellent transition to…
Highlight #3: Educational Attainment and The Goldilocks Rule
17- When extremes of a corpus of possible living choices do not present an agent with obvious advantages, those agents with more mobility and freedom to choose will tend toward those choices that are “just right”— Brian Parr; Parr’s Laws of Geostatistics; AKA “The Goldilocks Rule”
Skipping the obvious, well known correlations to vaguely income linked characteristics, educational attainment correlates interestingly to several groups of characteristics in NYC. I strongly suggest you examine the number of units in a structure that correlate well with different educational attainments, the age of structures that each population may prefer to live in, and also the vehicle ownership for those households.
The best educated New Yorkers seem to be willing and able to settle in the tracts that have a preponderance of housing that is “just right” for their tastes.
The best educated New Yorkers also live in areas where people spend the smallest fractions of their incomes on housing costs. A final note on educational attainment and choosing a home: very differently educated populations tend to avoid each other, from the data with the notable exception of those who hold an associate’s degree (they’re more content to live in the areas with the least educated populations at a rate higher than even high school graduates!).
A possibly heartbreaking set of correlations (only possibly because I don’t know how easily moved the audience is by Pearson’s r coefficients) is discovered comparing Women in the Workforce with a Birth in the Past Year (which I abbreviate “WIW Birth In Past Year” as a column header, apparently) with educational attainment. It seems the most predictive characteristic of a population for the disposition of its recent mothers who are employed/would like to be employed and their marital status is the population’s educational attainment. The highest educational attainments correlate positively with married, employed mothers in the workforce and the lowest educational attainments correlate with single, unemployed mothers in the workforce. The free advice this spreadsheet has for us is clear: get those degrees before you start your family and employment/family outcomes improve (the exception being bachelor’s degrees in communications, which actually correlate negatively with the ratio of married, employed new mothers).
Highlight #4: Racial integration


As mentioned last time on “Correlate… ALL THE THINGS”, the correlations between economic and social outcomes and race in the United States of America are pretty well understood already (and, frankly, they are as much a shameful indictment of historical policies, institutions, and attitudes as well as modern ones as they are a social studies lesson). I’ve got a limited amount of your attention left to work with, so I won’t rehash.
What is interesting to look at in a given city, and an analysis technique I hope to make popular, is the convergence of the race total ratios, which, in correlation coefficient spreadsheet form, can be referred to as a “Racial Integration Triangle” for a study area.
So, how’s New York doing at racial integration? It could stand to improve. Whites are integrated with the smaller populations of minority races but Blacks and Whites are the least integrated populations in NYC (and the only population ratios with a coefficient less than -.5, one could infer almost active avoidance from that kind of correlation…). Whites and Hispanics (who make up the largest minority in NYC) are not much better. On the brighter side, Biracial, Triracial, and Asian populations are living in a state of Pearson’s r Coefficient harmony (Pearson’s r-mony!). The coefficients between most other races indicate they co-inhabit tracts with a frequency at least consistent with a random “melting pot”.
Highlight #5: Other quick hits
- Democratic party affiliation correlates positively with household size while GOP and “Other” affiliations correlate negatively. I predict continued growth for Dems in NYC just based on lineage.
- Retail access and other forms of service provision (number of broadband providers, for example) correlate more strongly with income than I’ve seen in other cities. Takeaway: NYC’s vendors are better positioned to sell to the customers they want most.
- 2 age bands favor rented over owned accommodations at a rate inconsistent with their positioning in the age spectrum: children under 5 and people 20 to 34. Besides these groups, people in NYC favor renting most while they are young (or have young members of the household) and transition more toward ownership as they age.
- A very finite set of items correlate positively with the population of people who are not in the labor force: unemployed ratio (because after certain criteria are fulfilled people eventually age out of the unemployed population to the not in the labor force population), ratio of civilian labor to military labor (possibly because all unemployed members of the labor force are classified as civilian, which causes issues with some analyses), the ratio of the population that is in the education and healthcare industry, and ratio of the foreign born population lacking a high school education.
Thank you for coming back for more in Part II of “Correlate… ALL THE THINGS”. I hope you enjoyed this publication enough to read parts III and IV. A user friendly and feature selection capable version of this analysis is coming soon to Sidewalk Insights. Special acknowledgements to the timeless Brian Parr, supportive Nicholas Tomasino, knowledgeable Brian Wize, hilarious Bob Gurnett, and implacable Lauren Nguyen for their contributions in compiling this post.
Resources to learn more
- Full correlation coefficient dataset as a google sheet
- Github repo containing the ongoing and up to date scripts, geographic information, and raw csv version of the correlation data
Need help finding all the things that correlate in your community? Chat with us and discover how our tool can help you.
About the Author: Matt Barr works at mySidewalk helping to invent the technology that betters our understanding of communities. Civics, democracy, computing, and the great outdoors are his passions.

