K-Means Clustering of homes and neighborhoods in Minneapolis (2024)

Tenzin Kunsang

6 min read

Sep 9, 2020

This is (hopefully) the first of many blog posts I will be writing on Medium!

For the last few weeks, I have been working on a machine learning project as part of IBM’s Professional Data Science Certificate on Coursera. In this post, I will summarize my report on how I used K-Means Clustering method to understand the city of Minneapolis in terms of its neighborhoods, venues in each neighborhood, Zillow’s Home Value Index (HVI), data of the homes (number of bedrooms, bathrooms, year built, sale price etc.), and venues from each home.

This post is for:

Buyers looking for a new home in Minneapolis
Property Investors, realtors and agents
Data Scientists looking for ideas to use/get/scrape real estate data
The curious ones

I used a Python’s library called BeautifulSoup to scrape the most updated list of neighborhoods in Minneapolis.

In order to cluster the neighborhoods based on real estate information such as neighborhood’s median home price, estimation of change in home price in five and ten years, and more, I retrieved Minneapolis’s home price and values by neighborhood from Zillow.

To look further into the particular homes in Minneapolis, I used a web scraping platform called Apify where I got data of each home within a neighborhood such as the number of bedrooms, number of bathrooms, home size, year built, and more.

I also used Foursquare API to scrape the venues in the neighborhoods. I used this API again to get the nearby venues for each home to get a better idea of how homes within a neighborhood differ from each other.

To create choropleth maps, I found a geojson file of neighborhoods of Minneapolis on Github, which I later edited to match with the neighborhood names on Wikipedia.

Part A

Here’s the first five rows of neighborhood in Minneapolis giving us a general sense of homes in each neighborhood and estimation of the home values in the future.

K-Means Clustering of homes and neighborhoods in Minneapolis (2)

Using FourSquare API, I got a total of 220 unique venue categories amongst 63 neighborhoods including school, museum, bar, restaurant, shopping mall, park, and many more.

K-Means Clustering of homes and neighborhoods in Minneapolis (3)

I merged the location venues in each neighborhood and real estate data to cluster the neighborhoods. Using the Elbow method, I found that k = 5 is the optimal value for k for the K-Means clustering algorithm.

K-Means Clustering of homes and neighborhoods in Minneapolis (4)

Table below shows the average values for each cluster group ordered by the HVI column.

K-Means Clustering of homes and neighborhoods in Minneapolis (5)

The final clusters of the neighborhoods is shown in the picture below. Due to the unfortunate default colors, here are the clusters with their colors (order matched with the table above):

Cluster 4: Light green (upper left and center)
Cluster 1: Purple
Cluster 3: Turquoise (middle left)
Cluster 0: Red
Cluster 2: Blue (only one neighborhood: Kenwood, Minneapolis)

K-Means Clustering of homes and neighborhoods in Minneapolis (6)

Part B

From Apify, I retrieved a total of over 800 homes from all the neighborhoods in Minneapolis.

K-Means Clustering of homes and neighborhoods in Minneapolis (7)

I repeated the steps to find nearby location venues for each home, optimal k value (k = 6), cluster the homes based on parameters: bedrooms, bathrooms, sqft (living area), price (asking price of the real estate), and year built, and the nearby venues of each home. The table below shows how the 6 clusters differ by the number of bathrooms, bedrooms, living space, house sale price, and year built. Note that there’s an increment in living area (sqft) with an increase in price. Generally, it seems like the newer homes (year built) are more expensive than the older homes. Then again, we will have to run more analysis to decide if these observations are noteworthy.

K-Means Clustering of homes and neighborhoods in Minneapolis (8)

The seemingly positive correlation between the sale price and living area is confirmed in the picture below (correlation: 0.81).

K-Means Clustering of homes and neighborhoods in Minneapolis (9)

The map below displays the clusters of homes against a choropleth visualization where darker the shade of red of a neighborhood, higher the number of venues in it.

Again, for clarification, the clusters are colored as (order matched with the table above):

Cluster 0: Red
Cluster 4: Light green
Cluster 2: Blue
Cluster 1: Purple
Cluster 5: Orange
Cluster 3: Turquoise(only two homes)

K-Means Clustering of homes and neighborhoods in Minneapolis (10)

It is important to note that the data included for clustering of the homes are not standardized. As seen on the map, there are many homes belonging in Cluster 0 (red, 501 homes). Cluster 1 has 181 homes, Cluster 2 has 81 homes, Cluster 1 has 24 homes, Cluster 5 has 14 homes, and Cluster 3 has 2 homes. Coincidentally, the order of the number of homes in each cluster match the order of the clusters’ average home sale price. We definitely want more data (and standardize) to have a better understanding of whether the clusters of homes are grouped optimally.

Since Cluster 3 is hard to notice because of the (again) unfortunate colors, here’s a table showing the two homes in the neighborhoods East Isles and Lowry Hill.

K-Means Clustering of homes and neighborhoods in Minneapolis (11)

This project can be taken even further by finding more data on homes. Some parameters that many people consider when buying a home (that are found to affect the property value significantly) are usable space, upgrades, and local market among others. The program I wrote to scrape data such as commute and walk scores from the Zillow website had some web crawling issues. There might be APIs that also provide the year that a home was renovated, condition of the home, view/commute/walk scores — all of which are important factors to consider for buyers and agents alike.

It would also be a convenient next step to predict home prices using regression analysis. I would try to find more data in each clusters. Currently, the number of homes in the clusters are widely different. It might help to standardize the dataset used for clustering based.

It might have been easier to do this analysis on bigger cities with more available data. However, Minnesota has now become a second home for me and Minneapolis is the biggest I could get. Moreover, Minneapolis is the second most densely populated city in the Midwest region behind Chicago. The city, along with St. Paul, makes up the ‘Twin Cities.’

It would have been interesting to look into the twin cities as well in general:)

Another thought I had for this project was to use Natural Language Processing tools to look at the descriptions of each home. I like word clouds for readability. Since this step is only few lines of codes, stay tuned to see it eventually on my Github.

K-Means Clustering of homes and neighborhoods in Minneapolis (2024)

FAQs

What is clustering in real estate? ›

In real estate terms, a cluster typically refers to a group of homes or businesses that are close together, often with shared amenities such as green space or parking. Cluster developments are often created through zoning ordinances promoting higher-density and mixed-use development.

Explore More ›

What is the difference between K-means and K nearest neighbors? ›

KNN is a supervised learning algorithm so you need labelled data, but K-means is an unsupervised learning algorithm, so it discovers the structure of the data, for example how many groups you should divide your data into.

View Details ›

What is the property of K-Means clustering? ›

K-means is a centroid-based clustering algorithm, where we calculate the distance between each data point and a centroid to assign it to a cluster. The goal is to identify the K number of groups in the dataset.

What does K stand for in k-means clustering? ›

In K-means, k signifies the number of clusters (groups) that we want to form.

What type is k-means clustering? ›

K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike in supervised learning. K-Means performs the division of objects into clusters that share similarities and are dissimilar to the objects belonging to another cluster. The term 'K' is a number.

When should you not use k-means clustering? ›

K-means clustering is not well-suited for data sets with uneven cluster sizes or non-linearly separable data, as it may be unable to identify the underlying structure of the data in these cases.

Keep Reading ›

What is the weakness of K clustering? ›

The weakness of k-means clustering is that we don't know how many clusters we need by just running the model. We need to test ranges of values and make a decision on the best value of k.

Where is k-means used in real life? ›

KMeans is used across many fields in a wide variety of use cases; some examples of clustering use cases include customer segmentation, fraud detection, predicting account attrition, targeting client incentives, cybercrime identification, and delivery route optimization.

Keep Reading ›

What is an example of clustering? ›

For example, if you're clustering based on movie genres, a specialized measure might decide that “action” and “adventure” are closer to each other than “action” and “romance”.

Tell Me More ›

What does clustering mean in scatter plots? ›

Scatter Plot: A scatter plot is a graph displaying data points relating two variables. Each piece of data is its own point on the scatter plot and the points are not connected. Cluster: A cluster in a scatter plot is a group of points that follow the same general pattern.

What clustering refers to? ›

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some specific sense defined by the analyst) to each other than to those in other groups (clusters).

Discover More ›