Nashville Housing Data Cleaning

A technical SQL project focused on transforming raw housing data into a usable format through standardization, parsing, and deduplication.

Data Cleaning Advanced String Functions CTEs Self Joins CASE Statements

Project Overview

This project addresses common data quality issues found in the Nashville Housing dataset. Real-world data is often messy, containing inconsistencies in date formats, missing values, and compound fields that make analysis difficult.

By using advanced SQL techniques including PARSENAME, SUBSTRING, and ROW_NUMBER(), this project transforms the raw data into a clean, normalized schema ready for analytical processing.

Usage Instructions

Clone the repository via Git.
Import the Nashville Housing dataset into your SQL Server environment.
Execute the cleaning scripts sequentially to transform the dataset.

Key Cleaning Steps & SQL Queries

1. Standardize Date Format

Converts the 'SaleDate' column from a generic timestamp to a standardized Date format to ensure consistency.

SQL

ALTER TABLE NashvilleHousing
ADD SaleDateConverted Date;

UPDATE NashvilleHousing
SET SaleDateConverted = CONVERT(Date, SaleDate)

SELECT SaleDateConverted FROM NashvilleHousing

2. Populate Missing Property Addresses

Uses a Self Join to populate null Property Addresses. If a ParcelID exists with an address elsewhere, it populates the missing value.

SQL

UPDATE a 
SET PropertyAddress = ISNULL(a.PropertyAddress, b.PropertyAddress)
FROM NashvilleDataCleaning..NashvilleHousing a 
JOIN NashvilleDataCleaning..NashvilleHousing b 
    ON a.ParcelID = b.ParcelID
    AND a.UniqueID <> b.UniqueID
WHERE a.PropertyAddress IS NULL

3. Breaking Out Property Address (Substring)

Splits the single 'PropertyAddress' column into individual 'Address' and 'City' columns using Substrings and Character Indexes.

SQL

SELECT 
SUBSTRING(PropertyAddress, 1, CHARINDEX(',', PropertyAddress) -1) AS Address,
SUBSTRING(PropertyAddress, CHARINDEX(',', PropertyAddress) +1, LEN(PropertyAddress)) AS City
FROM NashvilleDataCleaning..NashvilleHousing

-- (Update statements omitted for brevity, see repo for full code)

4. Breaking Out Owner Address (Parsename)

A more efficient way to split delimited strings. Uses PARSENAME to split 'OwnerAddress' into Address, City, and State.

SQL

SELECT
PARSENAME(REPLACE(OwnerAddress, ',', '.'), 3),
PARSENAME(REPLACE(OwnerAddress, ',', '.'), 2),
PARSENAME(REPLACE(OwnerAddress, ',', '.'), 1)
FROM NashvilleDataCleaning..NashvilleHousing

5. Standardize 'SoldAsVacant' Field

Harmonizes the 'SoldAsVacant' field by changing 'Y' and 'N' values to 'Yes' and 'No' for better readability and consistency.

SQL

UPDATE NashvilleHousing
SET SoldAsVacant = CASE 
    WHEN SoldAsVacant = 'Y' THEN 'Yes'
    WHEN SoldAsVacant = 'N' THEN 'No'
    ELSE SoldAsVacant
END

6. Remove Duplicates

Identifies and removes duplicate rows based on unique identifiers (ParcelID, Address, SalePrice, etc.) using a CTE and Window Functions.

SQL

WITH RowNumCTE AS(
    SELECT *,
    ROW_NUMBER() OVER (
        PARTITION BY ParcelID, PropertyAddress, SalePrice, SaleDate, LegalReference
        ORDER BY UniqueID
    ) row_num
    FROM NashvilleDataCleaning.dbo.NashvilleHousing
)
DELETE FROM RowNumCTE
WHERE row_num > 1

7. Delete Unused Columns

Removes raw columns that have been split or converted to keep the dataset clean and lightweight.

SQL

ALTER TABLE NashvilleDataCleaning.dbo.NashvilleHousing
DROP COLUMN OwnerAddress, TaxDistrict, PropertyAddress, SaleDate

Outcomes & Final Thoughts

This project successfully transformed a raw, messy dataset into a clean, structured resource suitable for analysis. Key improvements include:

Data Integrity

Populated missing property address data using reference points within the dataset itself.

Usability

Split complex address strings into atomic columns (Address, City, State) for more granular geographic analysis.

Consistency

Standardized boolean fields ('Yes'/'No') and date formats across the entire table.

Optimization

Removed duplicate records and unnecessary columns to improve query performance and data relevance.

The techniques demonstrated here (Self Joins, CTEs, String Parsing) are essential for any data analyst working with real-world data pipelines.