DATA CLEANING PROJECT
A technical SQL project focused on transforming raw housing data into a usable format through standardization, parsing, and deduplication.
This project addresses common data quality issues found in the Nashville Housing dataset. Real-world data is often messy, containing inconsistencies in date formats, missing values, and compound fields that make analysis difficult.
By using advanced SQL techniques including PARSENAME, SUBSTRING, and
ROW_NUMBER(), this project transforms the raw data into a clean, normalized schema
ready for analytical processing.
Converts the 'SaleDate' column from a generic timestamp to a standardized Date format to ensure consistency.
ALTER TABLE NashvilleHousing
ADD SaleDateConverted Date;
UPDATE NashvilleHousing
SET SaleDateConverted = CONVERT(Date, SaleDate)
SELECT SaleDateConverted FROM NashvilleHousing
Uses a Self Join to populate null Property Addresses. If a ParcelID exists with an address elsewhere, it populates the missing value.
UPDATE a
SET PropertyAddress = ISNULL(a.PropertyAddress, b.PropertyAddress)
FROM NashvilleDataCleaning..NashvilleHousing a
JOIN NashvilleDataCleaning..NashvilleHousing b
ON a.ParcelID = b.ParcelID
AND a.UniqueID <> b.UniqueID
WHERE a.PropertyAddress IS NULL
Splits the single 'PropertyAddress' column into individual 'Address' and 'City' columns using Substrings and Character Indexes.
SELECT
SUBSTRING(PropertyAddress, 1, CHARINDEX(',', PropertyAddress) -1) AS Address,
SUBSTRING(PropertyAddress, CHARINDEX(',', PropertyAddress) +1, LEN(PropertyAddress)) AS City
FROM NashvilleDataCleaning..NashvilleHousing
-- (Update statements omitted for brevity, see repo for full code)
A more efficient way to split delimited
strings. Uses PARSENAME to split 'OwnerAddress' into Address, City, and State.
SELECT
PARSENAME(REPLACE(OwnerAddress, ',', '.'), 3),
PARSENAME(REPLACE(OwnerAddress, ',', '.'), 2),
PARSENAME(REPLACE(OwnerAddress, ',', '.'), 1)
FROM NashvilleDataCleaning..NashvilleHousing
Harmonizes the 'SoldAsVacant' field by changing 'Y' and 'N' values to 'Yes' and 'No' for better readability and consistency.
UPDATE NashvilleHousing
SET SoldAsVacant = CASE
WHEN SoldAsVacant = 'Y' THEN 'Yes'
WHEN SoldAsVacant = 'N' THEN 'No'
ELSE SoldAsVacant
END
Identifies and removes duplicate rows based on unique identifiers (ParcelID, Address, SalePrice, etc.) using a CTE and Window Functions.
WITH RowNumCTE AS(
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY ParcelID, PropertyAddress, SalePrice, SaleDate, LegalReference
ORDER BY UniqueID
) row_num
FROM NashvilleDataCleaning.dbo.NashvilleHousing
)
DELETE FROM RowNumCTE
WHERE row_num > 1
Removes raw columns that have been split or converted to keep the dataset clean and lightweight.
ALTER TABLE NashvilleDataCleaning.dbo.NashvilleHousing
DROP COLUMN OwnerAddress, TaxDistrict, PropertyAddress, SaleDate
This project successfully transformed a raw, messy dataset into a clean, structured resource suitable for analysis. Key improvements include:
Populated missing property address data using reference points within the dataset itself.
Split complex address strings into atomic columns (Address, City, State) for more granular geographic analysis.
Standardized boolean fields ('Yes'/'No') and date formats across the entire table.
Removed duplicate records and unnecessary columns to improve query performance and data relevance.
The techniques demonstrated here (Self Joins, CTEs, String Parsing) are essential for any data analyst working with real-world data pipelines.