NYC Housing Violation and Complaints Analysis
An exploratory data analysis project in R that compares Manhattan Housing Maintenance Code violations from HPD with housing-related 311 service requests (2022–2025) to understand where, when, and how serious housing problems are for tenants.
Abstract
This project uses NYC Open Data on Housing Maintenance Code violations (HPD) and housing-related 311 requests to understand where, when, and how serious housing problems are for Manhattan tenants between 2022 and 2025. The analysis compares complaint patterns to official violation records, looking at seasonal trends, spatial hotspots, severity codes, and building-level behavior to see how well 311 calls translate into enforcement.
What I did (methods)
- R + Quarto website: Organized the project as a Quarto site with
index.qmd(intro/questions),data.qmd(data + cleaning),project_code_final.qmd(analysis & results), andconclusion.qmd, rendered todocs/for GitHub Pages so the entire workflow can be re‑run. - Data setup and cleaning: Loaded HPD violations and 311 CSVs for 2022–2025, filtered to Manhattan,
parsed mixed‑format dates with
lubridate::parse_date_time, created clean inspection/NOV/status and created/closed timestamps, and saved cleaned CSV/RDS copies as reproducible inputs for all plots. - Category engineering: Used regular expressions on HPD
NOVDescriptionand 311Complaint.Type/Descriptorto map both datasets into common issue buckets (HEAT/HOT WATER, PLUMBING, PAINT/PLASTER, WATER LEAK, ELEVATOR, MOLD, PEST/SANITATION) so category‑level comparisons are on the same scale. - Exploratory plots: Built bar charts of violations vs 311 counts by category, three‑year per‑quarter facets for both datasets, a 2024 violation alluvial (inspection → NOV → status month by class), time‑series views (daily series, combined monthly 311+HPD line plot, quarterly smoothing), top‑10 panels for blocks/streets/ZIPs/categories, and spatial sample maps showing complaint and violation hotspots across Manhattan.
- Building‑level severity view: Constructed a parallel‑coordinates plot for high‑activity buildings comparing total violations, 311 complaints, share of Class C violations, and share of rent‑impairing violations to see how “worst” buildings differ from others.
- Missingness and severity summaries: Summarised missing‑value patterns, counted violation classes (A/B/C/I), and quantified rent‑impairing vs non‑rent‑impairing violations to interpret how serious typical issues are across the dataset.
Key findings
- Complaints vs violations: HEAT/HOT WATER is the top issue in both datasets, but 311 receives far more heating complaints than the number of HPD violations, so many calls never turn into official enforcement actions; in contrast, categories like PAINT/PLASTER and PEST/SANITATION can have more violations than complaints, likely reflecting proactive inspections.
- Where and when problems concentrate: The same blocks, streets, ZIPs, and lat–long bands (for example Broadway, St Nicholas Avenue, Amsterdam Avenue, ZIPs 10031–10033 and 10027) show up as hotspots in both 311 and violations, and quarterly plots show strong winter peaks for HEAT/HOT WATER, while other categories have weaker or less consistent seasonality.
- Noise, batch events, and trends: Daily series are noisy; violation counts include a few one‑day spikes above 5,000 inspections that behave like batch events, and 311 daily counts form an almost solid band, so aggregating to months and quarters is needed to uncover a stable winter–summer pattern without any clear long‑run upward or downward trend in overall violations.
- Severity and rent‑impairing issues: Most violations are serious: hazardous Class B (about 238k) and immediately hazardous Class C (about 185k) together make up well over half of the records, and although only around 8 percent are marked rent‑impairing, buildings with a high Class C share also tend to have higher rent‑impairing shares, indicating a subset of buildings with highly concentrated severe problems.
- Building‑level behavior: The 2024 alluvial slice shows most violations progressing from inspection to NOV to status within nearby months and a pipeline dominated by Class B and C, while parallel coordinates plots reveal that the main signal distinguishing “worst” buildings is severity (Class C and rent‑impairing share) rather than just total counts of complaints or violations.
What this shows about me
- Comfortable doing end‑to‑end exploratory data analysis in R (dplyr, lubridate, ggplot2, ggalluvial, GGally, etc.), including date parsing, category mapping, missingness analysis, and multi‑view visualizations.
- Able to design reproducible analytical projects with Quarto websites, clear file structure, and documented steps so others can rebuild the site from raw NYC Open Data.
- Can turn a long technical notebook into a clear narrative about housing quality and enforcement that is useful for tenants, advocates, or policy stakeholders.