
Josh Hoblitt engineered and modernized cloud infrastructure and storage systems across the lsst-it/k8s-cookbook and lsst-it/lsst-control repositories, focusing on reliability, security, and maintainability. He migrated clusters to RKE2, unified NFS and Ceph storage with encryption, and automated S3 credential rotation using Kubernetes, Helm, and Puppet. Josh implemented observability improvements with Grafana dashboards and Prometheus metrics, streamlined data pipelines by standardizing S3ND deployments, and enhanced network configuration using YAML and Infrastructure as Code practices. His work demonstrated deep expertise in DevOps and configuration management, delivering robust, scalable solutions that improved operational efficiency and enabled secure, auditable data access across environments.

Concise monthly summary for 2025-10 focusing on business value and technical accomplishments across two repositories: lsst-it/k8s-cookbook and lsst-it/lsst-control. Highlights include decommissioning Velero configurations, automating PR promotion and backport workflows with Mergify, updating branching strategies for cluster configurations, implementing Alloy IP address management, and several stability/security fixes (YAML indentation, Ceph OSD path prefixes, Keycloak image repo, and dependency maintenance).
Concise monthly summary for 2025-10 focusing on business value and technical accomplishments across two repositories: lsst-it/k8s-cookbook and lsst-it/lsst-control. Highlights include decommissioning Velero configurations, automating PR promotion and backport workflows with Mergify, updating branching strategies for cluster configurations, implementing Alloy IP address management, and several stability/security fixes (YAML indentation, Ceph OSD path prefixes, Keycloak image repo, and dependency maintenance).
2025-09 Monthly Summary: Delivered major reliability, observability, and modernization improvements across the k8s-cookbook and lsst-control repositories. The work enhanced incident diagnosis, availability, and operational efficiency through dashboard enhancements, platform upgrades, and modernization efforts (including ANTU).
2025-09 Monthly Summary: Delivered major reliability, observability, and modernization improvements across the k8s-cookbook and lsst-control repositories. The work enhanced incident diagnosis, availability, and operational efficiency through dashboard enhancements, platform upgrades, and modernization efforts (including ANTU).
2025-08 monthly summary for lsst-control and k8s-cookbook focusing on delivering business value through feature upgrades, improved observability, and network/infrastructure reliability across sites.
2025-08 monthly summary for lsst-control and k8s-cookbook focusing on delivering business value through feature upgrades, improved observability, and network/infrastructure reliability across sites.
July 2025 performance summary: Delivered a set of reliability, security, and data-access improvements across two repositories (lsst-control and k8s-cookbook) with a focus on simplifying maintenance and accelerating deployment of robust data pipelines. Key features delivered: - S3ND service optimization and standardization in lsst-control: tuned bandwidth limits and timeouts, standardized on s3nd across configurations/tests, upgraded to latest image versions, and aligned endpoint mappings; significant tests stabilized as s3nd moved from legacy daemon implementations. Commit series include upgrades to v1.6.x–v1.7.x and endpoint/name refinements (examples: 4f5feb26..., 0c367ebe..., cbe09720..., 15d67537..., 029dbd15..., d2d4a6ed...). - NFS data path migration to /data: migrated NFS exports/mountpoints from /ccs-data to /data across all nodes, with test configurations adjusted to reflect new paths and host export targets (commits: 75c182a6..., d43e52dc..., 420ca7af..., 453d20a3..., 56db350a...). Key bugs fixed: - RGW health and routing stability in k8s/cookbook: reduced RGW pool pg_num to address too many PGs per OSD, and fixed ingress service naming for RGW routing; plus cleanup of CephBucketTopic defaults to align with CRD behavior. Commits include a3afc0bc..., cffd81d0..., 3a6d115e.... - RGW erasure coding tweaks for small clusters: adjusted data/coding chunks to support ~5 OSD clusters (a67cf784...). Major additional improvements: - LSST-Cam S3 credential rotation across all deployments: introduced CephObjectStoreUser and ExternalSecret resources to rotate AWS keys for lsstcam in Ruka, Kon Kong, and Elqui; followed by completion of key rotation and cleanup of old credentials. Commits: c3b39d0c..., 5b2b3d77..., 0277202b..., f6d969b1..., 9c53041e... . - CephBucketTopic and Kafka integration: CRDs for CephBucketTopic and ExternalSecret to configure Kafka endpoints across components, enabling bucket notification delivery. Commits: d12b79fc..., aa938aa9.... - Mimir deployment migration to OBCs and Kustomize: provisioning migrated to Object Bucket Claims and replaced mimir-pre bundle with Kustomize (e76761c8...). - O11y RGW cross-namespace watch (Loki): RGW instance allowed to watch Loki namespace to improve cross-component observability (2f6030de...). - Additional LFA-related RGW work included new RGW users calib, rubintv, and saluser; and ongoing Kubernetes/OCS improvements. Overall impact and accomplishments: - Improved data access reliability and performance, aligning storage and compute configurations with current S3ND and NFS best practices. - Strengthened security posture via automated rotation of credentials and tighter access controls (ExternalSecret + CRD-driven workflows). - Increased observability and resilience with cross-namespace Loki integration and CRD-driven event notifications to Kafka. - Reduced operational risk by tuning RGW health parameters and fixing routing across the cluster, enabling smoother customer data flows. Technologies/skills demonstrated: - Kubernetes, CRDs, ExternalSecrets, Kustomize, Object Bucket Claims (OBCs), Loki, Ceph RGW, S3ND, NFS, and CI/test infrastructure - End-to-end configuration management, migration planning, and cross-team coordination across multiple clusters and deployments.
July 2025 performance summary: Delivered a set of reliability, security, and data-access improvements across two repositories (lsst-control and k8s-cookbook) with a focus on simplifying maintenance and accelerating deployment of robust data pipelines. Key features delivered: - S3ND service optimization and standardization in lsst-control: tuned bandwidth limits and timeouts, standardized on s3nd across configurations/tests, upgraded to latest image versions, and aligned endpoint mappings; significant tests stabilized as s3nd moved from legacy daemon implementations. Commit series include upgrades to v1.6.x–v1.7.x and endpoint/name refinements (examples: 4f5feb26..., 0c367ebe..., cbe09720..., 15d67537..., 029dbd15..., d2d4a6ed...). - NFS data path migration to /data: migrated NFS exports/mountpoints from /ccs-data to /data across all nodes, with test configurations adjusted to reflect new paths and host export targets (commits: 75c182a6..., d43e52dc..., 420ca7af..., 453d20a3..., 56db350a...). Key bugs fixed: - RGW health and routing stability in k8s/cookbook: reduced RGW pool pg_num to address too many PGs per OSD, and fixed ingress service naming for RGW routing; plus cleanup of CephBucketTopic defaults to align with CRD behavior. Commits include a3afc0bc..., cffd81d0..., 3a6d115e.... - RGW erasure coding tweaks for small clusters: adjusted data/coding chunks to support ~5 OSD clusters (a67cf784...). Major additional improvements: - LSST-Cam S3 credential rotation across all deployments: introduced CephObjectStoreUser and ExternalSecret resources to rotate AWS keys for lsstcam in Ruka, Kon Kong, and Elqui; followed by completion of key rotation and cleanup of old credentials. Commits: c3b39d0c..., 5b2b3d77..., 0277202b..., f6d969b1..., 9c53041e... . - CephBucketTopic and Kafka integration: CRDs for CephBucketTopic and ExternalSecret to configure Kafka endpoints across components, enabling bucket notification delivery. Commits: d12b79fc..., aa938aa9.... - Mimir deployment migration to OBCs and Kustomize: provisioning migrated to Object Bucket Claims and replaced mimir-pre bundle with Kustomize (e76761c8...). - O11y RGW cross-namespace watch (Loki): RGW instance allowed to watch Loki namespace to improve cross-component observability (2f6030de...). - Additional LFA-related RGW work included new RGW users calib, rubintv, and saluser; and ongoing Kubernetes/OCS improvements. Overall impact and accomplishments: - Improved data access reliability and performance, aligning storage and compute configurations with current S3ND and NFS best practices. - Strengthened security posture via automated rotation of credentials and tighter access controls (ExternalSecret + CRD-driven workflows). - Increased observability and resilience with cross-namespace Loki integration and CRD-driven event notifications to Kafka. - Reduced operational risk by tuning RGW health parameters and fixing routing across the cluster, enabling smoother customer data flows. Technologies/skills demonstrated: - Kubernetes, CRDs, ExternalSecrets, Kustomize, Object Bucket Claims (OBCs), Loki, Ceph RGW, S3ND, NFS, and CI/test infrastructure - End-to-end configuration management, migration planning, and cross-team coordination across multiple clusters and deployments.
June 2025 monthly summary for lsst-control (lsst-it/lsst-control). Focused on upgrading and hardening the S3ND image, performance improvements for uploads, and enhancements to the test gateway to expand testing capabilities and reliability. The work involved coordinated image version bumps, environment hardening, and test gateway integration across cluster components to improve data ingest reliability and test throughput.
June 2025 monthly summary for lsst-control (lsst-it/lsst-control). Focused on upgrading and hardening the S3ND image, performance improvements for uploads, and enhancements to the test gateway to expand testing capabilities and reliability. The work involved coordinated image version bumps, environment hardening, and test gateway integration across cluster components to improve data ingest reliability and test throughput.
May 2025 monthly summary: Achievements across the k8s-cookbook and lsst-control repositories include secure CephObjectStore access via 1Password integration, secrets-driven Kafka authentication for CephObjectStore, multi-cluster S3-compatible daemon deployment, governance enhancements with a block-merge-commits workflow, and storage/testing infrastructure improvements. These initiatives reduced risk, improved operational reliability, and standardized testing and bucket management across clusters.
May 2025 monthly summary: Achievements across the k8s-cookbook and lsst-control repositories include secure CephObjectStore access via 1Password integration, secrets-driven Kafka authentication for CephObjectStore, multi-cluster S3-compatible daemon deployment, governance enhancements with a block-merge-commits workflow, and storage/testing infrastructure improvements. These initiatives reduced risk, improved operational reliability, and standardized testing and bucket management across clusters.
April 2025 performance summary for lsst-it/k8s-cookbook and lsst-it/lsst-control focused on secure, scalable cluster operations, storage modernization, and CI improvements. Key storage/cluster work delivered in k8s-cookbook includes: (1) Rook Ceph upgrade and security hardening: upgraded image tags to ghcr.io/lsst-it/rook:v1.17.0-lsst2, bumped rook-ceph to v17.0.0 and later v1.17.1, enabled OSD encryption, aligned authentication mechanisms, and migrated CephBucketTopic credentials to Kubernetes secrets; (2) Rook Ceph demo configurations for elqui and konkong clusters, adding rook-ceph-demo with all elqui/konkong NFS exports to enable cross-project storage access via a shared library; (3) Ayekan cluster modernization: migrated from RKE1 to RKE2 and decommissioned monitoring, with a corresponding increase in pod density (to 250) and test updates; (4) Fleet deployment stability and CI: fixed fleet.yaml misconfigurations and cleaned duplicates; introduced a fleet bundles CI workflow and refined chart lint/bundle validation naming; (5) RKE2 upgrade and capacity optimization across lsst-control: migrated ayekan to RKE2 and increased pod density on ayekan/manke clusters, plus network configuration data format modernization to YAML.
April 2025 performance summary for lsst-it/k8s-cookbook and lsst-it/lsst-control focused on secure, scalable cluster operations, storage modernization, and CI improvements. Key storage/cluster work delivered in k8s-cookbook includes: (1) Rook Ceph upgrade and security hardening: upgraded image tags to ghcr.io/lsst-it/rook:v1.17.0-lsst2, bumped rook-ceph to v17.0.0 and later v1.17.1, enabled OSD encryption, aligned authentication mechanisms, and migrated CephBucketTopic credentials to Kubernetes secrets; (2) Rook Ceph demo configurations for elqui and konkong clusters, adding rook-ceph-demo with all elqui/konkong NFS exports to enable cross-project storage access via a shared library; (3) Ayekan cluster modernization: migrated from RKE1 to RKE2 and decommissioned monitoring, with a corresponding increase in pod density (to 250) and test updates; (4) Fleet deployment stability and CI: fixed fleet.yaml misconfigurations and cleaned duplicates; introduced a fleet bundles CI workflow and refined chart lint/bundle validation naming; (5) RKE2 upgrade and capacity optimization across lsst-control: migrated ayekan to RKE2 and increased pod density on ayekan/manke clusters, plus network configuration data format modernization to YAML.
March 2025 delivered storage modernization, security hardening, and cluster stability improvements across k8s-cookbook and lsst-control. Key features migrated storage paths to newer nfs1, optimized Grafana resource usage for reliable observability, enabled Ceph OSD encryption with RGW tuning for improved data security and performance, introduced a new Ceph Object Store config 'lfa' with OBCs to streamline multi-service data provisioning, and upgraded the RKE2 cluster in the ruka environment to benefit from the latest features and fixes. These changes reduce operational risk, improve security posture, and unlock more scalable storage and monitoring capabilities.
March 2025 delivered storage modernization, security hardening, and cluster stability improvements across k8s-cookbook and lsst-control. Key features migrated storage paths to newer nfs1, optimized Grafana resource usage for reliable observability, enabled Ceph OSD encryption with RGW tuning for improved data security and performance, introduced a new Ceph Object Store config 'lfa' with OBCs to streamline multi-service data provisioning, and upgraded the RKE2 cluster in the ruka environment to benefit from the latest features and fixes. These changes reduce operational risk, improve security posture, and unlock more scalable storage and monitoring capabilities.
February 2025 monthly summary for infrastructure work across lsst-it/k8s-cookbook and lsst-it/lsst-control. Focused on storage unification, cluster modernization, security hardening, and networking/ingress enhancements. Key initiatives include migrating from RKE1 to RKE2, relocating NFS exports under Elqui for unified management, Ceph tuning with OSD encryption, and upgrading Rook Ceph. Implemented modern ingress and authentication (cert-manager, Traefik, Keycloak) with IPAddressPool improvements. Expanded shared storage across roles (NFS from Elqui) and enhanced IP space management (IPAddressPool relocation). Completed network and role refinements in lsst-control, including bonding, DHCP pool hardening, and retirement of older EL7 support. Added RubinObs components and notifications to improve data access and observability. These changes deliver tangible business value: more reliable deployments, tighter security, scalable storage, and faster secure access to applications.
February 2025 monthly summary for infrastructure work across lsst-it/k8s-cookbook and lsst-it/lsst-control. Focused on storage unification, cluster modernization, security hardening, and networking/ingress enhancements. Key initiatives include migrating from RKE1 to RKE2, relocating NFS exports under Elqui for unified management, Ceph tuning with OSD encryption, and upgrading Rook Ceph. Implemented modern ingress and authentication (cert-manager, Traefik, Keycloak) with IPAddressPool improvements. Expanded shared storage across roles (NFS from Elqui) and enhanced IP space management (IPAddressPool relocation). Completed network and role refinements in lsst-control, including bonding, DHCP pool hardening, and retirement of older EL7 support. Added RubinObs components and notifications to improve data access and observability. These changes deliver tangible business value: more reliable deployments, tighter security, scalable storage, and faster secure access to applications.
December 2024: Delivered major infrastructure modernization across Kubernetes ingress, storage, and cluster tooling to improve reliability, security, and scalability. Implemented ingress modernization with ingressClassName, Traefik as the ingress provider, and IPAddressPool support; consolidated object storage and access controls by decommissioning deprecated RGW instances, migrating to LFA RGW, and replacing pool quotas with bucket quotas while tuning pool allocation. Enhanced Ceph reliability and observability with an extended exporter, global tuning, and OSD encryption, plus storage tuning (single MDS and PG sizing) and disabling Ceph rook orchestration. Implemented TLS automation via cert-manager and adjusted data governance by reducing retention to 180 days and cleaning up legacy constraints and net-attach definitions. Completed Kubernetes cluster modernization by migrating from RKE1 to RKE2, and advanced Pillan network/config improvements with an RKE2 deployment upgrade. These changes delivered improved traffic routing, data governance, security, and operational stability for production workloads and positioned the platform for future scale.
December 2024: Delivered major infrastructure modernization across Kubernetes ingress, storage, and cluster tooling to improve reliability, security, and scalability. Implemented ingress modernization with ingressClassName, Traefik as the ingress provider, and IPAddressPool support; consolidated object storage and access controls by decommissioning deprecated RGW instances, migrating to LFA RGW, and replacing pool quotas with bucket quotas while tuning pool allocation. Enhanced Ceph reliability and observability with an extended exporter, global tuning, and OSD encryption, plus storage tuning (single MDS and PG sizing) and disabling Ceph rook orchestration. Implemented TLS automation via cert-manager and adjusted data governance by reducing retention to 180 days and cleaning up legacy constraints and net-attach definitions. Completed Kubernetes cluster modernization by migrating from RKE1 to RKE2, and advanced Pillan network/config improvements with an RKE2 deployment upgrade. These changes delivered improved traffic routing, data governance, security, and operational stability for production workloads and positioned the platform for future scale.
November 2024 monthly summary for development work across lsst-it repositories. Delivered cross-cluster S3 daemon management, enhanced data-transfer integration, infrastructure reliability improvements, and standardized configuration naming. Expanded Ceph Object Store user provisioning in Elqui, and improved secure ingress exposure for S3 services (Chonchon/Elqui) with embargo support. Completed fleet/vault alignment and cleanup to reduce operational risk.
November 2024 monthly summary for development work across lsst-it repositories. Delivered cross-cluster S3 daemon management, enhanced data-transfer integration, infrastructure reliability improvements, and standardized configuration naming. Expanded Ceph Object Store user provisioning in Elqui, and improved secure ingress exposure for S3 services (Chonchon/Elqui) with embargo support. Completed fleet/vault alignment and cleanup to reduce operational risk.
Overview of all repositories you've contributed to across your timeline