Senior Data Engineer - EDA Datacenter Analytics and Observability
Job Description
This onsite Senior Data Engineer role in Austin, TX focuses on building analytics-ready data platforms to power observability, reliability analysis, and capacity forecasting for NVIDIA's EDA datacenters. The position transforms large-scale telemetry and observability data into trusted datasets that enable data scientists, analysts, and engineers to derive insights across global CPU and GPU compute clusters.
Responsibilities
- Architect, implement, and sustain analytics-focused data pipelines that ingest, transform, and curate observability data from EDA datacenters.
- Develop reliable ingestion pipelines for metrics, logs, traces, and hardware health telemetry produced by large-scale CPU and GPU clusters.
- Collaborate with observability engineers to merge data from Prometheus, Grafana, Elastic/OpenSearch, and Spark-based platforms into unified analytical datasets.
- Model and organize data to support exploratory analysis, reliability modeling, forecasting, and long-term trend analysis.
- Build and optimize batch and streaming workflows enabling near real-time analytics and historical analysis.
- Implement data quality checks, validation frameworks, and monitoring to ensure analytical accuracy and consistency.
- Define data retention, aggregation, and enrichment strategies balancing analytical needs, system performance, and storage costs.
- Enable self-service analytics by improving data discoverability, documentation, and usability.
- Collaborate with data scientists and analysts to understand analytical requirements and evolve datasets to support new models and insights.
- Continuously improve pipeline scalability, reliability, and performance as datacenter footprint and workload complexity grow.
Requirements
- MS preferred or BS in Computer Science or related field, or equivalent experience, with at least 5+ years designing, building, and operating large-scale data pipelines and data platforms for distributed systems or infrastructure data.
- Proficiency in Python and SQL, with a track record supporting analytical and exploratory workloads.
- Hands-on experience with distributed data processing frameworks such as Spark or similar technologies.
- Familiarity working with observability and telemetry data, including metrics, logs, traces, and time-series data.
- Experience designing data models and schemas that support flexible analysis and forecasting.
- Ability to take ownership of data engineering initiatives and drive them end-to-end in collaboration with multi-functional partners.
- Experience implementing data quality, validation, and monitoring for analytics pipelines.
- Strong communication and collaboration skills, particularly when working with engineering and infrastructure teams.
- Adaptability in fast paced environments with evolving analytical and operational needs.
Technologies
- Python
- SQL
- Spark
- Prometheus
- Grafana
- Elastic/OpenSearch
- Kafka
Benefits
- Equity and benefits
Ways to Stand Out
- Experience supporting datacenter infrastructure analytics, hardware reliability programs, or workload performance analysis.
- Familiarity with EDA workflows, HPC environments, or GPU-accelerated compute platforms.
- Experience integrating or operating observability stacks such as Prometheus, Grafana, Elastic/OpenSearch, Kafka, Spark, or similar tools.
Similar Jobs
N