DataJobs.io
← Back to all jobs

Job Description

Bitus Labs is seeking a Senior Data Engineer who communicates in Chinese Mandarin to design and scale an AWS based data lakehouse, develop production-grade data pipelines in Java and Python, and guide data quality, governance, and platform decisions while mentoring engineers.

Responsibilities

  • Architect and implement scalable medallion architecture data lakehouses on AWS S3 using Apache Iceberg, covering Bronze, Silver, and Gold layers.
  • Build and maintain high-throughput ETL and ELT pipelines with AWS Glue, EMR (Spark), and Lambda.
  • Apply schema evolution, partitioning strategies, and Iceberg table compaction to optimize storage and query performance.
  • Deliver production-grade pipeline code in Java and Python, selecting the appropriate language for performance and maintainability.
  • Design and operate event driven data pipelines with Amazon Kinesis Data Streams, Kinesis Firehose, or Apache Kafka (MSK).
  • Implement streaming semantics including exactly-once and at-least-once processing using Apache Flink or Spark Structured Streaming on EMR.
  • Manage infrastructure as code using AWS CDK or Terraform to enable repeatable, auditable deployments.
  • Optimize cost and performance across AWS services such as S3, Glue, Athena, Redshift Spectrum, EMR, Lambda, Step Functions, and EventBridge.
  • Enforce data security best practices including IAM least-privilege policies, KMS encryption, VPC networking, and Lake Formation access controls.
  • Establish and maintain CI/CD pipelines for data workloads using AWS CodePipeline, GitHub Actions, or similar tools.
  • Implement data quality frameworks (for example Great Expectations, Deequ) and integrate validation steps into pipeline orchestration.
  • Define and enforce data contracts between producers and consumers to ensure data reliability.
  • Contribute to data cataloging and lineage tracking via AWS Glue Data Catalog or Apache Atlas.
  • Collaborate with data scientists, ML engineers, and analysts to deliver performant, well-documented datasets.
  • Mentor mid-level and junior engineers through code reviews, design discussions, and pair programming.
  • Document architecture decisions (ADRs) and contribute to the internal engineering knowledge base.

Requirements

  • Minimum 5 years of professional data engineering experience, including at least 3 years on AWS cloud platforms.
  • Proven track record delivering production data pipelines at scale (TB+ datasets, high throughput SLAs).
  • Experience with data lakehouse architectures using the medallion pattern and open table formats (Iceberg preferred; Delta Lake or Hudi acceptable).
  • Java proficiency (8+) for Spark jobs, Iceberg connectors, and performance-critical components; familiarity with Maven or Gradle.
  • Python proficiency (Python 3) for AWS Glue scripts, orchestration, data quality checks, and automation; experience with pandas, PySpark, boto3, and packaging best practices.
  • Storage and compute: S3, Glue (jobs, crawlers, Data Catalog), EMR (Spark/Flink), Lambda, EC2.
  • Streaming: Kinesis Data Streams, Kinesis Firehose, or MSK (Kafka).
  • Orchestration: Step Functions, MWAA (Managed Airflow), or EventBridge Scheduler.
  • Querying: Athena, Redshift, or Redshift Spectrum.
  • Security and governance: IAM, KMS, Lake Formation, Secrets Manager, and VPC.
  • DevOps: AWS CDK or CloudFormation; CodePipeline or comparable CI/CD tools.
  • Apache Spark (PySpark and Spark Java API) including distributed transformations, performance tuning, and memory management.
  • Apache Iceberg capabilities such as table maintenance, time travel, snapshot management, and partition evolution.
  • SQL expertise with advanced transformations, window functions, CTEs, and query optimization.

Technical Stack

  • Languages: Java 8+ and Python 3
  • Cloud: AWS (S3, Glue, EMR, Kinesis, Athena, Lambda, Step Functions, Lake Formation, CDK)
  • Processing engines: Apache Spark, Apache Flink, Spark Structured Streaming
  • Table formats: Apache Iceberg (primary); familiarity with Delta Lake or Hudi
  • Streaming: Kinesis Data Streams, Kinesis Firehose, MSK
  • Orchestration: MWAA, AWS Step Functions, or EventBridge
  • IaC and CI/CD: AWS CDK, Terraform; GitHub Actions, CodePipeline
  • Build tools: Maven, Gradle
  • Data quality and lineage: Great Expectations, Deequ; AWS Glue Data Catalog, Apache Atlas
  • Security and governance: IAM, KMS, Lake Formation, Secrets Manager, VPC
  • Additional: SQL, pandas, boto3

Compensation and Benefits

Compensation: USD 130,000 per year.

  • 401(k) and matching
  • Dental insurance
  • Health insurance
  • Life insurance
  • Paid time off
  • Parental leave
  • Retirement plan
  • Vision insurance

Location, Language and Work Arrangement

  • Location: Irvine, California 92618, onsite
  • Language requirement: Chinese Mandarin is required
  • Work arrangement: In person

Similar Jobs

Get Job Alerts

New jobs delivered to your inbox.