Bitus Labs is seeking a Senior Data Engineer who communicates in Chinese Mandarin to design and scale an AWS based data lakehouse, develop production-grade data pipelines in Java and Python, and guide data quality, governance, and platform decisions while mentoring engineers.

Responsibilities

Architect and implement scalable medallion architecture data lakehouses on AWS S3 using Apache Iceberg, covering Bronze, Silver, and Gold layers.
Build and maintain high-throughput ETL and ELT pipelines with AWS Glue, EMR (Spark), and Lambda.
Apply schema evolution, partitioning strategies, and Iceberg table compaction to optimize storage and query performance.
Deliver production-grade pipeline code in Java and Python, selecting the appropriate language for performance and maintainability.
Design and operate event driven data pipelines with Amazon Kinesis Data Streams, Kinesis Firehose, or Apache Kafka (MSK).
Implement streaming semantics including exactly-once and at-least-once processing using Apache Flink or Spark Structured Streaming on EMR.
Manage infrastructure as code using AWS CDK or Terraform to enable repeatable, auditable deployments.
Optimize cost and performance across AWS services such as S3, Glue, Athena, Redshift Spectrum, EMR, Lambda, Step Functions, and EventBridge.
Enforce data security best practices including IAM least-privilege policies, KMS encryption, VPC networking, and Lake Formation access controls.
Establish and maintain CI/CD pipelines for data workloads using AWS CodePipeline, GitHub Actions, or similar tools.
Implement data quality frameworks (for example Great Expectations, Deequ) and integrate validation steps into pipeline orchestration.
Define and enforce data contracts between producers and consumers to ensure data reliability.
Contribute to data cataloging and lineage tracking via AWS Glue Data Catalog or Apache Atlas.
Collaborate with data scientists, ML engineers, and analysts to deliver performant, well-documented datasets.
Mentor mid-level and junior engineers through code reviews, design discussions, and pair programming.
Document architecture decisions (ADRs) and contribute to the internal engineering knowledge base.

Requirements

Minimum 5 years of professional data engineering experience, including at least 3 years on AWS cloud platforms.
Proven track record delivering production data pipelines at scale (TB+ datasets, high throughput SLAs).
Experience with data lakehouse architectures using the medallion pattern and open table formats (Iceberg preferred; Delta Lake or Hudi acceptable).
Java proficiency (8+) for Spark jobs, Iceberg connectors, and performance-critical components; familiarity with Maven or Gradle.
Python proficiency (Python 3) for AWS Glue scripts, orchestration, data quality checks, and automation; experience with pandas, PySpark, boto3, and packaging best practices.
Storage and compute: S3, Glue (jobs, crawlers, Data Catalog), EMR (Spark/Flink), Lambda, EC2.
Streaming: Kinesis Data Streams, Kinesis Firehose, or MSK (Kafka).
Orchestration: Step Functions, MWAA (Managed Airflow), or EventBridge Scheduler.
Querying: Athena, Redshift, or Redshift Spectrum.
Security and governance: IAM, KMS, Lake Formation, Secrets Manager, and VPC.
DevOps: AWS CDK or CloudFormation; CodePipeline or comparable CI/CD tools.
Apache Spark (PySpark and Spark Java API) including distributed transformations, performance tuning, and memory management.
Apache Iceberg capabilities such as table maintenance, time travel, snapshot management, and partition evolution.
SQL expertise with advanced transformations, window functions, CTEs, and query optimization.

Technical Stack

Languages: Java 8+ and Python 3
Cloud: AWS (S3, Glue, EMR, Kinesis, Athena, Lambda, Step Functions, Lake Formation, CDK)
Processing engines: Apache Spark, Apache Flink, Spark Structured Streaming
Table formats: Apache Iceberg (primary); familiarity with Delta Lake or Hudi
Streaming: Kinesis Data Streams, Kinesis Firehose, MSK
Orchestration: MWAA, AWS Step Functions, or EventBridge
IaC and CI/CD: AWS CDK, Terraform; GitHub Actions, CodePipeline
Build tools: Maven, Gradle
Data quality and lineage: Great Expectations, Deequ; AWS Glue Data Catalog, Apache Atlas
Security and governance: IAM, KMS, Lake Formation, Secrets Manager, VPC
Additional: SQL, pandas, boto3

Compensation and Benefits

Compensation: USD 130,000 per year.

401(k) and matching
Dental insurance
Health insurance
Life insurance
Paid time off
Parental leave
Retirement plan
Vision insurance

Location, Language and Work Arrangement

Location: Irvine, California 92618, onsite
Language requirement: Chinese Mandarin is required
Work arrangement: In person

Senior Data Engineer (Chinese Mandarin Speaker)

Job Description

Responsibilities

Requirements

Technical Stack

Compensation and Benefits

Location, Language and Work Arrangement

Similar Jobs

Data Engineer

Data Engineer

Data Engineer

Senior Data Analytics Engineer

Senior Data Engineer - AI and Analytics

Senior Data Analytics Engineer

Get Job Alerts