Senior Data Engineer (Chinese Mandarin Speaker)
Job Description
Bitus Labs is seeking a Senior Data Engineer who communicates in Chinese Mandarin to design and scale an AWS based data lakehouse, develop production-grade data pipelines in Java and Python, and guide data quality, governance, and platform decisions while mentoring engineers.
Responsibilities
- Architect and implement scalable medallion architecture data lakehouses on AWS S3 using Apache Iceberg, covering Bronze, Silver, and Gold layers.
- Build and maintain high-throughput ETL and ELT pipelines with AWS Glue, EMR (Spark), and Lambda.
- Apply schema evolution, partitioning strategies, and Iceberg table compaction to optimize storage and query performance.
- Deliver production-grade pipeline code in Java and Python, selecting the appropriate language for performance and maintainability.
- Design and operate event driven data pipelines with Amazon Kinesis Data Streams, Kinesis Firehose, or Apache Kafka (MSK).
- Implement streaming semantics including exactly-once and at-least-once processing using Apache Flink or Spark Structured Streaming on EMR.
- Manage infrastructure as code using AWS CDK or Terraform to enable repeatable, auditable deployments.
- Optimize cost and performance across AWS services such as S3, Glue, Athena, Redshift Spectrum, EMR, Lambda, Step Functions, and EventBridge.
- Enforce data security best practices including IAM least-privilege policies, KMS encryption, VPC networking, and Lake Formation access controls.
- Establish and maintain CI/CD pipelines for data workloads using AWS CodePipeline, GitHub Actions, or similar tools.
- Implement data quality frameworks (for example Great Expectations, Deequ) and integrate validation steps into pipeline orchestration.
- Define and enforce data contracts between producers and consumers to ensure data reliability.
- Contribute to data cataloging and lineage tracking via AWS Glue Data Catalog or Apache Atlas.
- Collaborate with data scientists, ML engineers, and analysts to deliver performant, well-documented datasets.
- Mentor mid-level and junior engineers through code reviews, design discussions, and pair programming.
- Document architecture decisions (ADRs) and contribute to the internal engineering knowledge base.
Requirements
- Minimum 5 years of professional data engineering experience, including at least 3 years on AWS cloud platforms.
- Proven track record delivering production data pipelines at scale (TB+ datasets, high throughput SLAs).
- Experience with data lakehouse architectures using the medallion pattern and open table formats (Iceberg preferred; Delta Lake or Hudi acceptable).
- Java proficiency (8+) for Spark jobs, Iceberg connectors, and performance-critical components; familiarity with Maven or Gradle.
- Python proficiency (Python 3) for AWS Glue scripts, orchestration, data quality checks, and automation; experience with pandas, PySpark, boto3, and packaging best practices.
- Storage and compute: S3, Glue (jobs, crawlers, Data Catalog), EMR (Spark/Flink), Lambda, EC2.
- Streaming: Kinesis Data Streams, Kinesis Firehose, or MSK (Kafka).
- Orchestration: Step Functions, MWAA (Managed Airflow), or EventBridge Scheduler.
- Querying: Athena, Redshift, or Redshift Spectrum.
- Security and governance: IAM, KMS, Lake Formation, Secrets Manager, and VPC.
- DevOps: AWS CDK or CloudFormation; CodePipeline or comparable CI/CD tools.
- Apache Spark (PySpark and Spark Java API) including distributed transformations, performance tuning, and memory management.
- Apache Iceberg capabilities such as table maintenance, time travel, snapshot management, and partition evolution.
- SQL expertise with advanced transformations, window functions, CTEs, and query optimization.
Technical Stack
- Languages: Java 8+ and Python 3
- Cloud: AWS (S3, Glue, EMR, Kinesis, Athena, Lambda, Step Functions, Lake Formation, CDK)
- Processing engines: Apache Spark, Apache Flink, Spark Structured Streaming
- Table formats: Apache Iceberg (primary); familiarity with Delta Lake or Hudi
- Streaming: Kinesis Data Streams, Kinesis Firehose, MSK
- Orchestration: MWAA, AWS Step Functions, or EventBridge
- IaC and CI/CD: AWS CDK, Terraform; GitHub Actions, CodePipeline
- Build tools: Maven, Gradle
- Data quality and lineage: Great Expectations, Deequ; AWS Glue Data Catalog, Apache Atlas
- Security and governance: IAM, KMS, Lake Formation, Secrets Manager, VPC
- Additional: SQL, pandas, boto3
Compensation and Benefits
Compensation: USD 130,000 per year.
- 401(k) and matching
- Dental insurance
- Health insurance
- Life insurance
- Paid time off
- Parental leave
- Retirement plan
- Vision insurance
Location, Language and Work Arrangement
- Location: Irvine, California 92618, onsite
- Language requirement: Chinese Mandarin is required
- Work arrangement: In person