J
AI/ML Lead Data Engineer - Automation/Image Processing
Job Description
The AI/ML Lead Data Engineer focusing on Automation and Image Processing is a senior technical leadership role based onsite in Tampa, FL, within JPMorganChase's Commercial & Investment Bank. The position centers on designing, building, and maintaining scalable data pipelines and architectures for ingesting, processing, and storing scanned document images and OCR outputs, in collaboration with data scientists and ML engineers to support analytics, model training, and production-grade workflows.
Responsibilities
- Architect and maintain scalable, high‑performance data pipelines and infrastructure to ingest, process, and store large volumes of scanned document images across enterprise workflows.
- Design end‑to‑end data solutions on AWS to enable seamless flow from source systems through OCR processing, model inference, and downstream data extraction and categorization pipelines.
- Develop robust image preprocessing and OCR integration pipelines handling TIF and PNG conversions, normalization, resolution enhancement, noise reduction, and batching for downstream computer vision and OCR models.
- Build and optimize pipelines that integrate OCR outputs, extracting structured text and metadata and routing them into databases and analytics platforms for further processing.
- Design and manage data storage architectures and containerized deployments using Oracle databases and AWS stores (S3, EFS) to catalog, index, and retrieve extracted text, classifications, and metadata from processed images.
- Drive containerized deployment practices with AWS EKS to deploy and scale image processing microservices, OCR engines, and data pipeline components with high availability and fault tolerance.
- Collaborate with data scientists and ML engineers to ensure training datasets for various models are curated, versioned, labeled, and accessible through well-structured data pipelines.
- Evaluate and integrate emerging data technologies to improve pipeline throughput, reduce processing latency for high‑volume document workloads, and optimize cost efficiency.
- Establish data quality, lineage, governance, and security frameworks to ensure traceability and integrity of extracted data throughout the processing lifecycle.
- Partner with security and compliance teams to ensure scanned document data, including sensitive content, is handled per regulatory requirements, encryption standards, and access controls.
- Lead and mentor a team of data engineers, establishing coding standards, peer review processes, CI/CD workflows, and best practices for production‑grade image and document processing pipelines.
Requirements
- Formal training or certification in data engineering with 5+ years of applied experience.
- Strong proficiency in Java, Groovy, and Python for building data pipelines, image preprocessing workflows, automation scripts, and backend services.
- Hands‑on experience with image file handling, especially TIF/PNG processing, multi‑page document splitting, format conversion, and integration with OCR and computer vision pipelines.
- Deep experience with AWS cloud services including S3, Lambda, Step Functions, and CloudWatch for scalable data workflows.
- Expertise in AWS EKS for deploying and managing containerized image processing, OCR, and data pipeline services using Docker and Kubernetes.
- Advanced knowledge of Oracle databases, including PL/SQL, performance tuning, partitioning strategies, and data modeling for large volumes of extracted document data and classifications.
- Familiarity with OCR technologies and the ability to structure OCR output for downstream analytics and model training.
- Understanding of data requirements for training deep learning models, including dataset preparation, annotation management, and feature store integration.
- Experience with CI/CD pipelines (Jenkins) and infrastructure‑as‑code tools (Terraform, CloudFormation) for automated deployment and environment management.
- Strong understanding of data governance, data quality, metadata management, and data cataloging, especially in document‑centric and image‑heavy data ecosystems.
- Excellent leadership, communication, and stakeholder management skills for guiding technical decisions across cross‑functional teams.
TECHNOLOGIES
- Java, Groovy, Python
- AWS S3, Lambda, Step Functions, CloudWatch
- AWS EKS, Docker, Kubernetes
- Oracle Database, PL/SQL, EFS
- Jenkins, Terraform, CloudFormation
- OCR technologies
BENEFITS
- Comprehensive health care coverage
- On-site health and wellness centers
- Retirement savings plan
- Backup childcare
- Tuition reimbursement
- Mental health support
- Financial coaching
- Commission-based pay
- Discretionary incentive compensation (cash and/or forfeitable equity)
PREFERRED QUALIFICATIONS, CAPABILITIES, AND SKILLS
- Domain expertise in the healthcare industry
Similar Jobs
J
J