Job Description
Job DescriptionSoftware Data Engineer
Expected start date: Dec 1st
Duration of engagement: 3 months, with potential extension
Work location: Alpharetta, GA or Plano TX.
Work location model: Hybrid, 3-days in officeAbout the Role:
The ideal candidate will be responsible for designing and maintaining modern, scalable data solutions on Azure using Databricks. This includes building data pipelines, ETL/ELT workflows, and architectures such as Data Lakes, Warehouses, and Lakehouses for both real-time and batch processing. The role involves integrating large datasets from diverse sources, implementing Delta Lake, and preparing data for machine learning through feature stores.
Key Responsibilities:
- Design, develop, and optimize scalable data pipelines and ETL/ELT workflows using Databricks on Azure
- Build and maintain modern data architectures (Data Lake, Data Warehouse, Lakehouse) for real-time streaming and batch processing on Azure
- Implement data integration solutions for large-scale datasets across diverse data sources using Delta Lake and other data formats
- Create feature stores and data preparation workflows for machine learning applications on Azure
- Develop and maintain data quality frameworks and implement data validation checks
- Collaborate with data scientists, ML engineers, analysts, and business stakeholders to deliver high-quality, production-ready data solutions
- Monitor, troubleshoot, and optimize data workflows for performance, costefficiency, and reliability
- Implement data governance, security, and compliance standards across all data processes
- Create and maintain comprehensive technical documentation for data pipelines and architectures
Required Qualifications:
- Data Architecture: Deep understanding of Data Lake, Data Warehouse, and Lakehouse concepts with hands-on implementation experience
- Databricks & Spark: 3+ years of hands-on experience with Databricks on Azure, Apache Spark (PySpark/Spark SQL), Delta Lake optimization
- Azure Platform: 3+ years working with Azure Data Factory (ADF), Azure Data Lake Storage (ADLS), Azure Synapse Analytics, Azure ML Studio, Azure Databricks
- Programming: Strong proficiency in Python (including pandas, NumPy), SQL, and Unix/Linux shell scripting; experience with Java or Scala is a plus
- Streaming: 3+ years’ experience with Apache Kafka or Azure Event Hubs, Azure Stream Analytics
- DevOps: Hands-on experience with Git, CI/CD pipelines (Azure DevOps, GitHub Actions), and build tools (Maven, Gradle)
- Orchestration: Working knowledge of workflow schedulers (Apache Airflow, Azure Data Factory, Databricks Workflows, TWS)
- Problem-solving: Strong analytical and debugging skills with ability to work in agile/scrum environments
Preferred Qualifications:
- Experience with ML frameworks and libraries (scikit-learn, TensorFlow, PyTorch) for data preparation and feature engineering on Azure
- Experience with vector databases (Azure AI Search, Pinecone, Weaviate, Milvus) and RAG (Retrieval Augmented Generation) architectures
- Experience with modern data transformation tools (DBT, Spark Structured Streaming on Databricks)
- Understanding of LLM applications, prompt engineering, and AI agent frameworks (Azure OpenAI Service, Semantic Kernel)
- Familiarity with containerization (Docker, Azure Kubernetes Service)
- Experience with monitoring and observability tools (Azure Monitor, Application, Insights, Datadog, Grafana)
- Certifications in Databricks, Azure Data Engineer Associate, Azure AI Engineer, or Azure Solutions Architect
Educational Background:
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.