Position Overview
Works on the data pipeline infrastructure that is veritably the backbone of our business Writing elegant functional Scala code to crunch TBs of data on Hadoop clusters, mostly using Spark Be owning a data pipeline deployment to clusters: on-prem or on-cloud (AWS or GCP or more).Β Be managing Hadoop clusters right from security to reliability to HA.Β Building a pluggable, unified data lake from scratch. Β Automating and scaling tasks for the Data Science team.Β Constantly look to improve framework and pipelines, hence learning on the job is sort of a given.Β Our expertise and requirements include but are not limited to Spark, Scala, HDFS, Yarn, Hive, Kafka, Distributed Systems, Python, Datastore (Relational and NoSql) and Airflow.