Best Practices for PySpark ETL Projects

Posted on Sun 28 July 2019 in data-engineering • Tagged with data-engineering, data-processing, apache-spark, python

png

I have often lent heavily on Apache Spark and the SparkSQL APIs for operationalising any type of batch data-processing ‘job’, within a production environment where handling fluctuating volumes of data reliably and consistently are on-going business concerns. These batch data-processing jobs may involve nothing more than joining data sources and …


Continue reading

Building a Data Science Platform for R&D, Part 3 - R, R Studio Server, SparkR & Sparklyr

Posted on Mon 22 August 2016 in data-science • Tagged with AWS, data-processing, apache-spark

Alt

Part 1 and Part 2 of this series dealt with setting up AWS, loading data into S3, deploying a Spark cluster and using it to access our data. In this part we will deploy R and R Studio Server to our Spark cluster’s master node and use it to …


Continue reading

Building a Data Science Platform for R&D, Part 2 - Deploying Spark on AWS using Flintrock

Posted on Thu 18 August 2016 in data-science • Tagged with AWS, data-processing, apache-spark

Alt

Part 1 in this series of blog posts describes how to setup AWS with some basic security and then load data into S3. This post walks-through the process of setting up a Spark cluster on AWS and accessing our S3 data from within Spark.

A key part of my vision …


Continue reading