Deploying a scalable data science environment using Docker
Date
2019Abstract
Within the Data Science stack, the infrastructure layer supporting the
distributed computing engine is a key part that plays an important role in order to
obtain timely and accurate insights in a digital business. However, sometimes the
expense of using such Data Science facilities in a commercial cloud infrastructure
is not affordable to everyone. In this sense, we present a computing environment
based on free software tools over commodity computers. Thus, we show how to
deploy an easily scalable Spark cluster using Docker including both Jupyter and
RStudio that support Python and R programming languages. Moreover, we present
a successful case study where this computing framework has been used to analyze
statistical results using data collected from meteorological stations located in the
Canary Islands (Spain)