The amount of data that is collected throughout the world continues to rise exponentially, doubling every two to three years. As the quantity of this data grows, the field of Big Data provides major opportunities for companies to analyze consumer preferences, purchases, and trends. The ability to capture, store, and process large amounts of data is an increasingly important capability for the modern corporation, providing a variety of applications and benefits. But if the data cannot be stored and effectively processed, these opportunities are lost. Along with the rise of Big Data, the tools and frameworks to effectively process that data have also risen in importance.

The Basics of Distributed Computing

When working with Big Data, conventional computing quickly becomes insufficient. Storing and processing this data on a single machine is limited. Conventional applications and frameworks are not able to scale effectively, both in storing data and in processing it. This is what Hadoop was created for – to reliably scale to store and process this data in a distributed model. Hadoop splits up the storing and the processing of the data across a distributed array of machines, allowing for parallelization which provides vastly improved performance.

Apache Hadoop consists of three parts:

HDFS or Hadoop Distributed File System handles the storage of the data, providing distributed storage across hundreds and thousands of nodes.

MapReduce is for distributed data processing, splitting up the processing of data into 2 parts – Map actions (which can be performed in parallel directly on the nodes), and Reduce actions (which merges the data together to produce the result).

YARN or Yet Another Resource Negotiator which manages the resources, scheduling, and monitoring of the jobs.

Hadoop was the first Big Data framework and paved the way for the Big Data revolution. But as with many technologies, additional opportunities for improvement were soon found. Other frameworks have been created, often building on top of Hadoop or parts of it. One of the most popular of these has been Apache Spark, which improves the performance of Hadoop, offering speeds which are three times faster. Spark also provides other improvements, like storing datasets directly in memory. This removes the disk latency time and improves the performance even more.

Big Data for Everyone

Over the last several years Hadoop, and later Spark, have allowed for processing Big Data at much faster speeds using distributed computing. However, distributed computing often requires large server farms to process the data efficiently. Many companies do not want to invest in the infrastructure for this, as it would require a large up-front cost and ongoing management costs. For these companies, running Big Data jobs on the cloud offers a great alternative. Cloud storage allows the cloud provider to handle the distributed storage requirements. The distributed processing can also be performed directly on the cloud, again, moving the management of the cluster to the cloud provider.

The cloud makes Big Data available to anyone, providing the management and infrastructure needed at a reduced price. Instead of running and maintaining the servers itself, a company can pay for only the resources that they need. But this can still be inefficient. Without autoscaling, a company that runs Spark on the cloud is not effectively using all its resources 100% of the time. This means between jobs the worker nodes are sitting around unused. Being able to scale up resources to match workloads, and then scale them down again when the workload is finished, provides valuable cost savings.

Autoscaling and Provisioning

The most cost-efficient way to run an application on the cloud is to use auto-scaling to provision additional resources dynamically within minutes, minimizing unused computing time. This scaling works well with Big Data, which often has inconsistent workloads. A great tool for provisioning and scaling on the cloud is Kubernetes, which is supported across all the major cloud providers (and can be run locally as well). Kubernetes is a container orchestration platform for deploying, scaling and monitoring containers, and it is also a great match for Big Data. The latest versions of Spark support direct integration with Kubernetes as a scheduler, replacing YARN.

Now instead of just assigning jobs to an array of machines, Kubernetes, along with the autoscaling groups available from cloud providers, can request additional resources, and allocate those resources intelligently to Spark to process the job. This integration provides many benefits; in addition to the ability to autoscale, it also makes it easy to run multiple Spark applications in full isolation from each other. A single Kubernetes cluster can be used by a variety of Spark applications (even different versions of Spark), without conflicts. It also makes Spark cloud-agnostic, allowing Spark to run on a variety of cloud providers.

At Synchronoss, our Synchronoss Insights Platform (SIP) is a tool that uses Spark for processing Big Data. SIP accepts data, performs ETL (Extract, Transform, Load) on it, and then provides an intuitive UI to perform analysis on the data, building custom charts and graphs as needed. We are currently updating SIP to run on Kubernetes, helping us achieve many of the benefits listed above.