Learn Apache Spark - Beginner Friendly

Learn Apache Spark from the basics - Big Data processing and Analyzing

Apache Spark is an open-source distributed computing system designed for big data processing and analytics. It efficiently provides a unified and high-level API that allows developers to write parallel applications to process large volumes of data efficiently.

“We use Spark to tame our Big Data “- Frank Kane Before I dive into the installation details of Apache Spark, let’s understand why it is so powerful and so well-suited for scalable data processing and analysis.

Here are some key features and characteristics of Apache Spark:

Speed: Spark is known for its exceptional speed and performance. It achieves this through in-memory computing, which allows it to cache data in memory and perform operations much faster compared to disk-based systems. Spark also optimizes data processing through advanced execution plans and query optimization techniques.
Distributed Computing: Spark is built to run on clusters of computers, enabling distributed data processing. It provides a distributed data structure called Resilient Distributed Datasets (RDDs), which can be efficiently processed in parallel across multiple nodes. Spark automatically handles the distribution and fault tolerance of data across the cluster.
Fault Tolerance: Spark provides built-in fault tolerance mechanisms. It automatically recovers from failures by rerunning failed tasks on other nodes in the cluster. Spark also supports data replication, ensuring data availability even in the event of node failures.
Flexible Data Processing: Spark offers a wide range of APIs for data processing, including batch processing, streaming, SQL queries, machine learning, and graph processing. This versatility allows developers to perform various data operations within a single framework, reducing the need for separate tools or systems.
Rich Ecosystem: Spark has a rich ecosystem with various libraries and extensions that enhance its capabilities. It includes libraries for machine learning (MLlib), graph processing (GraphX), stream processing (Spark Streaming), and data integration (Spark SQL). These libraries make it easier to perform complex data operations using Spark.
Integration with Other Technologies: Spark integrates well with other popular big data technologies, such as Apache Hadoop, Apache Hive, and Apache Kafka. It can read data from various data sources like Hadoop Distributed File System (HDFS), Apache Cassandra, Apache HBase, and more. Spark also provides connectors for integrating with external data stores and databases.
Developer-Friendly APIs: Spark offers APIs in multiple programming languages, including Scala, Java, Python, and R. This allows developers to work with Spark using their preferred language and leverage existing libraries and tools.

Spark’s versatility, scalability, and performance is what makes it a popular choice for processing and analyzing large-scale data. It is widely used in industries such as finance, e-commerce, healthcare, and telecommunications for tasks like data analytics, machine learning, real-time processing, and more.

To install Apache Spark on our system, we need to install the following:

Homebrew : A package manager for macOS and Linux operating systems. It allows users to easily install, update, and manage various software packages and libraries on their systems.
Java : A high-level, general-purpose programming language that is widely used for developing a variety of applications.
Scala : A programming language that combines object-oriented and functional programming paradigms, designed to address some of the limitations of Java while leveraging the existing Java ecosystem.
Apache Spark : It is an open-source distributed computing system designed for big data processing and analytics.

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Running this command will prompt for a root password, so type in your root password. If you do not have root access, contact your system admin.

echo 'eval "$(/opt/homebrew/bin/brew shellenv)"' >> /Users/admin/.zprofile
eval "$(/opt/homebrew/bin/brew shellenv)"

This command comes next, to set the brew to your path.

Next, Install Java as Apache Spark requires Java. You can install it using Homebrew by typing the following into Terminal.

brew install openjdk@11

After the installation, you should also set JAVA_HOME environment variable by adding it into your shell profile file (for example, ~/.zshrc or ~/.bashrc), here is an example for ~/.zshrc

echo 'export JAVA_HOME=/usr/local/opt/openjdk@8' >> ~/.zshrc
source ~/.zshrc

Next step is to install Scala since Apache Spark is written in Scala. Run the following command:

brew install scala

Next step is to install Apache Spark. Run the following command.

brew install apache-spark

Next step is to install PySpark. PySpark is the Python library for Spark. Run the following command if you are using Python 3.

pip3 install pyspark

Alternate method: If you are in Jupyter Notebook, run the following command.

!pip install pyspark

Now, Apache Spark and PySpark should be installed correctly in your system. Check it by starting the PySpark shell, using the command pyspark in your terminal. If everything is set up correctly, you will see the SparkContext defined as sc.

There are multiple ways to install Spark and these tools. You can use other tools like findspark or use Spark in a Docker container as well. This blog illustrates one of the easiest ways to get started on a local machine.

Also, these instructions can change based on the version of software you are using or macOS updates, always refer to the latest official documentation for each tool.

If you have any suggestions or questions, email me at [email protected]

Until next time!

image credit: https://medium.com/@le.oasis/getting-started-with-apache-spark-sparksql-scala-with-mac-terminal-b9c9513c51f1

Apache Spark : It is an open-source distributed computing system designed for big data processing and analytics.