About

Applications

Houdini Unreal Engine Unity 3D Nuke Maya Blender ZBrush Python Mixed Reality Machine Learning Graphic Design Extras

Site created with Notion, Super & Cluster

Amazon EC2

Reference: https://medium.com/@josemarcialportilla/getting-spark-python-and-jupyter-notebook-running-on-amazon-ec2-dec599e1c297

Amazon Elastic Compute Cloud (Amazon EC2)

Web service providing resizable cloud computing; sort of like a virtual machine (VM)

Quick Checks

Verify that instances are turned off to limit usage
Verify security for ports

Workflow

Create EC2 Ubuntu instance on AWS
Connect to EC2 instance via PuTTY SSH client on Windows
Setup instance with applicable Python libraries, including Jupyter access, Spark & Hadoop
Access Jupyter Notebook for data operations
Terminate EC2 instance when complete

EC2 Setup Guide

❗

There are many guides available online documenting similar processes and they may differ in configurations and successful deployment. Regardless, the following is my reference that I have been able to use to set up an EC2 Ubuntu instance for use with Spark.

Create EC2 Instance

Amazon Machine Image (AMI)

Preference is an Ubuntu Server

Instance Type

CPU/Memory: Specify as applicable to project requirements

Instance Configuration

Number of Instances: 1, unless intent is to deploy to cluster of instances
Storage: 8 GB General Purpose SSD (Default)
Tag Instance

Key: name (ex. myinstance)

Value: webserver (ex. mymachine)

Note that these values are case-sensitive.

Security Group Configuration

Create a new security group

Type: Set to specified security profile, or leave at All traffic, though it is not recommended.

Review Instance

Confirm, and Launch

Key Pair

Create new, or select existing, as applicable
Key pair name: Specify unique name
Download Key Pair .pem file

IMPORTANT: Verify this .pem file is downloaded before closing this dialog box.

Launch Instances

SSH Setup

The intent of this step is to remotely connect to command line of EC2 instance with an SSH (Secure Shell Connection) for Windows.

Step-by-Step Reference Guide

Connect to your Linux instance from Windows using PuTTY

After you launch your instance, you can connect to it and use it the way that you'd use a computer sitting in front of you. The following instructions explain how to connect to your instance using PuTTY, a free SSH client for Windows.

docs.aws.amazon.com

Connect to your Linux instance from Windows using PuTTY

Collect Support Files

Download PuTTY

PuTTY allows for creating a secure shell instance to the EC2 instance.

Download PuTTY: latest release (0.74)

This page contains download links for the latest released version of PuTTY. Currently this is 0.74, released on 2020-06-27. When new releases come out, this page will update to contain the latest, so this is a good page to bookmark or link to. Alternatively, here is a permanent link to the 0.74 release.

www.chiark.greenend.org.uk

Binaries to download:

putty.exe
puttygen.exe

Collect EC2 information from AWS Dashboard

Instance ID
Public DNS
Private Key (.pem file)
Enable inbound SSH traffic

This operation is dependent on the security profile specified, but if All traffic was selected in instance creation, then this has been enabled.

Convert Key Via PuTTYgen

Since PuTTY does not natively support the .pem format generated by EC2, so it needs to be converted by using PuTTYgen.

Run PuTTYgen.exe
Select Type of key to generate: RSA

RSA is a private-key cryptosystem for an SSL/TLS session

Load .pem file
Save private key

Name: Specify unique
Passphrase: Specify as needed for security profile
Save in .ppk file format

Close PuTTYgen.exe

Configure SSH Client Via PuTTY

With the refined key, use PuTTY to configure the access to the EC2 instance.

Run PuTTY.exe

Session Configuration:

Host Name

Format: user_name@public_dns_name

user_name: ubuntu, for Ubuntu Server (check docs for other server usernames

public_dns_name: Public DNS from EC2 console

Connection type: SSH
Port: 22
Category > Connection > SSH > Auth

Browse and load .ppk file

Open to start SSH Client.
Trust the Security Alert dialog box that appears since the server being connected to is the intended one.

Access EC2 Instance With SSH Client

With PuTTY initialized and opened, shell client opens as follows:

Using username "ubuntu".
Authenticating with public key "imported-openssh-key"
Welcome to Ubuntu 20.04.1 LTS (GNU/Linux 5.4.0-1029-aws x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

  System information as of Sun Feb  7 17:01:21 UTC 2021

  System load:  0.0               Processes:             99
  Usage of /:   16.8% of 7.69GB   Users logged in:       0
  Memory usage: 19%               IPv4 address for eth0: ???.??.??.???
  Swap usage:   0%

1 update can be installed immediately.
0 of these updates are security updates.
To see these additional updates run: apt list --upgradable


The list of available updates is more than a week old.
To check for new updates run: sudo apt update


The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

ubuntu@ip-???-??-??-???:~$ python3
Python 3.8.5 (default, Jul 28 2020, 12:59:40)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> quit()

When complete with use of EC2 instance, close shell and terminate instance on AWS Console.

Actions > Instance State > Terminate

Instance termination is critical for when the instance is no longer needed so as not to log active usage of charged services.

Jupyter & Spark Setup Guide

Overview

The intent of this process is to set up Anaconda Python, Jupyter, Spark, and Hadoop within the EC2 Instance.

Reference Step-by-Step Guide

Getting Spark, Python, and Jupyter Notebook running on Amazon EC2

You should now have successfully connected to the command line of your virtual Ubuntu instance running on EC2. The rest of the guide will tell you commands to put into this terminal. Next we will download and install Anaconda for our Python. You can replace the version numbers here with whatever version you prefer (2 or 3).

medium.com

Getting Spark, Python, and Jupyter Notebook running on Amazon EC2

Anaconda & Python

Install Anaconda

This process installs Anaconda and the Jupyter notebook with its own Python and all accompanying libraries to the specified directory.

Access the EC2 instance with PuTTY.
Specify version of Anaconda from the following link of available versions to acquire the weblink: https://repo.anaconda.com/archive/

ubuntu@ip:~$ wget <weblink_to_anaconda.sh_installer>
ubuntu@ip:~$ bash <anaconda.sh_installer>

Confirm install to /home/ubuntu/anaconda3
Press Enter key to navigate through agreement and enter yes for agreement acceptance.
Enter yes for installer to prepend the Anaconda3 install location to PATH in /home/ubuntu/.bashrc.

Python Versions

While the newly installed Anaconda has its own version of Python, Ubuntu already comes packaged with Python, so the following will confirm which Python is being actively utilized.

ubuntu@ip:~$ which python

When checking the version of Python, the default Ubuntu version is located at /usr/bin/python, while Anaconda's version is located at /home/ubuntu/anaconda3/bin/python. The following changes the active Python to the Anaconda version.

ubuntu@ip:~$ source .bashrc

Jupyter Notebook

Configuration

While Jupyter comes with Anaconda, it needs to be configured for use within this EC2 Ubuntu environment.

Create the Jupyter Notebook configuration file.

ubuntu@ip:~$ jupyter notebook --generate-config

Note that the configuration file is created and stored at /home/ubuntu/.jupyter/jupyter_notebook_config.py

Create a directory for storing certifications.

ubuntu@ip:~$ jupyter notebook --generate-config

Generate the applicable certification files.

ubuntu@ip:~$ sudo openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem

Populate the fields for the newly created certification file/s, as applicable.

Country Name (2 letter code) [AU]: 
State or Province Name (full name) [Some-State]: 
Locality Name (e.g., city) []:
Organization Name (e.g., company) [Internet Widgits Pty Ltd]:
Organizational Unit Name (e.g., section) []: 
Common Name (e.g., server FQDN or YOUR name) []: 
Email Address []:

Document the location where this .pem file is located on the EC2 instance; it will be needed in the next step.

Permissions

Before attempting to access the Jupyter Notebook, note that the ownership of the mycert.pem file needs to be addressed by way of the following. Change directory to the certs directory and input the following to change the group ownership of the certification file to the 'ubuntu' user.

ubuntu@ip:~$ sudo chown ubuntu:root mycert.pem

The above is noted in several troubleshooting documentation sites, as noted below. The Jupyter Notebook has issues being accessed via the browser and notes the following error. However the above step has shown to resolve the permission issue required to bypass the error.

Error:

Permission Error: [Errno 13] Permission denied

Resources:

Supporting Frameworks & Libraries

In support of the Conda environment, the following need to be installed on the EC2 instance, if not already accessible:

Java
Scala
pip for Conda
py4j
findspark
Spark
Hadoop

pip for Conda

Install Anaconda's version of pip

ubuntu@ip:~$ conda install pip

Confirm pip installation

ubuntu@ip:~$ which pip

It should be installed to the /home/ubuntu/anaconda3/bin/pip directory.

Confirm which Python libraries are installed to Conda

ubuntu@ip:~$ conda list

py4j Library

This library allows Python to connect to Java

Install py4j library

ubuntu@ip:~$ pip install py4j

findspark Library

This library makes pyspark importable as a regular library.

Install findspark Library

ubuntu@ip:~$ pip install -q findspark

Jupyter Notebook Utilization

Once applicable libraries and configurations are setup, access the Jupyter Notebook with

ubuntu@ip:~$ jupyter notebook

and by accessing the following web address:

https://<EC2_instance_public_ip>:8888

Creating a new notebook should provide the opportunity to test the configuration and installation for using Spark.

Close Jupyter Notebook

Back in the shell, press CTRL + C to exit Jupyter Notebook.

Enter y to confirm notebook shutdown

Terminate EC2 Instance

Right-click on instance in AWS Console and Terminate.

Amazon EC2 Instance