In this blog post I will walk you though the exact steps needed to set up Jupyter Notebook to connect to your private data warehouse in AWS Redshift.
At RJMetrics, I use the Jupyter + Redshift analytics stack every single day because there are truly very few limits to what one can do with this powerful combination. From querying your data and visualizing it all in one place, to documenting your work and building interactive charts and dashboards, to running machine learning algorithms on top of your data and sharing the results with your team, the possibilities are limitless.
This entire post was written in Jupyter itself so that you can:
- see how easy it is to get started,
- download the notebook to follow along on your computer, and
- modify the code so that you can start analyzing your own data.
Let’s get started!
What is Jupyter Notebook?
Jupyter Notebook is an open-source data science tool used by many data scientists and data analysts at some of the most data-driven organizations in the world, including Google, Microsoft, IBM, Bloomberg, O’Reilly and NASA.
An extension of the IPython project, Jupyter Notebook is an application that runs directly in your browser and allows you to create and share documents with live code from over 40 different languages. You can also supplement these documents with explanatory text, equations, and rich visualizations.
Working with Jupyter is literally like writing a notebook. You can document your work, make it accessible to other people and easily reproduce it in the future. Once the notebook is set up, you can readily share it with others. For example, GitHub natively renders notebook files as soon as they are committed into a repository. Alternatively, you can export your notebook into a number of different formats, including PDF, HTML, Markdown, and reST.
Combined with its open-source nature, big data integration capabilities (e.g., Apache Spark via Python, R, or Scala), and the ability to execute code in virtually any language of your choice, Jupyter Notebook is particularly well-suited for exploring data, experimenting with it, and capturing entire data-driven workflows (see also our Data Science Roundup #11 discussing the recent rise of notebooks).
Installing Jupyter and Python
If you have never worked with Jupyter before or if you are not sure whether it is installed on your machine, please take a look at this brief installation article from the developers of Jupyter. Several Python distributions (e.g., Anaconda) conveniently ship with both Python and Jupyter.
What is AWS Redshift?
Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse. With Redshift, it has never been easier or more cost-effective to securely store your data in the cloud. On-demand pricing means that you can get started for as little as 25 cents an hour and you don’t have to worry about backup, durability, availability, security, monitoring, or maintenance: AWS takes care of all of that for you.
Because of its ease of use, Redshift is experiencing rapid adoption among light quants, growth hackers, seasoned analysts and data scientists. It integrates seamlessly with the languages you may be already familiar with, including SQL, R, and Python. For the purpose of this post, I will be focusing on the latter.
Setting Up Your AWS Redshift Instance
Setting up your Redshift instance is easy. AWS provides an excellent Getting Started with Amazon Redshift guide and the whole process takes only a few minutes. Once you have your Redshift instance up and running, I would encourage you to check out RJMetrics Pipeline. Pipeline will connect to all of the disparate data sources you might have, including databases and SaaS products, and stream your data into one single place in Redshift. With Pipeline, you can start analyzing all of your data literally in minutes and not pay a penny if you stream less than 5 million rows/events. Here is a four-minute tutorial that will walk you through setting up your data warehouse to work with Pipeline.
1. Install Requirements
Please make sure Jupyter and Python are setup on your computer (see Installing Jupyter and Python above).
We will need to install a few packages in order to connect to and work with AWS Redshift. Note that depending on your Python distribution, you may need to install additional packages. I recommend that you use the completely free Anaconda distribution as it already comes with many of the required packages installed and the packages are tested and guaranteed to work together. If working with Anaconda, you can use the built-in package management system conda. Alternatively, you can use Python’s built-in pip package management system or Homebrew (OS X only).
I will use a combination of Homebrew and pip in the following. Note that you may need to use pip3 if working with Python versions 3.3 or later.
First, since Amazon Redshift is based on PostgreSQL 8.0.2, we will need a PostgreSQL client library.
If using Mac OS X, simply simply open up your terminal and type
brew install postgresql
For other operating systems, please see the installation instructions here.
Install Python Requirements
Second, we will need a few python packages to connect to and get data from Redshift. You can simply download the
requirements.txtfile, navigate to the folder containing it and type
pip install -r redshift_requirements.txt.
(Pro tip: you may want to create a virtual python environment to keep the package dependencies required by different projects in separate locations. See Python and Conda documentation for more details.)
Here is a brief description of the four packages included in
- Psycopg2: Python adapter needed to connect to Redshift.
- SQLAlchemy: Python SQL toolkit and Object Relational Mapper that provides an efficient way to access a database.
- iPython-SQL: provides a straightforward way to write SQL and get data back. Introduces magic commands
%%sqlso that you can write plain SQL and get back your query results in the form of a dataframe.
- SimpleJSON: simple and fast JSON encoder and decoder. We will store Redshift credentials in a JSON-formatted file and will make use of the package to read that file.
2. Save your Redshift Credentials
To access you Redshift instance, you will need to know your (i) user name, (ii) password, (iii) host name, (iv) port and (v) database name. If you personally set up your Redshift instance, you will already know what these are. If not, please ask someone who did or take a look at the AWS documentation on how to obtain these credentials. Note that the port number for AWS Redshift is 5439.
So that you don’t expose any of the sensitive information above in your Notebook (e.g, when sharing your notebook with other people or when storing your notebook in a GitHub repository), please save your Redshift credentials inside a JSON-formatted file called redshift_creds.json.nogit. You can download the template for that file here. The contents of the file is pretty self-explanatory:
I would suggest saving the redshift_creds.json.nogit file in a folder that is separate from the one with your Jupyter file in it so that you don’t accidentally store the credentials on GitHub. For extra security, you can specify the *.nogit pattern in the .gitingore file of your repository.
3. Whitelist your IP address in the AWS Panel
By default, your Redshift cluster is locked down so that nobody has access to it. You can grant people inbound access to your cluster via the AWS panel by creating a security group and specifying which IP addresses are allowed to connect to Redshift. Makes sure that your IP address is included in one of these rules. Please see the Amazon Redshift Cluster Security Groups documentation from AWS for more details.
4. Connect to Redshift!
You are now all set to connect to your Redshift instance! To do so from Jupyter, open up your terminal and type ipython notebook
This command will start up a notebook server in your console and open up a web browser to show you the Jupyter Notebook dashboard:
You can then navigate to the folder containing this notebook (get it from GitHub here) and click on the “jupyter_redshift_analytics_stack.ipynb” file to start up the exact notebook you are looking at right now.
Let’s import the packages we just installed and load SQL magic. You can press (shift + enter) to execute the cell below.
Note: if you are using Anaconda and encounter the “Library not loaded: libssl.1.0.0.dylib” error when importing the psycopg2 package, run the following command in your terminal to add a fallback library path to your .bash_profile:
sudo echo 'export DYLD_FALLBACK_LIBRARY_PATH=$HOME/anaconda/lib/:$DYLD_FALLBACK_LIBRARY_PATH' >>~/.bash_profile
%config SqlMagic.displaylimit = 5
Next, let’s read the credentials from file. Note that you will need to specify the path to the redshift_creds.json.nogit file on your machine.
with open("/PATH/TO/YOUR/redshift_creds.json.nogit") as fh:
creds = simplejson.loads(fh.read())
Finally, let’s connect to Redshift!
connect_to_db = 'postgresql+psycopg2://' + \
creds['user_name'] + ':' + creds['password'] + '@' + \
creds['host_name'] + ':' + creds['port_num'] + '/' + creds['db_name'];
As a simple test, let’s get the current date and time from our Redshift instance:
select getdate() as current_date
1 rows affected.
We can now store the result in a dataframe by simply typing
df = _.DataFrame()
With only a few commands you loaded your Redshift credentials, connected to your data warehouse, executed a query and converted the query result into a pandas dataframe! Now the world is your oyster: you can use any package in the Python ecosystem to slice and dice a dataframe, visualize your results and share them with others.
More on this in the blog posts to follow.