Available for | Roles | Super Admin |
Permissions | • Manage Data Warehouse Sync integration | |
Packages | Advanced Analytics add-on |
Figuring out how to extract and load data can be a complex process depending on the tools and services your company is using. This is a technical guide that walks through the entire process of setting up a local Postgres database as well as loading all of the information available via Data Warehouse Sync.
Requirements
To get started, ensure that the following are installed and up-to-date:
- Python 3
- Postgresql
- You can download this directly from https://www.postgresql.org/ or install it using your favorite package manager
Getting started
First, set up a database using Postgres. Refer to the Postgres documentation on how to start your server. Once the Postgres server is started, create a database that will store the information. Refer to the Postgres documentation on how to create a database. Keep track of the name of the database because it will be used later.
Next, download the Python script at the bottom of this article and open it with a code editor (e.g. Atom, SublimeText, etc.). Ensure that the Python packages "psycopg2" and "pyathena" are installed - the script depends on them. Save the file somewhere that is easy to find.
Setting up environment variables
It is a poor security practice to hard code sensitive credentials in code. Instead, set up environment variables. This can be done via the command line (Windows) or terminal (Mac). Set up environment variables by opening the application that applies to your machine and create the variables for the Athena schema, work group, access key, secret key, and database name (the name used earlier when creating the database). Refer to the connection sheet (provided upon purchase of Data Warehouse Sync) for the specific values.
Command line:
set ATHENA_SCHEMA=lever_leverdemo_parquet
Note: This only sets the variable for the current session. Use "setx" to make the variable persistent.
Terminal:
export ATHENA_SCHEMA=lever_leverdemo_parquet
Configuring the connection
On line 52 of the script, there is the following code:
conn = pg_connect(
host='localhost',
user='postgres', # This is a default value. It may be your username.
database=database_name,
# password='<password>', # By default, the password is empty. You can choose to enforce a password.
# port='5432', # This is a default value and can be omitted
)
Depending on the specific environment, these variables may need to be updated. The values used in the example are default values. If the connection fails, try changing the value of "user" to the name of the user that is logged in on the machine being used.
Running the script
Running the script is as easy as locating the .py file downloaded earlier.
On Windows, simply enter the filepath to the .py file. For example, if the file is on your desktop:
c:\Users\(username)\Desktop\athena_to_postgre_etl.py
On Mac, preface the command with "python" (or "python3" if multiple versions of Python are present). For example, if the file is on your desktop:
python3 /Users/(username)/Desktop/athena_to_postgre_etl.py
After a short delay, the terminal should start printing messages such as:
Dropping public.account_stages...
Creating public.account_stages...
Dumping data into public.account_stages...
Once the script has finished running, that's it! The Postgres database is now successfully synced with Data Warehouse Sync.
Automating the script
It may become tedious to manually run this script every time the database needs to be updated. There are a myriad of ways to automate a script, and AWS Lambda provides a lightweight, serverless solution. This guide does not delve into the specifics of automating this script due to the wide variety of options. When using a cloud-based solution like Lambda, be sure to protect the sensitive credentials and never hardcode them into the script. Learn more, refer to this AWS article on securely providing database credentials to Lambda functions.