How do I extract and load data into a Postgres database using a Python script?

Follow
Available for User roles All users with access to the Data Warehouse
Packages Select Lever packages

Figuring out how to extract and load data can be a complex process depending on the tools and services your company is using. This is a technical guide that walks through the entire process of setting up a local Postgres database as well as loading all of the information available via Data Warehouse Sync. 

Requirements

To get started, ensure that the following are installed and up-to-date:

  • Python 3
  • Postgresql

Getting started

First, set up a database using Postgres. Refer to the Postgres documentation on how to start your server. Once the Postgres server is started, create a database that will store the information. Refer to the Postgres documentation on how to create a database. Keep track of the name of the database because it will be used later.

Next, download the python script found at the bottom of this article and open it with a code editor (e.g. Atom, SublimeText, etc.). Ensure that the Python packages "psycopg2" and "pyathena" are installed - the script depends on them. Save the file somewhere that is easy to find.

Setting up environment variables

It is a poor security practice to hard code sensitive credentials in code. Instead, set up environment variables. This can get done via the command line (Windows) or terminal (Mac). Set up environment variables by opening the application that applies to your machine and create the variables for the Athena schema, work group, access key, secret key, and database name (the name used earlier when creating the database). Refer to the connection sheet (provided upon purchase of Data Warehouse Sync) for the specific values.

Command line:

set ATHENA_SCHEMA=lever_leverdemo_parquet

Note: This only sets the variable for the current session. Use "setx" to make the variable persistent.

Terminal:

export ATHENA_SCHEMA=lever_leverdemo_parquet

Configuring the connection

On line 52 of the script, there is the following code:

conn = pg_connect(
host='localhost',
user='postgres', # This is a default value. It may be your username.
database=database_name,
# password='<password>', # By default, the password is empty. You can choose to enforce a password.
# port='5432', # This is a default value and can be omitted
)

Depending on the specific environment, these variables may need to be updated. The values used in the example are default values. If the connection fails, try changing the value of "user" to the name of the user that is logged in on the machine being used.

Running the script

Running the script is as easy as locating the .py file downloaded earlier. 

On Windows, simply enter the filepath to the .py file. For example, if the file is on your desktop:

c:\Users\(username)\Desktop\athena_to_postgre_etl.py

On Mac, preface the command with "python" (or "python3" if multiple versions of Python are present). For example, if the file is on your desktop:

python3 /Users/(username)/Desktop/athena_to_postgre_etl.py

After a short delay, the terminal should start printing messages like the following:

Dropping public.account_stages...
Creating public.account_stages...
Dumping data into public.account_stages...

Once the script has finished running, that's it! The Postgres database is now successfully synced with Data Warehouse Sync.

Automating the script

It may become tedious to manually run this script every time the database needs to be updated. There are a myriad of ways to automate a script, and AWS Lambda provides a lightweight, serverless solution. This guide does not delve into the specifics of automating this script due to the wide variety of options. When using a cloud-based solution like Lambda, be sure to protect the sensitive credentials and never hardcode them into the script. Learn more about securely providing database credentials to Lambda functions here.

Was this article helpful?
0 out of 0 found this helpful