Multi-faceted ETL Toaol
AWS Glue has four major components. Metadata Catalog, Crawlers, Classifiers, and Jobs.
As we mentioned AWS Glue has a managed services that lets you store, and share metadata about your data between different AWS services. If you are familiar with Apache Hive catalog this shouldn’t be new to you.
You can store metadata about different data sources, and their attributes in the data catalog. The cool thing is the catalog integrates with IAM seamlessly so you can control access to each individual data source separately using IAM policies.
A very nice feature of AWS Glue is that you can have it crawl your data source and discover the schema for you. Of course you have the ability to define the schema manually, however using a crawler is much more efficient.
At the heart of the Crawler is the Classifier which reads the data, determines the format and generates the schema. AWS Glue has built-in classifiers for several standard formats like JSON, CSV, Avro, ORC, etc. (You can find the complete list here)
You also have the ability to write your own classifier in case you are dealing with proprietary formats.
Please note that you do not need a separate crawler for each data source if they have the same structure. If you have a hundred files and all have the same format and structure you can generate the schema with only one crawler.
Once the crawler is done its job we can go and change the generated schema manually if necessary.
Right away the power of AWS Glue should be obvious to you. Instead of Going through data profiling practice for each data source, finding out the schema, and then using DDL to create the table we just run the crawler and it will create the table with the correct schema for us.
In addition you have the ability to ask the crawler to automatically detect the changes in the structure of your data, and react to them by applying the changes to your Metadata, your target tables, or simply ignore them.
So basically you can avoid broken ETL pipelines and run-time errors due to schema changes in the source with this powerful tool.
The job is where you write your ETL logic and code, and execute it either based on an event or on a schedule.
You can write your jobs in either Python or Scala. Remember that AWS Glue is based on Apache Spark framework. Underneath there is a cluster of Spark nodes where the job gets submitted and executed.
In order to better demonstrate the features of AWS Glue let’s walk through an example. Let’s say we are tasked with reading a bunch of files about the test results from a few schools in different areas and make some insights. The files have the following fields:
The first step is to make the files accessible to AWS Glue, and what better place to store the files than S3, perfect location for your Data Lake files and objects.
Now instead of creating a table manually to hold the data for our ETL we will setup a crawler to that for us.
SQL language without the need to load the data into a relational database first.
The AWS Glue database and tables create a layer of abstraction over your data files and make it possible to write SQL queries in Athena even though the actual data is still on S3 and the format is CSV.
Using Amazon Athena you can query the source files as if they were database tables. This is very useful if you want to do data exploration, and ad-hoc queries before spending time and money to bring the data into your database.
Users and data scientists can have access to the source data in the original format and use SQL to explore the data and come up with new business logic and applications for the data.
That’s it we are all done!
In part three, we are going to write a simple job and see how we can add a calculated column to the data and load the curated results into a destination.