In the second part of Exploring AWS Glue, I am going to give you a brief introduction about different components of Glue and then we will see an example of AWS Glue in action.
AWS Glue Components
AWS Glue has four major components. Metadata Catalog, Crawlers, Classifiers, and Jobs.
As we mentioned AWS Glue has a managed services that lets you store, and share metadata about your data between different AWS services. If you are familiar with Apache Hive catalog this shouldn’t be new to you.
You can store metadata about different data sources, and their attributes in the data catalog. The cool thing is the catalog integrates with IAM seamlessly so you can control access to each individual data source separately using IAM policies.
Crawlers and Classifiers
A very nice feature of AWS Glue is that you can have it crawl your data source and discover the schema for you. Of course you have the ability to define the schema manually, however using a crawler is much more efficient.
At the heart of the Crawler is the Classifier which reads the data, determines the format and generates the schema. AWS Glue has built-in classifiers for several standard formats like JSON, CSV, Avro, ORC, etc. (You can find the complete list here)
You also have the ability to write your own classifier in case you are dealing with proprietary formats.
Please note that you do not need a separate crawler for each data source if they have the same structure. If you have a hundred files and all have the same format and structure you can generate the schema with only one crawler.
Once the crawler is done its job we can go and change the generated schema manually if necessary.
Right away the power of AWS Glue should be obvious to you. Instead of Going through data profiling practice for each data source, finding out the schema, and then using DDL to create the table we just run the crawler and it will create the table with the correct schema for us.
In addition you have the ability to ask the crawler to automatically detect the changes in the structure of your data, and react to them by applying the changes to your Metadata, your target tables, or simply ignore them.
So basically you can avoid broken ETL pipelines and run-time errors due to schema changes in the source with this powerful tool.
AWS Glue Jobs
The job is where you write your ETL logic and code, and execute it either based on an event or on a schedule.
You can write your jobs in either Python or Scala. Remember that AWS Glue is based on Apache Spark framework. Underneath there is a cluster of Spark nodes where the job gets submitted and executed.
AWS Glue in Practice
In order to better demonstrate the features of AWS Glue let’s walk through an example. Let’s say we are tasked with reading a bunch of files about the test results from a few schools in different areas and make some insights. The files have the following fields:
- SCH-DIST – The area where the school resides
- SCH-ID – The ID of the school
- STDTS – Number of students in the school
- SUBJT – The subject of the test
- AVG-SCR – The Average score of the test in that school
Upload the files to S3 bucket
The first step is to make the files accessible to AWS Glue, and what better place to store the files than S3, perfect location for your Data Lake files and objects.
Now instead of creating a table manually to hold the data for our ETL we will setup a crawler to that for us.
Setup a crawler
- Login to the management console and from the Services pick AWS Glue.
- Once in AWS Glue console click on Crawlers and then click on Add Crawler.
- Provide a name and optionally a description for the Crawler and click next.
- Specify the data store. You can select between S3, JDBC, and DynamoDB. We will use S3 for this example. In the “Include Path” box, we will put the path to the S3 bucket where we uploaded the source files.
- All the files have the same schema so we don’t need to add another data store.
- In the next page we need to specify an IAM role that has access to the data. Or we can let the wizard to create the role for us by providing a name for the role.
- Next we specify the frequency we want the crawler to be triggered. You can select any of the intervals provided, or if you select Custom then you can use CRON to specify the schedule. I will use “Run on demand”.
- Next we specify the database for the table(s) that will be created by the crawler. You can either select an existing database or create a new one. Please pay close attention to the Configuration Options section. Here you can specify how you want AWS Glue to handle changes in your schema.
- Click the Finish button, select the newly created crawler and click on “Run Crawler”. AWS Glue will automatically crawl the data files and create the database and table for you.
Please note that AWS Glue integrates very nicely with Amazon Athena. What it means to you is that you can start exploring the data right away using SQL language without the need to load the data into a relational database first.
The AWS Glue database and tables create a layer of abstraction over your data files and make it possible to write SQL queries in Athena even though the actual data is still on S3 and the format is CSV.
Using Amazon Athena you can query the source files as if they were database tables. This is very useful if you want to do data exploration, and ad-hoc queries before spending time and money to bring the data into your database.
Users and data scientists can have access to the source data in the original format and use SQL to explore the data and come up with new business logic and applications for the data.
That's it we are all done!
In part three, we are going to write a simple job and see how we can add a calculated column to the data and load the curated results into a destination.