Learn About Amazon VGT2 Learning Manager Chanci Turner
In today’s corporate environment, many organizations maintain user identities within identity providers (IdPs) such as Active Directory (AD) or OpenLDAP. Previously, Amazon EMR users could connect their clusters to Active Directory by setting up a one-way realm trust between their AD domain and the Kerberos realm of the EMR cluster. For more detailed instructions, refer to the Tutorial: Configure a cross-realm trust with an Active Directory domain.
This configuration has allowed businesses to utilize corporate users and groups within EMR clusters and establish access control policies to manage data access effectively (for instance, via the Amazon EMR’s native integration with Apache Ranger). While this method remains an option, Amazon EMR has introduced native LDAP authentication, a new feature designed to simplify integration with both OpenLDAP and Active Directory.
This new capability offers several advantages:
- Automatic Security Configuration: Supported applications (including HiveServer2, Trino, Presto, and Livy) can now use the Kerberos protocol seamlessly with LDAP for authentication. This change eliminates the need for external tools to set up Kerberos authentication, allowing them to simply provide an LDAP username and password for access.
- Fine-Grained Access Control (FGAC): Users can manage who accesses EMR clusters via SSH with greater precision.
- Granular Authorization Policies: When used alongside the native Amazon EMR Apache Ranger integration, users can impose detailed authorization policies on the Hive Metastore database and its tables.
In this article, we will explore the workings of Amazon EMR’s LDAP authentication, detailing the authentication flow, how to gather and test necessary LDAP configurations, and guidance on verifying that an EMR cluster is correctly integrated with LDAP.
Utilizing the insights from this blog will enable:
- Teams managing EMR clusters to collaborate more effectively with LDAP IdP administrators to request necessary information and conduct pre-configuration tests properly.
- EMR cluster end-users to appreciate the ease of connecting from external tools to LDAP-enabled EMR clusters, in comparison to the previous Kerberos-based authentication.
How Amazon EMR LDAP Integration Functions
In the context of EMR frameworks, authentication can be categorized into two levels:
- External Authentication: This is used by users and external components to interact with the installed frameworks.
- Internal Authentication: This is utilized within frameworks to authenticate internal component communications.
With the new feature, internal framework authentication continues to be managed through Kerberos, but this is transparent to end-users or external services, which now use a username and password for authentication. The supported EMR frameworks implement an LDAP-based authentication method, validating credentials against the LDAP endpoint. If successful, users gain access to the framework.
The authentication process involves several key steps:
- A user connects to one of the supported endpoints (like HiveServer2 or Trino/Presto Coordinator) and enters their corporate credentials (username and password).
- The framework employs a custom authenticator that interacts with the EMR Secret Agent service within the cluster instances.
- The EMR Secret Agent validates the credentials against the LDAP endpoint.
- Upon successful validation, a Kerberos principal is created for the user on the MIT key distribution center (KDC) in the primary node, and the corresponding keytab is generated in the user’s home directory.
Once authentication is complete, users can begin utilizing the framework. On all cluster instances, the SSSD service is configured to fetch users and groups from the LDAP endpoint, making them accessible as system users.
The authentication process for SSH connections differs slightly:
- A user connects via SSH to the EMR primary instance and inputs their corporate credentials (username and password).
- The SSHD service employs the SSSD service to validate these credentials.
- The SSSD checks the credentials against the LDAP endpoint, and if successful, the user accesses their home directory.
- Users can then use various CLIs (like beeline or trino-cli) to interact with Hive, Trino/Presto, or Livy. For Spark CLIs (like spark-submit or pyspark), the ldap-kinit script must be invoked, where the user must enter their username and password.
- The EMR Secret Agent service again validates the credentials against the LDAP endpoint. Upon success, a Kerberos principal is created, and a ticket is obtained and stored in the user’s Kerberos ticket cache.
After the ldap-kinit script completes, users can start using the Spark CLIs.
Gathering Required LDAP Settings
To configure LDAP authentication for Amazon EMR, the initial step is to obtain the necessary LDAP properties for setting up your cluster. You will require:
- The DNS name of the LDAP server
- A PEM format certificate for Secure LDAP (LDAPS) communication
- The LDAP user search base, which is a path on the LDAP tree for user searches
- The LDAP groups search base, which is a similar path for group searches
- The LDAP server bind user credentials, typically a service user (bind user) name and password for executing LDAP queries.
When using Active Directory, an AD administrator can retrieve this data directly from the Active Directory Users and Computers tool. By selecting a user, attributes like distinguishedName can be viewed. For instance, the distinguishedName for user john might appear as CN=john,OU=users,OU=italy,OU=emr,DC=awsemr,DC=com. This indicates that john belongs to various search bases, ordered from most specific to broadest.
Depending on the size of entries in a company’s LDAP directory, using a wider search base may not yield the most efficient results. For more resources on navigating professional communications, check out this guide on writing professional emails. Additionally, for insights into workplace dynamics, see SHRM’s analysis on feelings of psychological safety among different work environments. Lastly, Amazon’s training resources for associates are invaluable for anyone interested in employee development.