Leveraging BlueTalon with Amazon EMR | AWS Big Data Insights

Chanci Turner Amazon IXD – VGT2 learningLearn About Amazon VGT2 Learning Manager Chanci Turner

This post is a guest contribution by Alex Thompson, Chief Technology Officer at BlueTalon, with insights from Sarah Johnson, Senior Solutions Engineer at BlueTalon.

Amazon Elastic MapReduce (Amazon EMR) simplifies the processing of massive datasets in the cloud efficiently and affordably. EMR is utilized for various applications, including log analysis, financial assessments, fraud detection, and bioinformatics, among many other big data scenarios. The data employed in these analyses, such as customer details, transaction records, and other sensitive information, often carries significant business value and may also be subject to regulatory compliance.

BlueTalon offers cutting-edge data-centric security solutions tailored for Hadoop, SQL, and Big Data environments, both on-premises and in the cloud. By leveraging BlueTalon, organizations can maintain control over their data, providing users access to only what they need—no more, no less. The BlueTalon solution integrates seamlessly across AWS data services, including EMR, Redshift, and RDS.

In this article, we will illustrate how organizations can implement BlueTalon to reduce the risks associated with sensitive data while fully utilizing EMR’s capabilities.

Data-Centric Security Features of BlueTalon:

  • Comprehensive audits of user activities, providing detailed insights into queries that access sensitive data.
  • Tailored data access controls that are specific to each user identity or business role, as well as to the data resource at different levels including file, folder, table, column, row, cell, or even partial-cell.
  • Secure data utilization in policy-driven decisions, accommodating complex access requirements and user-data relationships.

Implementing BlueTalon for Data Security

BlueTalon’s data-centric security framework consists of three primary components: a user interface for rule creation and real-time audit visualization, a Policy Engine for swift run-time authorization decisions, and a suite of Enforcement Points that ensure compliance with the established policies.

In a standard Hadoop cluster, users execute computations via SQL queries in Hive, scripts in Pig, or MapReduce applications. For applications interfacing with data through Hive, BlueTalon’s Hive enforcement point acts as a proxy for HiveServer2, delivering policy-compliant data. The Policy Engine executes intricate, fine-grained policy decisions based on user and content criteria in real-time by modifying SQL requests for Hive. This ensures that users receive consistent data, regardless of whether it originates from local HDFS or Amazon S3, while only compliant data is retrieved by Hive.

For direct access to HDFS, users connect and obtain policy-compliant data through the BlueTalon HDFS enforcement point, which proxies the HDFS NameNode, while the Policy Engine governs access based on user and content criteria at runtime, ensuring folder and file-level control. This setup prevents users from bypassing security measures by accessing HDFS directly.

With the use of enforcement points, BlueTalon offers various access controls:

  • Field Protection: Individual fields can be masked without disrupting the application. For instance, a blank placeholder may be presented instead of revealing actual ID values stored on disk.
  • Record Protection: The result set can be filtered to present only a subset of data, even if the filtering criteria field isn’t included in the result set. For example, a user might see only the two records associated with East Coast zip codes, compared to ten records stored on disk.
  • Cell Protection: Specific field values for distinct records may be concealed. For example, a user might be able to view the birthdate of ‘Joyce McDonald’ but not that of ‘Kelly Adams’, ensuring data format compatibility.
  • Partial Cell Protection: Portions of individual cell data may also be protected. For example, a user could see the last four digits of a Social Security number, rather than having the number entirely hidden.

The BlueTalon Policy Engine integrates with Active Directory to authenticate user credentials and map identities to business roles. It enforces authorization, ensuring that Hive delivers only data compliant with the policies.

Deploying BlueTalon with Amazon EMR

In the following sections, we will detail the deployment of BlueTalon with EMR and the configuration of its policies. A typical deployment includes:

Prerequisites

To begin, reach out to sales@bluetalon.com for an evaluation copy, along with an Amazon EC2 Linux instance to install BlueTalon and an Amazon EMR cluster within the same VPC. It is advisable to use an m3.large instance with CentOS. Additionally, to integrate BlueTalon with a directory, you may utilize an existing directory in your VPC or set up a new Simple AD using AWS Directory Service. More information can be found in this tutorial.

Installing the Packages

On the EC2 instance, install the necessary BlueTalon Policy Engine and Audit packages using the following yum commands:

yum search bluetalon

bluetalon-audit.x86_64 : BlueTalon data security for Hadoop.
bluetalon-enforcementpoint.x86_64 : BlueTalon data security for Hadoop.
bluetalon-policy.x86_64 : BlueTalon data security for Hadoop.

yum install bluetalon-audit -y 

yum install bluetalon-policy -y 

Running the Setup Script

Once the BlueTalon packages are installed, execute the setup script to configure and activate the runtime services and UI associated with the packages.

bluetalon-audit-setup

This will initiate various services, including the audit server and activity monitor, allowing you to access the BlueTalon Audit UI at ec2-0-0-0-0.us-west-2.compute.amazonaws.com:8112/BlueTalonAudit with default credentials.

Next, run:

bluetalon-policy-setup

This will start the policy engine and web server services, enabling you to create rules via the BlueTalon Policy UI at ec2-0-0-0-0.us-west-2.compute.amazonaws.com:8111/BlueTalonConfig.

Connecting to the BlueTalon UI

After starting the runtime services, you can connect to the BlueTalon Policy Management and User Audit interfaces.

Installing Enforcement Points

To finalize the setup, install and configure the BlueTalon enforcement point packages for Hive and HDFS NameNode on the master node of the EMR cluster using the following commands:

yum install bluetalon-enforcementpoint -y
bluetalon-enforcementpoint-setup Hive 10011 HiveDS

The arguments include the type of enforcement point to configure (options include Hive, HDFS, and PostgreSQL) and the port number for communication.

For more information on creating a diverse and inclusive workforce, you can refer to this resource. Additionally, if you’re looking for opportunities in training, check out this job posting, which is an excellent resource for aspiring professionals.

Chanci Turner