Thursday, July 10, 2014

@MapR – Securing Hadoop – Great tech session from Keys Botzum.

I am at the Hadoop-DC Meetup group to learn about how MapR has secured Hadoop. See this link for more information about the Hadoop Meetup:

Keys Botzum is a SR Principal Technologist with MapR to lead the session.

MapR is a complete distribution of Hadoop.

MapR’s focus with Hadoop is:
- Performance
- Multi-Tenant
- Security

MapR – 80% of accounts triple their installation size in 12 months.

Why Security for Hadoop?

Historically Hadoop was processing public internet data. This is now switching to enterprises that are processing sensitive data. eg. Financial and Health records.

Traditional firms want to create a data lake of confidential operational data.

Typical weaknesses for Hadoop

  • Client Operating system is trusted to identify user (Weak Authentication)
  • Anyone that can reach a node is trusted.
  • Hive ran as a system user.
  • Traffic is not encrypted

MapR 3.1 Securing Hadoop

  • Leverage the work done in the Open Source community
  • Encrypt network traffic
  • Authorization
  • Support but DO NOT REQUIRE Kerberos

Customers made it clear that Kerberos was too hard to deploy.
Authorization is based on AUTHENTICATED Entities.

Design decisions

MapR native security is modeled on Kerberos but it is not Kerberos
1. Password based authentication
2. Can integrate with Kerberos if already implemented

Shared Secrets managed at the cluster level
Two shared Keys: container location DB and Server key

CLDB tickets are permanent
Server keys are ephemeral and issued for users.
Clients authenticate to trusted servers using the ticket.

maprlogin uses SSL to connect to CLDB.
after login with userid/password drops ticket as file to /tmp and all utilities look for that. It is set so only user can use the ticket.

Maprlogin can renew a ticket – great for keeping a script alive.

User information comes from Operating System. MapR uses PAM and Linux password APIs.
If your linux authentication works then Mapr works.

Client then uses encrypted RPC with user authenticaion ticket/key.

The ticket has data encrypted with a secret key obtained during authentication.
The server decrypts the data to prove it is a valid ticket.

MapR user identity is independent of the host or operating system.
The login process verifies the user and generates the ticket which is then what is used to validate access to a cluster.

Servers have to authenticate to the CLDB using a secret key you as a sysadmin placed there.

[Ed: what stops someone snooping the secret server key?]

Apache – Java uses SASL – Pluggable authentication.
MapR created a pluggable MaprSASL to perform this authentication.

Any user can submit jobs but they can only administer their own submitted jobs.

The Job Tracker creates a user ticket when a job is run. This prevents a ticket expiring between submission and running.

MapR also supports exposing the file system via NFS.

MapR can’t re-write NFS protocol so best practice is to create a MapR NFS server outside the cluster and compress and loop back data.

MapRLogin doesn’t support 2 factor authentication but it could be upgraded to provide this.


Bulk fileserver data transfers are not encrypted by default – it is an optional setting – due to performance concerns.

All MapR Servers authenticate to each other.
Most communications paths are encrypted.
Self-signed wildcard certificates are created for HTTPS traffic. You can replace with your own certificate if desired.


Cryptography uses current NIST standards – AES-256 in GCM Mode.
- Utilizes hardware encryption when it is available (auto-detection)


Hard to follow security is inherently insecure

Beyond the MapR core

  • Hive
  • Pig
  • Mahout
  • Sqoop

Most are libraries and just work with Mapr secured servers.

Hive Server 2 supports password authentication (but doesn’t do SSL without extra configuration).

MapR Tables is their re-written HBase. it works natively with MapR security. To use HBase instead you have to secure with Kerberos.

MapR Tables Authorization.

Sqrrl/Accumulo provides boolean logic constraints.

MapR has used the same logic. They also added logic at Table, Column and Column-family (and support the NOT (!) operator).

Accumulo goes to cell-level security.

MapR comes in different versions: M3, M5, M7.

MapR tables come in M7.

Security is in all versions.

Encryption is supported for data at rest.
MapR is a block addressable device. So you can add encryption using open source or commercial tools – like Gazzang.
Or you can use encrypted drives eg. FIPS140-2 encrypted drives.

MapR will incorporate encryption in to the core product in a future release.

via WordPress