vStorm Enterprise integrated with Hortonworks Data Platform (HDP) running on IBM Power Systems

[ad_1]

Steps for moving multi-sourced data using vStorm to HDP on IBM Power
Systems


Comments

Veristorm provides a solution called vStorm Enterprise that makes data migration to
Hadoop environments flexible, secure, and easy. vStorm already supports data
movement to Hadoop solutions running on Linux on IBM® Power Systems™.
Validation testing was performed to verify vStorm’s ability to integrate with
and move data specifically to Hortonworks Data Platform (HDP) 2.6 on IBM
POWER8® processor-based servers. Here is a brief introduction to the basic
capabilities of vStorm Enterprise that are important for IBM Power Systems, followed
by information about the validation test that was completed.

Veristorm
vStorm Enterprise

Veristorm offers a solution called vStorm Enterprise for Hadoop that
provides fast and seamless point-and-click data migration from many types of data
sources to Hadoop clusters and other destinations. The vStorm Connect
connector component securely transfers data at high-speed and offers batch
scheduling options. vStorm can move data to the following Hadoop targets:

  • Hadoop Distributed File System (HDFS): The connector interfaces
    to the HDFS name node and moves data as comma-separated values (CSVs) or Avro
    files directly into HDFS. Avro files are compressed to reduce storage
    requirements.
  • Hive: Metadata is written to the Hive server, reflecting the
    source data schema, and file data is moved to HDFS to make data available on
    Hadoop for HiveQL queries. Hive is not a data format rather it enables
    a schema to be imposed on existing data in HDFS so that it can be queried by
    using HiveQL, an SQL-like language. Data that exists in Hive is already stored
    in HDFS.
  • Linux file system: The connector can transfer data directly
    into a Linux file system. Data written to a local file system can be used by
    downstream ETL and analytics tools and applications. This flexibility is
    important when clients want to move data not only to Hadoop but to other
    environments as well.

For more information about vStorm Enterprise, refer to the Veristorm
website.

Objectives

The key objectives for the validation testing of vStorm Enterprise were to verify
whether it can successfully connect to an HDP Hadoop cluster running on an IBM
POWER8 processor-based server and successfully move data to and from HDP. More
specifically, the following key tests were run:

  1. Configuring vStorm to connect to HDP 2.6 running on an IBM POWER8-processor
    based server
  2. Moving sample data from a data source on an x86-based server, where vStorm is
    hosted, to the HDFS of HDP running on a POWER8-processor based server, and
    moving data from HDP to the original data source
  3. Moving sample data from a Linux file system on the HDP cluster to the HDFS of
    HDP running on a POWER8-based server, and moving data from HDP to a Linux file
    system on HDP

Test
environment

This section lists the high-level components of Veristorm and HDP used in the test
environment.

Veristorm

  • vStorm Enterprise 3.0 including the vStorm Connect connector component for
    HDP
  • Red Hat Enterprise Linux (RHEL) 7.2
  • Virtual machine on x86 processor-based server

Hortonworks Data Platform

  • HDP version 2.6
  • RHEL 7.2
  • Minimum resources: Eight virtual processor, 24 GB memory, 50 GB disk space
  • IBM PowerKVM™
  • IBM POWER8 processor-based server

Deployment
architecture

Figure 1 describes the high-level architecture for Veristorm vStorm Enterprise.
vStorm supports moving data from many types of data sources (shown on the left side)
to many data targets (shown on the right side). The tests performed with IBM Power
Systems used data in an x86 processor-based Linux file system and a POWER8
processor-based HDP Linux file system as data sources and a POWER8 processor-based
HDP HDFS and a POWER8 processor-based HDP Linux file system as the data targets.

Figure 1. High-level architecture for Veristorm vStorm
Enterprise

Veristorm vStorm Enterprise has three major components: Management console, vStorm
Data Hub, and vStorm Connect.

  • Management console is the graphical user interface (GUI) that
    manages the user interactions with the data source and target systems.
  • vStorm Data Hub is the main processing engine of vStorm. It
    runs on Linux and uses vStorm Connect to access the source data through vStorm
    Connect agents deployed on the database environment. As the data is streamed to
    vStorm Data Hub, it first does any specified data conversion, and then transfers
    the data to the target, for example, a big data platform such as Hadoop.
  • vStorm Connect is the connector that establishes communication
    between data sources and data targets and communicates with vStorm Hub as data
    is transferred.

Installation and configuration

The section covers the installation and configuration of a HDP cluster and vStorm
software.

Installing the HDP cluster

Refer to the following high-level steps to install and configure the HDP cluster:

  1. Follow the installation guide for HDP on Power Systems (see Hortonworks Data Platform: Apache Ambari Installation for
    IBM Power Systems
    ) to install and configure the HDP cluster.
  2. Log in to the Ambari server and ensure that all the services are running.
  3. Monitor and manage the HDP cluster, Hadoop, and related services through
    Ambari.

Installing prerequisites for vStorm Enterprise

Follow the instructions in the installation guide provided by Veristorm. Ensure that
the necessary hardware and software requirements are met. For this test, vStorm was
installed on RHEL 7.2 on a POWER8 processor-based server. The following list shows the requirements at the time the test was run.

  • RHEL 6.x, Cent OS 6.x, or SLES 11.x
  • Java 1.7 or later
  • Tomcat6 6.0.18 or later (Tomcat installed as a tomcat user)
  • PostgresSQL 8.4 or later with postgresql-contrib and postgresql-jdbc
  • Root access to Linux file system. All vStorm Enterprise installation and setup
    operations are done as a root user.
  • SELinux disabled (not enforcing).
  • Mozilla Firefox 44.x or Microsoft Internet
    Explorer 11.x for client browser on Microsoft Windows with JRE 1.8 or later.

Configuring and initializing the database

After the vStorm software prerequisites were installed, the database was
configured. The following commands were used to configure and start the
database:

  1. Configure the PostgreSQL database using the following command:

     

                            # su – postgres

  2. Check the PostgreSQL configuration file to ensure that three specific settings
    have the correct values. Use the following two commands to display the three
    settings and ensure that they match the values shown below.

     

      $ sudo cat /var/lib/pgsql/data/postgresql.conf  | grep -e listen -e   standard_conforming_strings
         listen_addresses = '*'
         standard_conforming_strings = off
      $ sudo cat /var/lib/pgsql/data/pg_hba.conf  | grep -e host
         host    all         all            0.0.0.0/0               trust

  3. Initialize and start the PostgreSQL database using commands similar to the
    following examples:

     

                # su - postgres
                $ sudo service postgresql initdb
                $ sudo service postgresql start

Installing vStorm Enterprise

Install vStorm using the following command:

           # yum install vse-2.4-0.x86_64.rpm

Setting up vStorm Enterprise

You can set up vStorm Enterprise in two modes: Traditional or standalone. The
traditional mode deploys the full UI-driven capabilities for data movement from
sources to targets. “Standalone mode” provides offline job scheduling
through a scheduler used in our deployment. In addition, vStorm Enterprise can be
deployed as a primary node on a single server or can be deployed on multiple servers
as a secondary node with all pointing to the same primary node, which allows load
balancing among multiple servers. For more details on these different options, read
the vStorm Enterprise User guide provided by Veristorm.

To set up vStorm in the traditional mode, change to the correct directory, and run
the following setup command:

            # cd /opt/vse/sbin/
            # ./setup_vhub.sh

The command will prompt for the IP address of the source and destination
systems, the data base credentials, management console credentials, and other
values. Enter the required values as shown in the following output.

To start vStorm, use the following command:

                #./start_vhub.sh

Launching the management console

Once the installation and setup is complete, launch the management console for vStorm
by entering the IP address along with the port number 8080 (Tomcat runs on 8080
port).

http://172.21.4.191:8080/vse/

The screen shown in Figure 2 will appear. Log in with the credentials of
“admin” user which was created at the time of vStorm setup.

Figure 2. vStorm Enterprise Management Console log in
page

If a dialogue box (as shown in Figure 3 appears), click Run to start
the management console. After this completes, the management console will appear as
shown in Figure 4.

Figure 3. Prompt before starting the vStorm Enterprise
Management Console

Figure 4. vStorm Enterprise Management Console

Test data and
data transfer

Two different sample sets of data were used in the testing.

Download and load the data into vStorm using the following steps:

  1. Using the wget command with the sample data links above, download the data to
    the server where vStorm is installed.
  2. After loading the data within vStorm, select the required data in the data source
    view in the vStorm console.
  3. Right-click the selection and then click Copy to as shown in the
    Figure 5.

    Figure 5. Options for copying the data in vStorm
    console

  4. A dialogue box to select the target, where the data needs to be transferred (as
    shown in Figure 6), is displayed. Click Next.

    Figure 6. Selecting the target

    A dialogue box with the job ID and job description is displayed. The job can also be
    scheduled for a specified time by providing the start time as shown in the Figure
    7.

    Figure 7. Job scheduler

  5. Click Finish to transfer the data to the data target.

    You can monitor the data transfer through the vStorm console. Refer to Figure 8 to
    see how the status of the jobs appear during the transfer process.

    Figure 8. Reviewing job status in the vStorm management
    console

Test
scenarios

Four test scenarios were considered in this process. Each test was initiated from the
vStorm Enterprise Management Console by selecting the data source on the left side
of the console and selecting the target on the right side. Each test transferred
data from the source to the target system in one direction only.

Refer to Table 1 for the test scenarios.

Table 1. Tests initiated from the vStorm Enterprise
Management Console
Test scenario Source Target Reference
Test 1 Linux file system on an x86 processor-based server HDFS on the HDP cluster on a POWER8 processor-based server Management console in Figure 9
Test 2 HDFS on the HDP cluster on a POWER8 processor-based server Linux file system on an x86 processor-based server Management console in Figure 10
Test 3 Linux file system on a POWER8 processor-based server HDFS on the HDP cluster on the same POWER8 processor-based server Management console in Figure 11
Test 4 HDFS on the HDP cluster on a POWER8 processor-based server Linux file system on the same POWER8 processor-based server Management console in Figure 12
Figure 9. vStorm Enterprise Management Console for Test 1

Figure 10. vStorm Enterprise Management Console for Test 2

Figure 11. vStorm Enterprise Management Console for Test
3

Figure 12. vStorm Enterprise Management Console for Test
4

Resources

Downloadable resources

[ad_2]

IBM Maintenance

Leave a Reply

Your email address will not be published.