Steps for moving multi-sourced data using vStorm to HDP on IBM Power
Veristorm provides a solution called vStorm Enterprise that makes data migration to
Hadoop environments flexible, secure, and easy. vStorm already supports data
movement to Hadoop solutions running on Linux on IBM® Power Systems™.
Validation testing was performed to verify vStorm’s ability to integrate with
and move data specifically to Hortonworks Data Platform (HDP) 2.6 on IBM
POWER8® processor-based servers. Here is a brief introduction to the basic
capabilities of vStorm Enterprise that are important for IBM Power Systems, followed
by information about the validation test that was completed.
Veristorm offers a solution called vStorm Enterprise for Hadoop that
provides fast and seamless point-and-click data migration from many types of data
sources to Hadoop clusters and other destinations. The vStorm Connect
connector component securely transfers data at high-speed and offers batch
scheduling options. vStorm can move data to the following Hadoop targets:
- Hadoop Distributed File System (HDFS): The connector interfaces
to the HDFS name node and moves data as comma-separated values (CSVs) or Avro
files directly into HDFS. Avro files are compressed to reduce storage
- Hive: Metadata is written to the Hive server, reflecting the
source data schema, and file data is moved to HDFS to make data available on
Hadoop for HiveQL queries. Hive is not a data format rather it enables
a schema to be imposed on existing data in HDFS so that it can be queried by
using HiveQL, an SQL-like language. Data that exists in Hive is already stored
- Linux file system: The connector can transfer data directly
into a Linux file system. Data written to a local file system can be used by
downstream ETL and analytics tools and applications. This flexibility is
important when clients want to move data not only to Hadoop but to other
environments as well.
For more information about vStorm Enterprise, refer to the Veristorm
The key objectives for the validation testing of vStorm Enterprise were to verify
whether it can successfully connect to an HDP Hadoop cluster running on an IBM
POWER8 processor-based server and successfully move data to and from HDP. More
specifically, the following key tests were run:
- Configuring vStorm to connect to HDP 2.6 running on an IBM POWER8-processor
- Moving sample data from a data source on an x86-based server, where vStorm is
hosted, to the HDFS of HDP running on a POWER8-processor based server, and
moving data from HDP to the original data source
- Moving sample data from a Linux file system on the HDP cluster to the HDFS of
HDP running on a POWER8-based server, and moving data from HDP to a Linux file
system on HDP
This section lists the high-level components of Veristorm and HDP used in the test
- vStorm Enterprise 3.0 including the vStorm Connect connector component for
- Red Hat Enterprise Linux (RHEL) 7.2
- Virtual machine on x86 processor-based server
Hortonworks Data Platform
- HDP version 2.6
- RHEL 7.2
- Minimum resources: Eight virtual processor, 24 GB memory, 50 GB disk space
- IBM PowerKVM™
- IBM POWER8 processor-based server
Figure 1 describes the high-level architecture for Veristorm vStorm Enterprise.
vStorm supports moving data from many types of data sources (shown on the left side)
to many data targets (shown on the right side). The tests performed with IBM Power
Systems used data in an x86 processor-based Linux file system and a POWER8
processor-based HDP Linux file system as data sources and a POWER8 processor-based
HDP HDFS and a POWER8 processor-based HDP Linux file system as the data targets.
Figure 1. High-level architecture for Veristorm vStorm
Veristorm vStorm Enterprise has three major components: Management console, vStorm
Data Hub, and vStorm Connect.
- Management console is the graphical user interface (GUI) that
manages the user interactions with the data source and target systems.
- vStorm Data Hub is the main processing engine of vStorm. It
runs on Linux and uses vStorm Connect to access the source data through vStorm
Connect agents deployed on the database environment. As the data is streamed to
vStorm Data Hub, it first does any specified data conversion, and then transfers
the data to the target, for example, a big data platform such as Hadoop.
- vStorm Connect is the connector that establishes communication
between data sources and data targets and communicates with vStorm Hub as data
Installation and configuration
The section covers the installation and configuration of a HDP cluster and vStorm
Installing the HDP cluster
Refer to the following high-level steps to install and configure the HDP cluster:
- Follow the installation guide for HDP on Power Systems (see Hortonworks Data Platform: Apache Ambari Installation for
IBM Power Systems) to install and configure the HDP cluster.
- Log in to the Ambari server and ensure that all the services are running.
- Monitor and manage the HDP cluster, Hadoop, and related services through
Installing prerequisites for vStorm Enterprise
Follow the instructions in the installation guide provided by Veristorm. Ensure that
the necessary hardware and software requirements are met. For this test, vStorm was
installed on RHEL 7.2 on a POWER8 processor-based server. The following list shows the requirements at the time the test was run.
- RHEL 6.x, Cent OS 6.x, or SLES 11.x
- Java 1.7 or later
- Tomcat6 6.0.18 or later (Tomcat installed as a tomcat user)
- PostgresSQL 8.4 or later with postgresql-contrib and postgresql-jdbc
- Root access to Linux file system. All vStorm Enterprise installation and setup
operations are done as a root user.
- SELinux disabled (not enforcing).
- Mozilla Firefox 44.x or Microsoft Internet
Explorer 11.x for client browser on Microsoft Windows with JRE 1.8 or later.
Configuring and initializing the database
After the vStorm software prerequisites were installed, the database was
configured. The following commands were used to configure and start the
- Configure the PostgreSQL database using the following command:
# su – postgres
- Check the PostgreSQL configuration file to ensure that three specific settings
have the correct values. Use the following two commands to display the three
settings and ensure that they match the values shown below.
$ sudo cat /var/lib/pgsql/data/postgresql.conf | grep -e listen -e standard_conforming_strings listen_addresses = '*' standard_conforming_strings = off $ sudo cat /var/lib/pgsql/data/pg_hba.conf | grep -e host host all all 0.0.0.0/0 trust
- Initialize and start the PostgreSQL database using commands similar to the
# su - postgres $ sudo service postgresql initdb $ sudo service postgresql start
Installing vStorm Enterprise
Install vStorm using the following command:
# yum install vse-2.4-0.x86_64.rpm
Setting up vStorm Enterprise
You can set up vStorm Enterprise in two modes: Traditional or standalone. The
traditional mode deploys the full UI-driven capabilities for data movement from
sources to targets. “Standalone mode” provides offline job scheduling
through a scheduler used in our deployment. In addition, vStorm Enterprise can be
deployed as a primary node on a single server or can be deployed on multiple servers
as a secondary node with all pointing to the same primary node, which allows load
balancing among multiple servers. For more details on these different options, read
the vStorm Enterprise User guide provided by Veristorm.
To set up vStorm in the traditional mode, change to the correct directory, and run
the following setup command:
# cd /opt/vse/sbin/ # ./setup_vhub.sh
The command will prompt for the IP address of the source and destination
systems, the data base credentials, management console credentials, and other
values. Enter the required values as shown in the following output.
To start vStorm, use the following command:
Launching the management console
Once the installation and setup is complete, launch the management console for vStorm
by entering the IP address along with the port number 8080 (Tomcat runs on 8080
The screen shown in Figure 2 will appear. Log in with the credentials of
“admin” user which was created at the time of vStorm setup.
Figure 2. vStorm Enterprise Management Console log in
If a dialogue box (as shown in Figure 3 appears), click Run to start
the management console. After this completes, the management console will appear as
shown in Figure 4.
Figure 3. Prompt before starting the vStorm Enterprise
Figure 4. vStorm Enterprise Management Console
Test data and
Two different sample sets of data were used in the testing.
Download and load the data into vStorm using the following steps:
- Using the
wgetcommand with the sample data links above, download the data to
the server where vStorm is installed.
- After loading the data within vStorm, select the required data in the data source
view in the vStorm console.
- Right-click the selection and then click Copy to as shown in the
Figure 5. Options for copying the data in vStorm
- A dialogue box to select the target, where the data needs to be transferred (as
shown in Figure 6), is displayed. Click Next.
Figure 6. Selecting the target
A dialogue box with the job ID and job description is displayed. The job can also be
scheduled for a specified time by providing the start time as shown in the Figure
Figure 7. Job scheduler
- Click Finish to transfer the data to the data target.
You can monitor the data transfer through the vStorm console. Refer to Figure 8 to
see how the status of the jobs appear during the transfer process.
Figure 8. Reviewing job status in the vStorm management
Four test scenarios were considered in this process. Each test was initiated from the
vStorm Enterprise Management Console by selecting the data source on the left side
of the console and selecting the target on the right side. Each test transferred
data from the source to the target system in one direction only.
Refer to Table 1 for the test scenarios.
Table 1. Tests initiated from the vStorm Enterprise
|Test 1||Linux file system on an x86 processor-based server||HDFS on the HDP cluster on a POWER8 processor-based server||Management console in Figure 9|
|Test 2||HDFS on the HDP cluster on a POWER8 processor-based server||Linux file system on an x86 processor-based server||Management console in Figure 10|
|Test 3||Linux file system on a POWER8 processor-based server||HDFS on the HDP cluster on the same POWER8 processor-based server||Management console in Figure 11|
|Test 4||HDFS on the HDP cluster on a POWER8 processor-based server||Linux file system on the same POWER8 processor-based server||Management console in Figure 12|
Figure 9. vStorm Enterprise Management Console for Test 1
Figure 10. vStorm Enterprise Management Console for Test 2
Figure 11. vStorm Enterprise Management Console for Test
Figure 12. vStorm Enterprise Management Console for Test