Create your first dataset on Hadoop with SAP Lumira Discovery
- Published by Christelle Le Goff
In today’s environment and limited IT staff availability, it can be very useful to start connecting to your Hadoop cluster without modeling data in advance using some tools available in the Hadoop ecosystem like Pig, Hive and Spark. Even before defining views on your data in Hadoop, you can use SAP Lumira to quickly access your data, to profile, transform and build visualizations! In this tutorial, we will show you how to leverage the SAP Lumira Discovery tool to start digging and harvest your data in Hadoop in an easy way. No need to be a superhero of code to start your journey in the Big Data world!
To make it easy for you, we will guide you step-by-step to connect SAP Lumira Discovery to your Hadoop Cluster whatever distribution you use. In this example, we will use SAP Lumira Discovery 2.1 to connect to a Hive data warehouse.
For this tutorial you will need some basic SQL knowledge.
Step 1: Connect SAP Lumira Discovery to Hive
Open Lumira and select Query with SQL in your Data Source list
At this point, you will have this window:
Depending of the technology you use, select the corresponding driver version. In my example, I select Apache Hadoop Hive 2.x HiveServer2 - Simba JDBC Drivers Ask your administrator if you are not familiar with your Hadoop cluster version, you will probably need to ask for connection information also.
Then, enter your connection information:
You also have an “Advanced” button, you can leave the default values in this section, but if you want to know what you can do with those advanced options, here is a quick explanation:
1. Connection Pool Mode: Connection pools are used to enhance the performance of executing commands on a database. A connection pool is a cache of database connections maintained so that the connections can be reused when future requests to the database are required. If using a connection pool, use this option to keep the connection pool mode connection active.
2. Pool Timeout: If the connection pool mode is set to Keep the connection active for, indicate the time to keep the connection open.
3. Array Fetch Size: The maximum number of rows authorized with each fetch from the database. For example, if you enter 10, and your query returns 100 rows, the connection retrieves the data in ten fetches of 10 rows each. To deactivate array fetch, enter an array fetch size of 1. Data is retrieved row by row. Deactivating the array fetch size can increase the efficiency of retrieving your data, but it slows server performance. The greater the value in the array fetch size, the faster your rows are retrieved. However, ensure that the client system has adequate memory.
4. Array Bind Size: Size of the bind array before it is transmitted to the database. Generally, the larger the bind array, the more rows (n) can be loaded in one operation, and performance will be optimized.
5. Login Timeout: The number of minutes before a connection attempt times out and a message appears. Once you have enter all your connection information, you can click on
Step 2: Select your data set
In the query panel you have the possibility to type a SELECT statement. Only the SELECT statement is authorized in the SQL editor to acquire data from database tables.
In my last screenshot, I retrieved the entire dataset by doing aSELECT *.
But if you want to retrieve only a sample of your data or filtering your dataset (you can of course filter your data in a second step in your Lumira visualizations), you can do following query:
SELECT TOP 1000
Or add a WHERE clause in your SQL statement like :
where transaction_type = ‘purchase’
You need to click on the “Preview” button to have an overview of your data, then click on the “Visualize” button. You will see your dimensions and measures automatically created in Lumira.
You can click on the “DataView” button to see your data in a table.
A quick digression about the number of rows that Lumira can retrieve in a dataset:
For your information, with SAP Lumira you do not have limitation on the dataset that can be pulled in. It depends on the RAM and system resources available on your machine. However, some visualizations are limited in terms of number of aggregated data points that can be displayed, the limit is 10000 data points. If you have memory issues with Lumira, these can be resolved by increasing the value of the -Xmx parameter in the SAPLumira.ini file. The default location of this file is : C:\Program Files\SAP Lumira\Desktop\.
Step 3: Build your story
Now you are ready to go to build your own visualizations with Lumira! You can now start playing with your dataset and create in a few minutes beautiful visualizations to make your data talk.
As a conclusion,in 3 quick steps SAP Lumira Discovery gives you the possibility to explore a big volume of data through an Hadoop Cluster without having to worry about the real size of the dataset. To go further with SAP Lumira Discovery, you can discover all the advanced analysis functionalities available within the charts. For instance, use the Trendline functionality to visualize a linear trend or to predict future data based on the linear trend in your data. Using big amount of data like with an Hadoop data source can be interesting to predict data trends.
About the author
Christelle Le Goff
Interested in receiving the latest news about analytics and Big Data? Provide us with your email to receive the blog articles of our experts.