top of page
Search
  • Writer's pictureJiasheng(Leo) Sheng

Splunk as a DataLake

With ever-increasing internet usage, data is an exponentially expanding balloon that contains scrambled information for people to explore. For both structured and unstructured data, the speed of data generation always outpaces the speed for data analysis. This observation identifies a problem: how do we deal with the abundance of data? One big brain idea is to just give up trying to process all the data real-time; it is okay to store all the excess data to a storage somewhere to be used in the future. This solution is extremely intuitive, too simple that a fancy name is required --"datalake".


DataLake is some service that stores data hoping some useful information can be extracted later. Just like its name, a lake of data, scramble together and informations are yet to be discovered.



Splunk is a DataLake service that allows enterprises to store both structured and unstructured data, including Database tables, log files, transaction information, etc. An enterprise can use Splunk cloud cluster to host virtual machines (VMs) to deploy their services, and the data generated from the VMs can be automatically imported into the splunk datalake; another way is to import your own data files. Splunk can also connect to different cloud services (AWS) and import the output of their VMs to splunk platform.


In the splunk platform, data files are first being indexed (a process of categorize the data and storing metadatas for later searching) and then stored inside splunk database. Now, with a lake of data, what's next? A common method in Machine Learning pipeline is ETL (extract, transform, load) raw data into a dataset. Splunk offers tools to conduct ETL with steps, for example, the search tool.


The search tool in splunk platform is used to extract potentially useful data and use some built-in visualization or aggregation tool to discover new insights for the selected data. An example of the select tool is shown in the figure below. With specified selection query, the data can then be piped to some aggregation command using "|". To pipe the data just means to take the output of a operation as the input of the next operation.


Example select query: index=systems sourcetype=audit_logs user=svc_*.

In the above example, a simple ETL is performed:

  1. Extract: Select data from systems index, the source data file is audit_logs and select rows of user column that only starts with svc_

  2. Transform: in the second line, stats collects some statistical data like the earliest/latest timestamp. The result of the second line is then piped into the third line for some filter on the range of data. The last line is to transform the format of the timestamp.

  3. Load: the resulting data is loaded into a table for further use.

 

Movie Recommendation System with Splunk


In the context of how splunk can be used in a movie recommendation system, tools other than search can also be utilized: for example, analyze data quality and extract rating data from raw data.


The raw data comes from a kafka stream. We can automatically pipe all rows from the kafka topic into splunk to develop a raw dataset. Then, using the search tool, we can filter the rows with "rating" as the type and then filter malformed data using other regular expression or simple if-else check filters. Finally, the resulting data can be exported as a rating dataset with the useful data such as user_id, movie_id, rating_score and time.

The extracted rating data

Pivot table is another tool to further analyze the rating data. We can split the rows based on rating score to narrow the scope of our analysis.


Splitting rows by rating score


One can even extract new fields from the existing column. Time can be break down into year, month, day, hour, minutes and seconds. Or you can compute the mean and variance for rating.

Extract mean and var

In this way, data quality can be easily measured by counting the number of filtered rows divided by the total number of rows; some feature engineering steps can also be computed on the fly.


Moreover, data visualization can also be easily performed by selecting the corresponding fields, which can also help with monitoring data quality.



 

Strengths

  • Splunk provides a solution to store the ever-growing data to be analyzed later

  • Search is very fast because the use of index to categorize data

  • Data manipulation, visualization and monitoring is easy with just a few commands

Limitations



55 views0 comments

Recent Posts

See All

Comments


bottom of page