Today’s blog post comes from a question from a subscriber on my mailing list. The question come from Guruprasad B.R.:

What are the best ways to Ingest data in to Big Data (HBase/HDFS) from different sources like FTP, Web, Email, RDBMS,..etc

There are a couple parts to this question and they’re technical:

Sqoop

I’ll start off with the easy one. How do you get data from a RDBMS into HDFS and HBase? You’d use Apache Sqoop. It can take data from both a RDBMS and put it into HDFS or HBase.

It can go the other way around too. Sqoop can move data from HDFS or HBase and put it back into the RDBMS.

Simple File Transfer

There are a few ways to do simple file transfers into HDFS. You could use:

The right tool for the job depends on your use case.

Getting Data In

The far more difficult problem is how to use the data or get it into HBase. For that, you’ll need to write custom code. The suggestions above only get you to the point where you’re using HDFS as a backup. The real value is working with the data.

The programs you need to write and the right tools for the job depends on your use case. This where qualified Data Engineers are important. They’ll help the team understand the use case and how the data pipeline should be created.