Today’s blog post comes from a question from a subscriber on my mailing list. The question come from Guruprasad B.R.:
What are the best ways to Ingest data in to Big Data (HBase/HDFS) from different sources like FTP, Web, Email, RDBMS,..etc
There are a couple parts to this question and they’re technical:
- How do I get data into HDFS?
- How do I get data into HBase?
- How does the source of data dictate how it’s ingested?
Sqoop
I’ll start off with the easy one. How do you get data from a RDBMS into HDFS and HBase? You’d use Apache Sqoop. It can take data from both a RDBMS and put it into HDFS or HBase.
It can go the other way around too. Sqoop can move data from HDFS or HBase and put it back into the RDBMS.
Simple File Transfer
There are a few ways to do simple file transfers into HDFS. You could use:
- Apache Oozie to move files as part of a workflow
- Use Hue’s REST interface
- Use Hadoop’s WebHDFS REST or FUSE interfaces
- Write a custom program that implements FTP, HTTP, etc and puts the files into HDFS with the HDFS API
The right tool for the job depends on your use case.
Getting Data In
The far more difficult problem is how to use the data or get it into HBase. For that, you’ll need to write custom code. The suggestions above only get you to the point where you’re using HDFS as a backup. The real value is working with the data.
The programs you need to write and the right tools for the job depends on your use case. This where qualified Data Engineers are important. They’ll help the team understand the use case and how the data pipeline should be created.