I spoke at Strata+Hadoop World two weeks on Kafka. There were three main themes from the conference that I came away with: real-time Big Data is the (present) future, we should be using intermediary libraries instead of programming directly to an API, and applied AI is the (present) future.
Big Data Companies are realizing it’s possible to handle Big Data in real-time, also known as streaming. They’re finding that using real-time gives them an advantage and agility that they didn’t have before.
I spoke at Galvanize in San Francisco on how real-time systems are going to change Data Science. The agility that real-time systems like Kafka give us, allows Data Scientists to get from hypothesis to production quicker. Instead of being limited to one model, a real-time system allows us to run and score using several models at the same time. Consuming systems can choose, in real-time, which of the models is performing the best at any given time.
We’re seeing second, third and above generation APIs supporting real-time from the beginning. These are APIs like Spark Streaming, Apache Flink, and Apache Kafka. Kafka is making real-time processing easier with the new Kafka Streams library. We’ll be seeing a proliferation of companies switching batch use cases to real-time with these technologies.
No More Direct APIs
I was eating lunch and talking to the random tablemates. Two were talking about how they’re excited to go back to work and rewrite their system to use the new frameworks they’ve just learned about. The engineer in me thinks “awesome!” The business person in me thinks “I hope they run it by their manager first. That’s a massive time sink and probably a waste of time.”
That conversation should have been much different. Had the engineers went through my class, they would have used Apache Crunch or an intermediary API. The would have said “I’m going to update a single line of code to use a different execution engine.” The two statements are vastly different. By only changing one line of code, they could test out MapReduce or Spark.
So far, I’ve been the minority opinion on not programming directly to an API like Spark or MapReduce. Don’t get me wrong, you should know how Spark and MapReduce work. I’m saying you shouldn’t be programming directly to their API. Instead you should be using one of the new intermediary APIs. As companies have just completed their rewrites from Hadoop MapReduce to Spark, they’re starting to understand this need. They shouldn’t have had to rewrite. It should have been a change of execution engines.
I talked to other companies whose code needs to run on both MapReduce and Spark. How do they handle this? Create two codebases or artifacts? No, they’re (now) using intermediary APIs. I’ve talked about Crunch, but I want to talk about the new intermediary APIs from Strata+Hadoop World.
Apache Beam looks very promising. It was originally called Dataflow and comes from Google. You can see some examples of their API here. One big feature is that it supports both batch and real-time from the same API. Other commercial companies are creating APIs allow their customers to run on any framework. Arimo is one such companies with its API. There are some notable downsides to intermediary APIs.
For Spark, you’re going to be missing Spark SQL. IMHO, that’s one of the biggest draws of Spark and Beam/Crunch/etc won’t have it. Personally, I used SQL most often when joining datasets. Both Crunch and Beam have vastly easier and built-in join functions. You might be missing these features less than you thought.
We’re starting to move from every machine learning (ML) being very specific for a company to some general usages. This will allow companies to use ML without a large investment in developing their own ML. This will allow developers to use AI in different or more imaginative ways. We’ll start to see some really interesting and valuable products using ML for end users.