Person who chases two rabbits catches neither. – Confucius
This applies really to learning. Learning two new and different technologies at the same time makes you catch neither. I’ve seen so many students trying to learn Big Data and a new programming language at the same time. A few succeed where most fail.
Why Two?
I see this consistently with Java and Apache Spark. Java programmers come, see the Java API with Spark, and decide to learn Scala. The Scala code is just more concise and readable. And who can blame them? Take a look at this code:
JavaPairRDD<String, Integer> suitsAndValues = input.mapToPair(
new PairFunction<String, String, Integer>() {
@Override
public Tuple2<String, Integer> call(String w) throws Exception {
String[] split = w.split("t");
int cardValue = Integer.parseInt(split[0]);
String cardSuit = split[1];
return new Tuple2<String, Integer>(cardSuit, cardValue);
}
});
JavaPairRDD<String, Integer> counts = suitsAndValues.reduceByKey(
new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer total, Integer amount) throws Exception {
return total + amount;
}
});
It’s as if you took everything wrong with Java’s verbosity and times it by two. When I first looked at Spark’s Java API, I wrote it off too. I saw code snippets similar to this and knew it just wouldn’t fly.
Take Another Look
I looked at Spark’s Java API with Java 7 eyes and saw Java 7 code examples. Then, I learned about Java 8’s lambdas and I took another look at Spark’s Java 8 API. This is what I saw:
JavaPairRDD<String, Integer> suitsAndValues = input.mapToPair(w -> {
String[] split = w.split("t");
int cardValue = Integer.parseInt(split[0]);
String cardSuit = split[1];
return new Tuple2<String, Integer>(cardSuit, cardValue);
});
JavaPairRDD<String, Integer> counts = suitsAndValues.reduceByKey(
(count, amount) -> count + amount);
I could actually read the code. I didn’t have to wade through anonymous class code. It was concise and worked well.
Spark and Java 8
Looking at this code, you don’t have to learn Scala anymore. You can use your existing Java skills, IDE, and experience to write beautiful code.
More importantly, you’re focusing your energy on just learning Spark. I’ve seen so many people fail trying to learn Spark and Scala at the same time. Learning two difficult things at the same time is just a non-starter.
What Am I Missing Out On?
There are several common questions I get from those wanting to learn Spark with Java:
Are there any missing API features? No, I’ve went through the API writing code samples and testing them. I haven’t found an issue. Sometimes, you’ll find that things are easier in Scala. This is often because the writers wrote the API thinking mostly about Scala as the main interface, instead of how Java will interface. An example of this is unit testing Spark with Java.
What does Spark on Scala do that Java can’t? Java isn’t a dynamic language like Scala. You won’t be doing any interactive coding. That means you can’t use the Notebooks (yet).
How hard are Java 8 lambdas to learn? Not hard once they’re explained well. Some of the programming articles I’ve read don’t explain lambdas well. My Professional Data Engineering course has an explanation that’s easy to understand and use.
Should you learn Scala? My advice is to learn Spark first and use your Java skills. If you find yourself needing more interactive and dynamic coding, try using more Spark SQL. If that still doesn’t help you, learn Scala.
What To Do Now?
There aren’t many Java and Spark resources out there.
Holden’s Learning Spark book has some Java coverage. However, the Java code examples use Java 7 or the verbose syntax.
My Professional Data Engineering course covers Spark using Java. All code slides, example code, and sample solution code only use Java 8 lambdas. During the class, I teach how Java 8 lambdas work and how to use them effectively with Spark.