Search
Close this search box.

There’s this friendly game in Big Data frameworks. It’s what’s the fewest lines of code it takes to do WordCount.

I’m a committer on Apache Beam and most of my time is dedicated to making things easier for developers to use Beam. I also help explain Beam in articles and in conference sessions.

One of my recent commits was to improve the regular expression handling in Beam. I wanted to make commonly used regular expression activities easier and built-in.

Here is the smallest WordCount I could create using Beam.

public class PicoWordCount {
    public static void main(String[] args) {

        PipelineOptions options = PipelineOptionsFactory.create();
        Pipeline p = Pipeline.create(options);

        p
                .apply(TextIO.Read.from("playing_cards.tsv"))
                .apply(Regex.split("\\W+"))
                .apply(Count.perElement())
                .apply(MapElements.via((KV<String, Long> count) ->
                            count.getKey() + ":" + count.getValue()
                        ).withOutputType(TypeDescriptors.strings()))
                .apply(TextIO.Write.to("output/stringcounts"));

        p.run();
    }
}

The first step is read in the file. In this case, the file is playing_cards.tsv.

After that, we use the new Regex transform. This is the transform I recently committed to make it easier to do regular expressions. If you think along the lines of Java’s String.split() method, that’s exactly what I’m doing here. Except this code is running in a distributed system across many different processes.

Next, we count the words that we’ve split up.

The next section is something that I’m looking at now. There isn’t a built-in way to take a KV object and make it a String. The MapElements.via() method is manually taking the KV<String, Long> and making it a String. Another way of saying this is that I’m taking a PCollection<KV<String, Long>> and manually converting it to a PCollection<String>. I have to do this because the TextIO.Write.to() requires a PCollection<String>. As I said, look for some API improvements on this front.

Finally, we write out the words and counts. These will be placed in the output directory with files prefixed as stringcounts.

Once we’ve setup all of the transforms, we can run the Pipeline.

If you interested in a course on Apache Beam, please sign up to be notified once it’s out.