Beam’s Pico WordCount, Big Data Institute

There’s this friendly game in Big Data frameworks. It’s what’s the fewest lines of code it takes to do WordCount.

I’m a committer on Apache Beam and most of my time is dedicated to making things easier for developers to use Beam. I also help explain Beam in articles and in conference sessions.

One of my recent commits was to improve the regular expression handling in Beam. I wanted to make commonly used regular expression activities easier and built-in.

Here is the smallest WordCount I could create using Beam.

public class PicoWordCount {
    public static void main(String[] args) {

        PipelineOptions options = PipelineOptionsFactory.create();
        Pipeline p = Pipeline.create(options);

        p
                .apply(TextIO.Read.from("playing_cards.tsv"))
                .apply(Regex.split("\\W+"))
                .apply(Count.perElement())
                .apply(MapElements.via((KV<String, Long> count) ->
                            count.getKey() + ":" + count.getValue()
                        ).withOutputType(TypeDescriptors.strings()))
                .apply(TextIO.Write.to("output/stringcounts"));

        p.run();
    }
}

The first step is read in the file. In this case, the file is playing_cards.tsv.

After that, we use the new Regex transform. This is the transform I recently committed to make it easier to do regular expressions. If you think along the lines of Java’s String.split() method, that’s exactly what I’m doing here. Except this code is running in a distributed system across many different processes.

Next, we count the words that we’ve split up.

The next section is something that I’m looking at now. There isn’t a built-in way to take a KV object and make it a String. The MapElements.via() method is manually taking the KV<String, Long> and making it a String. Another way of saying this is that I’m taking a PCollection<KV<String, Long>> and manually converting it to a PCollection<String>. I have to do this because the TextIO.Write.to() requires a PCollection<String>. As I said, look for some API improvements on this front.

Finally, we write out the words and counts. These will be placed in the output directory with files prefixed as stringcounts.

Once we’ve setup all of the transforms, we can run the Pipeline.

If you interested in a course on Apache Beam, please sign up to be notified once it’s out.

Beam’s Pico WordCount

Want to become a Data Engineer but can't find in-depth materials?

You have Successfully Subscribed!

Get your free copy of Data Engineering Teams: Creating Successful Big Data Teams and Products

Data Engineering Teams Book

Would you like to know what I teach successful organizations to do?

Mentoring

We’re here to help make the process more successful and the outcome more effective.

Architecture Reviews

The right tool for the job saves countless hours, time, money. Are you using the right tool for the job?

Project Acceleration

Why do so few companies create enormous value from Big Data while most fail?

Company

Resources

Resources

Stay updated with the latest.

Have a question?

Send us a message

or give us a call at +1 775.393.9122

© 2024 Big Data Institute

Privacy

© 2024 Big Data Institute

Privacy

Have a question?

Send us a message

or give us a call at +1 775.393.9122