We helped Airbus create a real-time big data project streaming 2+ billion events per day
- The system processes 2,000+ messages per second from aircraft all over the world
- We used Microsoft Azure managed services such as Event Hubs and Azure Managed Kubernetes
- The system uses open source technologies such as Apache Kafka, Apache Flink, Apache Avro, and Apache Spark.
Heiko Udluft: 00:01 Hello and welcome to the session. Thanks all for joining. My name is Heiko Udluft lift and together with Jesse Anderson, I’m going to talk about Airbus, our sense and how we make more of the sky using as your services. We’d like to start talking briefly about what the main takeaways are, so if you only want to pay attention to this slide, these are the four things that I would like you to remember from this talk. The first thing is, we at Airbus AirSense actually tackles some really, really challenging data problems as well from the data engineering point of view, so the amount of data that we’re dealing with, the size of data that we’re dealing with, the number of messages we researched per second is really, really high and I go into that in a couple of minutes and also the analytics that we want to extract from it are really, really challenging because you can do a lot of pattern recognition, machine learning, very many advanced analytics applications that are really interesting and challenging.
Heiko Udluft: 00:52 The other thing that I would like to highlight is we are in the space of air traffic management. So that is the question. How do you come from your origin airport to destination airport as safe, and efficient, and on time as possible? And if we do a good job there we are impacting billions of passengers globally. So it’s a really, really interesting and relevant to me to work in. The next one is how we implemented real-time big data processing on Azure. And that’s a takeaway that you will have at the end of the session. And the next one is that we got really good support from Azure, as well as for our developments. But also the support, in general, has been really good in the collaboration work well. So let me set up the stage a bit first, introduce ourselves. So as I said before, my name is Heiko Udluft I’m the Tech Team Lead for Airbus AirSense. My background is air traffic management. So I’m an engineer by training aerospace engineer by training. And then I moved more and more into digitalization and nonleading the tech team for Airbus AirSense.
Jesse Anderson: 01:51 My name is Jesse Anderson, I am the Managing Director at Big Data Institute. At my company, we help other companies gain the maximum value from their data, from their big data. In this case, we helped out with the architecture, creating the architecture as well as the implementation and we help them with figuring out the teams and some of the positioning for what the data product should look like.
Heiko Udluft: 02:13 Thanks, Jesse.
Heiko Udluft: 02:16 So some numbers about air transportation, a set before we are touching billions of people. I’m always quite impressed by these. Currently, we’re serving the aviation industry as a whole or the air transportation domain as a whole is serving about 56,000 routes globally. So that is point-to-point connection all over the globe. And that’s on an annual basis. Some shifts here but that’s a total number that’s being served at this year. We are moving about 3.6 billion passengers. So that is not individual humans. We’re not touching 3.6 billion humans, but we are touching 3.6 billion passengers. So it’s the same person does multiple trips, they all count in here. How do we move all of them? We have to have our aircraft flying in the air and for quite a while. So we’re flying a total of 69 million flight hours per year.
Heiko Udluft: 02:59 And that’s actually the data that we are working with. I’ll show that later. This comes to a couple of, from this come quite a few challenges. So has anybody experienced the delay in the last year in this room? Anybody? Same with me. And actually some of these are avoidable and some of these come simply from looking at an infrastructure system that is at capacity for some places. So if you look at the airports like LaGuardia, LaGuardia, has an arrival every two minutes or something like that on average 24 hours a day. So they really have a high demand some of the delay effects, if you see right now, basically are triggered by, again, the system being at capacity. And we not only have to overcome the constraints that we have for that right now, but that’s the fourth one that I have here is we need to somehow accommodate doubling the air traffic every 15 years and within 15 years we’re not going to build doubled the number of airports or higher double the amount of air traffic controllers or build twice as big aircraft. That’s not happening. So we need to come up with other ways how we can use our resources that we have right now more efficiently. And that’s exactly where our products want to want to help and contribute that all the stakeholders that are affected by these numbers actually get better.
Heiko Udluft: 04:19 So what other services that we are building? Two primary ones, one, our data services where we are building a real-time tracking and event detection that allows us to track the position of any aircraft anywhere, anytime around the globe. And this is a great tool, for, I talked about some stakeholders before, so it’s, it’s actually not known to everybody. What, when you’re an air traffic controller. So that’s the person who sits, for instance in Seattle airport, and he manages all the traffic around that airport. He actually only has a relatively local view of about 200 nautical miles around where his airport is. So if you are, but he has flights coming in from around the globe. So he needs to get a bit better visibility of these flights. And that’s exactly what we are building with these data services. Increasing the situational awareness that everybody knows for everybody who is in contact with a flight and is operating a flight.
Heiko Udluft: 05:13 The second one is our analytic services. We are building real-time actionable insights for decision support in aviation. So here are we talking about real-time alerting, to synchronize the different stakeholders in the system, building postop analytics to better understand the performance bottlenecks that you have seen, the past predictive analytics to get a view on what the future will be like and then ultimately want to build prescriptive analytics and that the suggesting to the end user how they can operate the system better. So to give an example for that if you are, I mentioned LaGuardia airport that they have a high demand right now we predict that an aircraft will arrive early there and we can then suggest to them how they should change the operations to accommodate that and ultimately make the systems go smoother, less delays, less fuel burn, better system overall.
Heiko Udluft: 06:09 So what data are we using for this? I mentioned that before we are using aircraft position reports, so that is the latitude and longitude of an aircraft. The timestamp when this position report was generated, the speed of the aircraft at that time, the heading of the aircraft at this time and then some Meta information such as the intent. So where’s the aircraft flying, what aircraft is it, who is operating it? So there’s a lot of features in the data that we can use for analytics. The standard that we’re using is called ADS-B. I assume that nobody has heard of that and I’m not going to go into too much detail, but it is a standard that is defined in the aviation industry. The aircraft volunteers up this information. So the aircraft broadcast its signal, and I’ll go into that how we basically picked that up in a second.
Heiko Udluft: 06:53 It does that at least twice a second. So the update rate is 2Hz, and the primary intention for ADS-B, sorry, the primary intention for ADS-B was for air traffic control. So if you get more frequent updates and better precision of where the aircraft is, you can operate them, you can handle them better in the second is for aircraft to aircraft separation assurance. So basically conflict avoidance. So as an aircraft, I pick up the other aircraft somewhere else, I know where it is and then I can avoid conflict with that. In total, a data volume that we’re looking at is we have between 10 and 15,000 flights operating at any given time. So, here’s sort of a middle sweet spot, they send two messages a second. So the total message volume that we have available globally is 24,000 messages per second. And as we operate more aircraft and we add more information to these messages that number actually goes up. So that Jesse will talk about in a minute. But that was really a big challenge for from a data engineering point of view, how to accept all of that data so that we can use it for analytics.
Heiko Udluft: 07:59 So how does the ADS-B data, so the aircraft position data actually make it from the aircraft to Azure. In the middle you see the aircraft itself, as I mentioned, the aircraft determines its own position, either using GPS or other means of knowing where it is in the, on the world right now. And then it broadcast that on a public frequency. So if you go on Amazon, you can buy yourself a USB stick, you plug that into your laptop and you can actually start to pick up ADS-B signals. Luckily we don’t have to do that ourselves for that. We work with partners. So our partners that we are primarily working with our area in flight rate at 24 area and is really cool because they have a space based ADS-B constellation. So they launched a bunch of satellites.
Heiko Udluft: 08:43 They all have these, what I mentioned before, the USB sticks on satellite a better version of that. They receive the signals that way, then they broadcast it to a ground station that relays it and basically puts it into our platform. Flat rate at 24 and the other side, they operate the largest crowdsourced ADS-B receiver network that’s currently on the globe. So you can go and tell flat rate 24, Hey, I live in over location where you have bad coverage right now and I would like to have a receiver. They might ship one to you and then you start contributing data to their platform. And then again, they collect the data from all their global receivers and we consume that from them as one data stream that they work out for us.
Heiko Udluft: 09:27 So what do we then do with the data? So I’m going to be going to focus primarily on the real-time components. lesser on the analytics. We combine the data. So as I mentioned, we have multiple data providers. I mentioned two right now, but we’re onboarding more as we go and we need to combine all that data into one position that makes sense. We then process it there, filter out data errors, to use some internal models to improve the data quality. We then need to store off date data and analyze it. And here again, we use some Airbus proprietary models, for instance, to determine fuel burn and things like that are relevant for analysts in the end. And in the end, we need to distribute it out to the customer. So far our data stream, we either do that through a Kafka API, Jesse’s going to talk about that in a bit more detail as well. You can consume it through an API that we defined and you can also get them as batch files. So that’s the way that we distributed data out. All of that again, in real time and at the scale of thousands of messages a second.
Heiko Udluft: 10:30 So that’s actually a good thing to talk about. What scale are we actually talking about? So our data providers, I mentioned before, potentially we can get 24,000 messages a second, but the data providers throttled that data down to something that can be consumed a bit better. So we get about 2000 messages per second from a provider. We currently have two providers, we do five processing steps. So overall we are building up a data pipeline that has to process about 2 billion messages a day and the data volume of 100 gigabytes. That is a big data challenge and we need to have the tools in place and to engineering in place to actually be able to handle that. So to summarize what I just said, AirSense processes and analyzes aircraft position data at a rate of about 2 billion events per day in real time or our mission for AirSense is to help aviation customers to operate more efficiently, improve the travel experience for you as final customers and better utilize the available resources globally. And really we want to overcome these, what I mentioned before, the performance bottlenecks, keep aviation safe, improve the environmental impact, reduce the environmental impact of it and overall have a better performance. And that’s the chance that we have with this project.
Jesse Anderson: 11:46 So thank you, Heiko, for giving us the overall domain and hopefully this isn’t it. That domain really sets up what we’re to do from the data engineering side. So let’s talk about that and specifically to really help you understand the, what Heiko is talking about. There’s the need for analytics, there’s the need for data science, but it’s really, really important for companies and organizations to understand that data engineering is what makes all of this possible. There was a post that I just wrote for O’Reilly, if, if your a reader of O’Reilly you may have seen it, it’s called why a data scientist is not a data engineer. There’s the minified URL up on the slide if you want to read that. And in there I showed this image. This image is showing part of what is making companies really fail with big data projects is that they’re really not realizing the importance of the data engineer holding up the world that is the data scientists. So you may have heard of the Greek myth of Atlas.
Jesse Anderson: 12:42 Well there’s Atlas, the data engineer holding up what is the celestial sphere or earth, or whatever you want to call it. Data Science only exists when you have good strong data engineering behind it and that’s what really what we saw from the Airbus team. So the Airbus team realized that and we saw that need and we created this system to enable them to process this amount of data. And that really comes into focus when we talk about the project requirements. The Airbus team, the AirSense team, their Airbus is more business development focused and data science focused. You’ll notice that there are two missing things there. There’s a missing operation and there was a missing data engineering and we’ll talk more about that but the key, the one that I’m more focused on for right here is the data engineering side of things and as part of that we really focused and tried to use the cloud to its to its maximum by understanding that we wanted to create value over doing operations.
Jesse Anderson: 13:42 My strong suggestion is to, to everybody who’s in this talk is to really focus on your value creation rather than operations. You’re not going to get Kudos from your CEO for operationalizing a Kafka cluster. You’re going to get Kudos from your CEO for creating this great data product that you can sell. That’s really one thing I want you to take away from this. We also had a real key of wanting low latency. Low latency is important because we need to get that message all the way from the airplane back down and be processed as fast as possible because we have future aims of getting this into the air traffic controller. The longer that takes, the longer the delay, the longer latency leg as it were. We also need to scale and that scale is going to manifest in several different ways. We needed to be able to have a platform that could scale up.
Jesse Anderson: 14:34 You saw from Heiko slide before that traffic is doubling every 15 years. Well, that data volume is going to scale that we are going to have more flights, we’re going to have more providers, and every time we add a new provider, we’re not just going to increase a little bit. We’re going to increase significant amounts. We’re going to increase, eh, Eh, with that data provider plus all of those planes. And then we have increasing numbers of customers. This data product that we’re creating, and we’ll talk more about this in a second. There’s data product that we’re creating isn’t just going to one-on-one provider or one person. We’re going to be exposing this data product in real time to places all over the world. This is going to be a very worldwide dissemination of this data and as a direct result of this, this data product in the importance of it, the sheer importance of it we needed s enterprise SLA is, it was really important for us to look at what sort of SLA, what sort of down a service level agreement could we expect from these services?
Jesse Anderson: 15:37 Can we get a five, nine can we get a four nine for example? It was really important to us because if somebody, especially in air traffic controller comes to expect or rely on this data, we can’t have downtime for it. And we also had the requirement that we may need to enter to deploy on-premise, on premises. And this kind of brought in a thing where we actually had to really look at not just our cloud services, but we had to figure out how that cloud service could be then deployed on prem and that’s, that’s a difficult decision there. Finally, we were talking about our data democratization. When you may have heard that term, what data democratization means is here we have this data product. It shouldn’t just be within Heiko’s ability to do, to work with that data. It would actually be really interesting if other people within an Airbus and perhaps even other companies be able to take that data product and say, I really, I have this idea of what I’d like to do.
Jesse Anderson: 16:36 What I’d like to see by democratizing this data, we’re now able to give that data product out to other people, to other people within the organization, and you might’ve heard the term citizen data scientists. This is something that many companies are trying to do. They’re trying to train their business analysts. They’re there, they’re various people on data science so that they can, in turn, do this processing. But all of this, once again, it stands on the back of good data engineering so that we can expose this data out and have it be democratized. Now let’s talk about the implementation specifically. We use quite a few different managed services from Azure and we chose them for very specific reasons. Let me talk about what, what we used and why. One of the key ones that we started off with, kind of the backbone of what we’re doing in real time.
Jesse Anderson: 17:27 And so once again, we’re doing all of this in real time and at scale. We’re not doing anything, any, any batch processing of this after the fact. This is all has to be done as soon as possible for those latency guarantees. In order to support that we used Azure event hubs and we didn’t just use the Azure event hubs. We actually use the Kafka API for Azure event hubs. If you didn’t know this already, there is binary level support in event hubs for Kafka. We’ll talk a little bit about what this manifestation later on, but it was really important for us to be able to go on-prem as I just mentioned. Well, what is the, one of the ways that we can go on-prem is by having a Kafka cluster on-prem or perhaps using event hubs with Azure stack once that supported, but that gave us the ability on several levels where it wasn’t him.
Jesse Anderson: 18:19 It wasn’t an important thing for us to operationalized a Kafka cluster for example. What was important though is to have that be managed for us. So it was really key and important for us to use these managed services. So once again, Azure event hubs are how the vast majority of our initial pipeline, that calf that Heiko just talked about and what I’m about to talk about, this is how all of that data was presented. Secondly, we used Azure Kubernetes services and container registry. Everything that we were doing was container based. If you aren’t using containers yet, I would highly suggest you do that. Because in a system like this, we were trying to get away from doing as many operations as possible. We didn’t want to hire a person to have to go through and make sure that our processes are running if you didn’t already think about this.
Jesse Anderson: 19:10 Air traffic is running 24/7 when we’re asleep, the planes are up in Asia. If you’re on the Pacific coast, or planes are going in Europe. So this is truly a 24 seven operation. But what we didn’t want to do is have to have a person who is on call 24 seven. What we did is we made use of our Kubernetes and we really relied on Kubernetes to make sure that our system was running at all times. We used some features that are part of Kubernetes in order to make sure that if a process dies, it’s spun up. And this was really important to us, this containerization improve the operation side, but also improve the development side. We didn’t have to worry about did the, did the process only run on my laptop while we have a container, it is containerized and therefore we know that it’s deployed correctly and that the artifacts are running correctly there.
Jesse Anderson: 20:05 Next we had to have, so we used Azure for the, for the actual data movement, what some people call data fabric or messaging. Now for our final resting place of this data, we needed several different places for it. One of the first places we put those data is into postgres. We use Postgres. We not just Postgres, we use the Azure managed Postgres and that Azure managed Postgres once again allowed us to push the operations onto Azure of here you take care of the database for us. What we’re doing is we’re taking all those aircraft positions that Haiku is just talking about and we’re putting them right into Postgres. Postgres has some really good geospatial, and intelligence that is built into it so that we can use those features.
Jesse Anderson: 20:53 Finally, we’re also using blob storage. If you didn’t know this already, event hubs actually have a thing that you, a button that you can push and say capture. They have this capture mode, we’ll capture mode will do is it will every x amount of time or every x amount of megabytes, it will actually take the data that has been put that put into those Azure to those event hubs topics and actually save that into blob storage. This is important because we’re going to be using that for other, other purposes as well, which we’ll talk about in a second. But what we didn’t want to do is we didn’t want to have to operationalize that. Once again, sometimes for other technologies, that is an actual process that has to be running at all times and I would much prefer a button that you push than some processes running that you have to worry about.
Jesse Anderson: 21:40 We also looked at Cosmos DB as a way to do that. this is a proof of concept that we, we were, we worked on with him that the cosmos DB team, we are, we haven’t actually finalized on this, but this is something that we’re looking at of how do we put a enough data into a database that scales perhaps it is as it hasn’t come to come to fruition within your mind or maybe you haven’t dealt with scale and as much, but when you deal with this many messages per second or a billion events per day or 2 billion events per day, you can’t just use any old thing. You couldn’t just use, we found we couldn’t just use Postgres for example. We had to really limit what we could do with Postgres. What we needed was a data store that could actually scale and handle the sheer amount of input as well as the sheer amount of queries. And that was a difficult problem unto itself.
Jesse Anderson: 22:38 You heard me talk about how the data was being put into files. Well, there are different ways of processing these files. We looked at three different technologies as part of a proof of concept as well. We’ll talk about some of the results of that. But spoiler alert, we chose Flink for this and we’ll talk about the reasons why we chose Flink, but the technologies we evaluated where Apache Flink and Databricks, if you Databricks has Apache Spark, in this case, specifically Apache Spark streaming as well as Azure durable functions. And each one of these played a different role. But what we’re doing is we’re actually using the data and real-time so that data coming from event hubs, we’re processing that in realtime with Flink. This sort of processing is important because this is how we’re really getting to those analytics. If any of you have ever tried to do analytics in real time at scale, you know how difficult that is because now you get into state fullness, you get into real-time shuffle sorts.
Jesse Anderson: 23:38 It is nontrivial problems that you need to do. And what we found is that the other technologies which we’ll talk more about later, we’re unable to do that at the, at the levels and the scales. And with the latencies that we wanted to. That said we are using data bricks as for processing those files that are laid down into blogs storage. So for batch processing Apache Spark is a great way to go. But once you do things in real time and you actually have that true real-time need, then we think Flink is a significantly better way to go. Let’s talk about this data pipeline. This data pipeline. Just to kind of give you an overview is written with custom consumers, producers using the event hubs API or more correctly using event hubs with the Kafka API behind it.
Jesse Anderson: 24:30 So the very, very first step that we have to do is we have to ingest. Now, if you’ve ever, I’m not sure how many data engineers or software engineers are in the room, but one of the key issues is one of the key differentiators in my opinion, between a software engineer and a data engineer is the data that and this was kind of made manifest by the data products output by some of our providers, that the data product output was something that a software engineer would output but not something a data engineer would output quite frankly. And this manifested as the…it was a group of or JSON output and this JSON output was not very well done. We’ll put it that way nicely. So one of the first things we had to do would be to ingest that craptacular JSON output and then actually make it usable.
Jesse Anderson: 25:21 And that’s a whole data product unto itself. And from there we ETL that. So if you didn’t know this already, JSON is a great format for certain things. Nothing bad about JSON. But when you’re doing things at scale, when you’re doing things with, with big data, string formats are not what you want to be doing. So the very first thing that we do is we get that out of a string format and into Avro. We want that in a binary format for many different reasons, but that’s what that ETL processes. We’re taking that initial ingestion of JSON, changing that into Avro as well as we’re looking for things that are unusable. For example, there is a unique identifier for each plane called an ICAO. If we get a message that lacks that ICAO, we can’t even use that. So we throw that out. For example, there are other checks that we do there as part of our ETL. And then we have the issue of, we have all these different data providers and we’ve created this, this notion that each data provider, is going to have their own format.
Jesse Anderson: 26:25 That’s just an inevitable part of each company looking at it in a different way. But in our system, we can’t deal with three, four, 10 different formats. What we do is we do a step to normalize. What we do is we take each, each Avro, the Avro for that specific data provider, and then we normalize that into an Avro format that is readable or works, Eh, is the union across all those different data services. So that normalization step allows us to now take all, I’ll, a different pipeline for each data provider as you see there and get that down into a single pipeline of a single Avro Schema. From there, we run our fusion algorithm and that fusion algorithm is in, is interesting because it’s really the secret sauce. If you’ve ever done something in real time, you know that it’s difficult to do something in real time because you can’t go back and you can’t do a retraction.
Jesse Anderson: 27:21 You can’t go back and say, “oh I messed up.” You have to give what you think is happening at that point in time as best possible. And this, this is what happens with our, our data, our data, since it’s coming from two different providers, could be slowed down, could be sped up. So if we get a message from a different provider that was let’s say five minutes ago, a minute ago, or 30 seconds go and we’ve already sent out a message, we need to be able to say, yes, we’ve received this, but this is not the most current point and these are some of the things that we need to do with that fusion algorithm. The, it’s going to compensate for these sorts of data errors and uncertainty. Finally, we cleaned that data. We make sure that what we do for that final output of that final data product is the correct and what it should be.
Jesse Anderson: 28:09 Here’s a different way of visualizing how each of that is. One thing I didn’t mention specifically, in each one of those steps, we decided that there was value to each one of those steps. For example, just the step of fixing JSON was a valuable step onto itself. And so we can expose that as a data product and that step of saying, here’s Avro for a particular data provider. That’s a whole different data product. And the way we created this system is that each one of our, each one of those steps you saw before was a completely separate consumer-producer and a completely separate topic in event hubs. So as we see that data provider coming, in and we’d do that processing and that box that you see of the data processing pipeline, that was the steps we were doing in the previous slide and then the output of that flows into the rest of the system.
Jesse Anderson: 29:03 Once again, all in real time and each message where we’re processing that and we’re doing analytics and this case we’re calling some of our analytics event tagging a when a, when an airplane does specific things, the airplane doesn’t say, hey everybody, I’m doing this. We just have to take the message and we have to try and infer what’s happening with that airplane and that’s part of what that event tagging is doing. And then that event tagging is going to flow through into another step of the analytics. We’re going to mix the tags that we’ve received and the tags that we’ve decided upon as well as the actual individual steps or the individual waypoints of the airplane so that we can create an even further analytic. But once again, each one of these data products is available to consumers. And you heard me talk about that data democratization.
Jesse Anderson: 29:57 This is available to that to the rest of the people at Airbus. This is available as a product to sell from Airbus, all in real time. This, this really shows how that data movement is really allowing us to do really interesting things. So now that we’ve talked a little bit about this implementation, what I’d like to share is some of the things that we learned, and hopefully, this will give you not just a leg up but also someplace to start as we start dealing with things. To start off with, here are some general learnings that we had. One thing that a few things that didn’t work very well. Micro batches. If you don’t know what a micro-batch is or you don’t know what Apache spark is, this is something that you do need to know, especially if you’re doing something in real time.
Jesse Anderson: 30:45 The way that Apache spark streaming works is through micro batches. It batches up several seconds worth of data and processes that all at once. Now if you are in the industry, you know that they’ve done some work on continuous streaming. Now the issue with continuous streaming is that one, it’s brand new and two is rudimentary and it’s the support that only supports map functions and does not support statefulnes and that was, that was a whole different issue. But once again, those macro batches, it didn’t allow us to handle the latency constraints. If we were arbitrarily putting in these latencies, this micro-batch latency by technology and we could use different technology. That was what we went with. We went with a different technology that didn’t add that extra latency in. In our case, we were looking at somewhere around 1.5 to two seconds worth of latency introduced by spark streaming.
Jesse Anderson: 31:42 Another thing that we found really difficult was cost estimation. I don’t know if you’ve hit this, but it was far too difficult for us to figure out how much certain things in Azure we’re going to cost, quite frankly. Perhaps one of the worst offenders was cosmos DB. We spent an inordinate amount of time trying to figure out how much cosmos DB was going to cost us. It would’ve been so much better and so much more valuable if we had just been able to say, here’s about what it’s going to cost and let’s move on. We spent way too much time on that. Another thing is that Microsoft is really good about calling out what’s in preview and the issue there is we needed things that we’re in preview and so we had to go into production with things that we’re in preview.
Jesse Anderson: 32:29 We kind of knew that going in, but we will share that hey, we did hit some problems with things that went in preview. We actually hit a few that actually bit us after they were turned off in into preview. There were definitely some issues that hit us there. But what we’d suggest is, unless you absolutely have to, or you just kind of go in knowing that you’re using something in preview; just know that. A few things that worked really well is that the managed services really helped us out by eliminating operations. It didn’t eliminate operations completely, but it really did help us from having to have an operations person. We were able to focus on the Data Scientists. We were able to focus on that value creation rather than the operation side. And I would really strongly encourage you to think about that.
Jesse Anderson: 33:17 Sometimes people look at the extra cost of a managed service. Well, the extra cost of a managed service is not having half a person that is there taking care of that. And that’s really worth it in my opinion. We also did a good job by offloading our operations as much as possible and to Kafka, or excuse me, onto Kubernetes. If you look at our Kubernetes cluster, you’ll see that some of our containers had been running for 90 days, a hundred days, where there was really, we’re not going in and looking at and seeing are they running every single day. We know that Kubernetes is going to do that and we’re also using the managed Kubernetes from Azure, which is even better. Azure is going to handle the Kubernetes side of that. This really allowed us to focus once again, our time on value creation, on data pipelines, not on the operational side of things.
Jesse Anderson: 34:12 Just remember that even in the meetings we had, when we went and talked to the management, we didn’t say we’re running Kubernetes and we’re doing this managed service. We talked about the value we created. So think about that as developers and as architects. We also had a paradigm of fail fast. And what fail fast allowed us to do is to use Kubernetes to respend something up. So if we hit an issue, if we hit a connection failure, if we hit something that was an exception that was unrecoverable, we would fail fast, we would actually exit immediately knowing that Kubernetes would realize that and spin us back up again. This allowed us to really get away from having to deal with a lot of the operations. We knew that operationally there was probably just something weird that happened, and Kubernetes would spin that up. And if there was something that was actually down, Kubernetes has a backoff and it will start backing off and restarting until that process eventually does start working again.
Jesse Anderson: 35:13 That really allowed us to, once again not have to deal with the operational side of is our container running, is our consumer-producer running, that sort of thing. We did have a really good experience with the open API’s that are part of Azure. It was helpful for us because we were able to really focus on the requirement of on-prem. Well, I’m starting to wonder is I deal a lot with open source personally. I’m wondering if we should be more focused on the open API’s rather than the open source so that we can actually move between technologies, not necessarily have to worry about what is underneath that technology. And a big Kudos to the Azure support team, they have been very responsive and the actual teams themselves have helped us. We’re going to talk about that in a second as well.
Jesse Anderson: 36:02 And a few learnings that we learned from the realtime tagging at scale. Apache Flink did significantly better without latency. As I mentioned, latency was really key and important to us. Micro batches just aren’t going to cut it for us. It has the best stateful support, in my opinion. If you’ve ever done real-time statefulness, you know how difficult that is. Apache Flink statefulness was incredibly good. It also has the ability to queries. You can actually query your state and get a response, It’s pretty interesting. That said, sparks streaming was interesting as well because it was a managed service. We were trying to keep away from operations, but that ability to start up spark through the managed Databricks was quite nice and as a direct result of that the data science team was really able to get started much faster.
Jesse Anderson: 37:00 They were able to get into their notebook, start querying, start doing things. And once again on the batch side, the data science team is still using the spark for batch. And that was because they had great language support. If you’re a Data Scientist, you know that most data scientists want things in python. Well, you have pie spark and you can do a lot of your querying there in pie spark. Finally, we have durable functions. Durable functions was low cost. There’s just no comparison for costs there. It was really, really cheap. So if you could get by with some of these durable functions, hey, that’s a great way to go. And it was pretty easy for us to deploy new code. They really nailed that code deployment in my opinion. And overall, you’ve heard me talk about wanting to keep away from the operational side of things. If we get to our final, what we’d like to do finally or, the ultimate goal is to be completely serverless. In reality. We shouldn’t even be talking or thinking about Kubernetes and that sort of thing. We should be talking about, I have this function running and I know this function is running. And that’s really what w what I like to see for coming out of the industry.
Jesse Anderson: 38:11 Well as you can see, we’ve had a lot of interactions with Azure and some of these interactions as I talked about, we were doing using things in preview, we were using things in preview at scale and as a direct result we were on some of the cutting edge things of what we were doing with Azure. And, as a result, we had some support from the teams around Azure.
Heiko Udluft: 38:34 Yeah. Thanks, Jesse
Heiko Udluft: 38:36 That is really something I would like to highlight. We got great support from teams. Our people that we primarily interacted with was a customer success unit. The customer software engineering and accounting unit responsible for Airbus. And again, they gave us great support I think for two reasons. One is we are spooling up quite large kinds of products. The second thing is as Jesse just mentioned really because we were doing such cutting edge things, some of the technologies we tested really to the limits of what the managed services can do right now. And these teams not only helped us with some of the deployment development that we did, but we also had both internal and external hackathons where we again got great support from Azure and not only us, but also the people that attended these hackathons. The other teams we work with, an Azure was the Event Ops team, the Cosmos DB team and the Azure Postgres Creole team. And again, great support from their end, really fast response times. And whenever we hit a roadblock they helped us to overcome those very, very proactively.
Heiko Udluft: 39:36 So with that I’d like to wrap up the session. And I will hope the Gods of demo a nice to meet today and I can show you a bit of, what the data that we work with actually looks like. So what you see here is our global ADS-B dataset. And I’m gonna stay on the zoom level for a second because you see three really interesting things here. One is if you compare the Chinese airspace here on the right side of the screen, to the US air space to the left side of the screen, you see that China has a lot more structure. So that is something for me as a domain enthusiast and expert, a really interesting thing to see, and that has two effects. One is because it’s so structured, it’s easier to manage, and that’s the main reason why they have it.
Heiko Udluft: 40:20 The second thing though, is that they, that’s penalized by having desk capacity. So ideally you want to utilize as much of airspace as you can, and not have these structures. Oh, I just realized I didn’t actually introduce what you’re looking at here. So these are aircraft positions. I’ll zoom in a bit in a second, but you will see that you see each individual aircraft together with the last, 20 minutes of trajectory that this aircraft had. Another thing that you see that I found interesting is when you look at India, you actually see the busiest routes on the planet. I’m not at this time of day, but between the three major airports here in India, those are the most domestic, most connections that we have on the planet. The third thing that I’d like to highlight is the transoceanic traffic.
Heiko Udluft: 41:01 So similar to what we saw in China, you actually don’t see aircraft flying direct, but they follow these routes here. That’s the so-called North Atlantic track system. And it could be a two-hour talk in itself to discuss that in more detail. But basically because until this space-based ADS-B became available, you didn’t actually know where an aircraft was, the people that are responsible for safety in this space, make sure that aircraft are spaced out quite a bit. So normal airspace, you talk five nautical mile separation aircraft, between the routes that are going from east to west here. You could have tens or hundreds of nautical miles required separation. So very, very conservative approach. So that’s actually one of the things that we hope that with business global situational awareness if we can have that we can eventually convince regulators that they break up this transatlantic track system a bit so that we can have more capacity there. Because again, less structured, easy to handle, safely separated, that’s all great. But we’ve really, it’s penalized by not having enough capacity. You can zoom in on this. So, and the one that I’m going to go for obviously is Seattle, so we not only know when the aircraft has met air. And if my internet connection works… go to Seattle Tacoma airport.
Heiko Udluft: 42:31 Now the Gods of demo are not nice today. Okay. I’m sorry about that but the Internet connection here runs from my phone so if you want to see you later, just stop by.
Jesse Anderson: 42:46 Did you try to refresh?
Heiko Udluft: 43:02 I can try to refresh
Heiko Udluft: 43:05 Now let’s see.
Heiko Udluft: 43:14 Worst case, third time’s the charm. I really just want to show you the kind of detail that we get out of this data. So when you zoom in far enough, you actually see the individual gate positions when an aircraft is parked. So again, here you see traffic around Seattle, Tacoma airport and you see how aircraft are landing on the runways, but we even get, so we get good coverage on the ground as well. Again, up to the individual gate position. So we can track an aircraft not only when it’s in flight and built analytics based on that, but a lot of the delays that you experience you actually have on the ground. So, with the data that we have available, we can also provide analytics and decision support for these ground systems.
Heiko Udluft: 43:57 And then something that we’re working on at the moment is, so what you just saw is a 2D visualization. But we are also working on a 3D solution for this.
Heiko Udluft: 44:09 This is just a recording of the prototype that we’re building here. We didn’t build a visualization actually ourselves, so that is a product called Luciad. And Luciad does web GL based on visualizations. And what you see here is in blue, the departing flights out of London Heathrow airport and then read the arriving flights. And it’s really interesting when you can interact with data in such a way because this would be some of the analytics we are interested in doing is, under which conditions do you have…Sorry, the four circles that you see here are the holding patterns that you typically have around London Heathrow airport. So I talked about capacity bottlenecks and that’s basically how they buffer aircraft before they can land them on the runways. If we can analyze this in more detail and identify when do these holding patterns occur under which weather conditions, when, under which time of day, we can actually then learn from that and improve the operations overall.
Heiko Udluft: 45:04 So in summary, what we told you about in this presentation is that Airbus essence provides real-time big data services for the broader aviation industry. We are actively developing this and you can expect that we have more to show in the near future, but for now we have productized our data stream as well as our event tagging. Aviation can gain significantly from such digital services. I talked a lot about the bottlenecks, the performance, and growth of aviation and we believe that if we provide the right decision support, we can overall end up with a better system. And there’s opportunity for optimization algorithms and all sorts of places. We also want, this is what Jesse presented. We implemented a real-time, big data processing on Azure and we utilized the cutting edge of managed services that are available in Azure and they really helped us with a relatively small team have such a demanding platform up and running and operating. So with that, we are at the end of our presentation and we are happy to take questions. We have about another 10 minutes for that.
Jesse Anderson: 46:14 And if you could come up to the mic for your questions, we’d appreciate it.
Audience Quest: 46:17 I just had a quick question. Can you talk again about the statefulness that you needed from Flink? I kind of missed that.
Jesse Anderson: 46:24 Sure, I didn’t go deeply into that. I wasn’t sure how much people were interested in that. But statefulness, in order to do certain complex event processing or CEP, if you’re familiar with it, in order to do some of that CEP, we needed for our analytics, we needed several different positions. We needed somewhere between 50 to 100, sometimes more than that positions. In order to run the algorithm, we needed to store that in a stateful place. When you’re doing statefulness, you either have to have a separate database that you look that up into and process that,
Jesse Anderson: 47:00 or with some of the newer technologies like Kafka, they’ll have a local database. Now the issue with those local databases is when there’s a failure. And when you have that failure, the Kafka streams databases aren’t very well suited to that, I’ll leave it at that. But on the Flink side of things, they did an incredible job with their statefulness of the checkpointing, as well as the ability to in a better way handle failures without having to replay the entire state. Hopefully, that answered your question.
Audience Quest: 47:43 Oh, all right there we go. I’m a little bit shorter than most people here. So in your general learning, you said that cost estimation could be more difficult than it should be with the cosmos database. Do you have any suggestions on what might’ve made that easier or how you could have better made those costs estimations?
Jesse Anderson: 48:03 Yes, we shared them with some of the cosmos DB team and I would say that some of that cost estimation was difficult for event hubs too. So the easier one for the event hubs team is I don’t believe that we should be looking at throughput units. I think throughputs are much more of a relic of the database days. The relational database days. They work well for relational database but they don’t work well for the sort of things that we’re doing. We’re thinking in terms of access to data, not necessarily the throughput of that data. So I think, I personally think that Cosmos DB should be costed more on how you access data, how much data you’re storing, how much data you retrieving, and not necessarily on throughput units. Similar thing for event hubs, event hubs charges based on those throughput units as well. I think event hubs should be costed based on the number of messages, and the amount stored, and the amount sent. I think that’s a significantly easier way.
Audience Quest: 48:59 Okay. So basically if you had had a way to see how much data you were going to be consuming and have those costs associated with that then you could do a better estimation?
Jesse Anderson: 49:13 Yeah. It’s significantly easier for us to, to estimate out how much we’re going to store, how many messages were going to send, rather than throughput units.
Audience Quest: 49:24 Cool. Thank you.
Heiko Udluft: 49:28 I have one more coming up to the mic.
Audience Quest: 49:31 Two questions actually. So the Flinck cluster that you mentioned, how are you running the Flinck cluster? That’s one thing. The second question is related to…
Jesse Anderson: 49:39 Let me answer your first question. If we’re running the Flinck cluster using Kubernetes service as well, that’s the recommended route from for Viverica, which is the company that’s commercializing Apache Flink.
Audience Quest: 49:50 Okay. The second question is the JSON to Avro that you mentioned. So did you have numbers and how much improvement you are able to get by processing Avro in the rest of the pipeline as against just JSON.
Jesse Anderson: 50:02 Sure. So the improvements there were significant, If you know JSON, JSON has inherent overhead because of the Schema. So every single message has, has here is the field name, and here’s the field name. And in fact, some of the crappiness of the JSON that we’re receiving is that they’re sending that in as an array. So in order to not have to send schema every time, they’re basically sending an array and you have to know which index in the array is which one, which kind of defeats the purpose of JSON, quite frankly. So the reduction there was somewhere around probably a 90%, probably 95% reduction simply because a lot of the data that is being sent over JSON was numeric rather than string. If you’re sending heavy strings, you’re going to see some benefits, but you’re not gonna see that.
Jesse Anderson: 50:51 When you send something across the wire with JSON, it’s a UTF-8 string. And so each one of those fields is a bite. And by reducing some of those bites down to, many bites down to a single byte or four bites, you get a significant reduction, like to the 90% side. And one thing I will say and share is that reduction that you get isn’t just a reduction in a point to point. It’s a reduction that happens because we’re now consuming that many times. and the at the end of the topics we’re not just consuming that one time were actually consuming that 10 times right now. And so if you have a significant reduction that’s a tenfold times 10 consumers there. I saw somebody else had a question.
Speaker 7: 51:41 I don’t want you to give away any trade secrets, but one of the requirements was the ability to have an on-prem deployment, but you didn’t really talk about the nature of that or the purpose of that whether it was for testing or a core part of the product.
Heiko Udluft: 51:55 Thanks I love it. That’s a good question. That’s actually a customer requirement. So our, our data and everything that we built a set up that it can work with customer data. And when I talk customer, let’s stick to the airport example. In airport Hasta el and surveillance infrastructure. So, they, they collected, they collect the same data. But that data and not always as allowed to be exposed externally. So they have the requirement that they can do everything in their own data warehouse and that’s where this on-prem requirement came from.
Audience Quest: 52:27 I’m curious, who do you see as your customers?
Jesse Anderson: 52:31 Could you repeat your question? I didn’t quite catch it.
Audience Quest: 52:34 You mentioned customers. Who Do you see as your customers? What type of industry, what airports, airlines, governmental agencies, what?
Heiko Udluft: 52:43 Okay, sorry the audio here is a bit bad. If I understood your question correctly, you asked about which customer groups are targeting. So for me, it’s a stakeholder that’s involved with a flight. That starts with the government agencies that have to, certify all the flights and we can basically provide them analytics to give them an overview of changes in the system. The air traffic controllers, or air navigation service providers, so that’s the people that are responsible for managing how aircraft flow through the system. Next big one would be all airports in the world because they can benefit from having better surveillance data. But also again, we can provide them with the performance analytics. You can talk, we can target airlines, we can talk finance services for aircraft travel agencies down to the individual traveler. Literally, the industry is really broad, and when we focus on customers, the two big groups that we targeted in the moment is the air navigation service provider in the airport. But the products that we’re building can really help literally everybody who is involved with the flight.
Jesse Anderson: 53:49 I think we have time for one or two questions. If anybody has a question, come up to the mic.
Audience Quest: 53:55 Can I say, in your presentation, you mentioned the term data product. Could you please explain furthermore about data product?
Jesse Anderson: 54:03 Sure. So then what is the data product? I have an entire post on my site. If you go to jesse-anderson.com I got not only myself but a bunch of my friends to say what their definition of data product is. So data product is what comes out of a clean pipeline that’s repeatable. Here the data product that came into us was in my opinion, not really repeatable, not really well outlined. And so the data product for us is something that is consumable in a fashion that is repeatable and structured and with a good SLA behind it, and that’s in the case of real time. And then you have batch data products where there is a data product that is on HDFS or blob storage that has been cleaned ETL’d, and is in a binary format that is extensible so that the rest of the organization can consume from there.
Jesse Anderson: 54:56 You shouldn’t be having your, if we were to back up to this slide for example, if we were to democratize our data, we would not democratize the ingest data. We would not democratize the pre ETL data in other words. That would just be a real non-starter where, hey everybody here’s our JSON. That causes a lot of problems. What we would do after that perhaps that normalization and that’s an example of a data product where we have taken all this. We have, it’s good. We have, we’ve said, put our blessing, our stamp on that. That answer your question, Kevin. Good.
Jesse Anderson: 55:45 One last question.
Audience Quest: 55:48 You talked about passenger planes a lot. Do you also track like freight planes, cargo planes, or small private planes?
Heiko Udluft: 55:55 Yes. In short, yes we track any, a plane that’s flying, civil aviation, general aviation planes, as well as cargo planes. So that’s all in there. Currently, ADS-B is not mandated for all of them, so when you look at what percentage of fleet where we’re currently receiving the platform, we’ve received a higher percentage of commercial flights then we do for cargo, and then we do for general aviation. But in general, all of these are available. Okay. Thank you very much. Good.