Predicting Out Of Memory Kill events with Machine Learning (Ep. 203)

20-09-2022 • 19 mins

Sometimes applications crash. Some other times applications crash because memory is exhausted. Such issues exist because of bugs in the code, or heavy memory usage for reasons that were not expected during design and implementation. Can we use machine learning to predict and eventually detect out of memory kills from the operating system?

Apparently, the Netflix app many of us use on a daily basis leverage ML and time series analysis to prevent OOM-kills.

Enjoy the show!

Our Sponsors

Explore the Complex World of Regulations. Compliance can be overwhelming. Multiple frameworks. Overlapping requirements. Let Arctic Wolf be your guide. Check it out at https://arcticwolf.com/datascience

Amethix works to create and maximize the impact of the world’s leading corporations and startups, so they can create a better future for everyone they serve. We provide solutions in AI/ML, Fintech, Healthcare/RWE, and Predictive maintenance.

Transcript

1 00:00:04,150 --> 00:00:09,034 And here we are again with the season four of the Data Science at Home podcast.

2 00:00:09,142 --> 00:00:19,170 This time we have something for you if you want to help us shape the data science leaders of the future, we have created the the Data Science at Home's Ambassador program.

3 00:00:19,340 --> 00:00:28,378 Ambassadors are volunteers who are passionate about data science and want to give back to our growing community of data science professionals and enthusiasts.

4 00:00:28,534 --> 00:00:37,558 You will be instrumental in helping us achieve our goal of raising awareness about the critical role of data science in cutting edge technologies.

5 00:00:37,714 --> 00:00:45,740 If you want to learn more about this program, visit the Ambassadors page on our website@datascienceathome.com.

6 00:00:46,430 --> 00:00:49,234 Welcome back to another episode of Data Science at Home podcast.

7 00:00:49,282 --> 00:00:55,426 I'm Francesco Podcasting from the Regular Office of Amethyx Technologies, based in Belgium.

8 00:00:55,618 --> 00:01:02,914 In this episode, I want to speak about a machine learning problem that has been formulated at Netflix.

9 00:01:03,022 --> 00:01:22,038 And for the record, Netflix is not sponsoring this episode, though I still believe that this problem is a very well known problem, a very common one across factors, which is how to predict out of memory kill in an application and formulate this problem as a machine learning problem.

10 00:01:22,184 --> 00:01:39,142 So this is something that, as I said, is very interesting, not just because of Netflix, but because it allows me to explain a few points that, as I said, are kind of invariance across sectors.

11 00:01:39,226 --> 00:01:56,218 Regardless of your application, is a video streaming application or any other communication type of application, or a fintech application, or energy, or whatever, this memory kill, out of memory kill still occurs.

12 00:01:56,314 --> 00:02:05,622 And what is an out of memory kill? Well, it's essentially the extreme event in which the machine doesn't have any more memory left.

13 00:02:05,756 --> 00:02:16,678 And so usually the operating system can start eventually swapping, which means using the SSD or the hard drive as a source of memory.

14 00:02:16,834 --> 00:02:19,100 But that, of course, will slow down a lot.

15 00:02:19,430 --> 00:02:45,210 And eventually when there is a bug or a memory leak, or if there are other applications running on the same machine, of course there is some kind of limiting factor that essentially kills the application, something that occurs from the operating system most of the time that kills the application in order to prevent the application from monopolizing the entire machine, the hardware of the machine.

16 00:02:45,710 --> 00:02:48,500 And so this is a very important problem.

17 00:02:49,070 --> 00:03:03,306 Also, it is important to have an episode about this because there are some strategies that I've used at Netflix that are pretty much in line with what I believe machine learning should be about.

18 00:03:03,368 --> 00:03:25,062 And usually people would go for the fancy solution there like this extremely accurate predictors or machine learning models, but you should have a massive number of parameters and that try to figure out whatever is happening on that machine that is running that application.

19 00:03:25,256 --> 00:03:29,466 While the solution at Netflix is pretty straightforward, it's pretty simple.

20 00:03:29,588 --> 00:03:33,654 And so one would say then why making an episode after this? Well.

21 00:03:33,692 --> 00:03:45,730 Because I think that we need more sobriety when it comes to machine learning and I believe we still need to spend a lot of time thinking about what data to collect.

22 00:03:45,910 --> 00:03:59,730 Reasoning about what is the problem at hand and what is the data that can actually tickle the particular machine learning model and then of course move to the actual prediction that is the actual model.

23 00:03:59,900 --> 00:04:15,910 That most of the time it doesn't need to be one of these super fancy things that you see on the news around chatbots or autonomous gaming agent or drivers and so on and so forth.

24 00:04:16,030 --> 00:04:28,518 So there are essentially two data sets that the people at Netflix focus on which are consistently different, dramatically different in fact.

25 00:04:28,604 --> 00:04:45,570 These are data about device characteristics and capabilities and of course data that are collected at Runtime and that give you a picture of what's going on in the memory of the device, right? So that's the so called runtime memory data and out of memory kills.

26 00:04:45,950 --> 00:05:03,562 So the first type of data is I would consider it very static because it considers for example, the device type ID, the version of the software development kit that application is running, cache capacities, buffer capacities and so on and so forth.

27 00:05:03,646 --> 00:05:11,190 So it's something that most of the time doesn't change across sessions and so that's why it's considered static.

28 00:05:12,050 --> 00:05:18,430 In contrast, the other type of data, the Runtime memory data, as the name says it's runtime.

29 00:05:18,490 --> 00:05:24,190 So it varies across the life of the session it's collected at Runtime.

30 00:05:24,250 --> 00:05:25,938 So it's very dynamic data.

31 00:05:26,084 --> 00:05:36,298 And example of these records are for example, profile, movie details, playback information, current memory usage, et cetera, et cetera.

32 00:05:36,334 --> 00:05:56,086 So this is the data that actually moves and moves in the sense that it changes depending on how the user is actually using the Netflix application, what movie or what profile description, what movie detail has been loaded for that particular movie and so on and so forth.

33 00:05:56,218 --> 00:06:15,094 So one thing that of course the first difficulty of the first challenge that the people at Netflix had to deal with was how would you combine these two things, very static and usually small tables versus very dynamic and usually large tables or views.

34 00:06:15,142 --> 00:06:36,702 Well, there is some sort of join on key that is performed by the people at Netflix in order to put together these different data resolutions, right, which is data of the same phenomenon but from different sources and with different carrying very different signals in there.

35 00:06:36,896 --> 00:06:48,620 So the device capabilities is captured usually by the static data and of course the other data, the Runtime memory and out of memory kill data.

36 00:06:48,950 --> 00:07:04,162 These are also, as I said, the data that will describe pretty accurately how is the user using that particular application on that particular hardware.

37 00:07:04,306 --> 00:07:17,566 Now of course, when it comes to data and deer, there is nothing new that people at Netflix have introduced dealing with missing data for example, or incorporating knowledge of devices.

38 00:07:17,698 --> 00:07:26,062 It's all stuff that it's part of the so called data cleaning and data collection strategy, right? Or data preparation.

39 00:07:26,146 --> 00:07:40,782 That is, whatever you're going to do in order to make that data or a combination of these data sources, let's say, compatible with the way your machine learning model will understand or will read that data.

40 00:07:40,916 --> 00:07:58,638 So if you think of a big data platform, the first step, the first challenge you have to deal, you have to deal with is how can I, first of all, collect the right amount of information, the right data, but also how to transform this data for my particular big data platform.

41 00:07:58,784 --> 00:08:12,798 And that's something that, again, nothing new, nothing fancy, just basics, what we have been used to, what we are used to seeing now for the last decade or more, that's exactly what they do.

42 00:08:12,944 --> 00:08:15,222 And now let me tell you something important.

43 00:08:15,416 --> 00:08:17,278 Cybercriminals are evolving.

44 00:08:17,374 --> 00:08:22,446 Their techniques and tactics are more advanced, intricate and dangerous than ever before.

45 00:08:22,628 --> 00:08:30,630 Industries and governments around the world are fighting back on dealing new regulations meant to better protect data against this rising threat.

46 00:08:30,950 --> 00:08:39,262 Today, the world of cybersecurity compliance is a complex one, and understanding the requirements your organization must adhere to can be a daunting task.

47 00:08:39,406 --> 00:08:42,178 But not when the pack has your best architect.

48 00:08:42,214 --> 00:08:53,840 Wolf, the leader in security operations, is on a mission to end cyber risk by giving organizations the protection, information and confidence they need to protect their people, technology and data.

49 00:08:54,170 --> 00:09:02,734 The new interactive compliance portal helps you discover the regulations in your region and industry and start the journey towards achieving and maintaining compliance.

50 00:09:02,902 --> 00:09:07,542 Visit Arcticwolves.com DataScience to take your first step.

51 00:09:07,676 --> 00:09:11,490 That's arcticwolf.com DataScience.

52 00:09:12,050 --> 00:09:18,378 I think that the most important part, though, I think are actually equally important.

53 00:09:18,464 --> 00:09:26,854 But the way they treat runtime memory data and out of memory kill data is by using sliding windows.

54 00:09:26,962 --> 00:09:38,718 So that's something that is really worth mentioning, because the way you would frame this problem is something is happening at some point in time and I have to kind of predict that event.

55 00:09:38,864 --> 00:09:49,326 That is usually an outlier in the sense that these events are quite rare, fortunately, because Netflix would not be as usable as we believe it is.

56 00:09:49,448 --> 00:10:04,110 So you would like to predict these weird events by looking at a historical view or an historical amount of records that you have before this particular event, which is the kill of the application.

57 00:10:04,220 --> 00:10:12,870 So the concept of the sliding window, the sliding window approach is something that comes as the most natural thing anyone would do.

58 00:10:13,040 --> 00:10:18,366 And that's exactly what the researchers and Netflix have done.

59 00:10:18,488 --> 00:10:25,494 So unexpectedly, in my opinion, they treated this problem as a time series, which is exactly what it is.

60 00:10:25,652 --> 00:10:26,190 Now.

61 00:10:26,300 --> 00:10:26,754 They.

62 00:10:26,852 --> 00:10:27,330 Of course.

63 00:10:27,380 --> 00:10:31,426 Use this sliding window with a different horizon.

64 00:10:31,558 --> 00:10:32,190 Five minutes.

65 00:10:32,240 --> 00:10:32,838 Four minutes.

66 00:10:32,924 --> 00:10:33,702 Two minutes.

67 00:10:33,836 --> 00:10:36,366 As close as possible to the event.

68 00:10:36,548 --> 00:10:38,886 Because maybe there are some.

69 00:10:39,008 --> 00:10:39,762 Let's say.

70 00:10:39,896 --> 00:10:45,678 Other dynamics that can raise when you are very close to the event or when you are very far from it.

71 00:10:45,704 --> 00:10:50,166 Like five minutes far from the out of memory kill.

72 00:10:50,348 --> 00:10:51,858 Might have some other.

73 00:10:51,944 --> 00:10:52,410 Let's say.

74 00:10:52,460 --> 00:10:55,986 Diagrams or shapes in the data.

75 00:10:56,168 --> 00:11:11,310 So for example, you might have a certain number of allocations that keep growing and growing, but eventually they grow with a certain curve or a certain rate that you can measure when you are five to ten minutes far from the out of memory kill.

76 00:11:11,420 --> 00:11:16,566 When you are two minutes far from the out of memory kill, probably this trend will change.

77 00:11:16,688 --> 00:11:30,800 And so probably what you would expect is that the memory is already half or more saturated and therefore, for example, the operating system starts swapping or other things are happening that you are going to measure in this.

78 00:11:31,550 --> 00:11:39,730 And that would give you a much better picture of what's going on in the, let's say, closest neighborhood of that event, the time window.

79 00:11:39,790 --> 00:11:51,042 The sliding window and time window approach is definitely worth mentioning because this is something that you can apply if you think pretty much anywhere right now.

80 00:11:51,116 --> 00:11:52,050 What they did.

81 00:11:52,160 --> 00:12:04,146 In addition to having a time window, a sliding window, they also assign different levels to memory readings that are closer to the out of memory kill.

82 00:12:04,208 --> 00:12:10,062 And usually these levels are higher and higher as we get closer and closer to the out of memory kill.

83 00:12:10,136 --> 00:12:15,402 So this means that, for example, we would have, for a five minute window, we would have a level one.

84 00:12:15,596 --> 00:12:22,230 Five minute means five minutes far from the out of memory kill, four minutes would be a level two.

85 00:12:22,280 --> 00:12:37,234 Three minutes it's much closer would be a level three, two minutes would be a level four, which means like kind of the severity of the event as we get closer and closer to the actual event when the application is actually killed.

86 00:12:37,342 --> 00:12:51,474 So by looking at this approach, nothing new there, even, I would say not even a seasoned data scientist would have understood that using a sliding window is the way to go.

87 00:12:51,632 --> 00:12:55,482 I'm not saying that Netflix engineers are not seasoned enough.

88 00:12:55,556 --> 00:13:04,350 Actually they do a great job every day to keep giving us video streaming platforms that actually never fail or almost never fail.

89 00:13:04,910 --> 00:13:07,460 So spot on there, guys, good job.

90 00:13:07,850 --> 00:13:27,738 But looking at this sliding window approach, the direct consequence of this is that they can plot, they can do some sort of graphical analysis of the out of memory kills versus the memory usage that can give the reader or the data scientist a very nice picture of what's going on there.

91 00:13:27,824 --> 00:13:39,330 And so you would have, for example, and I would definitely report some of the pictures, some of the diagrams and graphs in the show notes of this episode on the official website datascienceaton.com.

92 00:13:39,500 --> 00:13:48,238 But essentially what you can see there is that there might be premature peaks at, let's say, a lower memory reading.

93 00:13:48,334 --> 00:14:08,958 And usually these are some kind of false positives or anomalies that should not be there, then it's possible to set a threshold where the threshold to start lowering the memory usage because after that threshold something nasty can happen and usually happens according to your data.

94 00:14:09,104 --> 00:14:18,740 And then of course there is another graph about the Gaussian distribution or in fact no sharp peak at all.

95 00:14:19,250 --> 00:14:21,898 That is like kills or out of memory.

96 00:14:21,934 --> 00:14:33,754 Kills are more or less distributed in a normalized fashion and then of course there are the genuine peaks that indicate that kills near, let's say, the threshold.

97 00:14:33,802 --> 00:14:38,758 And so usually you would see that after that particular threshold of memory usage.

98 00:14:38,914 --> 00:14:42,142 You see most of the out of memory kills.

99 00:14:42,226 --> 00:14:45,570 Which makes sense because given a particular device.

100 00:14:45,890 --> 00:14:48,298 Which means certain amount of memories.

101 00:14:48,394 --> 00:14:50,338 Certain memory characteristics.

102 00:14:50,494 --> 00:14:53,074 Certain version of the SDK and so on and so forth.

103 00:14:53,182 --> 00:14:53,814 You can say.

104 00:14:53,852 --> 00:14:54,090 Okay.

105 00:14:54,140 --> 00:15:10,510 Well for this device type I have this memory memory usage threshold and after this I see that I have a relatively high number of out of memory kills immediately after this threshold.

106 00:15:10,570 --> 00:15:18,150 And this means that probably that is the threshold you would like to consider as the critical threshold you should never or almost never cross.

107 00:15:18,710 --> 00:15:38,758 So once you have this picture in front of you, you can start thinking of implementing some mechanisms that can monitor the memory usage and of course kind of preemptively dialocate things or keep that memory threshold as low as possible with respect to the critical threshold.

108 00:15:38,794 --> 00:15:53,446 So you can start implementing some logic that prevents the application from being killed by the operating system so that you would in fact reduce the rate of out of memory kills overall.

109 00:15:53,578 --> 00:16:11,410 Now, as always and as also the engineers state in their blog post, in the technical post, they say well, it's much more important for us to predict with a certain amount of false positive rather than false negatives.

110 00:16:11,590 --> 00:16:18,718 False negatives means missing an out of memory kill that actually occurred but got not predicted.

111 00:16:18,874 --> 00:16:40,462 If you are a regular listener of this podcast, that statement should resonate with you because this is exactly what happens, for example in healthcare applications, which means that doctors or algorithms that operate in healthcare would definitely prefer to have a bit more false positives rather than more false negatives.

112 00:16:40,486 --> 00:16:54,800 Because missing that someone is sick means that you are not providing a cure and you're just sending the patient home when he or she is sick, right? That's the false positive, it's the mess.

113 00:16:55,130 --> 00:16:57,618 So that's a false negative, it's the mess.

114 00:16:57,764 --> 00:17:09,486 But having a false positive, what can go wrong with having a false positive? Well, probably you will undergo another test to make sure that the first test is confirmed or not.

115 00:17:09,608 --> 00:17:16,018 So adding a false positive in this case is relatively okay with respect to having a false negative.

116 00:17:16,054 --> 00:17:19,398 And that's exactly what happens to the Netflix application.

117 00:17:19,484 --> 00:17:32,094 Now, I don't want to say that of course Netflix application is as critical as, for example, the application that predicts a cancer or an xray or something on an xray or disorder or disease of some sort.

118 00:17:32,252 --> 00:17:48,090 But what I'm saying is that there are some analogies when it comes to machine learning and artificial intelligence and especially data science, the old school data science, there are several things that kind of are, let's say, invariant across sectors.

119 00:17:48,410 --> 00:17:56,826 And so, you know, two worlds like the media streaming or video streaming and healthcare are of course very different from each other.

120 00:17:56,888 --> 00:18:05,274 But when it comes to machine learning and data science applications, well, there are a lot of analogies there.

121 00:18:05,372 --> 00:18:06,202 And indeed.

122 00:18:06,286 --> 00:18:10,234 In terms of the models that they use at Netflix to predict.

123 00:18:10,342 --> 00:18:24,322 Once they have the sliding window data and essentially they have the ground truth of where this out of memory kill happened and what happened before to the memory of the application or the machine.

124 00:18:24,466 --> 00:18:24,774 Well.

125 00:18:24,812 --> 00:18:30,514 Then the models they use to predict these things is these events is Artificial Neural Networks.

126 00:18:30,622 --> 00:18:31,714 Xg Boost.

127 00:18:31,822 --> 00:18:36,742 Ada Boost or Adaptive Boosting Elastic Net with Softmax and so on and so forth.

128 00:18:36,766 --> 00:18:39,226 So nothing fancy.

129 00:18:39,418 --> 00:18:45,046 As you can see, Xg Boost is probably one of the most used I would have expected even random forest.

130 00:18:45,178 --> 00:18:47,120 Probably they do, they've tried that.

131 00:18:47,810 --> 00:18:58,842 But XGBoost is probably one of the most used models on kaggle competitions for a reason, because it works and it leverages a lot.

132 00:18:58,916 --> 00:19:04,880 The data preparation step, that solves already more than half of the problem.

133 00:19:05,810 --> 00:19:07,270 Thank you so much for listening.

134 00:19:07,330 --> 00:19:11,910 I also invite you, as always, to join the Discord Channel.

135 00:19:12,020 --> 00:19:15,966 You will find a link on the official website datascience@home.com.

136 00:19:16,148 --> 00:19:17,600 Speak with you next time.

137 00:19:18,350 --> 00:19:21,382 You've been listening to Data Science at home podcast.

138 00:19:21,466 --> 00:19:26,050 Be sure to subscribe on itunes, Stitcher, or Pot Bean to get new, fresh episodes.

139 00:19:26,110 --> 00:19:31,066 For more, please follow us on Instagram, Twitter and Facebook or visit our website at datascienceathome.com

References

https://netflixtechblog.com/formulating-out-of-memory-kill-prediction-on-the-netflix-app-as-a-machine-learning-problem-989599029109