# Machine Learning is a multi-year journey - Wiki essay

When I started in InfoSec around 2008 (in Germany), there were no entry-level opportunities. The industry wasn’t what it is today, but somehow I had decided that this is what I want to do for the upcoming decade(s). It took many years until I had “the skills”[tm]. Good times

Machine Learning is a similar story today: it’s a multi-year journey until you become proficient at it. I don’t foresee that there will be (many) entry-level positions here. Germany isn’t good at strategic planning or foresight in its tech sector. Which makes sharing the vision and some level of unbiased insights even more important.

– InfoSec can help to treat risks. Doing that affects entire organizations, primarily from the inside out. You have these security controls, audits, corrective actions, and security tests like pentests and vulnerability assessments.
– With ML things are somewhat similar – and different – because ML affects the digital services and products themselves; and their early and continuous development process on a Story Level. The key similarity is: doing ML means to cross-functionally collaborate through an entire “value chain”. In good orgs InfoSec is part of that. In many orgs… not so much. Which is the problem with the vision and change again.

Here is my growing Machine Learning (ML) Wiki essay with simplifications and inaccuracies for the sake of pragmatism. The focus is to deliver (initial) understanding. This isn’t the machine learning literature you are looking for It’s a loose collection to render the vision.

tl;dr:

• Getting into ML can take years. Getting into InfoSec can take years as well.
• Both fields are cross-functional and can affect multiple business and product domains
• Every (tech) job is (going to be) affected, and if we learn from the mistakes InfoSec has made over the years we can influence this journey positively

Status: 2021-03-31T22:00:00Z

# Machine Learning beyond BS

There are many definitions of ML, but this one is mine[tm]:

Machine Learning (ML) uses mathematical (Computer Science) methods with data to generate predictions.

So… predictions eh? Sounds harmless. – How good was your computer from the 1990ies at estimating things? At predicting your next line in the text editor or word processor? At Intelli-Sense and word completion? At pre-loading the programs to save startup time? At translating texts? At grouping activity data and sectoring based it on automatically extracted patterns? Correct: it wasn’t doing that. Welcome to ML.

## Will Machine Learning (ML) / Deep Learning (DL) / Artifical Intelligence (AI) replace humans?

– It’s the next “revolution”. The more I learn about ML the more I believe that it’s about “trust and verify”. It won’t replace humans, but it will certainly be part of the most influential changes over the next decades. You got the choice: work all by yourself or get some assistance in the form of “predictions”.

### InfoSec + ML = future?

InfoSec for the most part is “trust and verify”. But this isn’t just an InfoSec mentality. One of the key issues with Machine- or Deep Learning (Neuronal Networks etc.) is, that you cannot (always) explain how the prediction was generated.
– This influenced DARPA to define a project about explainable Artificial Intelligence[1].

DARPA’s XAI page delivers an understanding many books strive to deliver across 100s of pages:

– You can find ML where it’s about automation scenarios where we do not know “all” the rules. Some estimations / predictions may be needed.

In the physical world, someone will always have to look after autonomous cleaning robots or self-driving cars. They can be “dumb” and just try to drive upwards a wall.

In the media world, someone will have to correct auto-generated news journals. In retail, someone will have to adjust recommendation systems and logistic plans to stock up goods.

And in coding, someone will have to piece together the right code for the right app. All of that will create new InfoSec requirements as well. Change gets accelerated, faster-paced, and less distinguishable from auto-generation. And at some point: what isn’t accelerated with ML will die from slowness like a snail that was too slow to seek cover from the hot sun.

## More ML definitions - more understandings

• If you can build a simple rule-based system, you don’t need ML. That system doesn’t require estimations because you have a robust (statistical) model to get predictions / correct output.

In ML, you train a model with data to use the generated predictions with new data. Instead of rule-based / conditional decision procedures (if, then, else) the model is used and you rely on its predictions. That’s how it affects software development and the way software can describe business logic.

If you can clearly define the rules of a system, ML (DL and AI) are not needed.

• For simple step-by-step processes in industrial automation, it is not required. I don’t foresee that we are going to develop PLC controllers which orchestrate simple assembly line machines with ML. At least as long as the logic resembles the simplicity of a cooking recipe.
• When it comes to predictive data analytics things start to become different. Or with self-driving cars. These use cases don’t allow a full definition of rule-sets. Estimations are required, and this is where you’ll begin to need ML.

## ML is such a broad topic

In the following, this Wiki post:

• transports a rudimentary understanding (basics)
• links to a checklist in Application Security for Machine Learning
• takes the route to introduce model verification and ML as part of ML software development
• remains focused on simplicity

## ML in short - 4 steps

1. You need data (structured, possibly labeled, …). I’d say usually more than 10k rows and a couple of columns
2. Most orgs have unorganized data that’s unfit for machine learning unless it’s prepared / enriched etc. This step is often called feature engineering, but I prefer to avoid this term.
3. Then we “train” & test models, meaning we define math functions
4. Select a model that scales

tl;dr: it’s math with data, and a step by step process to produce more or less accurate predictions

## My Model

A model is a math function with vector input (let’s name it \vec v) and numerical output.

• A model is a mathematical function, like f(\vec v) or function(x, y, z)(...)
• It takes an input vector \vec v and then generates a numerical output (prediction).
• The Learning in “Machine Learning” refers to training the model by adjusting \vec v so that the predictions (generally) improve
• In ML functions have free parameters[2] which influence the computation output.
• In my terminology this is a heuristic because you fit these values to the ML application and use your domain experience, eye measurement, etc.

Parts of this can be automized because models can get auto-generated and measured. It’s just estimations and math.

tl;dr: ML is related to heuristics if you consider how free parameters may get adapted automatically

### Isn't ML just statistics?

Yes: there are statistical components in most ML applications (to save computation time by reducing the solution space), but the overall difference is that more guesswork (heuristics) is involved. A more academic discussion of this is out of the scope of this website because this is not a university trying to resell an undergraduate Computer Science curriculum as “Machine Learning” (hello MITx et. al.). Youtube university is full of that, and this caters to a different audience.

The academic consensus seems to be to count Statistical Regression (Modelling) into Machine Learning, but this is sometimes blurry when it’s called Data Science. I don’t count linear regression into ML, or even artificial intelligence. Fite Me

As mentioned in an earlier paragraph:

As a Computer Scientist, I associate this with Linear Algebra and Discrete Structures. As long as you understand enough[tm], anything goes. Getting towards an appropriate understanding may require skills like Bias-Variance Trade-Off analysis (estimation theory) or again statistical analysis.

tl;dr: ML has to do with math if you want to be able to improve the results.

#### Heuristics? Buzzword bingo?

You’ll most likely have to estimate some value, even for layered ML models. Refer to the note on terminology and jargon: I don’t think that we have to become terminology experts.

If we want to apply ML we need to know enough about the math and the algorithms to adjust the free parameters at least.

Not everyone has to implement the algorithms, but ML problems differ from sorting problems in the sense that assessing the results requires deeper / scientific insights given that the resulting quality is not as obvious.

tl;dr: you can use ML libs, but you need to be able to check them

#### Deep Learning? Reinforcement Learning?

Connections between Deep Learning[3] and ML are tangential. For the start we are concerned about Supervised and Unsupervised Learning.

Super-short overview:

• Supervised Learning has use cases in Spam Detection, Image Classification, …
• Unsupervised Learning in Clustering (usually of sequences), or Language Processing (sediment analysis, topic mapping (e.g. automatic tagging, semantic analysis, …))

Generally, I would use Supervised Learning on categorical data if I am not interested in a prediction output. Rather than that I may want to cluster / segment data.

In contrast:

• (Deep) Reinforcement Learning has use cases in Chess / Go AI[4]s, or an AI playing video games. It’s about an agent making decisions within an entire (simulated) environment.

– In other words: ML is for static / unchanging / repetitive tasks. In practice, you’ll see that sometimes ML algorithms are layers before RL / DL / RDL (=AI).

– AI is out of scope in this Wiki essay. The solution space for AI tasks is larger than for (un)Supervised Learning applications. A commonality to mention is that it’s about predictions and making decisions.

– ML may be defined as the combination of Supervised and Unsupervised Learning including tabular integration of data. You can do ML in Excel if you want, speaking of tabular integration and the minimally required programming to call something ML[5]. Actual algorithmic ML is not well suited for Excel (or even SPSS, Tableau etc. if you ask me). But you must think about the deployment as well.

– Data Science for me may be defined as preparing data, data-sets, and the (pure) statistical data analysis (incl. statistical modeling). It’s possible that you don’t need ML to solve the problem at hand, given that probability and statistics may offer robust prediction methods.

tl;dr: Supervised and Unsupervised Learning belongs in ML, but that’s not AI.

## Training the Model - Learning

In case we can already predict exact matches, we do not need ML. Unfortunately, we cannot always generate ideal predictions, even if we have mapped, prepared, and analyzed the data.

If we work on one data set we may fine-tune these free parameters too specifically for that set: this is called overfitting.
– Because the model then will not work well enough for other data-sets with these parameters. If we select parameters that work well enough for new data this is called generalizing .

Training-data is the interface for Supervised and Unsupervised Learning. It’s how we interact with the implementation (of a library) to determine / heuristically guess the free parameters of the ML app.

tl;dr: some level of methodical guesswork is involved

### Prevent overfitting

• Split data into a training- and a testing-set
• Train model on training-set and measure with testing-set

The fist formula for the split seems to put 20% into the training set, pseudo-randomly selected if it makes sense.

tl;dr: Testing-set results indicate model quality.

## Vectorizing

The data we use has to be relatively simple.

From Ari via Email.

– Pictures will be represented as a vector / matrix (usually there is a fair amount of image processing involved).

– Text can be (sequentially) encoded to get a vector representation.

All data is the same. – Finance, Information Security, … that’s also why some people conclude ML is “just statistics”.
But that would obviously exclude data-set preparation and vectorizing as well as training the model.

– Once you have a fit model, you can translate it into a statistic function, but that’s likely to result in overfitting again.

tl;dr: proper vectorizing allows for proper ML, independent of the domain. The domain matters as much as a scientific understanding to assess the resulting quality.

## Practical ML snippets -- with Python

To represent data-intensive programming, we need to share the data and the code. We can get around this for the Hello-World ML code, but for InfoSec it’s not that easy. – Where do you find good data (from cases, incidents, attacks, …) related to security events? No one shares this publicly, right?

### Hello Word with the Iris data-set and sklearn

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, test_size=0.5)


Why are we doing this? (refer to earlier paragraph, also concerning the test_size parameter)

## Practical ML -- seen in reality

### XGBoost - anomaly detection

XGBoost (eXtreme Gradient Boosting) is a new Ensemble Algorithm[6] that uses regression trees (CART[7]) in a sequential learning process as weak learners. In other words, it creates an army (of weak learners) that performs slightly better than random. – It’s a supervised learning algorithm.

1. DL - Deep Learning, RL - Reinforcement Learning, DRL - Deep Reinforcement Learning, … ↩︎

2. Artificial Intelligence ↩︎

3. Ref: Hands-On Machine Learning with Microsoft Excel 2019: Build complete data analysis flows, from data collection to visualization – Julio Cesar Rodriguez Martino ↩︎

4. Ensemble learning - Wikipedia / in short: “the art and science of combing machine learning models” ↩︎