A show-case of concurrent swapped security event enrichment with Modin (Pandas-style), Python, and mapping from a CSV - Pivot tables
This refers to the code listing documented in a post I made earlier this month.
The purpose of this is primarily to maintain a knowledge base of security data-science topics.
Jupyther Notebook (full implementation)
Here is the documented code in the form of a Jupyther Notebook:
Note: a Cloudflare WAF event ID that has no match in the CSV map may be an anomaly-based detection (only).
Key factors of the implementation
This code is made to process millions of lines of JSON in a reasonably fast manner.
Besides static lookups within a loaded CSV more sophisticated functions could be implemented for other use cases.
Could command-line data-science outperform the implementation
Yes, especially for smaller amounts of events / JSON records. The main problem is summing up the value counts. Implementing statistical processing with Shell scripting would not be feasible. Especially not on time series. Don’t do that. Time math can cause major frustration during development
Could using small portions of the logs at a time be faster with just Pandas and Swifter?
– Like using 3 minutes (or less) per iteration within the Batch-processing? And using swifter for the
Yes, especially on a single small-scale system where the selected data (of the timeframe) fits in memory.
The implementation instantiates Ray in local mode only and doesn’t distribute the load horizontally across multiple computing nodes. With a larger size of data (that does not fit into the memory), you will face an increasing amount of performance problems. Modin takes this away easily, and lets you scale horizontally.
Wouldn't streaming the data be more suitable?
Assuming that other use cases will include security event correlation and statistical anomaly detection (for Threat Hunting), problems with Continuous Integration (or other forms of automation) would be difficult to handle. How to continue an interrupted process? How to buffer input (and what if the buffer grows to be larger than the memory)? How to distribute the load?
Many streaming approaches work on a single core / system only.
For more real-time focused detections streaming is needed and you’d apply workflows like that on a buffer. That would be like a small portion of data, and probably be a use case for a simpler swifter-based implementation.
Swifter can also work with Modin / Ray (or Dask) DataFrames, which means that the memory-spilling (swapping) is available. The data structure will reside primarily in memory, and a “plasma” file will be used just like a swap file. On Linux / Docker it resides in
/tmp/, but this is configurable. You could handle large buffers on an extra SSD.
Is Ray / Modin mature?
Modin and Ray are relatively new. Some issues with the logging (I needed to suppress the Ray agent logs). Besides that, it’s fantastic that we can run Pandas analysis workflows in a scaleable manner, on really big problems.
For security data analysis these frameworks are awesome. If you want to develop the next disruptive security technology, you develop a SIEM based on these frameworks, that works on top of open Log Management solutions like Elasticsearch. As demonstrated, coding isn’t the issue here.
In the end, SIEMs are just parsers and Pivot tables.
Why develop it in Jupyther?
With Jupyther code can be segmented into Cells, which are short REPL[^10] sections. You can “printf-debug” anything, and that’s the reason why the comments have been left in the notebook. The notebook can be saved as Python code, get containerized, and deployed just like Python code. Generally, this way data-heavy workflows are much easier to debug.
In an advanced software architecture, I’d use this as a prototype only, and implement some cells in a Plugin-Oriented way. The REPL style of development creates a highly introspective development experience. But reusing cell code is not as easy as it could be, although the basis is there and you could refactor it.
I’d write the parser in Rust, REPL style. to keep it well-documented. And import it into a small “capsule” Python program, that orchestrates a workflow and some table processing. For the use-case here, given that it’s supposed to introduce concurrency and memory-backed processing in general for the problem space of Information Security data-analysis, this is too much for one post.
Why not use a database as soon as possible?
Some databases support horizontal scaling with approaches like sharding, but most often they aren’t made for high ingest volumes. True, you could pass much of the scalability challenge to a backend database instead of handling it in a scaling middleware. But I doubt that this is the best way for such data, which is only as temporary as this.