Go and data-frames - JSON reading without mapping - get ready to machine-learn

json
cloudflare
data_science
docker
coding
superset
golang
Tags: #<Tag:0x00007f9ffc2a6e78> #<Tag:0x00007f9ffc2a6b08> #<Tag:0x00007f9ffc2a6838> #<Tag:0x00007f9ffc2a64a0> #<Tag:0x00007f9ffc2a6220> #<Tag:0x00007f9ffc2a5758> #<Tag:0x00007f9ffc2a5370>

#1

Pandas >= Go's data-frame libs?

If you are into Data-Science, you probably know Pandas {1}. Among many other other things it introduces R-like data-frames to Python {2}.

The concept of a data-frame can be used for many statistical tasks, also within applied Machine Learning.
One advantage is, that the columns of the data-frames can be used as (numerical) vectors. Tabular vector data-structures are what we need to get ready to machine-learn with data. Given that we have appropriate data-sets.

Go is probably going to be the future lingua franca for mathematical computing tasks within the domains of “Deep Learning”, “Neural Networks”, “Machine Learning” etc. Think of it like a Python replacement.

It is not mutually exclusive

You can call Go libraries from Python {3}. Generally speaking side by side comparisons between languages don’t matter.

The objective is to be able to read massive amounts of data with Go, and to put the de-serialized (JSON) data into a structure. This structure can be a time-series, where the sequential order matters. Applying string manipulation within a time-series data-set, that gets de-serialized from JSON is a key feature for security analytics. Doing this without an interpreter runtime has lots of advantages.

Python Pandas and JSON

Pandas in its core is a high-level library around NumPy-based arrays. Pandas provides a rich time-series analysis library, which can help to cross-correlate log- and security-data from various domains with very little code.

For reference, Python in the following refers to:

➜  ~ source ~/miniconda3/bin/activate
(base) ➜  ~ python
Python 3.7.0 (default, Jun 28 2018, 07:39:16)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.

And for Go:

➜  ~ go version
go version go1.10.3 darwin/amd64

Python 3 code

Let’s keep this short: first we define a dict for the authentication data we read from a file (we don’t hardcode credentials, do we?!)

data = {
    'X-Auth-Key': api_key,
    'X-Auth-Email' : mail
}

Then we make a request with Python 3’s request library, wrapped within exception handlers using the Chain of Responsibility pattern.

try:
    r = requests.get(url + zone + path + start + end + timestamps, headers=data)
except requests.exceptions.HTTPError as e:
    print("ERROR: " + str(e))
except:
    print("Unknown error")
finally:
    r.close

Since we have been able to finally use finally we pat us on our backs and call it a day. What a productive way to code. Here we do not see the declaration of the concatenated variables for the request URL string.

If you want to read (de-serialize) record-based JSON, you need to specify the lines parameter with Pandas’ read_json function.

Keep in mind that Pandas uses ujson to parse the data, which is written in C. Language-interopation is not uncommon in the Python ecosystem.

import pandas as pd
result = pd.read_json(r.content, lines=True) 

Now obviously we could have done all of this in a single line.

Pandas and time math

Let’s day for technical reasons we have to have a 5 (or 6) minute delay. Adding a couple of minutes to a timestamp sounds easy… but sometimes these easy things turn out to be difficult.

PAST_CF_CONSTRAINT=5 # delay
end_time = datetime.now() - timedelta(minutes = PAST_CF_CONSTRAINT)
end_time = end_time.strftime('%s')
print("The End   is " + end_time + ", " + datetime.fromtimestamp(float(end_time)).strftime('%c'))

Not in this case.

This should wrap up Python. Easy…

Golang Clutter code

I am new to Go… I want to keep it simple, but I gradually want to get rid of Python 3 to a certain degree.

Print coloured text on the terminal

Printing bold text. Bold and colorful. I’m skipping the imports here as well, because GoLand or VScode will add them for you anyways.

func main() {

	boldRedu := color.New(color.Bold)
	boldRedu.Println("Reading configuration file config.ini from working directory.")

Magic… and a main() function. Just keep in mind that the following snippets all appear within the main() function, but the tabs are removed to keep the post’s format.

Read a configuration ini-file

Next we load some values (API Keys etc.) from an ini-file. Many people use JSON for everything, but it’s not a config standard. Don’t do that, folks.

cfg := loadConfig("config.ini")
cloudflareAPIKey := cfg.Section("CLOUDFLARE").Key("API_KEY").String()
cloudfareAccount := cfg.Section("CLOUDFLARE").Key("ACCOUNT").String()
cloudflareZone := cfg.Section("CLOUDFLARE").Key("ZONE").String()

In this particular case we want to use the Enterprise Log Share REST API to pull some logs. This needs to be enabled.

Currently there are some restrictions to it, in opposite to common Log Aggregation solutions:

  • You can query for log data starting at 5 minutes in the past (relative to the actual time the request is being made) and going back up to 72 hours.
  • You can only request 1 GB at a time (compressed data).
  • You can only make 1 request every 5 seconds.
  • You can only request from a 1 minute bucket.

The respective config.ini may look like this:

[CLOUDFLARE]
API_KEY=123
[email protected]
ZONE=321

Easier than YAML or JSON.

Finally we need to look at functions, that return an object:

func loadConfig(cfgfile string) ini.File {

	cfg, err := ini.Load(cfgfile)
	if err != nil {
		fmt.Printf("Fail to read file: %v", err)
		os.Exit(1)
	}

	return *cfg
}

The string cfgfile points to a path, e.g. “config.ini” in the current working directory of the compiled executable. The function returns a pointer to the instantiated object, that is returned to a local variable in the main() function. Go uses escape analysis, and will be able to move this object for Garbage Collection.

Working with timestamps

Now… math with time. Everyone’s favorite. 0…59 or 1…60. Same thing, different day, minute or hour. Time-math can cause serious mental problems. Be careful.

const pastCloudFlareConst = 6
const timeFromNow = 8
end := time.Now().Add(time.Duration(-pastCloudFlareConst) * time.Minute)
start := end.Add(time.Duration(-timeFromNow) * time.Minute)

We calculate the end based on current time minus pastCloudFlareConst (in minutes) via time.Minute. For start we .Add a negative value.
The result of this simple computation can be printed in a human-readable form:

fmt.Println("Start   :" + start.String())
fmt.Println("End     :" + end.String())

For the API call here (and it’s common standard) we need Unix (Epoch) timestamps. This is simple:

startString := strconv.FormatInt(start.Unix(), 10)
fmt.Println("Start epoch  :" + startString)
endString := strconv.FormatInt(end.Unix(), 10)
fmt.Println("End epoch    :" + endString)

The 10 here formats the integer for the decimal number system, which actually should be the default for .FormatInt as far as I am concerned.

Basic string operations

Next we do some string concatenation. The &s here are just strings, and API specific for the parameter handling at the server.

cloudflareURL := "https://api.cloudflare.com/client/v4/zones/"
cloudflareZone = cloudflareZone + "/"
cloudflarePath := "logs/received?"

And:

cfURL := cloudflareURL +
	cloudflareZone +
	cloudflarePath +
	"start=" + startString +
	"&end=" + endString +
	"&timestamps=unixnano"
fmt.Println(cfURL)

These variables have been defined via the ini-file config reader, if you remember. The timestamps have been derived from the current time and the const values, that are based on the API restrictions here.

Make HTTP POST requests

We need to make a HTTP POST request to the API endpoint to retrieve the data. In order to prepare this we create a http.Client object.

var postData []byte
client := &http.Client{}
req, err := http.NewRequest("POST", cfURL, bytes.NewReader(postData))
if err != nil {
	log.Fatal(err)
}

The POST header data gets added after the instantiation. Then the request is sent. The http.Client object will direct the returning data into a []byte array. This means that the data is raw within the variable.

The []byte array grows as needed, while data is written into it {6}. It satisfies the bytes.NewReader interface. For our particular use case here this is fine, and keep in mind that Go is garbage-collected in opposite to C++.

We set the POST request headers:

req.Header.Set("X-Auth-Key", cloudflareAPIKey)
req.Header.Set("X-Auth-Email", cloudfareAccount)
resp, err := client.Do(req)
if err != nil {
	log.Fatal(err)
}

The connection gets closed and the data is within resp.Body:

defer resp.Body.Close()
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
	log.Fatal(err)
}

Finally the variable body holds the JSON records from the remote API endpoint.

Auto-parse and de-serialize JSON records into a tabular data-structure

I just want to sort the JSON records into columns, given that these are log fields. Same columns, every line… kind of. Hopefully at least. JSON without a schema is a recipe for surprises, you know.

Here is some magic code to parse the JSON records, and to put them in a data-frame.

sbody := "[" + strings.TrimRight(strings.Replace(string(body), "\n", ",", -1), ",") + "]"
gdf := dataframe.ReadJSON(strings.NewReader(sbody))
qdf := qframe.ReadJSON(strings.NewReader(sbody))

The first line wrapes record-oriented JSON. In order to to this it adds square brackets at the beginning and the end.
– Then read it from the inside to the outside: strings.Replace(string(body), "\n", ",", -1). It replaces the line-breaks. The third parameter (-1) is smaller 0, which will adjust the behaviour of the function to replace all occurrences {7}. Might look confusing, but it’s useful.

The result of the inner string operations be someString in the following. strings.TrimRight(someString), ",") is the outer level of the prior line of code. The final , will be removed here, to comply with the JSON standard. I found this to be a short and simple method to do this, for reference.

I see two data-frame objects, Sir

Indeed. The “magic code” that reads the JSON records without having to create a map uses gota, a Go-based data-frame library. And qframe, another Go-based data-frame library that is a little more mature.

  1. gota - "github.com/kniren/gota/dataframe"
  2. qframes - "github.com/tobgu/qframe"
LOC and LLOCs

Note: the code that reads the JSON records into a data-frame could be a one-liner, but using sbody here makes the code more readable for the sake of a tutorial.

Besides that one-liners are not debuggable.

Summary

With Go we can:

  • read config files
  • perform string operations
  • make API calls
  • create data objects suitable for statistical analysis

Note that C++ Boost has got a data-frame proposal from GSoC {5}.

gota and qframe - prepare for Machine Learning

In the following:

  • gdf be a gota data-frame object, commit 737ddace4c9ee84d774f8747c855639a81b5f7c2
  • qdf be a qframe object, commit 9b66547f7f2785e9c8aa1bb4c74a441f68ba0f61

Given that gota sees less development I will focus more on qframe.

Imagine with data

Here is the de-serialized data within a tabular frame object:

> ClientIP(s) ClientRequestHost(s) ClientRequestMethod(s) ClientRequestURI(s) EdgeEndTimestamp(f) EdgeResponseBytes(f) EdgeResponseStatus(f) EdgeStartTimestamp(f) RayID(s)
> ----------- -------------------- ---------------------- ------------------- ------------------- -------------------- --------------------- --------------------- --------
> 212.202....       test.com                    GET /test?token=eyJ... 123                  390                   101   1538478392686000000 4636b...
>  109.1.3.4       test.com                    GET /test?token=eyJ... 123                  380                   101   1538478594536000000 4636b...
> 2a02:810...       test.com                    GET /test?token=eyJ... 123                  374                   101   1538478691331000000 4636b...
> 84.56.21...       test.com                    GET /test?token=eyJ... 123                  380                   101   1538478527358000000 4636b...

(s) - string, (f) - float etc… Note that this data-set is altered, and does not display actual information from the respective Cloudflare tenant.

Select columns

gota
// this will select the 3 columns in gota
fmt.Println( gdf.Select([]string{"ClientIP", 
                                 "EdgeStartTimestamp", 
                                 "ClientRequestHost"}) 
 )

I cannot say that I like this syntax.

qframe
// this will select the 3 columns in qframe
fmt.Println(qdf.Select("ClientIP", 
                       "EdgeStartTimestamp", 
                       "ClientRequestHost")
 )

ClientIP(s) EdgeStartTimestamp(i) ClientRequestHost(s)


212.202… 1538478392686000128 test.com
109.1.2.3 1538478594536000000 test.com
2a02:810… 1538478691331000064 test.com
84.56.21… 1538478527358000128 test.com

Easier.

  • An IP – (s)
  • a timestamp – (i), Epoch in ms.
  • and a hostname (s), FQDN.

But… where does the (i) come from. It was a Float!!11 – Well who cares. It’s Go.

Type conversions

A closer look at Go’s minimalistic type system is out of scope for this little post, given that for the size of the typical data-analysis project (prediction, basic models, some statistics) it is not relevant. No pun intended.

EdgeStartTimestamp originally is a Float and we want it to be represented as an Integer within our qframe object. We can apply this in-place:

qdf = qdf.Apply(qframe.Instruction{Fn: function.IntF, 
          DstCol: "EdgeStartTimestamp", 
          SrcCol1: "EdgeStartTimestamp" })

This uses a Go-specific syntax. function. refers to a collection of methods. In this case the IntF method is for an Integer to String conversion.

Time-math in Go and adding a column to a qframe

In order to apply our time-math (a simple Epoch conversion to a human-readable format), we define a Go function literal (or a lambda expression).

timestamp_converter := func(x int) *string {
		result := time.Unix(int64(x) / 1000000000, 0).String()
		return &result
	}

x is not declared outside the function. In Go this is a closure. But it looks intuitive enough here.

Applying the timestamp_converter closure over the time-series within qdf should be straight forward. We return the result into qf2. We could also replace the object.

qf2 := qdf.Apply(qframe.Instruction{Fn: timestamp_converter, DstCol: "New_Column", SrcCol1: "EdgeStartTimestamp"})
fmt.Println(qf2.Select("New_Column"))

This enables us to perform data-enrichment tasks and to keep the columns and rows of the time-series. For tasks like log-analysis this is very useful.

Iterate over a column and convert Epoch timestamps to human-readable strings

Another option and chance to exemplify some data-science specific Go code is, that we export a column into a slice. A slice in Go is a sequence of typed data.

	view := qdf.MustStringView("EdgeStartTimestamp")
	var j = 0 // just to limit the output
	presult := make([]string, 1)
	for i:=0; i < view.Len(); i++ {
		j++

		item := view.ItemAt(i)
		if citem, err  := strconv.Atoi(*item); err == nil {
			presult = append(presult,
				time.Unix(int64(citem) / 1000000000, 0).String())
		}

		if j >= 10 {
			break
		}
	}

This code is not elegant, but it does the job.

Reading it from the inside to the outside: presult is a slice of the type String. We simply append the converted timestamps (from the Integer Epoch value) to it per iteration. Due to the type conversion in Go we must do some error handling.

item is a view object from the qframe library, that gives us individual row access for a selected typed column. j is just a counter variable without special meaning, and it helps to trigger the break within the for loop.

Now let’s print the slice:

for i := range presult {
	fmt.Println(presult[i])
}

2018-10-02 13:06:32 +0200 CEST
2018-10-02 13:09:54 +0200 CEST
2018-10-02 13:11:31 +0200 CEST
2018-10-02 13:08:47 +0200 CEST
2018-10-02 13:09:25 +0200 CEST
2018-10-02 13:08:22 +0200 CEST
2018-10-02 12:47:21 +0200 CEST
2018-10-02 12:59:32 +0200 CEST
2018-10-02 12:32:22 +0200 CEST
2018-10-02 13:12:28 +0200 CEST

10 elements, counted by j, converted from the qdf column EdgeStartTimestamp via the view.

Package a Go binary into a Docker container

Assuming we build it from the $GOPATH and on a Linux system:

CGO_ENABLED=0 
GOOS=linux 
go build  -a -tags netgo -ldflags '-w' cloudflare_logs.go

And the Dockerfile for docker build .

FROM google/debian:stretch
ADD cloudflare_logs cloudflare_logs
ENV PORT 80
EXPOSE 80
ENTRYPOINT ["/cloudflare_logs"]

There are minimal examples with scratch, which may work if you add the proper certificates in a multi-step build process. This here features simplicity over minimalism.
The point is, that we do not need to add a couple of libraries via conda. The footprint (and the complexity) is already more controllable.

Further reading

It is possible to serialize the dataframe object into an SQL database. This comes in handy for data-visualization, if we use Apache Superset {8}.

We prepare our local build environment to use Sqlite 3, because we don’t want to setup a DBMS for this ad-hoc task:

go get github.com/mattn/go-sqlite3

The qframe examples feature an in-memory Sqlite DB.

	db, _ := sql.Open("sqlite3", ":memory:")
	tx, _ := db.Begin()
	// Write the QFrame to the database.
	qdf.ToSQL(tx,
		// Write only to the test table
		qsql.Table("test"),
		// Explicitly set SQLite compatibility.
		qsql.SQLite(),
	)

If we want a file, we simply replace :memory: with data.db. It’s worth mentioning that in-memory DBs can be efficient, if we don’t need an export to a file.

Next we start the Superset server locally within our Miniconda (here Python 2, don’t mind this):

➜  awesomeProject source ~/miniconda2/bin/activate
(base) ➜  awesomeProject superset runserver -d -p 16666

The respective connection string is: sqlite:////Users/marius.ciepluch/go-workspace/src/awesomeProject/data.db. Your’s might differ.

35

And the rest is point & click to some visualizations. Like with Tableau or other “Business Intelligence” tools. I’ll probably focus on using Superset with Druid or ClickHouse within another post.

References

{1} Pandas introduces dataframes to Python 2 and 3.

{2} R data-frame documentation - for reference, because R started this trend

{3} Calling Go functions from Other Languages (Medium blog)

{5} C++ Boost has a proposed data-frame implementation from GSoC 2017

{6} Go and streaming data from APIs is explained here, with some remarks to memory management

{7} Go’s string functions: replace

{8} Apache Superset is a modern web-app to visualize data

{9} Apache Druid is a free OpenSource analytics data-store for event-driven data

{10} Yandex ClickHouse is a free OpenSource DB for high-performance ingest scenarios