SemanticBeeng SemanticBeeng

## TypeSafeDataFrame_full.scala
import org.apache.spark.sql.{Column, DataFrame, SparkSession}
import org.apache.spark.sql.functions.broadcast
import shapeless.ops.hlist.Prepend
import shapeless.{::, HList, HNil}

object flow {
  type JoinList = HList


  case class AnnotatedDataFrame[D, J <: JoinList](toDF: DataFrame) extends Serializable

## Laika wiki
evergreen documentation


Hi again Jens.

I studied some but did not yet use for an application. An application idea I'd like is some form of dynamic resume.

https://planet42.github.io/Laika/03-preparing-content/03-theme-settings.html#the-helium-theme

Here you mention of the possibility to use Bootstrap based themes: do you have any example of this kind, please?

## Knowledge Science First - Pitch
I tried to write my thoughts down in an "one page proposal" style.
Have succeeded only moderately.

Please advise if this make sense.

In my work to design big data management and analytics products I often make the case that "knowledge science" has to come before "data science".
Unless the meaning of the data is under governance the numbers produced by the data/ML analyses will not be as useful.

Instead, semantic data governance enables:
* better use of the raw data from both business and engineering POV

## chains
iptables -L -nv --line-numbers

```
Chain INPUT (policy DROP 0 packets, 0 bytes)
num   pkts bytes target     prot opt in     out     source               destination
1       12   792 ICMP-flood  icmp --  *      *       0.0.0.0/0            0.0.0.0/0
2       10   400 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            ctstate INVALID
3      953  519K ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
4      204  9472 AUTO_WHITELIST  tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            tcp flags:0x17/0x02
5       13  1322 AUTO_WHITELIST  udp  --  *      *       0.0.0.0/0            0.0.0.0/0

## Shapley Value
"Interpretable Machine Learning with XGBoost" https://towardsdatascience.com/interpretable-machine-learning-with-xgboost-9ec80d148d27
"Interpreting complex models with SHAP values" https://medium.com/@gabrieltseng/interpreting-complex-models-with-shap-values-1c187db6ec83
"Interpreting your deep learning model by SHAP" https://towardsdatascience.com/interpreting-your-deep-learning-model-by-shap-e69be2b47893
"SHAP for explainable machine learning" https://meichenlu.com/2018-11-10-SHAP-explainable-machine-learning/
"Detecting Bias with SHAP - What do Developer Salaries Tell us about the Gender Pay Gap?" https://databricks.com/blog/2019/06/17/detecting-bias-with-shap.html
https://github.com/slundberg/shap

## DataFabric
# Cross language/framework/platform data fabric

## Requirements / Goals

1. #DataSchema abstract over data types from simple tabular ("data frame") to multi-dimension tensors/arrays, graph, etc (see HDF5)
2. #DataSchema specifiable throygh by a functioanal / declarative language (like Kotlingrad + Petastorm/UniSchema)
3. #DataSchema with bindings to languages (Scala, Python) and frameworks (Parquet, ApachHudi, Tensorflow, ApacheSpark, PyTorch)
4. #DataSchema to define both in-memory #DataFabric and schema for data at rest (Parquet, ApacheHudi, PetaStorm, etc)
5. Runtime derived from the "shared runtime" paradigm of #ApacheArrow (no conversions, zero-copy, JVM off-heap)
6. Runtime treats IO/persistence as a separate effect (abstracted away from algo/application logic)

## tql with spark
package io.yields.common.meta

import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType

import scala.annotation._
import scala.meta._

/**

## arrow panda marshalling
# https://arrow.apache.org/docs/python/memory.html
# https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html
# https://arrow.apache.org/docs/python/ipc.html
# https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_io.py
# https://github.com/apache/arrow/blob/master/python/pyarrow/serialization.py
# https://jakevdp.github.io/PythonDataScienceHandbook/02.09-structured-data-numpy.html
# https://stackoverflow.com/questions/46837472/converting-pandas-dataframe-to-structured-arrays

import pyarrow as pa
import pandas as pd

## structured numpy arrays
# #resource
# https://docs.scipy.org/doc/numpy-1.14.0/user/basics.rec.html

conda install -c conda-forge traits=4.6.0
		traits: 4.6.0-py36_1 conda-forge


import numpy as np
from traits.api import Array, Tuple, List, String
from traitschema import Schema

## bckp_root_dirs.sh
#! /bin/bash
# Root backup directories (sources, locals. destinations and mount points) for backups executed on a/this machine

# Root of backups executed on this machine (local copies for $BCKP_DIRs of all the backups)
export BCKP_DIRS=/data/bckp_dirs

# Root of backup source directories for data from other machines (see $BCKP_SRC)
export BCKP_SRCS=/mnt/backups/bckp_srcs

# Root of backup remote destination directories (remote copies for $BCKP_DIRs of all the backups)
	import org.apache.spark.sql.{Column, DataFrame, SparkSession}
	import org.apache.spark.sql.functions.broadcast
	import shapeless.ops.hlist.Prepend
	import shapeless.{::, HList, HNil}

	object flow {
	type JoinList = HList


	case class AnnotatedDataFrame[D, J <: JoinList](toDF: DataFrame) extends Serializable
	evergreen documentation


	Hi again Jens.

	I studied some but did not yet use for an application. An application idea I'd like is some form of dynamic resume.

	https://planet42.github.io/Laika/03-preparing-content/03-theme-settings.html#the-helium-theme

	Here you mention of the possibility to use Bootstrap based themes: do you have any example of this kind, please?
	I tried to write my thoughts down in an "one page proposal" style.
	Have succeeded only moderately.

	Please advise if this make sense.

	In my work to design big data management and analytics products I often make the case that "knowledge science" has to come before "data science".
	Unless the meaning of the data is under governance the numbers produced by the data/ML analyses will not be as useful.

	Instead, semantic data governance enables:
	* better use of the raw data from both business and engineering POV
	iptables -L -nv --line-numbers

	```
	Chain INPUT (policy DROP 0 packets, 0 bytes)
	num pkts bytes target prot opt in out source destination
	1 12 792 ICMP-flood icmp -- * * 0.0.0.0/0 0.0.0.0/0
	2 10 400 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 ctstate INVALID
	3 953 519K ACCEPT all -- * * 0.0.0.0/0 0.0.0.0/0 ctstate RELATED,ESTABLISHED
	4 204 9472 AUTO_WHITELIST tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp flags:0x17/0x02
	5 13 1322 AUTO_WHITELIST udp -- * * 0.0.0.0/0 0.0.0.0/0
	"Interpretable Machine Learning with XGBoost" https://towardsdatascience.com/interpretable-machine-learning-with-xgboost-9ec80d148d27
	"Interpreting complex models with SHAP values" https://medium.com/@gabrieltseng/interpreting-complex-models-with-shap-values-1c187db6ec83
	"Interpreting your deep learning model by SHAP" https://towardsdatascience.com/interpreting-your-deep-learning-model-by-shap-e69be2b47893
	"SHAP for explainable machine learning" https://meichenlu.com/2018-11-10-SHAP-explainable-machine-learning/
	"Detecting Bias with SHAP - What do Developer Salaries Tell us about the Gender Pay Gap?" https://databricks.com/blog/2019/06/17/detecting-bias-with-shap.html
	https://github.com/slundberg/shap
	# Cross language/framework/platform data fabric

	## Requirements / Goals

	1. #DataSchema abstract over data types from simple tabular ("data frame") to multi-dimension tensors/arrays, graph, etc (see HDF5)
	2. #DataSchema specifiable throygh by a functioanal / declarative language (like Kotlingrad + Petastorm/UniSchema)
	3. #DataSchema with bindings to languages (Scala, Python) and frameworks (Parquet, ApachHudi, Tensorflow, ApacheSpark, PyTorch)
	4. #DataSchema to define both in-memory #DataFabric and schema for data at rest (Parquet, ApacheHudi, PetaStorm, etc)
	5. Runtime derived from the "shared runtime" paradigm of #ApacheArrow (no conversions, zero-copy, JVM off-heap)
	6. Runtime treats IO/persistence as a separate effect (abstracted away from algo/application logic)
	package io.yields.common.meta

	import org.apache.spark.sql.Column
	import org.apache.spark.sql.functions._
	import org.apache.spark.sql.types.StructType

	import scala.annotation._
	import scala.meta._

	/**
	# https://arrow.apache.org/docs/python/memory.html
	# https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html
	# https://arrow.apache.org/docs/python/ipc.html
	# https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_io.py
	# https://github.com/apache/arrow/blob/master/python/pyarrow/serialization.py
	# https://jakevdp.github.io/PythonDataScienceHandbook/02.09-structured-data-numpy.html
	# https://stackoverflow.com/questions/46837472/converting-pandas-dataframe-to-structured-arrays

	import pyarrow as pa
	import pandas as pd
	# #resource
	# https://docs.scipy.org/doc/numpy-1.14.0/user/basics.rec.html

	conda install -c conda-forge traits=4.6.0
	traits: 4.6.0-py36_1 conda-forge


	import numpy as np
	from traits.api import Array, Tuple, List, String
	from traitschema import Schema
	#! /bin/bash
	# Root backup directories (sources, locals. destinations and mount points) for backups executed on a/this machine

	# Root of backups executed on this machine (local copies for $BCKP_DIRs of all the backups)
	export BCKP_DIRS=/data/bckp_dirs

	# Root of backup source directories for data from other machines (see $BCKP_SRC)
	export BCKP_SRCS=/mnt/backups/bckp_srcs

	# Root of backup remote destination directories (remote copies for $BCKP_DIRs of all the backups)