BLOG

Get the Latest News and Press Releases

Quantitative Trading 101

Summary

  • Anyone can learn quantitative trading. You don’t need to have a PhD in Quantum Astrophysics to create quantitative trading systems or perform quantitative research
  • The process of identifying a suitable trading strategy is identical to the scientific method: It requires creating hypotheses and making assumptions based on data to identify a statistical edge.
  • Quantitative research (data mining, hypothesis testing…) always precedes backtesting trading strategies

As a trading enthusiast, I have always wondered if the best quant traders possessed predetermined trading strategies that they could use to consistently generate superior returns. I thought trading was as straightforward as solving an equation and using the solution to generate market beating returns. After doing some research and chatting with a few pro quant traders, I started familiarizing myself with quantitative analysis techniques to get a better understanding of the entire quantitative research process.

Let’s look at what the quantitative research process looks like.

We’ll be analyzing the stock of the most popular company in the world: Apple stock (ticker: $AAPL).

STEP 1: Analyze the distribution of daily returns

Note: We will be using the research environment provided by Quantconnect to perform our research.

We start off our analysis by plotting the distribution of AAPL returns over the past 5000 days.

Image for post

Figure 1–1: Histogram of the distribution of Apple’s daily returns

The next step would be to compare this distribution to a normal distribution. (A lot of models used in quantitative finance and statistics assume a normal or lognormal distribution)

STEP 2: Compare figure 1–1 to the normal distribution

Let’s generate some random data to plot the normal distribution.

Image for post

Figure 1–2: Histogram of the normal distribution (obtained by generating random data)

Now that we have plotted both distributions, let’s put them in one plot for comparison purposes.

Image for post

Figure 1–3: AAPL distribution returns vs Random normal distribution

Comparison Summary

  • The distribution of Apple stock daily returns resemble the normal distribution
  • The distribution of Apple stock has “heavier tails”. In layman terms, we can expect outsized moves to the upside and downsize, more so than a normal distribution would suggest.
  • Statisticians often use the “kurtosis” of a distribution as a statistical measure to simply identify wether the tails of a given distribution contain extreme values.

STEP 3: Investigate patterns in our data (the fun part)

Jim Simons, arguably one of the most successful quant traders of all times once said: “We search through historical data looking for anomalous patterns that we would not expect to occur at random.”

Let’s follow Jim’s advice and explore Apple’s historical data to see if we can uncover some interesting patterns. Let’s look at the hourly resolution data (typically hard to find for free but easily accessible through the Quantconnect platform).

Image for post

Figure 1–4: AAPL hourly returns

Image for post

Figure 1–4: Boxplot of AAPL hourly returns

It looks like the most substantial returns were made overnight. An interesting idea would be to “buy at market close and sell at market open” to capture overnight gains. You can investigate this phenomenon further by exploring this research paper which explains “the overnight drift” (Most gains are made in the after hours).

Let’s continue to explore the discrepancy that we previously discovered. The previous bar plot suggested that there were substantial gains made in the after hours. Let’s plot the cumulative performance of overnight returns vs intraday returns to better visualize and confirm this discrepancy.

Image for post

Figure 1–5: Overnight Returns vs Intraday Returns (Apple Stock)

Our hypothesis was correct. It’s quite apparent that most returns are realized in the after hours.

STEP 4: Look for potential autocorrelations/trends

Autocorrelation is a mathematical representation of the degree of similarity between a time series and a lagged version of itself over successive time intervals. in simpler terms, it describes how the present value of a series is related to its past values. You can learn more about autocorrelations here.

The goal of the quantitative analyst is to look for possible trends within the dataset. This can be accomplished by analyzing the Autocorrelation function plot (ACF plot).

Image for post

figure 1–6: ACF plot

It looks like there are no significantly correlated lags (we are basically looking for autocorrelations that lie outside the red band).

If the first lag on the graph lied outside for the red band for instance, then we would have concluded that there is a negative autocorrelation at lag 1 (on the x-axis of the ACF plot). Once you have that information, you could potentially investigate the relationship between that lag and the stock’s annual volatility.

Image for post

Fig 1–7: Rolling lag 1 vs Annual Volatility

There seems to be a negative correlation between the volatility of AAPL and its lag 1 autocorrelation. Furthermore, we can visualize how that relationship held up over the past 5 years.

Image for post

Figure 1–8: Annual volatility vs Rolling lag 1 (from 2015 to now)

As expected, there is no clear consistent relationship between volatility and the rolling lag 1 correlation of Apple stock returns. This is how you would typically investigate time series data.

Conclusion

  • The noise to signal ratio is extremely high in quantitative analysis. Clean and consistent patterns are usually very subtle and can quickly vanish
  • Based on our research, a substantial amount of Apple stock returns is made in the after-hours

Originally posted here

Understanding of Artificial Neural Networks

Introduction

Artificial neural networks are based on collection of connected nodes, and are designed to identify the patterns. They are part of deep learning, in which computer systems learn to recognize patterns and perform tasks, by analyzing training examples. For example an object recognition system can be fed to thousands of labeled images of houses, cars, traffic signals, animals etc. and would recognize visual patterns in the images so that it can consistently correlate with defined labels. These are inspired by biological neural networks of our brain. A neural network is modeled loosely like human brain and can consist of millions of simple processing nodes, called perceptrons which are densely interconnected. An individual node may be connected to several nodes in the layer beneath it, from which it receives data, and several nodes in the layer above it, to which it sends data. Each node can take multiple inputs, process it and transmit the output to the neurons in next layers. The connections are also called as edges. Nodes and the edges have weights, which adjusts the strength of the signal at a connection. When the network is active, the node receive different number / signal over each of its connections and multiplies by associated weight. The output (aggregate signal) of each node is calculated by non-linear function of the sum of its inputs. If the output is below a threshold value, node does not pass data to the next layer. However, if the output exceeds the threshold value, the node pass the data to all the outgoing connections of next layer.

Initially, when the neural network is trained, weights and thresholds are set to some random values. Here, training data is fed to the input layer which passes to the succeeding hidden layers, gets multiplied and transformed at each node and added together in complex ways until it reaches the output layer, where the final predicted output is compared with the expected output and the error is calculated. During the training, thresholds and weights at each nodes are continuously adjusted until training data yields consistently expected outputs. In modern days, neural network algorithms are emerging as a new artificial intelligence technique that can be applied to real-time problems.

Neural Network Architecture

A neural network is composed of input layer (leftmost layer), the neurons within input layers are called input neurons. The rightmost layer is the output layer consists of output neurons. In the figure below, output layer consists of single neuron. The middle layers, are also called hidden layers. Below figure consists of two hidden layers. Such neural networks consisting of multiple layers are also called multi layer perceptrons or MLPs.

Now, let’s explore, how computation on each node. Node is loosely patterned to a neuron of human brain. It computes the input data with a set of weights or coefficients which either amplify or suppress that input, based on the task, algorithm is trying to handle. The summation of product of inputs and weights are passed through node’s activation function. The output signal based on its value in comparison to the threshold value is decided whether it should be passed and to what extent that signal should progress further through the neural network to impact the ultimate outcome. If the signal passed through the node, it indicates that specific node is activated.

A layer is a row of multiple such nodes or neurons like switches which turn on or off as the input is passed through the neural network. Each layer’s output is the input to the subsequent layer. Pairing of the adjustable weights along with input features determines significance to those features with regard to how the neural network classifies and clusters input. Below is the framework of artificial neural networks (ANN) –

Types of Neural Networks

There are multiple types of neural networks, which use different principles in determining their own rules and learn the patterns. Each of them has their own unique strengths. 

Above figure depicts various types of neural networks.

Summary –

Neural networks represents very powerful techniques of AI, as they start with blank state and find their way through to a precise model. Neural networks are effective, but complex in their approach to modeling, as it can’t make assumptions on financial dependencies between input and output. The best part of the neural networks is , they are designed in a way which is similar to the biological neurons in a human brain. Hence, they are designed are learn faster and identify complex patterns much more accurately among huge data and it’s performance improves with more data and usage. Hence, neural networks are the fundamental framework on which critical artificial intelligence (AI) systems are built.

 

Bibliography

Nguyen, H., Bui, X.-N., Bui, H.-B., & Mai, N.-L. (2020). A comparative study of artificial neural networks in predicting blast-induced air-blast overpressure at Deo Nai open-pit coal mine, Vietnam. Neural Computing and Applications, 32(8), 3939–3955. https://doi.org/10.1007/s00521-018-3717-5

Orimoloye, L. O., Sung, M.-C., Ma, T., & Johnson, J. E. V. (2020). Comparing the effectiveness of deep feedforward neural networks and shallow architectures for predicting stock price indices. Expert Systems with Applications, 139, 112828. https://doi.org/https://doi.org/10.1016/j.eswa.2019.112828

Sherstinsky, A. (2020). Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) network. Physica D: Nonlinear Phenomena, 404, 132306. https://doi.org/https://doi.org/10.1016/j.physd.2019.132306

Yao, P., Wu, H., Gao, B., Tang, J., Zhang, Q., Zhang, W., Yang, J. J., & Qian, H. (2020). Fully hardware-implemented memristor convolutional neural network. Nature, 577(7792), 641–646. https://doi.org/10.1038/s41586-020-1942-4

https://www.analyticsvidhya.com/blog/2014/10/ann-work-simplified/

https://pathmind.com/wiki/neural-network

https://medium.com/@ODSC/5-essential-neural-network-algorithms-9336093fdf56

http://neuralnetworksanddeeplearning.com/chap1.html

https://www.digitalvidya.com/blog/types-of-neural-networks/

Anyone interested in emotional currency?

I’m looking to start a startup that involves turning emotions into a form of currency. It has to do with mental health, big data, and machine learning. I would like to know if anyone is interested in hearing my abstract concepts, or be part of the team.

submitted by /u/chesterbryce
[link] [comments]

Insider Secrets From a Front-Line Industrial CISO

Times of significant change often amplify challenges within organizations — especially when it comes to blending business-critical operational technology (OT) with information technology (IT).

During the third installment of Recorded Future’s executive dialogue series, Recorded Future’s chief operating officer, Stu Solomon, joined Satish Gannu, chief security officer and chief technology officer digital at ABB, to discuss the fast-converging worlds of IT and OT, along with new remote work challenges, and explore why OT and IT leaders must band together to tackle growing cybersecurity risks.

Don’t miss their insightful discussion on demand here, and read on for some of our favorite highlights from the conversation.

Setting the Record Straight on OT, IT, and IIoT

Even within the cybersecurity community, many people are fuzzy on how OT and IT intersect. Then there’s this whole other beast: The industrial internet of industrial things (IIoT). Where do they all converge?

“OT can be broadly defined as the manufacturing and delivery of things, whether they are objects or critical services like power generation or water transportation,” explains Gannu. “Many OT systems have been around for years, even decades, and as the world evolves, they are becoming interconnected and getting IP addresses in the process.”

These OT systems are rapidly converging with IT, bringing the promise of improved efficiency and new business models enabled by the industrial internet of things (IIoT). However, the promise of greater connectivity comes with greater risk.

Comparing Apples to Oranges?

OT has evolved at a much slower rate than IT, and for good reason. Gannu says, “When you buy equipment to set up a cement plant, it runs nonstop for seven years. Some transformers are even built to last 40.” In the world of OT, availability is king, and keeping systems up and running to avoid devastating power outages or water shortages is critical. “If something is running, that means the plant is operating, and you don’t touch it,” he says. This also means that change across OT systems, processes, and operators happens gradually, in stark comparison to the dynamic, fast-paced world of IT. You can easily patch a software bug. It’s something entirely different to upgrade an OT system.

Separating IT and OT Is No Longer an Option

Due to the critical nature of OT systems, IT and OT have historically been kept apart to minimize risk. Additionally, there’s often a level of mistrust for IT on the OT side, which hinders collaboration and convergence. Gannu says this needs to change.

“The 2001 Stuxnet attack showed us that isolation doesn’t work. There are different mechanisms threat actors can apply to attack critical systems,” explains Gannu. Just because systems are separated doesn’t mean a malicious insider can’t walk in with a USB stick to launch an attack, for example.

Applying IT Learnings to Tackle OT Challenges

With 30-plus years of leadership experience in both IT and OT, Gannu hasled large-scale convergence initiatives across processes, data, and physical systems.

During the discussion, he describes his approach for effectively communicating with line-of-business owners, helping them understand how to apply learnings from IT to address OT challenges. For example, anomaly detection techniques are widely used in enterprise security programs to establish a baseline of traffic and understand what’s normal, and what’s not. From micro-segmentation to behavior analytics, there are many examples like this. “Just because OT is a different environment, doesn’t mean we should forget the basic blocking and tackling that makes a good security program effective in the first place.”

This is particularly true in the wake of COVID-19, as many employees continue to work from home. There’s much to be learned from IT about rapidly scaling infrastructure and establishing new policies to enable remote access to critical systems.

The Critical Role of the CISO

More than ever, this pervasive question urgently demands an answer: Who is responsible for mitigating cybersecurity risk across OT systems, as they become increasingly connected as part of IIoT?

Organizations are increasingly focusing on security as a key business enabler. While OT and IT leaders across the business are collaboratively addressing evolving challenges and threats, CISOs are uniquely positioned to lead because they understand the context in what they’re seeing, especially from an anomalous perspective. And, Gannu notes, since most of their careers follow the IT path, “Who knows better than the CISO?”

Collaborating and Sharing Information to Evaluate Risk

It’s clear that IT and OT network environments will continue to merge, so how can organizations build the right security monitoring and detection controls to effectively manage them?

Gannu urges CISOs to focus on the fundamentals first. Discover what you’re actually responsible for — where IoT-connected devices exist across OT and IT environments. As you discover assets, you’ll also discover vulnerabilities and want to patch them. Resist this urge, Gannu says. “Before patching, do a vulnerability analysis to see how bad things are. Don’t just patch, patch, patch.” It’s important to understand the risk of the vulnerability in relation to the organization so you can prioritize efforts and keep critical systems up and running, he notes.

Gannu also encourages information sharing and industry-wide collaboration. “The best way to learn is through your peers,” he says. In security, while the impact always feels personal, the reality is, someone’s probably been there before. “Bodies such as the OT Cybersecurity Allianceare helping to build a community of knowledge, guidance, and resources to help organizations mitigate cyber risk in the digital world.

Watch this executive dialogue now to dig further into these topics.

The post Insider Secrets From a Front-Line Industrial CISO appeared first on Recorded Future.

FER0887780

The complainant has requested information relating to a Land Stability Site Management Plan (detailed plan) submitted to Hastings Borough Council (the council) by a local caravan park (the site). Whilst the council provided some information to the complainant, both at the internal review stage, and during the course of the Commissioner’s investigation, it advised that the remainder of the information was either not held, or was exempt from disclosure under regulation 12(5)(e) and 12(5)(b) of the EIR. The Commissioner’s decision is that the council is entitled to rely on regulation 12(5)(e) in respect of all that information which has been withheld in response to the request. In addition, the Commissioner is satisfied that, on the balance of probabilities, the council was correct when it advised the complainant that it did not hold part of the information that he had requested. However, the Commissioner has found that the council has breached regulation 14(2) of the EIR by failing to issue a refusal notice to the complainant within 20 working days. It has also breached regulation 14(3) by failing to cite regulation 12(4)(a) where no recorded information was held. Furthermore, where the council did provide information in response to part of the request, it failed to do so within the prescribed time period and has therefore also breached regulation 5(2) of the EIR. The Commissioner does not require the council to take any steps as a result of this decision notice.

FER0887781

The complainant has requested information held by Hastings Borough Council (the council) relating to proposals for the erection of signage and fencing on a local caravan park site (the site). Whilst the council provided some information to the complainant, both at the internal review stage, and during the course of the Commissioner’s investigation, it advised that the remainder of the information was either not held, or was exempt from disclosure under regulation 12(5)(e) and 12(5)(b) of the EIR. The Commissioner’s decision is that the council is entitled to rely on regulation 12(5)(e) in respect of all that information which has been withheld in response to the request. In addition, the Commissioner is satisfied that, on the balance of probabilities, the council was correct when it advised the complainant that it did not hold part of the information that he had requested. However, the Commissioner has found that the council has breached regulation 14(2) of the EIR by failing to issue a refusal notice to the complainant within 20 working days. It has also breached regulation 14(3) by failing to cite regulation 12(4)(a) where no recorded information was held. Furthermore, where the council did provide information in response to part of the request, it failed to do so within the prescribed time period and has therefore also breached regulation 5(2) of the EIR. The Commissioner does not require the council to take any steps as a result of this decision notice.

Architectures Every Data Scientist And Big Data Engineer Should Know

Source

Comprehensive and Comparative List of Feature Store Architectures for Data Scientists and Big Data Professionals

Introduction & Motivation – Why Feature Store

Feature store has become an important unit of organizations developing predictive services across any industry domain. Some of the earlier challenges in deploying ML solutions at scale involves :

  • Developing and maintaining customized systems by individual teams with little or no coordination.
  • No collaborative system for sharing features for similar type ML models (models from a similar domain or models addressing. same business use-cases or customer domains).
  • Increased cognitive burden without the proper scope of scalability
  • Limited integration with big-data ecosystems.
  • Limited scope for model retraining, comparison, model governance, and traceability, limiting agile development life-cycle.
  • Difficult to track and retrain model which exhibits seasonality

To overcome the above limitations, Architects. Data scientists, Big Data, and Analytics professionals have felt the necessity to walk under one roof with one unified framework to facilitate easier collaboration, sharing of data, results, reports.

Departments, teams and organizations shared some of the similar notions of Feature Engineering:

  • Feature Engineering is expensive and amortization happens over time and across models.
  • The increase in cost is non-linear/exponential with the increase in the number of features.
  • Triggers/Alerts due to addition/removal of feature is high.
  • Most often dependencies are not documented/tracked which results in an increase of implicit and explicit dependencies getting added over time.

While sharing a similar opinion, it became easier to come together and create a Unified Framework called Feature Store. This would enhance the speed of ML model deployment life-cycle along with the creation of proper documents, required version analysis, and model performance in order to save time and effort.

In this blog, we highlight on the features supported by different Feature Store frameworks, that are primarily developed by different leading industry giants.

Advantages of Feature Store

  • Ability to re-use and discover features between teams across the organization.
  • Features should be governed by adding features like access control and versioning.
  • Ability to precompute and automatically backfill features — including online computation and offline aggregation
  • Helping to create a collaborative environment between data scientists and big data engineers
  • Save effort and cost by sharing not only features but also related artifacts, documents, marketing insights of models developed from these features.
  • Enable consistency between training and serving.

Michaelengelo From Uber

Michaelangelo – a framework developed by Uber that allows feature integration/joining in both offline and online pipelines. Here Hive (Offline) and Cassandra (Online) acts as the main storage unit for raw/transformed features. It provides a horizontally scalable multi-tenant architecture for multiple models with suitable scaling and monitoring. Training jobs can be configured and managed through a web UI or an API, via Jupyter notebook.

It further provides options to define hierarchical partitioning schema to train models per partition, that can be deployed as a single logical model. This provides easy bootstrapping and helps to overcome challenges when several models need to be trained based on the hierarchical structure of the data.

At runtime during serving, it finds root to the best model for each node. Further its best known for its ability to support continuous learning, providing integration with AutoML, along with its support for distributed deep learning.

Feast Feature Store

Google released Feast which is primarily built around Google Cloud services: Big Query (offline) and Big Table (online) and Redis (low-latency), with Apache Beam for feature engineering. It allows a clear separation between big data and model development. This online predictive service allows feature sharing among teams with strong consistency between model training and serving.

Further Feast comes with centralized feature management, discovery, feature validation, and feature aggregation. The feature columns reside inside wide-entity tables. In addition, the composite entities separate individual features.

Wix Feature Store

Wix provides a platform for feature-sharing across different ML models for both batch and real-time datasets. It supports a pre-configured set of feature families on the site and user-level for both training and serving models. The different stages of data management, model training and deployment are marked and show in the figure above. It further uses S3 to store real-time extracted features.

FeatureStore from Comcast

The Feature Store developed by Comcast helps data scientists to reuse versioned features, upload online (real-time)/streaming data, and review feature metrics by models. The product is available in multiple pluggable feature store components. The built-in model repository contains artifacts related to data pre-processing (normalization, scaling) displaying the required mapping to the features needed to execute the model. Further, the architecture is built using Spark on Alluxio (open source data orchestration layer that brings data close to compute for big data and AI/ML workloads in the cloud), S3, HDFS, RDBMS, Kafka, Kinesis. The Model deployment with Kubeflow helps to build a resilient, highly available distributed systems with support for rate-limiting, shadow deployments, and auto-scaling.

The integration with Data Lake with suitable API s helps data scientists to use SQL and create training/validation/test datasets that can be versioned and integrated into the full model pipeline. In addition, the framework comes with the support of Seldon Inference Graphs for A/B Testing, Ensembles, Multi-armed bandits, Custom combinations. The end to end system not only provides traceability from use-case, models, features, model to features mapping, versioned datasets, model training codebase, model deployment containers, and prediction/outcome sinks, it is also known for integration with Feature-Store, Container Repository, and Git to integrate data, code and run-time artifacts for CI/CD integration.

Just like any other architecture, it has continuous Feature Aggregation on streaming data + on-demand features. The Online Feature Store uses the following sequences before giving a prediction:

  • Payload only contains Model Name & Account Number
  • Model Metadata informs which features are needed for the model
  • Pull required by features by Account Number
  • Pass a full set of assembled features for model execution

HopWorks Enterprise Edition is a multi-tenant architecture that integrates AWS Sagemaker, Databricks, Kubernetes, and Jupyter Notebook. It also supports integration with Authentication frameworks like LDAP, Kerberos, and Oauth2.

The Batch / Live Streaming functionality is facilitated by Apache Beam, Apache Flink, and Apache Spark, whereas the model governance and monitoring pipeline are built using Kafka and Spark Streaming.

The architecture is composed of several building blocks namely

  • The Feature Store API – For reading/writing to/from the feature store
  • The Feature Store Registry – User-Interface to discover features
  • Feature Metadata – Documentation, Analysis and Versioning
  • Feature Engineering Job – For computation
  • Storage Layer – For feature storage

The feature store developed by Netflix supports both online and offline model training and development. The online micro-services enables the framework to collect the data elements required by the feature encoders in a model. It further passes this downstream for future use by offline predictions. The Fact Logging service of Netflix logs user-related, video-related, and computation specific features in a serialized format in appropriate storage units (S3).

The unique point of this architecture is the presence of components that help to:

  • Develop/Create contexts to snapshot
  • Snapshot data of various micro-services for the selected context
  • Build APIs to serve this data for a given time coordinate in the past

As snapshotting data for all contexts (e.g all member profiles, devices, times of day) would incur overhead and cost, Netflix relies on selecting samples of contexts to snapshot periodically (at regular intervals – daily/twice daily), though different algorithms. It achieves this through Spark, by training data on different distributions, and by using stratified samples based on properties such as viewing patterns, devices, time spent on the service, region, etc.

Netflix embraces a fine-grained Service Oriented Architecture for cloud-based deployment model.

FBLearner from Facebook

The FBLearner designed by Facebook is a framework for AI WorkFlow with Model Management and Deployment. It is mainly composed of 3 components – FB Learner Feature Store (runs on CPU), FB Learner Flow (runs on CPU +GPU), and FB Learner Predictor (runs on CPU). It supports building all kinds of deep learning models (Caffe2, Pytorch, Tensorflow, MxNet, CNTK) and models can be stored in ONNX format (standardizes portability across converters, runtimes, compilers, and visualizers. supports and to) across different hardware/software platforms.

The above broad categories can be seen as creating logical units from hardware to application software.

  • Frameworks (FB Learner Feature Store) needed to create, migrate and train models
  • Platforms (FB Learner Flow) for model deployment and management and
  • Infrastructure (FB Learner Predictor) needed to compute workloads and store data 

Facebook also uses a principle to split development and deployment (production) environments.

Pinterest Feature Store

Pinterest’s – Big Data Machine Learning is a classic example of high speed and quality which is scalable, reliable, and secure. This Metadata-driven framework is built using open-source technology with individual building blocks that help in reusability. It also provides governance: enforcement & tracking.

The uniqueness of this architecture lies in capturing relationships and interactions (clicks made by users) between pins (how objects are organized into collections).

The below figure illustrates the different components in model governance and development architecture

Zipline from Airbnb

The predictive system ZipLine created by Airbnb relies on a scoring service based on features gathered in due time and space. The scoring log (acts as debug/audit log) is computed/updated daily to ensure feature consistency and single feature definition both during training ML model and deploying them at production. In addition, it ensures Data Quality monitoring, feature back-filling, and making features searchable and sharable.

The architecture integrated with data sources — Hive Table, databases and Jitney’s Event Bus apart from Apache Spark (batch) and Flink (streaming) with Lambda as serving point.The uniqueness of this platform lies in :

  • Reduction of custom pipeline creations
  • Reducing data leaks in custom aggregations
  • Feature distribution observability
  • Improved model iteration workflow

TFX

TensorFlow Extended (TFX), a TensorFlow based general-purpose machine learning platform provides orchestration of many components—a learner for generating models based on training data, modules for analyzing and validating both data as well as models, and finally infrastructure for serving models in production. The platform is particularly known for training, validation, visualization, and deployment of fresh newly trained models in production continuously relatively quickly. The individual components can share utilities that allow them to communicate and share assets. Due to fast training data and deserialization teams and community can share their data, models, tools, visualizations, optimizations, and other techniques

The components are further known for gathering statistics over feature values: for continuous features, thestatistics include quantiles, equi-width histograms, the mean and standard deviation, whereas for discrete features they include the top-K values by frequency. In addition, the components support the computation of model metrics on slices of data e.g., on negative and positive examples in a binary classification problem) and cross-feature statistics like correlation and covariance between features. These statistics give insights to users on the shape of each dataset.

Further, the architecture also provides configuration free validation-setup enabled for all users, multi-tenancy to serve multiple machine-learned models concurrently, soft model-isolation to increase model performance.

Apache Airflow

This image has an empty alt attribute; its file name is Screenshot-2020-07-27-at-4.51.58-PM-1-1024x496.png

Apache Airflow : Source

Apache Airflow’s entire architecture is based on the concept of DAG (Directed Acyclic Graph), which takes into account the dependencies within them. Its principal responsibility to ensure all things happen at the right time and in the right order. The DAGs define a single logical workflow and they are defined in python files.

Further, it supports Airflow Operators which states what steps are executed over time (e.g. download or transfer operators- GoogleCloudStorageDownloadOperator ). One such Operator is the GoogleCloudStorageObjectSensor which pauses execution until aa key appears in S3.

Apache Airflow guarantees Idempotence (ensuring subsequent execution of any step produces the same end-result, irrespective of the number of times.), Atomicity, and Metadata Exchange. Data exchange between different components of this distributed architecture is facilitated using XCOM (cross-communication) that provided an exchange of small metadata. However, for large volumes of data, it supports shared network storage, data lake (S3) or URI based exchange through XCOM.

Parameterized representations of operators help DAG to run tasks that spawn a TaskInstance at a particular instant of time. Further, the instances within Apache AirFlow DAG are grouped into a DagRun.

Zomato Feature Store

Zomato’s restaurant business heavily relies on stream data processing to compute running orders at the restaurant at any given point. The architecture use Apache Flink that provides job level isolation for each ML model as features from each ML model maintain their separate space for research, analysis, logging and do not interact with features from other ML models.

In addition to streaming and online feature extraction, the life-cycle management of ML models is provided by MLFlow. The ML models are served to the external world via API Gateway by means of AWS Sagemaker endpoints.

Overton from Apple

Overton automates the life cycle of model construction, deployment, and monitoring by providing a set of novel high-level, declarative abstractions. It supports multi-task learning to concurrently predict several ML models in both real-time and backend production applications.

Further, the architecture allows separation between model and data with two components the tasks, which capture the tasks the model needs to accomplish, and payloads that represent sources of data, such as tokens or entity embeddings.

The model training is governed by a schema file, which acts as a guide to compile a TensorFlow model and to describe its output for downstream use. Overton also embeds raw data into a payload, which is then used as input to a task or to another payload. The payloads are either singletons (e.g., a query), sequences (e.g. a query tokenized into words or characters), and sets (e.g., a set of candidate entities).

StreamSQL Feature Store

StreamSQL Feature store is alow latency based model development framework with high throughput serving. It allows new model features to be deployed confidently with versioning with much with ease. With the use of feature definitions, consistent feature deployment is ensured across training, in serving and across production.

The architecture is also known for its ability to increase model performance by integrating features from 3rd party. It combines batch and stream processing with an immutable ledger, where each event is appended to the end of the ledger. Further, the framework at any point allows the addition of new data sources/transformations (from Flink and Spark. Files, tables, and stream), modify or create a new set of features and even analyze/discover features from feature registry.

Hybrid Feature Store

The above figure illustrates a Hybrid Feature Store with Data Pipeline, BI Platforms (Tableau) using Apache Airflow, S3, Hopsworks Feature Store, and Data Lakes from Cloudera. The platform is capable of ingesting raw data, event or SQL data at the input.

Feature Store from Tecton

Tecton has come up with a unified architecture to develop, deploy, curate/govern and monitor a platform built to standardize high-quality features, labels, and data sets for ML models in production, ensuring the safe operation of models over time, with proper reproducibility, lineage, and logging.

The Tecton platform consists of:

  • Feature Pipelines for transforming your raw data into features or labels
  • Feature Store for storing historical feature and label data
  • Feature Server for serving the latest feature values in production
  • An SDK for retrieving training data and manipulating feature pipelines
  • Web UI for managing and tracking features, labels, and data sets
  • Monitoring Engine for detecting data quality or drift issues and alerting

Feature Store from Scribble data

The Feature Store provided by Scribble Data puts lots of stress on Input Data Correctness and Completeness (gaps, duplicates, exceptions, invalid values), as it is known to play an impact on ML models’ prediction. Hence it recommends a continuous check/early morning system to prevent poor quality data from coming into the system. On the reactive side, the system undertakes a continuous process to improve ML operations over time.

Conclusion

Here we have discussed different architectural frameworks using Big Data (some of them are Open Source tools), ML model training, and serving tools, along with orchestration layer (such as Kubernetes). Each of the component is equally important and they go hand in hand to create a real-time end to end predictive system.

References

  1. FBLearner – https://www.matroid.com/scaledml/2018/yangqing.pdf
  2. FBlearner https://medium.com/@jamal.robinson/how-facebook-scales-artificial-intelligence-machine-learning-693706ae296f
  3. MetaFlow by Netflix https://netflixtechblog.com/open-sourcing-metaflow-a-human-centric-framework-for-data-science-fa72e04a5d9
  4. Tensorflow Extended http://stevenwhang.com/tfx_paper.pdf
  5. Apache Airflow: https://mlsys.org/Conferences/2019/doc/2019/demo_7.pdf
  6. Survey Monkey:http://snurran.sics.se/surveymonkey.pdf
  7. Overton: A Data System for Monitoring and Improving Machine-Learned Products:https://arxiv.org/pdf/1909.05372.pdf
  8. https://www.slideshare.net/Alluxio/pinterest-big-data-machine-learning-platform-at-pinterest
  9. https://www.bigabid.com/blog/data-the-importance-of-having-a-feature-store
  10. https://towardsdatascience.com/mlops-with-a-feature-store-816cfa5966e9
  11. https://github.com/EthicalML/awesome-production-machine-learning#feature-stores
  12. http://featurestore.org/
  13. https://github.com/logicalclocks/hopsworks
  14. https://gist.github.com/mserranom/10aaac360617d58e00f1c380db22592e
  15. https://github.com/quantopian/zipline
  16. https://mlsys.org/Conferences/2019/doc/2019/demo_7.pdf
  17. The Hopsworks Feature Store
  18. Ormenisan et al, Horizontally scalable ML pipelines with a Feature Store
  19. Sculley et al, What’s your ML Test Score? A rubric for ML production systems
  20. Baylor et al, TFX: A TensorFlow-Based Production-Scale Machine Learning Platform
  21. Mewald et al, Drift detection for production machine learning
  22. CDF Special Interest Group — MLOps
  23. Continuous Delivery for Machine Learning
  24. GitOps‍
  25. Metaflow -Netflix https://github.com/Netflix/metaflow/tree/master/test
  26. HopWorks https://www.slideshare.net/dowlingjim/the-feature-store-in-hopsworks
  27. https://www.tecton.ai/blog/data-platform-ml/

How to Handle Missing Data

No one “perfect” method exists for filling in missing data; You can view this one picture as a starting point with some suggestions, rather than an absolute. You may want to decide beforehand if you care about statistical power or uncertainty; If you do, you’ll want to lean towards one of the more complex routes (like multiple imputation), rather than a single imputation method–even if your data is linear or follows another trend or distribution shape.

More info:

Large Enough Sample

Shapes of Distributions

References:

Appropriately Handling Missing Values for Statistical Modelling and Prediction

Weekly Entering & Transitioning into a Business Intelligence Career Thread. Questions about getting started and/or progressing towards a future in BI goes here. Refreshes on Mondays: (August 10)

Welcome to the ‘Entering & Transitioning into a Business Intelligence career’ thread!

This thread is a sticky post meant for any questions about getting started, studying, or transitioning into the Business Intelligence field.

This includes questions around learning and transitioning such as:

  • Learning resources (e.g., books, tutorials, videos)

  • Traditional education (e.g., schools, degrees, electives)

  • Career questions (e.g., resumes, applying, career prospects)

  • Elementary questions (e.g., where to start, what next)

I ask everyone to please visit this thread often and sort by new.

submitted by /u/AutoModerator
[link] [comments]

Practicing Data Cleaning

Any resources where I can practice data cleaning? I’m a college student thinking of shifting to a BIA degree and wanna get a feel whether or not I’ll like the job.

submitted by /u/Dudeguybrochingo
[link] [comments]

Scroll to top