Month: March 2021

DSC Weekly Digest 29 March 2021

Data As A Galaxy

One of the more significant “quiet” trends that I’ve observed in the last few years has been the migration of data to the cloud and with it the rise of Data as a Service (DaaS). This trend has had an interesting impact, in that it has rendered moot the question of whether it is better to centralize or decentralize data.

There have always been pros and cons on both sides of this debate, and they are generally legitimate concerns. Centralization usually means greater control by an authority, but it can also force a bottleneck as everyone attempts to use the same resources. Decentralization, on the other hand, puts the data at the edges where it is most useful, but at the cost of potential pollution of namespaces, duplication and contamination. Spinning up another MySQL instance might seem like a good idea at the time, but inevitably the moment that you bring a database into existence, it takes on a life of its own.

What seems to be emerging in the last few years is the belief that an enterprise data architecture should consist of multiple, concentric tiers of content, from highly curated and highly indexed data that represents the objects that are most significant to the organization, then increasingly looser, less curated content that represents the operational lifeblood of an organization, and outward from there to data that is generally not controlled by the organization and exists primarily in a transient state.

Efficient data management means recognizing that there is both a cost and a benefit to data authority. A manufacturer’s data about its products is unique to that company, and as such, it should be seen as being authoritative. This data and metadata about what it produces has significant value both to itself and to the users of those products, and this tier usually requires significant curational management but also represents the greatest value to that company’s customers.

Customer databases, on the other hand, may seem like they should be essential to an organization, but in practice, they usually aren’t. This is because customers, while important to a company from a revenue standpoint, are also fickle, difficult to categorize, and frequently subject to change their minds based upon differing needs, market forces, and so forth beyond the control of any single company. This data is usually better suited for the mills of machine learning, where precision takes a back seat to gist.

Finally, on the outer edges of this galactic data, you get into the manifestation of data as social media. There is no benefit to trying to consume all of Google or even Twitter without taking on all of the headaches of being Google or Twitter without any of the benefits. This is data that is sampled, like taking soundings or wind measurements in the middle of a boat race. The individual measurements are relatively unimportant, only the broader term implications.

From an organizational standpoint, it is crucial to understand the fact that the value of data differs based upon its context, authority, and connectedness. Analytics, ultimately, exists to enrich the value of the authoritative content that an organization has while determining what information has only transient relevance. A data lake or operational warehouse that contains the tailings from social media is likely a waste of time and effort unless the purpose of that data lake is to hold that data in order to glean transient trends, something that machine learning is eminently well suited for. 

This is why we run Data Science Central, and why we are expanding its focus to consider the width and breadth of digital transformation in our society. Data Science Central is your community. It is a chance to learn from other practitioners, and a chance to communicate what you know to the data science community overall. I encourage you to submit original articles and to make your name known to the people that are going to be hiring in the coming year. As always let us know what you think.

In media res,
Kurt Cagle
Community Editor,
Data Science Central

DSC Featured Articles

TechTarget Articles

Picture of the Week


To make sure you keep getting these emails, please add to your address book or whitelist us.

This email, and all related content, is published by Data Science Central, a division of TechTarget, Inc.

275 Grove Street, Newton, Massachusetts, 02466 US

You are receiving this email because you are a member of TechTarget. When you access content from this email, your information may be shared with the sponsors or future sponsors of that content and with our Partners, see up-to-date  Partners List  below, as described in our  Privacy Policy . For additional assistance, please contact:

copyright 2021 TechTarget, Inc. all rights reserved. Designated trademarks, brands, logos and service marks are the property of their respective owners.

Privacy Policy  |  Partners List

Data Agility and ‘Popularity’ vs. Data Quality in Self-Serve BI and Analytics

One of the most valuable aspects of self-serve business intelligence is the opportunity it provides for data and analytical sharing among business users within the organization. When business users adopt true self-serve BI tools like Plug n’ Play Predictive Analysis, Smart Data Visualization, and Self-Serve Data Preparation, they can apply the domain knowledge and skill they have developed in their role to create reports, analyze data and make recommendations and decisions with confidence.


It is not uncommon for data shared or created by a particular business user to become popular among other business users because of a particular analytical approach, the clarity of the data and conclusions presented or other unique aspects of the user’s approach to business intelligence and reporting. In fact, in some organizations, a business user can get a reputation as being ‘popular’ or dependable and her or his business intelligence analysis and reports might be actively sought to shape opinion and make decisions. That’s right, today there is a social networking aspect even in Business Intelligence. Think of it as Social Business Intelligence or Collaborative Business Intelligence. It is a new concept that we can certainly understand, given the modern propensity for socializing and sharing information that people want to share, discuss, and rate and they want to understand the context, and views and opinions of their peers and teammates.


By allowing your team members to easily gather, analyze and present data using sophisticated tools and algorithms (without the assistance of a programmer, data scientist or analyst), you can encourage and adopt a data sharing environment that will help everyone do a better job and empower them with tools they need to make the right decisions.


When considering the advantages of data popularity and sharing, one must also consider that not all popular data will be high-quality data (and vice versa). So, there is definitely a need to provide both approaches in data analysis. Create a balance between data quality and data popularity to provide your organization and business users with the best of both worlds.


You may also wish to improve the context and understanding of data among business users by leveraging the IT curation approach to data and ‘watermarking’ (labeling/tagging) selected data to indicate that this data has been certified and is dependable. Business users can then achieve a better understanding of the credibility and integrity of the integrated data they view and analyze in the business intelligence dashboard and reports.


As the organization builds a portfolio of reports and shared data it can better assess the types of data, formats, analysis and reports that are popular among its users and will provide more value to the team and the enterprise.


Encourage your team members to share their views and ratings, with self-serve data preparation and BI tools and create an environment that will support power business users. While self-serve data prep may not always produce 100% quality data, it can provide valuable insight and food for thought that may prompt further exploration and analysis by an analyst or a full-blown Extract, Transform and Load (ETL) or Data Warehouse (DWH) inquiry and report.


There are many times when the data extracted and analyzed through self-serve data preparation is all you will need; times when the organization or the user or team needs solid information without a guarantee of 100% accuracy. In these times, the agility of self-serve data prep provides real value to the business because it allows your team to move forward, ask questions, make decisions, share information and remain competitive without waiting for valuable skilled resources to get around to the creating a report or performing a unique inquiry or search for data.


If you build a team of power business users, and transform your business user organization into Citizen Data Scientists, your ‘social network’ of data sharing and rating will evolve and provide a real benefit to the organization. Those ‘popular’, creative business users will emerge and other users will benefit from their unique approach to data analysis and gain additional insight This collaborative environment turns dry data analysis and tedious reporting into a dynamic tool that can be used to find the real ‘nuggets’ of information that will change your business.


When you need 100% accuracy – by all means seek out your IT staff, your data scientists and your analysts and leverage the skilled resources to get the crucial data you need. For much of your organization, your data analysis needs and your important tasks, the data and analysis gleaned from a self-serve data preparation and business intelligence solution will serve you very well, and your business users will become more valuable, knowledgeable assets to your organization.


By balancing agility and data ‘popularity’ and democratization with high quality, skilled data analysis, you can better leverage all of your resources and create an impressive, world-class business ‘social network’ to conquer the market and improve your business. To achieve a balance between data quality and data popularity, your organization may wish to create a unique index within the business intelligence analytics portal, to illustrate and balance data popularity and data quality, and thereby expand user understanding and improve and optimize analytics at every level within the enterprise.

Security Intelligence Handbook Chapter 11: Geopolitical Intelligence Identifies IT Risks Across the Globe

Editor’s Note: We’re sharing excerpts from the third edition of our popular book, “The Security Intelligence Handbook: How to Disrupt Adversaries and Reduce Risk with Security Intelligence.” Here, we’re looking at chapter 11, “Geopolitical-Party Intelligence.” To read the entire section, download your free copy of the handbook.

Nation-state threat actors are out to cause maximum damage and disruption, which has led to more critical infrastructure attacks targeting cities, government agencies, critical infrastructure, and large companies. Take, for example, the recent attack on more than 250 federal agencies and businesses presumed to be at the hands of Russian operatives.

Attacks like this underscore the importance of reducing geopolitical risk. That means going beyond protecting your digital assets from domestic cyber threats to also consider the unique challenges of defending against global threats and protecting your offices, manufacturing plants, warehouse facilities, and remote personnel in foreign countries—a whole new set of cybersecurity challenges come into play.

IT risks emerge when conflicts between countries occur. There are also risks related to national political environments and the stability of governments. Then there’s more risk when governments change environmental, health, safety, and financial regulations. Finally, activism on the part of polarized groups causes possible IT risks. Natural and man-made disasters—such as disease outbreaks, hurricanes and earthquakes, military actions, and terrorist attacks—also come into play. Keep in mind, some of these may seem to be purely physical in nature, the chaos that surrounds these events invite cyber threat actors to take advantage of the digital space during these times of crisis.

Because of these possibilities, 90 percent of executives from companies in the Americas say country-level and geopolitical risk have a high or very high impact. Worldwide, 70 percent of the executives say their company has an individual or function responsible for political risk management.1

Organizations solve these challenges with precision geopolitical intelligence. Consider the advantages of receiving a warning days before these types of events impact your operations, or getting alerts in real time as they occur. That knowledge enables you to prevent such events from affecting your organization—it at least puts you in a position to respond faster when mitigating their effect. Additionally, intelligence about local attitudes and long-term trends provides the insights you need to make smarter determinations about expanding operations into specific countries and cities.

To find out more, check out the impact of geopolitical risk and the benefits of geopolitical intelligence in “The Security Intelligence Handbook, Third Edition: How to Disrupt Adversaries and Reduce Risk With Security Intelligence.” In the excerpt below, which has been edited and condensed, learn how to understand the factors that cause geopolitical risk, discover all the groups that use geopolitical intelligence, and explore geofencing and geopolitical risk-event types.

  1. EY, “Geostrategy in Practice 2020,” (survey of global organizations with revenues of $250 million or more): May 2020. 

Get ‘The Security Intelligence Handbook’

This chapter is one of many in our new book that demonstrates how to disrupt adversaries and measurably reduce risk with security intelligence at the center of your security program. Additional chapters explore different use cases, including the benefits of security intelligence for SecOps, vulnerability management, security leadership, and more.

Download your copy of “The Security Intelligence Handbook” now.

The post Security Intelligence Handbook Chapter 11: Geopolitical Intelligence Identifies IT Risks Across the Globe appeared first on Recorded Future.

Weekly Entering & Transitioning into a Business Intelligence Career Thread. Questions about getting started and/or progressing towards a future in BI goes here. Refreshes on Mondays: (March 29)

Welcome to the ‘Entering & Transitioning into a Business Intelligence career’ thread!

This thread is a sticky post meant for any questions about getting started, studying, or transitioning into the Business Intelligence field. You can find the archive of previous discussions here.

This includes questions around learning and transitioning such as:

  • Learning resources (e.g., books, tutorials, videos)
  • Traditional education (e.g., schools, degrees, electives)
  • Career questions (e.g., resumes, applying, career prospects)
  • Elementary questions (e.g., where to start, what next)

I ask everyone to please visit this thread often and sort by new.

submitted by /u/AutoModerator
[link] [comments]

Are your data documentation tools getting the job done?

Everyone has a desire to work with data and to make the right business decisions. Unfortunately, they suppress this desire because they believe they are not smart enough to use the right tools or don’t understand the data.

The delta between data-savvy and non-data-savvy employees can change with technology. It starts by having the right data documentation toolkit that’s available to everyone and always updated. Below are 8 signs that your team should reconsider your data documentation tools:

submitted by /u/secodaHQ
[link] [comments]

Ransomware and Extortion Evolve More Brazen Tactics

For this week’s show we welcome back Allan Liska, a member of Recorded Future’s CSIRT security team. Allan updates us on the latest trends he and his colleagues are tracking on the ransomware and online extortion fronts. We discuss the growing sophistication of the tools and tactics attackers are using, and the remarkable brazenness with which they do their business.

This podcast was produced in partnership with the CyberWire.

The post Ransomware and Extortion Evolve More Brazen Tactics appeared first on Recorded Future.

Defining and Measuring Chaos in Data Sets: Why and How, in Simple Words

There are many ways chaos is defined, each scientific field and each expert having its own definitions. We share here a few of the most common metrics used to quantify the level of chaos in univariate time series or data sets. We also introduce a new, simple definition based on metrics that are familiar to everyone. Generally speaking, chaos represents how predictable a system is, be it the weather, stock prices, economic time series, medical or biological indicators, earthquakes, or anything that has some level of randomness. 

In most applications, various statistical models (or data-driven, model-free techniques) are used to make predictions. Model selection and comparison can be based on testing various models, each one with its own level of chaos. Sometimes, time series do not have an auto-correlation function due to the high level of variability in the observations: for instance, the theoretical variance of the model is infinite. An example is provided in section 2.2 in this article  (see picture below), used to model extreme events. In this case, chaos is a handy metric, and it allows you to build and use models that are otherwise ignored or unknown by practitioners.  

Figure 1: Time series with indefinite autocorrelation; instead, chaos is used to measure predictability

Below are various definitions of chaos, depending on the context they are used for. References about how to compute these metrics, are provided in each case.

Hurst exponent

The Hurst exponent H is used to measure the level of smoothness in time series, and in particular, the level of long-term memory. H takes on values between 0 and 1, with H = 1/2 corresponding to the Brownian motion, and H = 0 corresponding to pure white noise. Higher values correspond to smoother time series, and lower values to more rugged data. Examples of time series with various values of H are found in this article, see picture below. In the same article, the relation to the detrending moving average (another metric to measure chaos) is explained. Also, H is related to the fractal dimension. Applications include stock price modeling.

Figure 2: Time series with H = 1/2 (top), and H close to 1 (bottom)

Lyapunov exponent

In dynamical systems, the Lyapunov exponent is used to quantify how a system is sensitive to initial conditions. Intuitively, the more sensitive to initial conditions, the more chaotic the system is. For instance, the system xn+1 = xn – INT(xn), where INT represents the integer function, is very sensitive to the initial condition x0. A very small change in the value of x0 results in values of xn that are totally different even for n as low as 45. See how to compute the Lyapunov exponent, here.

Fractal dimension

A one-dimensional curve can be defined parametrically by a system of two equations. For instance x(t) = sin(t), y(t) = cos(t) represents a circle of radius 1, centered at the origin. Typically, t is referred to as the time, and the curve itself is called an orbit. In some cases, as t increases, the orbit fills more and more space in the plane. In some cases, it will fill a dense area, to the point that it seems to be an object with a dimension strictly between 1 and 2. An example is provided in section 2 in this article, and pictured below. A formal definition of fractal dimension can be found here.

Figure 3: Example of a curve filling a dense area (fractal dimension  >  1)

Approximate entropy

In statistics, the approximate entropy is a  metric used to quantify regularity and predictability in time series fluctuations. Applications include medical data, finance, physiology, human factors engineering, and climate sciences. See the Wikipedia entry, here.

It should not be confused with entropy, which measures the amount of information attached to a specific probability distribution (with the uniform distribution on [0, 1] achieving maximum entropy among all continuous distributions on [0, 1], and the normal distribution achieving maximum entropy among all continuous distributions defined on the real line, with a specific variance). Entropy is used to compare the efficiency of various encryption systems, and has been used in feature selection strategies in machine learning, see here.

Independence metric 

Here I discuss some metrics that are of interest in the context of dynamical systems, offering an alternative to the Lyapunov exponent to measure chaos. While the Lyapunov exponents deals with sensitivity to initial conditions, the classic statistics mentioned here deals with measuring predictability for a single instance (observed time series) of a dynamical systems. However, they are most useful to compare the level of chaos between two different dynamical systems with similar properties. A dynamical system is a sequence xn+1 = T(xn), with initial condition x0. Examples are provided in my last two articles, here and here. See also here

A natural metric to measure chaos is the maximum autocorrelation in absolute value, between the sequence (xn), and the shifted sequences (xn+k), for k = 1, 2, and so on. Its value is maximum and equal to 1 in case of periodicity, and minimum and equal to 0 for the most chaotic cases. However, some sequences attached to dynamical systems, such as the digit sequence pictured in Figure 1 in this article, do not have theoretical autocorrelations: these autocorrelations don’t exist because the underlying expectation or variance is infinite or does not exist. A possible solution with positive sequences is to compute the autocorrelations on yn = log(xn) rather than on the xn‘s.

In addition, there may be strong non-linear dependencies, and thus high predictability for a sequence (xn), even if autocorrelations are zero. Thus the desire to build a better metric. In my next article, I will introduce a metric measuring the level of independence, as a proxy to quantifying chaos. It will be similar in some ways to the Kolmogorov-Smirnov metric used to test independence and illustrated here, however, without much theory, essentially using a machine learning approach and data-driven, model-free techniques to build confidence intervals and compare the amount of chaos in two dynamical systems: one fully chaotic versus one not fully chaotic. Some of this is discussed here.

I did not include the variance as a metric to measure chaos, as the variance can always be standardized by a change of scale, unless it is infinite.

To receive a weekly digest of our new articles, subscribe to our newsletter, here.

About the author:  Vincent Granville is a data science pioneer, mathematician, book author (Wiley), patent owner, former post-doc at Cambridge University, former VC-funded executive, with 20+ years of corporate experience including CNET, NBC, Visa, Wells Fargo, Microsoft, eBay. Vincent is also self-publisher at, and founded and co-founded a few start-ups, including one with a successful exit (Data Science Central acquired by Tech Target). You can access Vincent’s articles and books, here.

Scroll to top