Month: September 2021

DSC Weekly Digest 28 September 2021

  • The growth of self-service BI is driving organizations to create data literacy programs to ensure that business users have the data knowledge and skills they need to understand data, work with it to generate useful information and communicate the analytics results to others. Check out the Search Business Analytics e-Handbook How to develop a data literacy program in your organization for an in depth look at why investments in data literacy frameworks can help improve data quality and integrity.
  • Constantly changing digital workflows are hastening acceptability and powering the escalation of automated document management system deployments. Read the Search Content Management e-handbook Automated document management system tools transform workflows to learn about why DMS are driving the hybrid workforce, and get insight about the tools, features and applications to consider when choosing the right DMS for your organization.

Click to Become A Member of Data Science Central

How Climate Change and Supply Chain Issues May Derail AI Adoption

This week, the Chinese government made the announcement that several key manufacturing hubs critical to chip production will be shut down periodically in order to reduce the high demand for power (driven mostly by goal production), with companies including Apple, Nvidia, and Intel all making announcements that this would have a direct impact upon their own supply chain availability.

This announcement came at a particularly bad time for the beleaguered computer chip industry. A fire at a major chip fabrication plant in Japan earlier this year has already impacted automobile production in the United States and elsewhere, as many of the chips produced there were specifically for the increasingly complex machine learning components going into contemporary vehicles.

Additionally, the emergence of the Delta variant of the Covid-19 virus is ramping up as much of the Northern Hemisphere moves into the Fall and Winter months, which had both forced many companies that were just beginning to emerge from the earlier strains of the virus to once again put plans on hold, quite probably into late Spring. This is also putting strain on chip production, especially specialized GPUs that are at the core of the AI revolution.

This has pushed the cost of electronics and related goods up significantly as inflation, which has been fairly dormant for the last twenty years, is now beginning to heat up. From an enterprise standpoint, this is generally bad news, though in the longer term one effect of this is likely to be that more countries will start to invest once again into their own chip production facilities (as Japan, South Korea, and the United States have all recently announced they are ramping up to do). However, in the interim this deficit is likely to hit companies that have been buying these specialized high-performance chips for building out machine learning and AI-based pipelines, as well as those involved in media production.

It’s also possible, though by no means certain, that this could have the effect of reducing the hiring of data scientists and machine learning specialists, at least in the short term, until new capability comes online. Whether this spills out into the broader economy remains to be seen, but the increasing stress on global supply chains, coupled with the increased environmental and energy costs of advanced computing, makes this worth keeping an eye on.

In media res,

Kurt Cagle
Community Editor,
Data Science Central

To subscribe to the DSC Newsletter, go to Data Science Central and become a member today. It’s free! 

Data Science Central Editorial Calendar

DSC is looking for editorial content specifically in these areas for September, with these topics having higher priority than other incoming articles.

  • Machine Learning and IoT
  • Data Modeling and Graphs
  • AI-Enabled Hardware (GPUs and similar tools)
  • Javascript and AI
  • GANs and Simulations
  • ML in Weather Forecasting
  • UI, UX and AI
  • Jupyter Notebooks
  • No-Code Development
  • Metaverse
  • GNNs and LNNs

DSC Featured Articles

Picture of the Week


To make sure you keep getting these emails, please add to your browser’s address book.

This email, and all related content, is published by Data Science Central, a division of TechTarget, Inc.

275 Grove Street, Newton, Massachusetts, 02466 US

You are receiving this email because you are a member of TechTarget. When you access content from this email, your information may be shared with the sponsors or future sponsors of that content and with our Partners, see up-to-date  Partners List  below, as described in our  Privacy Policy . For additional assistance, please contact:

copyright 2021 TechTarget, Inc. all rights reserved. Designated trademarks, brands, logos and service marks are the property of their respective owners.

Privacy Policy  |  Partners List

Having trouble scaling graphs to large sizes!

I’ve tried Amazon Neptune, Neo4j and others but when I try to scale to several TB of data the whole graph slows considerably or stops working. Are other people having this issue?

submitted by /u/gregmaul1
[link] [comments]

Real-World Applications Of Machine Learning In Healthcare

The healthcare industry has always benefited from technological advances and their offerings. From pacemakers and X-Rays to electronic CPRs and more, healthcare has been able to add value to society and its evolution immensely due to the role of technology. Taking the evolution forward at this phase of advancements is Artificial Intelligence (AI) and its allied technologies such as machine learning, deep learning, NLP, and more.

In more ways than imaginable, AI and machine learning concepts are helping doctors and surgeons save precious lives seamlessly, detect diseases and concerns even before their advent, manage patients better, engage more effectively in their recovery process, and more. Through AI-driven solutions and machine learning models, organizations around the world are able to better deliver healthcare to people.

But how exactly are these two technologies empowering hospitals and healthcare providers? What are the real-world tangible applications of use cases that make them inevitable? Well, let’s find out.

The Role Of Machine Learning In Healthcare

For the uninitiated, machine learning is a subset of AI that allows machines to autonomously learn concepts, process data, and deliver desired results. Through different learning techniques such as unsupervised, supervised learning, and more, machine learning models learn to process data through conditions and clauses and arrive at outcomes. This makes them ideal to churn out prescriptive and predictive insights.

These insights immensely help in the organizational and administrative side of healthcare delivery such as patient and bed management, remote monitoring, appointment management, duty rosters creation, and more. On a daily basis, healthcare professionals spend 25% of their time on redundant tasks such as records management & updation and claims processing, which prevents them from delivering healthcare as required.

The implementation of machine learning models could bring in automation and eliminate human intervention in places they are least required. Besides, machine learning also helps in optimizing patient engagement and recovery by sending out timely alerts and notifications to patients on their medications, appointments, reports collection, and more.

Besides these administrative benefits, there are other practical benefits of machine learning in healthcare. Let’s explore what they are.

Let’s discuss your AI Training Data requirement today.

Real-World Applications of Machine Learning

Disease Detection & Efficient Diagnosis

One of the major use cases of machine learning in healthcare lies in the early detection and efficient diagnosis of diseases. Concerns such as hereditary and genetic disorders and certain types of cancers are hard to identify in the early stages but with well-trained machine learning solutions, they can be precisely detected.

Such models undergo years of training from computer vision and other datasets. They are trained to spot even the slightest of anomalies in the human body or an organ to trigger a notification for further analysis. A good example of this use case is IBM Watson Genomic, whose genome-driven sequencing model powered by cognitive computing allows for faster and more effective ways to diagnose concerns.

Efficient Management of Health Records

Despite advancements, the maintenance of electronic health records is still a plaguing concern in the healthcare sector. While it is true that it has become a lot easier compared to what we collectively used earlier, health data is still all over the place.

This is quite ironic because health records need to be centralized and streamlined (let’s not forget interoperable, too). However, a lot of crucial details that go missing from records, are either locked or wrong. However, the influence of machine learning is changing all these as projects from MathWorks and Google are helping in the automatic updation of even offline records through handwriting detection technologies. This ensures healthcare professionals across verticals have timely access to patient data to do their job.

Diabetes Detection

The problem with a disease like diabetes is that a lot of people have it for a prolonged period of time without experiencing any symptoms. So, when they actually experience the symptoms and effects of diabetes for the first time, it’s already quite late. However, instances like these could be prevented through machine learning models.

A system built on algorithms such as Naive Bayes, KNN, Decision Tree, and more could be used to process health data and predict the onset of diabetes through details from an individual’s age, lifestyle choices, diet, weight, and other crucial details. The same algorithms could also be used to detect liver diseases accurately.

Behavioral Modification

Healthcare is beyond treating diseases and illnesses. It’s about overall wellbeing. Often, we as humans reveal more about ourselves and what we go through with our bodily gestures, postures, and overall behavior. Machine learning-driven models can now help us identify such subconscious and involuntary actions and make necessary lifestyle changes. This could be as simple as wearables that recommend you to move your body after prolonged periods of idle time or apps that ask you to correct your body postures.

Discovering New Drugs & Medications

A lot of major health ailments still don’t have a cure. While there are immediately life-threatening concerns like cancer and AIDS on one side, there are also chronic illnesses that could eat up individuals for their entire life such as autoimmune diseases and neurological disorders.

Machine learning is immensely helping organizations and drug manufacturers to come up with medications for major diseases faster and more effectively. Through simulated clinical trials, sequencing, and pattern detection, companies are now able to fast-track their experimentation and observation processes. A lot of unconventional therapies and remedies are also being developed in parallel to mainstream medicine with the help of machine learning.

Wrapping Up

Machine learning is significantly reducing the time required for us humans to reach the next phase of evolution. We are now moving ahead at a pace faster than how we got here. With more use cases, experiments and applications, we could be discussing how cancer has been cured or how a devastating pandemic avoided due to a simple smartphone app in the coming years. AI in Healthcare is revolutionizing the medical industry

Future Proofing Your Career

I am very fortunate that I get asked to present at several universities about what students (and all professionals) can do to “future proof” their careers. We live in a world of constant change driven by technology, economics, pandemics, environmental, political, and society forces. We live in a world where we need to build “transformational muscle” so that we not only can survive – but can actually thrive – in a world of constant disruption and transformation.

And I believe that a “future proof” foundation is built by blending the critical and empowering disciplines of data science, design thinking, and economics.  When you meld those three together, you are certainly in a great position for whatever the future is going to throw at you (Figure 1).

Figure 1: Blending Data Science, Design Thinking and Economics to Future Proofing Your Career

Discipline #1: Data Science – the Language of AI / ML

Data Science is about identifying those variables and metrics that might be better predictors of behaviors and performance.

While not everyone will be required to code a Neural Network algorithm (thank God), it is critically important that everyone learns what can be done with advanced analytic capabilities like Machine Learning, Neural Networks, Reinforcement Learning, and Artificial Intelligence. While there seems to be multiple ways to define the “Advanced Analytics topology”, I use the following 3 levels to help explain the differences to my students (Figure 2):

Level 1 quantifies cause-and-effect (strength of relationships) and goodness of fit (model accuracy) using:

  • Statistics are used to support hypothesis (decision) testing and provide credibility to model outcomes (confidence levels, p-values, goodness-of-fit)
  • Predictive Analytics and Data Mining uncover statistically significant patterns, trends, and relationships buried in large data sets to quantify risks and opportunities

Level 2 predicts likely actions or outcomes in order to prescribe recommendations to improve human decision-making using:

  • Deep Learning (Neural Networks) recognizes “things” – images, photos, voice, audio, video, text, handwriting – out of complex data formats
  • Machine Learning identifies relationships and patterns in the data. Supervised Machine Learning identifies “known unknown” relationships and patterns from “labeled” outcomes (e.g., purchase, fraud, attrition, breakage) using algorithms such as linear regression, logistic regression, Naive Bayes, and Support Vector Machine (SVM).  Unsupervised Machine Learning identifies “unknown unknown” relationships and patterns from data with no labeled outcomes using algorithms such as clustering, segmentation, and K- nearest neighbor.

Level 3 seeks to continuously-learn and adapt within a continuously changing environments with minimal human intervention (robots, vacuums, autonomous vehicles) using:

  • Reinforcement Learning takes actions within a controlled environment to maximize rewards while minimizing costs. Reinforcement Learning uses trial-and-error to map situations to actions to maximize rewards (think of the kids’ game Hotter/Colder).
  • Artificial Intelligence acquires knowledge about a specific environment, applies knowledge to successfully interact with environment, and continuously learns from interactions so that subsequent interactions get more effective with minimal human intervention.

Figure 2: Three Levels of Analytics Maturity

Discipline #2: Design Thinking – the Language of your Customer

“Design Thinking is a human-centered and collaborative approach to problem solving using a design mindset to solve wicked complex problems” – IDEO

Design Thinking is all about people…their points of view…and their stories. Design Thinking is about gaining an intimate understanding of your customers – what jobs they are seeking to do, and the gains (benefits) and pains (impediments) that they encounter on their personal journey (Figure 3).

Figure 3: Design Thinking: Speaking the Language of Your Customer

But the key to Design Thinking is the empowering mindset that it establishes across all the stakeholders. Design Thinking seeks to empower and democratize the ideation process by ensuring that all ideas, regardless from whom they originated, are worthy of consideration.  That one can only have break-through moments if they are willing to fail and learn from those failures. To seek to unleash the greatness that is in every one of us.  Design Thinking creates a culture of rapid exploration, rapid testing, failure-tolerant, continuous learning and adapting (Figure 4).

Figure 4: Design Thinking Users Empowerment to Democratize Ideation

Discipline #3: Economics – the Language of Business

Economics is the branch of knowledge concerned with the production, consumption, and transfer of wealth or value.

Data and analytics, in particular, possess unique economic characteristics that enable new opportunities to drive and derive new sources of customer, product, and operational value including:

Nanoeconomics is the economic theory of individual entity (human or device) predicted behavioral and performance propensities.  We can apply Nanoeconomics to transition the organization from making decisions based upon overly generalized averages, to making decisions on individual human or device entity’s predicted behavioral and performance propensities.  Organizations can leverage Nanoeconomics to transform their economic value curve – which measures the relationship between a dependent outcome and independent inputs required to achieve that outcome – to deliver more value (outputs) with less investments (inputs). See Figure 5.

Figure 5: The Economic Theory of Nanoeconomics

Schmarzo Economic Digital Asset Valuation Theorem. Economics is a powerful enabler, but when it comes to digital economic assets, the power is magnitudes greater because (Figure 6):

  • Data is an asset that never depletes, never wears out, and can be used across unlimited use cases at zero marginal cost.
  • Using AI, organizations can build analytic assets that appreciate, not depreciate, in value the more they are used.
  • Data Economic Multiplier Effect measures the increase in aggregated value from the application and reuse of the organization’s data and analytic assets against the organization’s use cases at zero marginal cost.
  • Marginal Propensity to Reuse (MPR) states that an increase in the reuse of a digital asset across multiple use cases drives an increase in the attributable value of that digital asset at zero marginal cost.

Figure 6: Schmarzo Economic Digital Asset Valuation Theorem

Future Proofing Your Career Summary

In a world of constant disruption and transformation, everyone needs to embrace a mindset of lifetime learning.  And the best way to future proof your career, no matter your profession, is to blend the disciplines of data science, design thinking, and economics.

I can dream, right?

Mitigating risk of natural disasters with data


Data science has been playing an increasingly important role in mitigating the risk of natural disasters, such as wildfires, and has enhanced our utilization of data and technology to protect our most vulnerable communities. Join OmniSci for a webinar on Wednesday (9/29), Preventing the Next Paradise Disaster with Accelerated Analytics, where we will explore how data science and analytics tools play a critical role in understanding factors contributing to wildfires, associated risks, and impacts on communities across the Western United States, Canada, and beyond.

submitted by /u/OmniSci_
[link] [comments]

The Business of Fraud: Laundering Funds in the Criminal Underground

Insikt Group

Editor’s Note: The following post is an excerpt of a full report. To read the entire analysis, click here to download the report as a PDF.

Recorded Future analyzed current data from the Recorded Future® Platform, dark web, and open-source intelligence (OSINT) sources to review money laundering services within underground sourcing and the methodology and operations used by threat actors. This report expands upon findings addressed in the first report of the Insikt Group’s Fraud Series, “The Business of Fraud: An Overview of How Cybercrime Gets Monetized

Executive Summary

Money laundering services within the dark web facilitate a combination of activities through which threat actors can conceal the origins of their money, transfer cryptocurrency, have funds sent to a bank account or payment cards, or exchange to physical cash via online payment solution platforms like WebMoney or PerfectMoney. Many of these services are linked to the use of cryptocurrency and rely on other mixing services to tumble funds and help threat actors remain anonymous when transferring them. Peer-to-peer (P2P) transactions are a convenient alternative to traditional financial platforms, with support for platforms such as Venmo being touted as key features within popular underground services. 

Key Judgments

  • Dark web money laundering services facilitate a multitude of combinations through which threat actors can clean their money and can transfer cryptocurrency into virtual currency, have funds sent to a bank account or payment cards, or exchange to physical fiat currency. 
  • Money laundering services referenced within underground sources over the past year have consistently relied on money mules, cash-out requests, exchangers, or mixers to succeed.
  • Despite a high volume of arrests and takedowns of money laundering services or services that support laundering activity over the past year, underground actors generally appear disinclined to cease laundering operations they likely continue to deem profitable.
  • Cybercriminals are likely to adopt new technologies such as NFTs and other laundering techniques in response to law enforcement action and growing private sector awareness of their activities.
  • Ransomware operators likely use the multitude of dark web money laundering services operated by threat actors on well-known cybercrime forums such as Verified. Bitcoin is likely to continue to be the most widely used cryptocurrency in ransomware and laundering operations. 

Editor’s Note: This post was an excerpt of a full report. To read the entire analysis, click here to download the report as a PDF.

The post The Business of Fraud: Laundering Funds in the Criminal Underground appeared first on Recorded Future.

4 Chinese APT Groups Identified Targeting Mail Server of Afghan Telecommunications Firm Roshan

Insikt Group

Insikt Group has detected separate intrusion activity targeting a mail server of Roshan, one of Afghanistan’s largest telecommunications providers, linked to 4 distinct Chinese state-sponsored threat activity groups. This includes activity we attribute to the Chinese state-sponsored groups RedFoxtrot and Calypso APT, as well as 2 additional clusters using the Winnti and PlugX backdoors that we have been unable to link to established groups at this time. Notably, data exfiltration activity for these intrusions, particularly the Calypso APT activity and the unknown threat actor using the Winnti malware, spiked throughout August and September 2021, coinciding with major geopolitical events such as the withdrawal of US troops and a resurgence in Taliban control. This focus on intelligence gathering targeting one of Afghanistan’s largest telecommunications providers is likely in part driven by the Chinese Communist Party’s (CCP) purported desire to expand influence within Afghanistan under renewed Taliban rule. The telecommunications firm offers a hugely valuable platform for strategic intelligence collection, be it for monitoring of downstream targets, bulk collection of communication data, as well as the ability to track and monitor individual targets. Moreover, the Chinese government considers the telecommunications sector to be of strategic significance in countries participating in the Belt and Road Initiative.

Timeline of Activity

Insikt Group tracks and regularly reports on a range of Chinese state-sponsored threat activity groups, exemplified by our recent RedFoxtrot reporting in June 2021. One of the methods used to track these groups combines adversary infrastructure detection methods and Recorded Future Network Traffic Analysis (NTA) data. Through our tracking of malicious infrastructure associated with known Chinese state-sponsored actors, we identified multiple concurrent intrusions targeting Roshan over the past year linked to 4 separate activity groups:

  • The earliest identified activity targeting Roshan is linked to the suspected Chinese state-sponsored group Calypso APT, and has been ongoing from at least July 2020 to September 2021, and was first reported by Insikt Group in August last year. 
  • More recently, the same Roshan server was identified communicating with RedFoxtrot PlugX command and control infrastructure from at least March to May 2021. During this time, RedFoxtrot was also identified targeting a second Afghan telecommunications organization.
  • Two more clusters were also engaged in the targeting of the same Roshan mail server. These are referred to as the Winnti and PlugX clusters respectively and are outlined further in the sections below. Both of these clusters appear unrelated to each other or the Calypso APT and RedFoxtrot activity, but we have been unable to link them to a tracked activity group at this time.

Figure 1: Timeline of Roshan NTA data exfiltration events versus Afghanistan geopolitical reporting (Source: Recorded Future)


The targeting of the same organization by activity groups under the same state sponsorship is not unusual, particularly for Chinese adversaries. Many of these groups have separate intelligence requirements and, due to the scale of the Chinese intelligence apparatus, are often not coordinated in their targeting and collection. In this case, as visible in Figure 1, there has been an increase in data exfiltration events associated with the Calypso APT and Winnti intrusions in August and September 2021. This is indicative of both historical strategic collection targeting Afghanistan as well as a further concentration of activity in line with major geopolitical events. 

Afghanistan is strategically important to China for several reasons, particularly in the wake of the US withdrawal. For one, the PRC likely seeks to increase its influence within Afghanistan to prevent regional instability and extremism from spreading into the bordering Xinjiang Uyghur Autonomous Region of the PRC, as well as to other Central Asian countries. These issues raise national security concerns and a need to protect PRC interests in the region, including major Belt and Road Initiative (BRI) investments. The US withdrawal also presents the PRC with opportunities for major new BRI-linked and extractive industry projects within Afghanistan. 

Technical Analysis

Figure 2: Chart of infrastructure used in Roshan intrusions (Source: Recorded Future)


As shown in Figure 2, the compromised Roshan server has been identified communicating with a range of adversary C2 infrastructure, particularly associated with the PlugX malware family commonly used by Chinese state-sponsored groups. The section below contains a breakdown of the intrusion activity by group.


In June 2021, Insikt Group reported on RedFoxtrot activity targeting government, defense, and telecommunications organizations across South and Central Asia since at least 2014. We linked this activity group to Unit 69010 of the People’s Liberation Army Strategic Support Force (PLASSF) Network System Department (NSD) located in Ürümqi, Xinjiang, through lax operational security employed by a suspected RedFoxtrot operator. The group uses an array of bespoke malware variants commonly associated with Chinese groups, such as IceFog, QUICKHEAL, and RoyalRoad, as well as other more widely available tools often used by China-linked threat actors, including Poison Ivy, PlugX, and PCShare. 

In follow-up analyses in July and September 2021, we identified RedFoxtrot abandoning large amounts of operational infrastructure following public disclosure and reported on several newly identified victims across government and defense sectors in India and Pakistan. RedFoxtrot activity targeting Roshan ceased before our public reporting on the group in June 2021 and was linked to the following PlugX command and control infrastructure:


C2 Domain Last Seen C2 IP Address Last Seen Date of Activity
randomanalyze.freetcp[.]com 143.110.250[.]149 April 4, 2021
darkpapa.chickenkiller[.]com 149.28.139[.]86 May 5, 2021
dhsg123.jkub[.]com 159.65.152[.]7


April 21, 2021

Table 1: RedFoxtrot PlugX Indicators from Roshan Intrusion

Calypso APT

In March 2021, Insikt Group reported on the Calypso APT conducting a mass exploitation campaign targeting Microsoft Exchange servers using the ProxyLogon exploit chain (CVE-2021-26855, CVE-2021-27065), alongside several other Chinese state-sponsored groups. One of the PlugX C2 domains highlighted in this activity, www.membrig[.]com, remains active and is linked to ongoing intrusion activity targeting Roshan. 


C2 Domain Last Seen C2 IP Address Last Seen Date of Activity
www.membrig[.]com 103.30.17[.]20 September 12, 2021

​​Table 2: Calypso APT indicators from Roshan intrusion

Unknown Winnti Cluster

The Winnti backdoor has historically been used by several Chinese state-sponsored groups, including APT41/Barium, APT17, and most recently a group tracked by Insikt Group as TAG-22. The Winnti backdoor is commonly associated with activity linked to multiple groups of loosely connected private contractors operating on behalf of China’s Ministry of State Security (MSS). In September 2020, the US Department of Justice (DoJ) charged 5 Chinese nationals linked to APT41, which had access to Winnti malware, with conducting widespread intrusion operations targeting over 100 victims globally. 

In relation to the Roshan targeting, we identified a high level of data exfiltration activity from the targeted Roshan server and the Winnti C2 45.76.144[.]44, from at least August 17 to September 12, 2021. We have been unable to link this Winnti C2 infrastructure with a known group, but it is very likely separate from the RedFoxtrot and Calypso APT activity highlighted above. 

Unknown PlugX Cluster

Finally, the same Roshan mail server was also identified communicating with an additional PlugX C2 server from April to August 2021. This PlugX C2, 45.86.162[.]135, is linked to the Australia-based PS hosting reseller Crowncloud. 


Several Chinese state-sponsored groups remain highly active across Central Asia, often operating in an uncoordinated manner, likely due to differing tasking and chains of command. Like other geopolitical flashpoints such as India and the South China Sea, Afghanistan is likely to remain a prime target for Chinese government intelligence collection following the US’s withdrawal and the Taliban’s takeover. Always a prime target of cyberespionage activity, telecommunications organizations are at particularly high risk within these regions due to the intelligence value of the data they hold. Additionally, the Chinese government considers it a strategic priority to influence the telecommunications sectors of countries participating in the Belt and Road Initiative, giving it increasing leverage in the debate over global internet governance.

The post 4 Chinese APT Groups Identified Targeting Mail Server of Afghan Telecommunications Firm Roshan appeared first on Recorded Future.

A window of opportunity for data democracy (Part I of III)

One of the unanticipated consequences of digitization has been data feudalism. A major reason data feudalism was such a surprise was that society just didn’t anticipate how quickly online services could scale, or how quickly power would shift as more and more users spent more and more time online. Another surprise was how quickly the services they used would take advantage of the narrow window of opportunity that opened during the 2000s and 2010s to control and harness user data. Governments, caught flat footed, have yet to respond effectively to this development.

Beginning in the 2000s, the world saw tremendous growth in social networking services. With the commoditization of and improvements in distributed compute, networking and storage, each successful social network could scale out to serve hundreds of millions or even billions of users.

Owners of the controlling shares of such a burgeoning service became de facto lords and ladies of the manor. Through the 2010s to the present, the data farm surrounding each manor produced a more and more bountiful harvest with each succeeding year.

The contract users agreed to assigned provider’s rights to a continual stream of data each user was generating. In this way, each provider collected a tax of sorts in exchange for a service that was otherwise “free.” The tax was your data harvested from the provider’s data farm.

Those who signed up were presented with a choice: agree to the providers’ terms, or don’t. Those who didn’t stayed disconnected from the rich online communities that emerged.

Those who signed up (as most did) became passive data serfs of a sort. Each data serf helps seed, nurture, harvest and enrich the data from the farms surrounding these online manors. Meanwhile, each provider harvesting a user’s data maintained and interconnected it with others’ data–within the provider’s own data infrastructure. The power of most networked data is therefore now in the hands of the data gentry.

Changing the data custody model

The laws enacted within the past five years to try to protect personal data–the EU’s General Data Protection Regulation (GPDR) and the California Consumer Privacy Act (CCPA) being examples–are well intentioned.

But data protection in the current IT environment is a Sisyphean task. Enterprise architectures have been designed to collect and strand data in silos, trap logic in applications, and encourage the creation of more and more silos. (See Dave McComb’s book Software Wasteland for a full exploration of how current app-centric architectures fail.)

As an unknown Scot from a past century observed, possession is nine-tenths of the law. How can we get personal data away from providers if the whole ecosystem is in the habit of collecting and making use of that data?

Cloud services perpetuate, rather than alleviate, the problems with older architecture. The only relevant difference between public cloud and on-premise is rent versus buy, and it’s much simpler to rent. Then the problems become ones of trust and control.

US WPA Art Project, [between 1936 and 1940],

Hostless or serverless P2P data networks and IPFS
One way to sidestep the old architecture is to move to a more suitable one that already exists. Peer-to-Peer networks aren’t new, but developments over the past decade have made them more compelling for business use.

That’s particularly the case when it comes to desiloing, data-centric development, and personal data protection. For example, without a central server, each individual user can be in control of their own data from the start, by default. And both data enrichment and app development can be broadly collaborative, even more broadly than Github, by default.

Finally, the development stack for today’s P2P data networks is much simpler.
This timeline summarizes the major P2P network developments since the days of Napster and Gnutella, which were designed for music file sharing:

Erik Daniel and Florian Tschorsch, “ IPFS and Friends: A Qualitative Comparison of Next-Generation Peer-to-Peer Networks,” 2021,

The best-known P2P networks in use today are the Bitcoin and Ethereum networks. The goal of these transactional networks has been to harness the power of a tamperproof, immutable “blockchain” ledger. The ledger with the help of a suitably incentivizing and effective consensus algorithm makes it possible to verify transactions without the help of a third party.

Instead, each peer node can play a role in confirming blocks of transactions. This method also enables tamperproof smart contracts, or legal agreements expressed in self-executing code. Smart contracts will be indispensable in automated commerce and governance, as this illustration underscores:

“Blockchain” (a.k.a., Distributed Ledger Technology, or DLT as the enterprise-class version is called) since its inception in 2009 has been the subject of an enormous amount of hype, primarily because of its association with (marginally) viable cryptocurrency.

But the fact is that blockchain itself or DLT, even as it has evolved over the last 12 years, perpetuates a number of problems with older information infrastructure. In essence, most established chains on their own continue to reinforce the tabular, siloed, opaque status quo we’ve lived with for 20+ years. It’s more recently created data networks such IPFS that could empower individual users to escape data serfdom–and these can link to the ledgers.

Parts II and III of this series will continue to unpack what next-generation P2P technology could mean for data democracy, as well as what’s required to build successful data democracy.

Scroll to top