The Trials and Tribulations of Building In-House Machine Learning Capabilities

Abby McCulloch
3 min readAug 19, 2021

Third party data models are popular right now and its no surprise why. Using machine learning (ML), organizations can cut down on time-consuming processes and, inherently, the number of employees required to do a related task. However, the foundation organizations need to start building in-house ML models comes with many complexities. Before building anything, there is a short list of heavy lifts to check off. In no particular order, organizations need quality data, data scientists, firm-wide data understanding (or at least functional data SMEs to spread data/model awareness), and some sort of project management or IT group. Assuming a cohort of data scientists imply a kind of IT group, we will focus on data scientists as challenge number one. Grouping together data understanding and quality data, we will focus on data governance (DG) as challenge number two.

Data scientists are in high demand, yet few companies employ enough to build ML in-house. In Figure 1, you can see only 3% of companies have more than 1,000 data scientists. And 61% of companies employ less than 11 data scientists.

Figure 1, Source

This data scientist deficiency its not for lack of want — between 2012 and 2017, the number of data scientist jobs on LinkedIn increased by more than 650 percent (KDnuggets). Without a strong group of data scientists, organizations may have no choice but to rely on those models built by other companies, which has its own set of problems.

Challenge number two, data governance, may prove an equally heavy lift as challenge number one. If an organization does not already have a DG program stood up the process may take many years and millions of dollars in productive human hours. Worth the data understanding and quality in the long run, most organizations choose to embark on this journey regardless of the time and effort it takes. However, while waiting for the fruits of data governance to become available, it is not uncommon for institutions to either neglect their in-house data assets or simply wait for a more complete governance program. While waiting, organizations rely on other organization’s proprietary data models to advance their insights.

Third party data models have the benefit of allowing organizations to keep pace with competitors, regardless of the quality of their own data. However, this reliance comes at a cost — both monetary and ethical. There is little to no way of knowing the quality of data used to train the third-party models. Even those firms building the models often use data collected by other firms multiple degrees removed. Such uncertainty — from data collection, to cleansing, to use — can be dangerous. The COMPAS algorithm, which is used by judges to assign a risk score that indicates the likelihood a defendant will commit a future offense, assigned disproportionately worse scores to African Americans than to their Caucasian counterparts who were equally likely to re-offend, resulting in African Americans receiving longer detention periods while awaiting trial. COMPAS was trained on the data available in arrest records, defendant demographics, and other variables but lacked proper vigilance over common data bias pitfalls and proxies (brookings). This story is not an uncommon one, from Amazon Rekognition to Apple Card to Clearview AI, these problems plague even those companies with seemingly endless resources.

So where does all of this leave you? Well, in a tough spot. Organizations that do not embrace their big data may be missing out on big opportunities, but doing so is difficult and costly. However, those firms interested and able to build out their ML capabilities must be patient and careful to not to over-rely on third party data models and face the consequences. I’ll further discuss the solution to this problem in later posts.