5 Ways Data Scientists Can Advance Their Careers

Data scientists and ML engineers so often spend the majority of their time cleaning up data messes they didn’t create. Kyle Kirwan, CEO and co-founder of Bigeye, says it’s difficult to get out of defense mode, but data people can empower themselves and their teams.

Last Updated: September 6, 2022

Data and machine learning people join companies with the promise of cutting-edge ML models and technology. But often, they spend 80% of their time cleaning data or dealing with data riddled with missing values and outliers,  a frequently changing schema, and massive load times. The gap between expectation and reality can be massive. 

Although data scientists might initially be excited to tackle insights and advanced models, that enthusiasm quickly deflates amidst daily schema changes, tables that stop updating, and other surprises that silently break models and dashboards. 

While “data science” applies to a range of roles, from product analytics to putting statistical models in production, one thing is usually true: data scientists and ML engineers often sit at the tail end of the data pipeline. They’re data consumers, pulling it from data warehouses or S3 or other centralized sources. They analyze data to help make business decisions or use it as training inputs for machine learning models. 

In other words, they’re impacted by data quality issues but aren’t often empowered to travel up the pipeline earlier to fix them. So they write a ton of defensive data preprocessing into their work or move on to a new project.  

If this scenario sounds familiar, you don’t have to give up or complain that the data engineering upstream is forever broken. Make like a scientist and get experimental. You’re the last step in the pipe and putting models into production, which means you’re responsible for the outcome. While this might sound terrifying or unfair, it’s also a brilliant opportunity to shine and make a big difference in your team’s business impact.   

Here are five things data scientists and ML analysts get out of defense mode and ensure that even if they didn’t create data quality issues, they’d prevent them from impacting the teams that rely on data.

1. Increase Trust Through better data quality monitoring

Business executives hesitate to make decisions based on data alone. A KPMG report Opens a new window showed that 60% of companies don’t feel very confident in their data, and 49% of leadership teams didn’t fully support the internal data and analytics strategy. 

Good data scientists and ML engineers can help by increasing data accuracy, then getting it into dashboards that help key decision-makers. In doing so, they’ll have a direct positive impact. But manually checking data for quality issues is error-prone and a huge drag on your velocity. It slows you down and makes you less productive.

Using data quality testing (e.g. with dbt testsOpens a new window ) and data observability helps to ensure you find out about quality issues before your stakeholders do, winning their trust in you (and the data) over time.

2. Make SLAs to prevent confusion and blaming

Data quality problems can easily lead to an annoying blame game between data science, data engineering, and software engineering. Who broke the data? And who knew? And who is going to fix it? 

But when bad data goes into the world, it’s everyone’s fault. Your stakeholders want the data to work so that the business can move forward with an accurate picture.  

Good data scientists and ML engineers build accountability for all data pipeline steps with Service Level Agreements. SLAs define data quality in quantifiable terms, assigning responders who should spring into action to fix problems. SLAs help avoids the blame game entirely.

3. Faster analysis through experiments

Trust is so fragile, and it erodes quickly when your stakeholders catch mistakes and start blaming. But what about when they don’t catch quality issues? Then the model is poor, or bad decisions are made. In either case, the business suffers. 

For example, what if you have a single entity logged as “Dallas-Fort Worth” and “DFW” in a database? When you test a new feature, everyone in “Dallas Fort-Worth” is shown as variation A and everyone in “DFW” is shown variation B. No one catches the discrepancy. You can’t conclude users in the Dallas Fort-Worth area – your test has been thrown off, and the groups haven’t been properly randomized.  

Clear the path for better experimentation and analysis through a foundation of higher quality data. By using your expertise to boost quality, your data will become more reliable, and your business teams can run meaningful tests. The team can focus on what to test next instead of doubting the results of the tests.

4. Become the point-person for data quality

Confidence in the data starts with you; if you don’t have a handle on high-quality and reliable data, you’ll carry that burden into your interactions with the product and your colleagues. 

So stake your claim as the point-person for data quality and data ownership. You can have input into defining quality and delegating responsibility for fixing different issues. Remove friction between data science and engineering. 

If you can lead the charge to define and boost data quality, you’ll impact almost every other team within your organization. Your teammates will appreciate the work you do to reduce org-wide headaches.

5. Minimize data waste

Incomplete or unreliable data can lead to terabytes of wasted data. That data lives in your warehouse, getting included in queries that incur compute costs. Low-quality data can be a major drag on your infrastructure bill as it gets included in the filtering-out process time and again. 

Identifying complex data is one way to immediately create value for your organization, especially for pipelines that see heavy traffic for product analytics and machine learning. Recollect, reprocess, or impute and clean existing values to reduce storage and compute costs. 

Keep track of the tables and data you clean up, and the number of queries run on those tables. It’s essential to notify your team about how many questions are no longer running on junk data and how many gigs of storage are freed up for better things. 

All data professionals, seasoned veterans, and newcomers should be indispensable parts of the organization. You add value by taking ownership of more reliable data. Although tools, algorithms, and analytics techniques are growing more sophisticated, often the input data is not – it’s always unique and business-specific. Even the most sophisticated tools and models can’t run well on erroneous data.  The impact of data science can be a boon to your entire organization through the above five steps. Everyone wins when you improve the data your teams depend upon. 

Which techniques can help data scientists and ML engineers streamline the data management process? Tell us on FacebookOpens a new window , TwitterOpens a new window , and LinkedInOpens a new window . We’d love to know!

MORE ON DATA QUALITY MANAGEMENT

 

Kyle Kirwan
Bigeye is the data observability platform that helps data teams keep their pipelines fresh and high quality. Data teams at companies like Instacart, Zoom, and Udacity use Bigeye to automate their data monitoring, detect issues proactively, and keep data reliable for the data scientists, executives, and customers who depend on it.
Take me to Community
Do you still have questions? Head over to the Spiceworks Community to find answers.