How are Data Engineering & Data Science related (if at all)? Which has good scope in the future?
If you landed here from LinkedIn, then you probably have the right context! Let's continue on the topic directly:
As per my understanding and
experience in #datafield so far, data like a software has lifecycle!🚲
Few stages:
i) Inception (collection of data
from disparate systems e.g., transactional systems, sensors, etc.)
ii) Collection (the generated
data is collected through different methods and stored at a place)
iii) Cleaning (generally the data
collected isn’t processing ready hence some sort of preparation is required)
iv) Processing (using various
business logics and/or logical transformations the data is processed so that it
gives out some information)
v) Presenting (the well-processed
data is then presented using various dashboarding tools/techniques e.g.,
Tableau, Power-BI)
vi) Intelligence (using the data,
the machine learning models are built which can identify patterns and predict
the future patterns using mathematical/statistical methods)
Normally, first few stages like
iii) & iv) are owned by Data Engineers, final one is owned by Data
Scientists and the second last by Data Analysts, BUT with time these
responsibilities are blurred and there may not be a distinct separation between
them.
As per numerous studies done, “on
an average a Data Scientist spends good amount of her/his time in just cleaning
the data!” 😮
(https://www.reddit.com/r/datascience/comments/bupmyf/data_scientists_spend_up_to_80_of_time_on_data/)
This can give idea that how much
the pre-processing of data is important to derive any value out of it!
Hence, it would be safe to say
that data engineering is backbone of data science which provides a base ready
to do fancy ‘AI/ML modelling and predicting’.
As a data engineer, I’ve to
understand the business context of the data I interact with while doing
cleaning/transformations.
Few of them are:
- removing duplicate values/NULL values,
- prepare fields based on biz logic & requirement,
- validation of data if it is suitable for further stages or not,
- how will data handling at each stage affect my first-hand customers (i.e., downstream applications which consume data from my pipelines)
So, in a nutshell, a data
engineer must be aware of Business Context of data, different mechanisms to
handle/process data, pipelining of end-to-end data flow (technical field
knowledge)
A Data Scientist, on the other
hand, must be well versed with Maths & Statistics as they play major role
in building ML models on the data provided by data engineer.
Few major activities of a Data Scientist, according to me, may include:
- Exploratory Data Analysis,
- Hypothesis testing,
- Model building and running,
- Getting feedback and work on that
Here, again, the business context
is of utmost importance. Along with, that having in depth knowledge of
Mathematical/Statistical methods for model building and training is important.
So, if you’re still in love with
that integration techniques from elementary math days, you belong here to use
them to derive values and make predictions out of the data.
As mentioned earlier, this
distinction of responsibility isn’t always visibly clear. You may find a data
engineer working on a dashboard/report for a user using his SQL skills! 😉
OR you may find data scientist
doing data cleaning tasks! (This is quite common!) 😅
An Important point to consider
here is that cloud knowledge is now A MUST! (Same is the case in software word
as well I guess!) Pick up any cloud tech and learn about it…this will benefit a
lot in the long run! 😃
Which one should one go for and which has good
scope?
As the data volume grows even
more (we already have BIG DATA!), handling of it will become more and more
tricky and skilled data engineers in this task WILL BE NEEDED.
Because as they say: “Your
machine learning model is as good as your data!”
Having said that, skilled data
scientist will churn out more fine details from the data using her knowledge
and math and have more powerful data insights and predictions.
Ultimately, it all boil downs to
your area of interest & skill set.
If you’re someone who loves to
play around with data (dirty data!) and come up with some plan to make it
processing ready and have got programming background knowledge, you can go for
data engineer role.
OR
If you’re someone who loves
stats/math and would like to use that to predict the future, go for data
scientist role.
Thanks for the read & Do share your thoughts! :)
Comments
Post a Comment