What Makes a Great Data Scientist: Roles, Skills, and AI‑Powered Tools

Name: Classroom: Data Science (introduction)
Uploaded: 2026-03-05T03:55:23.623876+00:00
Channel: Chisquares
Description: Summary and key takeaways on What Makes a Great Data Scientist: Roles, Skills, and AI‑Powered Tools, covering to Data Science Training The current training

Chisquares

Mar 05, 2026

•

5 min read

YouTube video ID: QX25zOpDolc

Source: YouTube video by Chisquares — Watch original video

PDF

The current training session focuses on the foundational aspects of data science with an emphasis on the R programming language. While R will be the primary tool for today’s exercises, future sessions will introduce Python and other languages to broaden the skill set required for different job duties.

Defining a Data Scientist

Job postings from organizations such as the Department of Treasury, Google, and OpenAI illustrate the varied ways a data scientist is described. A common misconception is that the role is filled with glamorous model‑building and endless coding. In reality, day‑to‑day work often involves data cleaning, exploratory analysis, feature engineering, visualization, collaboration, documentation, and continuous learning. Data science sits at the intersection of statistics and computer science: a statistician’s goal is to prove, whereas a data scientist’s goal is to improve.

Programming Languages: R vs. Python

The choice between R and Python should be guided by the nature of the tasks rather than by any inherent superiority of one language.

R is preferred for statistics‑heavy roles because of its extensive ecosystem of statistical packages (e.g., mlr, caret).
Python shines in engineering contexts, especially when building APIs, data pipelines, or integrating with production systems.

Thus, the decision hinges on whether the job leans more toward statistical analysis or software engineering.

The Art of Problem Solving

Coding proficiency is attainable, but it does not alone define a data scientist. Problem solving is language‑agnostic: it requires breaking large, vague challenges into smaller, manageable pieces and visualizing the solution in three dimensions. From an entrepreneurial perspective, the problem must address a real pain point and possess a clear path to profitability. As one speaker put it, “Problem solving is agnostic of the language.”

Qualities of an Effective Data Scientist

Imagination – Drives innovation and enables the creation of novel solutions.
Paranoia – In this context means being constantly vigilant about potential mistakes and rigorously testing code.
Obsession – A deep, sustained focus on the problem ensures thorough exploration and high‑quality outcomes.
Customer Orientation – Listening to user feedback and aligning solutions with real needs.
Time Orientation – Balancing speed to production with performance efficiency once the solution is live.
Collaboration – Working seamlessly with cross‑functional teams; data scientists rarely work in isolation.
Documentation – Writing clear, maintainable code that others can understand and extend.

Characteristics of a Suboptimal Data Scientist

“All API, no KPI” – building technology without business context.
Optimizing tasks that should never be optimized, such as automating inherently human decisions.
Skipping quality assurance and testing, leading to fragile code.
Poor documentation that hampers teamwork.
Trying to be a hero and refusing to ask for help.
Relying on a “babysitter” for basic independence.
Getting stuck on unsolvable “monkey and banana” problems.
Favoring easy shortcuts over optimal solutions.
Declaring tasks impossible, reflecting a lack of imagination.
Chasing the newest tools instead of solid principles.
Maintaining a poor attitude that damages collaboration.

Data Scientist as an Entrepreneur/Innovator

Beyond analysis, a data scientist can act as an innovator, creating seamless user experiences and driving change through creativity. The pursuit of “seamlessness” means delivering intuitive solutions that hide underlying complexity from the end user.

Building Data Science Solutions

A practical workflow can be distilled into three steps:

Get it to work – develop a functional prototype.
Integrate it – embed the prototype into existing systems or workflows.
Scale it – ensure the solution can handle larger data volumes and broader usage.

R’s scalability limitations are acknowledged, so breaking complex problems into smaller, reusable components is essential for growth.

The Role of Code

Code serves as a tool for automation, prediction, and analysis. For R‑focused data scientists, the minimum requirements include:

Deep understanding of data structures.
Intuitive grasp of logical flow.
Ability to write clean, reusable functions.
Familiarity with machine‑learning packages such as mlr and caret.
Strong statistical foundation.

AI in Data Science

AI can be organized into a task hierarchy:

Tier 1 – Correct work: AI reliably extracts structured data from unstructured sources.
Tier 2 – Good enough work: AI provides acceptable summaries or drafts.
Tier 3 – Tasks AI shouldn’t do: High‑risk activities involving sensitive data or critical decisions.

A risk‑based framework helps decide when AI is appropriate. When used wisely, AI acts as “glue,” stitching together disparate steps into a seamless process.

Kai Squares Platform Demonstration

Kai Squares exemplifies how data‑science principles translate into a real‑world research platform. Key features demonstrated include:

Survey Design – AI suggests aims, keywords, and even generates questions.
Data Import – Importing questions from Word documents highlights challenges of structuring unstructured text.
AI Builder – Generates new survey items and label suggestions (e.g., educational background).
Translation – Automated multilingual support for surveys.
Deployment – Launching surveys, collecting responses, and handling multiple submissions for outbreak investigations.
Automated Reporting – Generates methodology sections, analyses, and visualizations without manual effort.
Data Cleaning – Built‑in tools streamline preprocessing.
Study Management – Survey bank, sample‑size calculators, sampling tools, cross‑sectional and longitudinal study designs, and offline mode for low‑connectivity environments.
Real‑Time Tracking – Dashboards update instantly, and data access policies govern sharing.
Scalability – The platform can handle large datasets while maintaining performance.

Through these capabilities, Kai Squares demonstrates how AI can be responsibly integrated to enhance productivity without compromising data integrity.

Takeaways

A data scientist blends statistics and computer science, focusing on improving outcomes rather than merely proving hypotheses.
Choosing R or Python depends on whether the job emphasizes statistical analysis or engineering tasks, not on any inherent superiority.
Problem solving, not coding alone, is the core competency of a data scientist and applies across any programming language.
Key personal traits—imagination, paranoia, obsession, customer focus, time awareness, collaboration, and documentation—separate effective data scientists from suboptimal ones.
AI should be used as a risk‑aware glue for correct or good‑enough tasks, as illustrated by the Kai Squares platform’s automated survey and reporting features.

Frequently Asked Questions

Who is Chisquares on YouTube?

Chisquares is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.

Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.

Helpful resources related to this video

If you want to practice or explore the concepts discussed in the video, these commonly used tools may help.

Data Science For Business Recommended

A practical book that teaches how to translate data insights into business improvements, aligning with the role's focus on impact over proof.

Amazon →

R For Data Science

Provides hands‑on guidance with R packages like mlr and caret, matching the minimum skill set outlined for R‑focused data scientists.

Amazon →

Python Data Science Handbook

Covers Python tools for building APIs and data pipelines, supporting the engineering side of data science.

Amazon →

Ai Platform Subscription

Gives access to AI services that can automate data extraction and report generation, similar to the AI features demonstrated in Kai Squares.

Amazon →

Links may be affiliate links. We only include resources that are genuinely relevant to the topic.

Summarize another video

Full Transcript YouTube

okay
three four oh okay so overall it's the
it's a small minority I
guess
yes okay so maybe we would have to
incorporate the basics the elements just
foundational aspects of our in
tomorrow's session so I'm going to
change
the the overall curriculum a little bit
the goal was to provide you some high
level understanding of um your role as a
scientist but we're going to adapt that
a little but um whatever the case you're
going to learn a little and um and
hopefully be able to take some things to
apply in your careers or in your daily
work so we're going to cover data
science in R we plan to have other
sessions where we're going to cover data
science in Python data science in in
Rost data science and other programming
languages that are that are key to
engineering but today's session or this
session for the next um two or three
days is just focused on the our
language so let's let's get
started so who is a data scientist that
is a very um it's a simple question but
the answer you will get will depend on
who you asked so I thought we look at
some job postings from some of the key
agencies and see what they expect of
data scientist so who from the
perspective of these organizations who
is a data scientist so this is the
Department of a treasury and this is the
responsibilities that they have
described as being those in the
PowerHouse of the data
scientist the person should apply
Professional Knowledge of computer
science mathematical statistical
theories techniques and methods to
gather analyze design and construct new
processes for analytical modeling
interpreting models and or reporting
quantitative information Trends
relationships and correlations among or
within data
and of course to advise on the potential
benefits or uses of
automation to improve the efficiency of
program
operations and to analyze a variety of
data sources to provide datadriven
insights to the organization so that
that captures very quite succinctly I
would say what is required of a data
scientist now let's go to Tech and let's
see from a tech perspective let's say
from Google when Google is hiring you as
data scientist right what what what
qualifications or what kind of
responsibilities are you expected to
have um course many of you are going to
be drawn to the pay range 177k
266k um but let's look at the
responsibilities here the first is
collaboration which I think is awesome
because I think that's the one of the
Hallmarks of data scientist you're going
to work with a lot of people as we're
going to discuss in a few slides from
now and you should be able
to translate and refine business
questions into tractable analysis
evaluation metrics or mathematical
models you should be able to use custom
data infrastructure or existing data
models as appropriate and to design and
evaluate models that will solve
questions or solve
problems you should be able to gather
information on business goals and
priorities and answer those questions
with new data or with existing data and
you're supposed to own the process of
gathering data extracting and compiling
data across various sources with a whole
variety of tools to answer questions
that of business interest so again that
gives us a a a snapshot into what the
data scientist or who scientist is
supposed to be in the business context
and last let's look at you know open AI
this is the the owners of chat GPT so if
open AI hired you as data scientist what
are you expected to to do in that role
well you're supposed to embed with a
human team to uncover new ways to
improve data quality and and output you
are expected to Define and interpret
tests that help answer critical
questions about the impact of model ux
changes and process changes on their
overall
outcomes you're supposed to help
establish a datadriven and development
Culture by driving the definition
tracking and operationalization of you
know various metrics of interest and to
develop dashboards and other things that
are data related so again those three
companies organizations by looking at
their their their responsibilities as
advertised for jobs gives us an insight
into what data scientists are supposed
to do now there are two ways people
think of data scientist when when you
see the word data scientist people
suddenly think of some very very hyped
up things which of course are within the
PowerHouse of data scientist but are not
done that frequently and so um some of
those things include like oh the that
scientist is just working every day to
develop this groundbreaking AI models or
they building fully autonomous systems
they are predicting you know Trends in
the market in the stock market and then
of course big data right people always
imagine oh data scientists must be
working with big data all the time of
course you work with big data but we're
saying that these things they happen but
not as common as you think and then for
some they think data science is
synonymous with deep learning right it
has to sound very sophisticated and very
cool and then other things like
algorithmic trading and virtual reality
and Quantum Computing and all of those
those are things that again all of these
things are things that data scientists
do but unfortunately to disappoint you
the day-to-day job of data scientist is
way way more boring here are the things
that a data scientist does on a
day-to-day basis um data cleaning and
preparation exploratory data analysis
feature engineering very very
important data visualization
collaborating with stakeholders writing
and maintaining code documenting
documenting documenting
code having you know code review
sessions over and over
and then reporting um of you know
outcomes from their work and of course
you have to keep learning learning
learning learning learning because again
if you solving problems you you'll be
faced with challenges which you don't
know how to address so you have to learn
to figure those things out so the day
today job of data scientist this is this
is this about sums it up very
nicely again this is just the same thing
we said earlier but in just a pictorial
form
now how does this field of dat science
come about it's a relatively new field
um and it's a hybrid of two different
Specialties we have a we have a field of
statistics and a field of computer
science so you can think of data
scientists or data science as a marriage
of those two Fields so a data science
scientist knows more about statistics
than a computer scientist and knows more
about Computer Sciences than a
statistician right so we've pulled some
things from data science and pulled some
things from from computer science and
together those created that field of of
data science and again that gives some
ideas to what the data scientist does
day today so is there a difference then
between a stat a statistician and a data
scientist um well it depends depends on
where you're working in in some places
statisticians call themselves data
scientist but technically there is a
difference at least in our
organization the goal of Statistics or
statistician is to prove the goal of
data science or data scientist is to
improve right and there in is the
difference when I hire a data scientist
I am hiring him specifically to improve
processes with data so both of them work
with data but a statis is trying to
prove a hypothesis that indeed there's
an association between X and Y that
indeed smoking is associated with lung
cancer you're trying to prove a a
scientific hypothesis that that is the
goal of Statistics in in in in the
context of data but data scientist if
I'm hiring you of course I'm not hiring
you to come and prove any hypothesis I'm
hiring you to improve improve processes
how do we improve our process for X Y
and Z how do we reduce cost or wastage
how do we improve efficiency right those
are the things that I I'll be hiring
other scientists to do so that is a nice
way to summarize at least from a very
rather simplistic view but at least
that's one way to understand the
differences between a statistician and a
data scientist even though they CAU from
simpler to some
extent so the question then is what
language should you programming as a
data scientist should you use of course
there are many options there but the two
most debated ones oh should I use python
or should I be using R well it depends
like we said a data scientist is a
hybrid of two two two Fields statistics
and
and computer
science the question is your job duties
do they lean more towards statistics or
they lean more towards computer science
if your job duties lean more towards
data towards statistics then it makes
perfect sense for you to be using R
because R is a beloved Language by the
the community of statisticians
statisticians Love Are and one
consequence of that is that you see a
lot of packages in R for almost anything
you want to do in statistics I mean
there are very few things you'd want to
do in statistics that you won't find an
existing package in R already so it
becomes an issue of just efficiency at
that point that you just grab an
existing package and use it for your
work oh by the way for those who may not
be familiar with packages package is
like a package is like a tool to do
everything you know if if I want to if I
want to do let's say a particular type
of analysis I need a tool that will help
me do that right so we'll call those
tools packages packages are written by
people in the community so you know um I
is open source and so you have a very
large community of statisticians who are
very active in that community and who
write packages all the time so if if you
work 10 more toward statistics then you
you you would be very very um it makes a
lot of sense for you to use R for your
work
now if you work tends more towards the
other side towards engineering you're
working with apis all the time you are
you are building processes data
pipelines you things that are more
towards engineering then it obviously
makes sense for you to use Python so
it's not a question of whether one
particular language is better than the
other it's a question of what are you
doing in your day-to-day tasks if there
are if if the 10 more towards statistics
then you use if the 10 10 more towards
computer science then you use Python
it's pretty
basic now I want to also clarify that
writing code doesn't make you do the
scientist that is that is very important
um anybody can write code I mean it's if
you if you're dedicated enough in six
months You' become proficient in any
language you know whether it's you know
um C or python or R you know whatever
language it is six months of just
dedicated um learning will make you
quite proficient in writing code and and
at least if you are reading the
instructions and solving the problems
it's it's it's pretty easy and I mean in
the age of this age of
chpt that's
easy however being a data scientist is
way more than that is the art the
ability to solve problems and problem
solving is not the same as writing
code um problem solving is also agnostic
of the
language you that could like in Kai
squares when we have problems anybody
can write a solution to that program in
any language they are familiar with once
the sub problem has been s translating
it into any other language is just as
simple as ABC so problem solving is
agnostic of language and problem solving
is not the same thing as writing code
right so it's about how do you think
about problems how do you how do you how
are you able to take a big problem and
break it down into small
pieces now from entrepreneurship
perspective the problems you're solving
must meet two criteria there must be
problems real problems that address pain
points right in in Tech it's very common
to see people who create you know um
solutions that are looking for for a
problem right um you know so that that
is not that doesn't make sense like okay
so we have built a solution now how how
who who who who is going to use it right
that doesn't make any sense so um you
have to solve real problems s that
address you know pain points faced by a
lot of people in the real world and from
an entrepreneurship perspective there
has to be money in the problem if if if
there is no money in the problem then
it's it's well that means you are the
government is a the government's role is
to do things which people are either
unable or willing or unwilling to do
right so again I'm sure most of you here
have you know maybe you build a product
that's my assumption at least that some
of you are in in the infancy of building
Tech products and you're like okay how
do we do this so you have to really
really be deliberate about the kinds of
problems you choose to solve because for
it to be sustainable it has to be
profitable I mean so how are you going
to you have to pay your staff you have
to you have to engineer new features and
all of those things they cost money and
so solve problems that are a real
problem not imaginary problems and make
sure there's money in the problem you're
solving again this is just Why
to emphasize that they they um solving
problems is more than writing code like
I can't tell you how much time I spend
just writing you know in my my tablet
when I have a problem it's just endless
writing and sketching and thinking about
the problem and you know writing it down
and figuring out all the possible ways
that data could flow or things could
things could happen the writing of a
code is the last last last stage if I
can figure out the problem and how it
flows then writing the writing the code
is very easy so for you to be a good
data scientist you need you must have
the ability to see the problem in your
mind in three dimension you you need to
be able to conceptualize a problem from
just an abstract idea to something that
you can almost feel and touch in your
mind because if you cannot see it in
your mind then how do you how do you
solve the problem how do you then say
okay this is how it flows from here to
here and this is information flows from
there to there right that is that is a
fundamental part of solving the problem
of solving problem solving you have to
first of all conceptualize it and put it
down on paper and then later later later
on you can write the code like I said
the writing of a code is easiest part in
the whole process you have to first of
all solve a problem in your mind first
and then put it on paper and then the
rest is
easy so what makes a good data scientist
this is my own criteria for what makes a
good data scientist one imagination I
cannnot emphasize enough the power of
imagination imagination any day any time
I'll pick imagination over knowledge
knowledge is very finite the things you
know are very finite imagination is
infinite you can imagine about things
that you have never heard about never
seen before um and and people have the
ability to imagine things I mean like
you hear people imagine that oh my gosh
If I Had a Million Dollars I'm going to
do this so okay it's very clear to me
that you have the power of
imagination now we just we just have to
steer that
imagination from just daydreaming to
actual you know problem solving and so I
think that that is very important right
the ability to imagine things because if
you cannot imagine then you can
innovate and and I also think it's very
important that you you retain the power
of being surprised I mean for me I I
always repeat that a lot because and I
think it's very important when you fly
from point A to point B and you're in
the plane you should be shocked that oh
my gosh I am flying thousands of feet
above the ground and I'm not falling
like what if you pick a phone and you
make a call you should be surprised that
oh my gosh like I'm talking to somebody
who is thousands of miles away because I
think that that that the the gift of
surprise makes you realized that oh
human beings like me are the ones that
engineered this I mean people these guys
had one head two legs just like me and
they came up with this amazing
engineering fit and that challenges you
you know in my opinion to really say
okay what what can I do so I think
imagination is deeply tied with the gift
of surprise because when you are
constantly surprised by Innovations
around you you you start thinking that
wow I wonder how they put these things
together and you you're constantly on
the lookout for how you could also see
gaps in you know engineering or gaps in
you know technology and S those things
as well so I think imagination is very
very important secondly paranoia
paranoia in this context means that you
should always be paranoid that you've
made a
mistake that keeps you on your toes
because you just you won't just write
some code and just push it right you
write a quote and you go back over and
over and over to make sure that you've
not made a mistake and so that's what I
mean by paranoia in this context they
fear that wow I've made a mistake and
that just constantly makes you to keep
on guard you know against you know bugs
and against you know just human errors
that could jeopardize an otherwise
excellent piece of
work and then Obsession Obsession in
this case you have to be obsessed with
the problem like if you have a problem
you're solving like that should just
that should occupy your every every
waking thought that is how you create
great product because you're obset
there's no great thing that is easy of
course the reason there's a reason why
not everybody's doing it because it's
hard and so if it's hard that means that
the solution is nonobvious and if the
solution is non obvious it means that
you got to be obsess before you can
figure it out in a way that makes
sense a good de scientist also has to be
customer oriented and so that means that
if customer provides feedback you you
know you take that feedback and you say
okay okay um obviously we have to go
back to the drawing board um we thought
we did great but um people don't
understand what we did and Trust
customers for them to completely
misunderstand whatever you
did and so and Trust customers also
break any technology you build and so
being customer oriented means that you
just go back to drawing board over and
over and you address those
issues you also have to be time oriented
and when I talk about time I look at
time in two domains time to production
and time in production so time to
production simply means that if we are
trying to build a solution how much time
am I spending to write the solution
right um you have to understand that
this is there especially if you're work
in the business environment there are
business requirements there there are Al
other people who are trying to solve the
same problem so you got to be fast you
got to move fast right so being quick on
your toes being being able to think fast
so that the time to production is is
reduced and then when you've pushed your
code how long does it take for this code
to be in production now we're talking
about okay you how what kinds of quality
assurance and what's the speed how do
you communicate those things to members
of your team right so that everybody
does their own part to make sure that
the time between you push your code and
the time to which users start to use
that feature is reduced so the time
oriented means that a good data
scientist is very conscious of both time
to production as well as time in
production collaboration is extremely
important to being a good data scientist
because you realize that you are not you
are not a oneman team of course you're
not you know you are you are surrounded
by a whole lot of other experts you
cannot work in a silo
it doesn't matter how how excellent you
are as a as an individual contributor if
you cannot communicate with the team um
then then All Is Lost and so
contribution starts with
documentation please you know if for
those of you write code write your code
as though you understood that everybody
on this team is going to review your
code so document your code provide notes
you know especially if you're doing
something that is not very very obvious
right make sure that your your your your
code is well documented um and and so
that is a key part of communication and
so these are the key members of a of a
an an engineering team of course there
are more but these are just the basic
ones that I put here you have the uiux
uiux means user interface user
experience those are the product
designers so the product designer will
say okay we want to we want to collect
data on this from this
form how best can we design this form so
that people can feel the information
clearly and it's seamless and we capture
the data in a way that makes sense and
then you have the frontend engineer the
front end engineer whatever you see on a
web application well you are seeing it
because the front end engineer worked on
that so they're responsible for
validations if you provide a wrong entry
and it says well this entry is invalid
well that's the work of a frontend
engineer all the user interactions are
um engineered by front-end Developers
the backend engineer is responsible for
the question what did what the data is
in so if you have a data set obviously
we're collecting data in a database
right so backing Engineers are all about
how do we store this data so that we can
retri it we can process it and then we
pass data to to the um data scientist
right data scientist is now responsible
for what is in the data how do we get
insight from within the data and so sign
is responsible for the structure of the
data analysis and interpretation but you
can see also that from here one of the
core requirements of data scientist is
before we even if we building a platform
for let's say data collection the data
scientist has to think up front and say
how will we how will the data be
analyzed so that the datab base has to
be set in a very specific way so that we
collect the right kind of data it can't
be medicine after death so you can see
how you know even before we we we talk
about you know the data from the back
end the data scientist is very very key
and so you can see how Communications is
very key between the data scientist and
the backing engineer the data scientist
and the uiux because data scientist is
going to tell him that well the form
you've designed I mean it makes sense
but um here are some problems with the
form please you know redo your design so
that the responses are going to be
closed ended and so on and so forth the
data scientist is going to they walk and
tell the front end engineer that well I
can see that you've not added so and so
validation um people are people are able
to enter values like negative numbers
even though this metric we asking for is
is is just positive values only we're
asking for age and somebody can still
enter negative values please can you add
a validation and make sure that people
cannot enter negative values so you see
the D scientist is involved with
communication there
too that scientist can also tell the
back in engineer that well we want the
data in this specific specific format we
want the data to be in this um this
particular configuration so that when we
get the data we can analyze it again the
role of
communication and then the QA engineer
of course the DAT scientist has to
communicate with them to to say okay
please can you kindly test for this or I
see that your test um flag this as an
error but here this is actually not an
error this is not a bug this is actually
a feature so please don't be overzealous
so so the death scientist is a the point
I'm trying to make is that the death
scientist is a key key part of a of the
engineering team but they have to
communicate with everybody on the team
before you can get good
results so what what will make you a bad
data scientist so these are some
characteristics you really want to stay
away from all API no kpi um um API is
just a technical thing um um it's it's
you know when we write code or we build
we build features we we we use apis API
is application um processing interface
so essentially you can think of that as
the as a a chef you know you know a a
chef is in the kitchen and I am in the
restaurant eating a waiter goes and
brings the food from the chef and brings
it to me that's what an API API is like
a middle guy who who makes sure that
when you use a particular service
everything works well so the point is
that that is a technical feature but you
as a data scientist you can't worry only
about apis you have to worry about kpis
kpi is more of a business context so in
the nsh what I'm trying to say here is a
good developer or a suboptimal um data
scientist is one who's just worried
about the tech and the tech and the tech
but doesn't recognize that there's a
business behind all of this right so all
API no kpi KP is key performance
indicators um and so that is a business
indicator so a a good data scientist in
a nutshell should be familiar with the
business you have to understand the
business first of all you have and by
business I don't necess mean
entrepreneurship it could be a
government agency it could be whatever
you're doing but you have to understand
that context in which you're working
before you can deliver value from a
technical side because other than that
you won't be able to do that
they also another feature of a of a
suboptimal data scientist is optimizing
that which should never be
optimized and again this also ties in
with Point number one because you don't
understand the business and so you don't
know for you everything oh let's
automate everything right let's make
life easy but you fail to understand
that there are some things that should
be inherently human driven
and there are some things that we can
automate not everything that can be
automated should be automated for
example in the area of research right um
there may be some things that the human
being needs to do for example human
being needs to um come up with the title
of a project a human being should do
should conduct certain aspects of
analysis there are some key tasks that
are in inherently human now and they be
automated yes should we automate them
well no because that doesn't necessarily
align with ethical conduct of research
and so this is why data scientists must
understand the field in which they're
operating they have to understand that
well not everything that can be
automated should be op should be
automated and so optimizing that which
should never be optimized is is one of
the things that would indicate that you
have a bit naive or you're not
experienced in the context in which
you're
working lack of quality insurance or
adequate testing okay you write code
that and you somehow just push this code
without like testing it like um you
really have a lot of trust and faith in
yourself and so test test and test again
this goes back to my earlier point about
parano you need to be paranoid that you
made a mistake and that is what drives
you to test and test and test
again po
documentation that is such a terrible
thing because even you yourself if you
come back to your code the code you
wrote two months later four months later
you start wondering H I wonder what this
was all about you know and then also
recall that you were working in a
collaborative environment right and so
if you are part of a team somebody else
you might be away from work maybe you're
sick or you traveled somebody else might
need to deploy your code and so it is a
failure of a system when we need you to
be there for us to understand your code
that means that all our systems have
failed so if you're a d scientist and we
have to pick our phone and call you and
say hey this code you wrote online 122
what is it saying or or can you jump on
a call that means that there is a
complete failure of of systems there's a
failure of documentation that's a bad
thing it is a business risk as well
because it means that our systems are
not sustainable the business is not
sustainable we need you to be physically
there so for documentation is a whole
Mark of a poor did
scientist trying to be a hero what do
what what do I mean by that it means
that you should know your limitations
when you've struggled with a
problem and you've come to a point where
you're like okay I am I am completely
out of my depth you should have to know
when to call for help and this is
something I tell you know my Engineers
all the time please you know we're a
bunch of pretty smart people here if if
you reach your depth just just just
shout right um and and some we will help
you you know so don't try to be a hero
know when to call for help at the same
time the reverse of that also makes you
a horrible You Know da scientist if if
you need a babysitter all the time that
you come to an organization and every
single thing somebody has to literally
explain like every single thing for you
like you are you're unable to
independently work and even after things
have been explain to you oh somebody has
to come back and review your work over
and over and over because it's just it's
it's maybe you're in the wrong
profession so if you constantly need
that kind of handholding maybe you know
you you're perhaps you're in the wrong
profession you're not data science may
not you may love dat science but dat
science does not love you
back then what I call the monkey and
bana problem you know when a monkey
grabs of a banana oh it's impossible
almost impossible to separate it from
that banana so while I said earlier that
you should be obsessed about the problem
you should also know when the problem is
not solvable right if if you if you and
this is something we've experienced
right you know where we like okay we've
solved we we've put our best effort to
solving this problem at this point we
can conclude that this problem is simply
unsolvable because the technology has
not reached to that point yet where we
can solve it so we'll just put it on on
the side burner and come back to that
later please do not push for easy
solutions as opposed to Optimal
Solutions of course Optimal Solutions
will take more time with take more
effort will take more just thinking
through the problem and collaborating so
don't be don't be in a in a rush to go
for what is easy right um at the same
time don't push for the Perfect
Solutions that's why we call it optimal
optimal means it's just good enough it's
not perfect and it's not horrible it's
optimal so optimal doesn't mean perfect
so we we don't need to have a perfect
solution to have a good solution right
we don't want the perfect to become the
enemy of the
good do not be too quick to say no or to
say that something is is impossible now
that is something that really really
hits me
like saying that something is impossible
is a real statement of arrogance because
somehow you assume that because it's
impossible for you it's impossible for
everybody else so don't be too quick to
say no I I don't want a my my engineer
to come and tell me that no it is this
is hard or this is impossible no that is
unacceptable to me tell me why it is
hard and what the trade-offs are and how
we can address it and then we can
quantify the risks we can quantify
whether the juice is worth the squeeze
and then we can decide whether whether
it's to go or no go but just declaring
that something is impossible that just
it for me it it indicates that you have
a complete lack of imagination and you
know um and that's not a good thing you
also should not be caught up in the new
object there are always new technologies
coming out new Frameworks new ways of
doing things um new software like let's
not go around chasing the new the newest
thing on the Block right let's let's
stick with the principles and let's
let's solve problems and then at the end
of the day it doesn't M it doesn't
really matter whether you have you're
the smartest guy in in the whole world
right you have you have great aptitude
but poor attitude that is not going to
work you we you should to work with
people right so those are some some
characteristics that will make you a
suboptimal data scientist and you want
to stay away from
those the scientist could be an
entrepreneur or an entrepreneur so you
don't have to own your own business even
within a an agency or a company that you
are employed as as an employee obviously
you can still solve problems um you Le
Levi says that you know creativity is
thinking of new things Innovation is
doing new things so data science or data
scientist is someone who can truly bring
about change in the way we do
things one of my
favorite um Pursuits is the pursuit of
seamlessness um I I think about
seamlessness a lot in the work I do and
and how can we deliver an experience
that is just wow how how do you elicit
the wow factor right a same is simply
where two materials come come about and
are joined together by like a stitch so
this this are these are seims and these
are stitches that join two seams right
when you think of something that is
seamless there are no look at this um
table tennis egg there's there's no seam
there's no you can't see where something
begins or where it ends right and
so as a data scientist right that's
that's something that you need to think
about all the time like in in in in the
context of improving things because
remember said that a statistician tries
to prove why data scientist should try
to improve things how can we bring about
improvements and processes in such a way
that it is it appears to be seamless
right to to the user so that that should
be the goal um and so that reflects in
how you do things how you build your
your Solutions and how you and how you
even think about Solutions in
general so how what does it take to
build a solution as scientists because
if you're a data scientist at some point
you have to build a solution you have to
build a process something that works
right so it's three steps simply right
can I get it to work on my machine so it
should at least work on my computer
right now that's step one
check can I integrate it with the rest
of a solution the fact that it works on
a computer does not mean it's going to
work in the solution because now you
have to integrate it with every other
that is existing and so that's the
second check and then number three and I
scale it scalability is extremely
important in your work as as a data
scientist the fact that I works for
10,000 people doesn't mean that it's
going to work for 10 million people it
might break because you did not build it
in a way that it will scale right and so
for any solution you're doing right
those are the three ways or three steps
you want to think about first of all I
just want to solve a problem like let me
just let me just figure out the solution
and let let me write a code and make
sure that this solution solves problem
on my computer and okay great then I can
then integrate it into the solution and
then we can then figure out okay what
kinds of Technologies can we incorporate
or what kind of Frameworks can we you
know inculcate into the solution in such
a way that it will scill and that might
mean changing the language right you
know some languages are more you know
more skillable than others right uh I
for example is notoriously non-
skillable you know um and so if I want
to build a data solution that is very
scalable then I might say maybe I don't
want to use R for example maybe I want
to use rust for example because then
it's more scalable but again before I
get to that part you have to worry about
making sure it works and it integrates
and then issues of scalability can come
later a good data scientist you should
be able to take a big problem and break
it down into its component part like you
say how do you eat an elephant well one
bite time so how do we take a a huge
complex problem and divide it into small
parts this is how you must think as did
the scientist right I have a huge
problem I want to solve this y XYZ
problem very
complicated okay um how do I how do I
break that you know why can't I solve
that problem well let me write down all
the reasons why I cannot solve that
problem well number one to solve this
problem I must be able to do XYZ then I
must be able to do X x y a BC then I
must do this then I must do that so if I
do all of these things that means I can
solve that problem and then I can now
start taking each of those things one by
one and solving them so the ability
to split apply combine take a big
problem split it into small parts solve
each part and put them together so this
is what we mean by problem solving if
you cannot if you cannot take a problem
and split it into small parts then most
likely you won't be able to solve
problem and this is where code comes in
we said earlier that well being a DAT
scientist is not all about writing code
but it you know um this what I call the
Finger of God like code code is one of
the most um it it is it is the the magic
of today you know there's a thin line
between code and Magic nowadays because
all of the things that code can achieve
now well a few centuries ago that that
would have been described as magic like
imagine calling imagine what we're doing
now speaking over over Zoom you can you
can see this the slides are on my screen
and you are all over around the world
and you can see my screen and you can
hear me in real time this is code
working right um automating tasks
predicting the future we can you know
fit predictive models we can do
sentiment analysis to know what people
think this this are just a marvels of
code right so fine we said it's not all
about writing code but that sentence
will be incomplete if we did not say
that a huge part of being scientist is
also about writing good code and so now
since this session is about R what are
some of the minimum requirements that
you need to have to be to call yourself
a data scientist you know based on the
our programming
language first and most important you
need to have a deep deep understanding
of data structures I cannot emphasize if
that's the only thing you know that's
fine you know but you need to deeply
understand data structures list data
frame vectors scalers so on and so forth
because every solution you're going to
create will come out as one kind of data
structure or another so if I'm creating
a solution do I want to render the
solution as a data frame or as a list or
as a vector or as what like if you don't
even understand the structures how then
are you going to be able to take in a
particular
piece of
information process it and deliver it as
an output in in an appropriate form so
again like I said if if if there was one
skill I will emphasize that if if
there's just one thing you have to learn
not learn that would be a deep and I
remember I emphasized the word deep deep
understanding of data
structures you also have to intuitively
understand logic so that means that
logic has to be your second nature you
have to understand that because if you
don't if you cannot understand logic you
cannot build Solutions again remember
that our emphasis now is not just to
write a logistic regression analysis
that would give us predictors of smoking
no we're talking about Building
Solutions as a data scientist right you
need to have an intuitive understanding
of
logic you have to be able to write good
functions a function is like a tool a
small tool you are building to do a
particular
task uh you can't just be copying and
pasting code
right um so that's why we write
functions so you need to write functions
good functions and to be able to um
validate them appropriately perhaps one
of this one maybe in the future session
we're going to have on a deeper dive
into our what we just talk about how do
you write a good function but for today
again today's session is not to teach
you about writing codes the fundamentals
of what we mean by good data
scientist now you don't have to write
every function yourself there are some
functions have been written already and
are in packages for data manipulation
data cleaning data visualization those
are abundant in R you can use any of
them you need to be familiar with
machine learning algorithms and there
are also packages in r that do that
excellently right like the mlr package
and cared package in r that allow you to
implement um machine learning
algorithms and then not to be overlooked
you need to have a strong grasp of
statistics remember that a data
scientist is partly a statistician right
he has some some skills from the world
of Statistics so we can't you can't call
yourself data scientist if you have zero
ideas about the world of Statistics you
need to have a strong grasp of
statistical methods and techniques
including hypothesis testing regression
analysis and basan statistics right so
those are um in my in my judgment the
minimum requirements for you to have for
you to declare yourself as data
scientist um in you know whether it's
are or whatever now we cannot also talk
about data science without talking about
AI I mean the world is crazy on AI so
and there a huge number of
tools how do you use AI in your work as
St
scientist this is this picture is a good
rendition of how a is been used today in
in many software applications it's like
okay sure pour it into the B but now
let's just pour it everywhere right
there's excessive use of air even in
areas where AI should not be used right
so it feels it it feel almost feels as
though there's a need to just use AI
just coply and just just extra
everywhere um so a good data scientist
you should know when to use a particular
technology and when it's okay we've it
has served its purpose and we should we
should do something
else so this is what I call the task
hierarchy in you know you know um in in
terms of of AI
usage you should know intuitively what
kind of tasks AI will be appropriate for
and the ones that AI will not be able to
handle very well and so if we look at
them in three tiers tier one is just
correct work right if I have a bunch of
unstructured data with emails I just
want to extract the emails into one
column oh AI can do that excellently
well I can just say hey extract the
emails and put them to me as as
structured data colum it do that 10
times out of 10 times to do that
excellently well that's correct work
tier two work is work that is just good
enough if I tell AI okay um summarize
this text I have this you know this
bunch of paragraphs I want you to
summarize them for me well the the the
walk may not be the walk that will meet
the standards of a noble lauret but the
summary is good enough for me all right
so that's just good enough
work but then you have set work that AI
should not be we as as a good data
scientist you should know that AI should
not be doing this right we I should not
ask AI to do it you know um and so K
Quest For example our our policy our
framework is based on you know a
risk-based framework right we use AI as
glue so AI as glue meaning that we just
use AI to make certain tasks seamless so
M said talk about seamlessness then
seamlessness is the perception of
seamlessness where it almost seems as if
the processes are just so fluid and so
so coherent like wow this is such a nice
experience so what glues those different
processes together is AI so we use AI to
just make life so easy so that you don't
you don't feel the pressure feel the the
broadness of the task but all of in
doing so we also look at the risk based
framework and that essentially says that
we classify all tasks as either highrisk
tasks or lowrisk a highrisk task is
anything that has to do with data if you
are and any type of data for that matter
right if we are dealing with
bibliographic data or your data set or
any kind of data for that matter as far
as it's something that we're going to
analyze to make an inference AI cannot
touch that right because that is a
highrisk task we can use AI for any
other thing summarizeed test we say okay
take this big test Tex text and make it
small or t this small text and make it
big those are lowrisk tasks and so so
those are again the a good data
scientist should have an awareness of
why we have that framework in place and
you know and the risks of of using AI
for things that can be considered as um
tasks that it cannot at least at this
time do excellently
well okay I'm sure you all know what CH
it is so um so what I want to do now is
to walk you through so that you can see
how all of this tasks come
together walk you through a the the
process of let's say designing a simple
survey but now instead of just worrying
about the the science right we want to
pay particular attention to the the data
tasks on the platform and so you can see
how data science plays such a key role
in in Building Technology right of
course it's a teamwork but like we said
the DAT scien is a core part of that
team so um so what I want to do now is
to
um we we go through the process of
creating a simple survey and as we go
from one process to the other we can we
can appreciate um the the tasks involved
that tasks involved so that tomorrow
what we could then do is we'll build on
on that we could say okay here is what
we've done today now what does it take
to actually build some something like
that right so you've seen that okay um
you know you click Sometimes some users
actually do not understand the
complexity of of the solution right you
just click something and things happen
and you don't really understand just the
complexity that that goes into that so
um so tomorrow and next we're actually
going to dive into code to see all right
so what what goes on behind the scenes
when you click on this and X Y or Z
happens right what is a how do you bu
the code how do you even think through
the problems um but for now we want to
go through that process of um of
reviewing the work so that we can see
how our how knowledge of data structures
ties into the different things we just
talked about so you can see from a a you
know a firstand experience how that how
that works um and how that ties in
together but before I do that I'm going
to stop now and see if there are any
questions um L any questions
yes
okay someone wanted to ask a question
uh
Lloyd L are you still here
okay okay uh someone also asked what
what of those only interested in using
the packages to analyze the
data that's fine too you know but in the
focus of this presentation is not Neally
for is we're not this this wasn't geared
at data analysis or data analyst or
statisticians the
the topic was specifically about data
scientists and that's why we are
focusing on data s but if you're just if
you're just interested in analyzing the
data making inferences that is
completely fine too um so you don't have
to go all the way you
know okay can I unmute yourself
LD okay thank you very much um good
evening to everyone else in the in the
group and good day
doctor so
um I I I just wanted to understand um
there's a point you made in the
beginning of the of the presentation um
you spoke about not being a hero and um
at the same time uh you should don't
need a baby sitter so I just wanted to
to just put some more emphasis because I
believe the line between the two is is a
bit thin uh given the the fact that in
computer science it's especially data
science it's always it's an evolving uh
uh um Paradigm where things are always
happening so then how do you know when
you actually do need help and when you
actually need to push yourself to fully
understand and grasp certain
concept that's a very good question I
really love that question so
so let's take two data scientists right
I have two you know junior junior
Engineers you know on the team I give
them TS one and it come back to me one I
the first I'm going to ask is so let me
see what you've done so far let me see
what what let me see your code let's
review your code and let's see what
you've done and so this is two weeks
after I give them both of them are tasks
right the first guy is like oh I didn't
I didn't I didn't write any code like I
don't have anything I I just wanted to
have another meeting with you so that
you can explain to me everything you
explained that first time oh
really like you are as lazy as the C you
know so you me you mean that two weeks
you did not you did not write one line
of code you did not think about the
problem
like you did nothing right the other
person comes and says Okay um here here
here here's my here are the notes I made
I I I I attempted a code here let's have
a we have a code review this is where I
was talk online as you can see from line
14 two of my code I'm not sure you know
why this stuff is not returning this or
why see this person has made efforts
right um so what I'm saying is that
please do not come back to me and say
just explain to me all you explained
before right I need to see some evidence
that you have worked and of course I'm
going to know where if you just went and
copied some chity code and just to give
an appearance of work right I I need to
know that you have thought through the
problem problem and that you've reached
a block where you can't Pro and and so I
have enough expertise to determine that
wow you you you've really tried your
best right so a babysitter is one who
has who is not just either just just
intellectually just nonc curious or just
lazy or a combination of those and not
willing to exert themselves right and
and and so just need somebody to hold
them by the hand for every step of the
way whereas somebody who is genuinely um
trying
I don't want them to spend all the if
you if you reach a point where you know
you cannot progress I want you to try to
learn right so learn how to learn but
don't don't let it take forever if if if
it's one day P two days pass and you're
still stuck aha that's it's time for you
to scream or to shout and say hey I need
help I hope that is clear it's a fine
distinction like you said but there is a
distinction
there uh thank you very much
is this course going to teach data
cleaning and data analysis to help in
writing
manuscripts no we're not going to we're
not going into the rubrics of data
cleaning dat anal we're going to talk
about code writing tomorrow um but the
goal of the course as I think that was
actually on the flyer is it's like we're
not going to the Rubik of saying hey
this is how you this is how you write
hello world right um it's it's going to
be on a high level so even when we talk
about code tomorrow or next it's gonna
be at a very high level so this session
is really for those who really want to
understand the the
principles right uh and so um the things
that we will talk about of course are
applicable to um to data cleaning and
Analysis but if your goal is to attend a
session so that you can learn how to do
the the the nity of your cleaning this
session will not help you again it's
focus on Broad high level principles of
things you need to know as the
scientist do you have oh okay so this is
the same
question okay if I understood correctly
the job of data scientist is to know
which model best serves the data in
order to predict accurately and how to
make good
examples the work of the data scientist
is
um this is day-to-day task of what you
doing as data scientist data clean and
preparation exploratory analysis
engineering features testing features
documenting code collaborating with
people again so those are the things
you're going to be doing the issues of
regression and deep learning of course
those are course you might do them but
they're not they're not going to be
day-to-day tasks um and of course
depends on the organization you're
working into right in in kis Quest this
is this are a day-to-day task of you
spend a lot of time on future
engineering like just building a future
right uh we want we want to build a
feature that user can upload their
whatever in so and so form right well
it's your job as data scientist to go
and figure that out that is feature
engineering so um and again like I said
it's going to vary from organization to
organization because data science is a
very Amorous
field are you going to cover the syntax
of R in the upcoming session today or
tomorrow tomorrow we're going to do a
little bit again we're not going to go
to the ABC of R and say hey this is how
you write um we'll cover a little of it
um but it's not going it's not going to
be inep so if you want to if you never
read read written code in R you will
have to go and learn you know whole
bunch of that by yourself but we're
going to provide you with a high level
understanding of how thing how it
works I think
yeah what what's the specific Target for
the next two days deep learning
visualization
Etc so we're again like I said we're
focusing on the principles right so um
to tomorrow we're going to talk about
the gentle introduction to R and some of
the basic basic um um libraries that you
know folks again if you're interested in
the nuances of if you're coming to this
session with the ability that wow when I
leave this session I'll be able to learn
you know do uh you know machine learning
tasks or I'm going to be able to
perfectly clean my data this is not the
session for you like we said in the
flyer this is focused on High level
principles so so just to share that in
all transparen so that you don't waste
your time if that is what you are
looking for that this will not be the
session for you next two days we're
going to focus on high principles the
same way we've covered the session but
tomorrow and next we're focusing on the
code and the syntax but again from a
very high
level okay similar question what are the
coverage areas of this data scientist
that we going to cover an NOA Anova kis
Square correlation and
regression uh will we continue with the
same time slots should we install R for
coding to complete the task given in the
coming
classes you can install our for sure um
I you know um but the tasks um I can I
can again we're going to review them
tomorrow how about just show up tomorrow
and just see what you know like because
we're going to we're going to go you
know go play by ear so you know I would
rather save the time so we can cover the
what we're supposed to cover today than
talking about what we might cover
tomorrow or not those have been already
covered in the syllabus so please refer
to
syllabus okay so in princip question in
principle again let me just emphasize so
folks those who want to learn who who
who this is what they want to learn can
show up and those who this is not this
is tangential to what they want cannot
can choose not to come we are not you're
not going to come and live away with wow
I never knew how to write code in R and
now I'm an expert in r that is never
going to happen so so you know so we're
going to cover some code but it's going
to be at the high level so that that is
the that is what you should keep in mind
when it coming tomorrow so that that
should inform whether you choose to come
or
not okay so when do you Le to teach data
analysis we are going to you're going to
well just follow us on LinkedIn we had a
we had an 8we course on R where we
covered the fundamentals of R and the
ABCs of R so I assumed that you were not
in that session so we might have another
session like that so just follow us on
LinkedIn we don't know yet but when we
post that you can follow us in real time
and you you'll be the first to know but
but we will have that session again that
that was a different session that was
focused on you don't know anything in R
how do we teach you how at the end of
this eight weeks you should become
proficient in IR this is not that
session I'm afraid to ask the same
question ask anyway are we going to
learn shiny quto or mcdown to no you
will not learn
that you can learn you can go to YouTube
and learn those things by the principles
I'm talking about are things that are
high level principles that you may not
easily learn in on YouTube and the
things that you you are asking are
things that you can learn the golden
rule is do not do for others what they
are perfectly capable of doing for
themselves so um so we're not going to
cover all of those things you're asking
it's again principles principles
principles any other
questions no
all right
great all right so now let's let's go
back to the K platform and let's see um
again as we go through the task we'll
see how they how you can how you can see
how data science is applied and that
that provides you with some context for
um what we are
doing how do i l do you remember how to
get rid of this
lines yes you need to stop sharing and
and re okay
right Mario right um Jer Jeremiah you
have a question sorry I didn't see your
hand okay you can ask later you can
continue all right so this is the kis
platform for those who are not familiar
with the k platform is a One-Stop shop
for everything that is research so on
Kai squares you can come design your
studies calculate sample size collect
data like launch your surveys the
platform also analyze the data for you
and all of that but what we really what
we really want to go through is to just
go through the a simple process of
creating a survey and we want to see
want to pay very close attention to see
how data science is leveraged in in real
applications so that what what we just
said shouldn't sound so abstract to you
it it should sound you know like you
should have a context for how it works
so um let me log
in let's um let's start a a survey so on
the K platform to start a survey you
come to survey
design and I'll start a new survey so
let's say that we want to do a survey
that looks at uh evaluation
right evaluation of the um um our
workshop on data
science they participants we don't have
a predefined location so we can leave
that
blank um the aims of a study again every
research paper whether it's a research
manuscript or a thesis or a dissertation
or a grant should have aims now this is
now we now come to the first application
of of of data science how do we help
people because remember this goes back
to the issue of seamlessness the
experience should be seamless so I am a
a novice in research I want to collect
data let's say this is for a grant and I
don't know how to write aims again so
you know we have we come to the first
application of data science data science
a data scientist in this case was
responsible for taking this
information the context what the user
has has
provided and integ that with AI and
providing us with the suggestions here
so now you see you have aims that have
been suggested for you by the platform
seamlessly again we go back to that word
it's having a seamless experience so I
can say okay these are the aims that I
want and again you also see immediately
the issues of data structure how are we
going to render the data to the user so
like what kind of data structure is this
because this is data so how how has dat
been rendered right those are the
decisions that the data scientists had
to take like how would the data go out
because this you know how how how how
will data go out and how will it come
back in right what's the best way to
present the data that makes sense that
makes a simless experience so so you can
see right on the back we see a first
application of data science so selected
our aims and clear the remaining ones
the same thing for keywords right
again having a similar experience the
user might not know what keywords are
the platform suggest it to them again
direct application of data science so we
start our project so just in page one
already you see that we we we already
have applications of data science in
real
time for now this is a collaborative
platform you can invite collaborators
but for now um let's skip that for now
and let us go
to our survey where we can create create
questions and we can um see again other
direct applications of data science so
now let's create a couple of questions
we there are five ways to create
questions on the platform you can start
from a blank survey you can copy an
existing survey in which case you know
it shows you all the surveys that exist
and you can copy one that you want you
have an option to import from servy bank
to import from word now when you import
a question from Ward let us see again
how that involves that is a direct
application of data science let me see
if I can pull up a questionnaire that
has
um
so this is the format of a a
questionnaire that's imported from word
yeah you have to put some formatting so
imagine you are you are data science
scientist right in five squ and they
said okay please go and figure out how
do we import a questionnaire in in
Microsoft word right it's now your job
to figure out how like okay what kind of
data is this this this data we have
right now like it is obviously
unstructured how do I take this
unstructured data and convert it into a
form that is structured so you see this
is not just like okay we want to do this
you have to now go into into the weeds
of how am I going to make this happen
for you to achieve that you must have a
solid understanding of data structures
and so you now have to figure out okay
what kind of data structure do I even
want to work with is it a list is it a
data frame is it a vector so again this
just goes back to emphasise that you
cannot do data science if you don't have
an understanding of data structures and
how they work right because you will
need that as either inputs or outputs of
your work so this is import from word
for tomorrow's exercise we're going to
look at we're going to we don't have
inut from Excel and that is what we're
going going to solve for our task
tomorrow since a lot of you were asking
so tomorrow we're going to say okay we
only have import from word and this is
how the import from word is and let's
say we don't like all these features we
don't have the time to be saying at at a
and this right how do you then if if you
were now tasked with the work of okay go
and build a
fature that can import from Excel how do
you how do you go how do you start
thinking about that um and so that's
what we're going to cover tomorrow in
our tomorrow and the next day we're
going to talk about thinking through the
problem and how do you write the code
and how do you structure it and how and
all of those things in are so that's
that's going to be the session tomorrow
and next so for those of you who are so
curious that's what we're going to
cover and then use AI Builder again um
with us using AI Builder you specify the
number of questions you want and you
click on generate see the experience
itself feels seamless right because all
the US is doing select number of
questions and click generate but there
is a lot of data science that is going
on behind the background right so you
have your questions here you can just
select the ones you want right and and
you just import them but behind all of
that there's a lot of there's a lot of
um you know um a lot of data science
involved in that whole process like how
do we for example if you look at chat
GPT the output from chat GPT um it is
unstructured so how do we take it from
unstructured and make it something
structured right um and then if the user
selects um something that is wrong how
do it display validation errors so that
they know that they do not make a right
selection right so those are all of some
of the things that are required when we
talk about you know um you know data
science so you can see the the direct
applications of that in everywhere you
do you know um or whatever you do on the
platform
now let's go and say we want to start a
blank survey and let's say let's just
add one or two questions to our survey
let's say
um what is your background educational
background and
so here the user is supposed to type a
label and that label letter becomes the
variable name all right and so you to
provide a simless experience we say okay
this user may not know what to label is
so let's get them give them suggestions
the user clicks on get suggestions and
the platform then suggests um you
know some labels that they can use it
can click again and it supplies more
again this simple task here there is
obviously AI involvement here so if you
dat scientist how do you how do you make
that happen so again I'm just trying to
show you specific examples so that can
see the kind of tasks that data
scientists do and how you have to think
about problems because obviously if I
give you a problem to solve I'm not
going to give I'm not going to give you
the solution you have to bring the
solution to me and then we can figure
out which solution is the best right so
those are some of the things that you
have to think about as a data scientist
let's say here you said you know your
educational background is in let's say
nursing engineering
and let's just say other other
specialty right and then we can then
push the question here then let's add
one more
question to our survey let's say what
country are you from
select that and let's let's use system
Supply
text and then we push question here so
again we've seen how AI is involved in
that and the role of data scien is there
but there are also other features right
that involve data science remember our
goal is to provide a seamless experience
for participants so if what if somebody
wants to translate their survey into
other languages right how what is the
role of a data scientist in that process
in this case if I want to translate
myself to let's say French um I I simply
select the language I
want now I now have I now have to enter
the translations here but as as what I
what I want to challenge you is that
when you do this kind of process when
you click and something happens you have
to F you have to wonder if I were the
data scientist how would I Engineer this
future like how will I do the same will
I do it differently you have to be
curious about okay how does so I just
click on let's say this button here and
it analyzes it translates the text right
how how is that happening right those
are the kinds of things that if you a
data scientist you will be doing right
because it's like about okay we want to
solve a problem and we want to do it in
a way that is seamless that makes life
easy for the individual and so those are
again
things that the data scientist does and
things that you have to be thinking
about so that when you are in that
situation you will also be able to solve
those problems um you know seamlessly
because the whole point is to solve a
problem and make life easy for the user
so now I've translated my
survey and I want to launch it so let's
launch the survey and let's look at how
the field of data science is also used
in other areas like in analyzing data
for example in writing reports
automatically right and again the goal
is for you to think through those
problems or this so that tomorrow when
we when we are going through a the
problem we're trying to solve you can at
least bring Alternatives and say okay
could we do this way or could we do that
way so we our survey is available in
English and in French now so let's go
ahead and launch our survey and let us
see how um data science is also used in
other domains as well so let's set our
and
timelines so let's launch
that and let's um get some responses so
that we can then see how
um let
me complete one of two responses
all right let me um let me share the
link so that we can we have some
responses
and so if if you can take the survey
please take it so that we can generate
some responses and then we can see how
again data science was used in other
areas such as you know again the whole
goal is how do we make the the the users
life easy how do we solve their problems
for them in a way that is seamless right
that is a framework work against which
we operate so and data science is just a
tool for us to achieve those objectives
so uh
let's Okay so we have four responses now
um now we can download the methodology
report again that is also a direct
function of data science like okay we
want to have a a methodology report that
documents how the study was conducted
and you know issues of place person
place and time so we want the user's
life to be so easy that everything is
just one click like one click Solutions
the user doesn't have we it's our job to
suffer as the ones creating a solution
the user doesn't the user doesn't care
that the solution is complex all the
user wants is a solution you know and so
the user with just one click they want
the whole report to be generated for
them so that they can access it and use
it for their work right it is our jobs
as as as dat scientists and members of a
team to then figure out how best to do
that in a way that is seamless so you
see this is our report that has been
generated in seconds so my question to
you is if you a data scientist how will
you frame this because it is not as if
this report was sitting down somewhere
and or some parts of it were sitting out
somewhere that we just grabbed and
chunked it into this report this was
written in real time in seconds right
and it talks about everything that we
need the background the study design the
studies participants the admin ation of
a survey the you know the limitations
and so on and so forth right it is a
complete report for the study so how do
you do that again this goes back to what
I was saying earlier about being curious
about how things work right and you need
to ask yourself if I was the one doing
this how will I do it and then you have
to go beyond that you have to say Okay
I'm going to I'm going to my mock
project will be to write a report like
this so that it's automated
and so that those are going to be your
first tasks to really challenge yourself
like I'm going to write an automated
report stud report that is going to be
dynamic in real time and delivered in
seconds then that's how you gain you
know um expertise in the field um and so
of course this also uses you know AI is
involved in some parts to write the
background for the user so it's a mix of
AI and just you know um rules-based
programming the same thing with analysis
report if I wanted to download the
analysis report again just a single
click it takes the data analyze the data
but again based on our framework we do
not use AI for anything that involves
data right so the question is what you
should be thinking is then how does it
do that like how do you how do you
engineer a solution that would just in
seconds it takes the data it knows the
kind of data it is it analyzes
appropriately it creates the figures
right right so that user can then take
that report and again everything is done
in seconds and without the use of AI of
because based on our risk based
framework AI just we cannot use AI for
that so you can see the diversity of
areas between the time we started the
projects and now you can just see all
the areas AI is AI you know um data
science is involved in coll obviously in
collaboration with all the other members
of the of the team right engineering
team um because for
for us to be able to see this the
designer have to design this thing
obviously the data we're analyzing comes
from the back end the back end has to
that we can click on download and it
downloads of course that's the front end
right but it's team a team approach but
data scientist is a core part of that
team and makes the whole production come
to life so um so keep this in mind
tomorrow when we're talking about our
solution and you know you can you can
start thinking of okay
um how can we do this like because
there's no one way to do one thing right
there are many many many ways we have to
think through the ways and then figure
out the one that makes the most sense
either it's just efficient in terms of
computing power or in terms of cost or
in terms of other parameters you know
but you have to come up with Solutions
because at the end of the day it's just
a human being that has to write a code
right and that is your job as the
scientist so so again I want to
emphasize these sessions are for people
who this was this is a data science
Workshop it's not for statistics or and
that's why the focus is more on data
science in case you're wondering oh I
just want to write I just want to clean
my data and generate P values this the
session the goal of this session is on
primarily focus on data science and
that's where we're going into that slant
any questions I'm going to stop there
and ask take questions um if there are
any
L do you have any questions for
me checking not
yet okay can okay thanks IIA for for for
answering can a master student use the
kai squares for data collection and
Analysis at Master's level most
especially in epidemiology apart from
SPSS that's highly
recognized yeah of course you can use
you can you can use the platform for
your data collection and um and for
those can share with them the the um we
provide free access to platform for
people who ATT our session so they you
can use it for your work so L will share
the link um with you so that you can if
you're doing I mean if you're doing any
data collection the best thing that will
happen to you is using the high platform
you know really because it's there is
nothing that involves data collection
that has not been very very carefully
thought out in terms of validity and
reproducibility and documentation and
and collaboration right it's a
collaborative environment and so um yeah
so is literally the best thing that will
happen to you if you're collecting data
and you've not really started you know
um because again all of those
considerations have been very very
carefully put into place so that the
data you get are fit for use and fit for
purpose as well um if you're if you're
doing work um along the lines of
research the platform also does all the
work of cleaning your data set for you
automatically so you can you can
download the raw data or you can access
the already cleaned data set so for
those of you who are asking about oh um
are we going to teach about data
cleaning well if you using if you're
using a kace platform you don't have to
worry about that because all of the data
are automatically clean for you of
course if you want to suffer you can
download the RO cleaning yourself but
the platform automatically cleans the
data for you as well
okay so okay before before we continue
I've shared a
survey I've shared a link and I would
like everyone to complete especially
those who would like to receive
certificates of
attendance um I'm going to share it
again here it is so basically it's an
evaluation uh survey so we would like to
know your thoughts about the session but
most importantly we would need your
email addresses if you want to receive a
certificate so please complete the
survey um if you only want to complete
just your name and email address that's
fine too you're not forced to respond to
all the questions but we would like to
get some feedback thanks okay so yeah I
will I will share the Ser I mean the
link in a second I just wanted to first
start with this evaluation before people
leave other questions okay we have Dr
asunda chagua you can unmute
yourself okay good good afternoon good
evening good
uh uh good morning please I just want to
ask how can I use K Square in V
profession uh that is we are doing
vaccination we are treating animals we
are we are going out outbreak
investigation that's outbreaking in
animals so that it not cross to human
being so how can I use Square in this
scenario okay so what you could do then
is you can let's pause the survey now
you what you do is you so you can use
the cascus platform for any kind of data
collection
right um but in the case of the outbreak
you described obviously you are the one
who will be collecting the data on all
the animals so you you you you have for
each animal you might have different
indicators like you have the temperature
you have you know whether they have you
know certain characteristics or not so
you'd come to
settings and then you come to the option
that says multiple entries you say yes
allow multiple
submissions and then you save
so that when you take the survey this
time around when you go to preview and
you start taking the
survey um at the end of a
survey it says submit another response
so you are the one who is just
submitting a response for the different
animals over and over and over again so
that's how you can do
it does that make sense
[Music]
sorry
yes can you repeat
that okay I'll I said you go to the
settings and you go to um multiple
entries multiple entries means suggest
that you are the one who is collecting
the data for all the people so you want
to be able to submit one form and then
submit another form so on and so forth
by default is it says no restrict to one
submission per individual but if that's
your case you can say yes allow multiple
submissions so that when you're taking a
survey right it would allow you to you
start the survey um you complete the
responses and once you've submitted you
see it's says submit another response if
I click on this it takes me again to the
form and then I can start you know um I
can submit over and over and over for
for all my all the subjects in my
study
Lloyd sorry my my went off so I couldn't
hear the the the answer from my question
my my n went off sorry if should repeat
it sure I can repeat it again so I said
that you go to
settings by default you only allowed to
submit one you you know you have one
answer per participant but in this case
you are the one who is feeling the
responses for all the animals so in in
that case you'll say yes allow multiple
submissions under the settings and that
what happens is that once you take the
survey let's if you resume
this if we start if if we took a survey
let me copy the link
again and we we took the survey so let's
say this is the first animal I complete
the response for
them and I submit
here now I have this feel that this
button that was not there before submit
another response I can now do it for the
second
animal and provide and submit again and
then continue like that over and over
and over so that's how you would do it
if you were collecting animal for a
population that cannot answer for itself
because you are the one who is recording
the
responses does that make sense
okay yeah it makes sense makes all right
great so but this one is it for
treatment or up investigation or
vaccination it doesn't matter whatever
data you're collecting it could be any
data okay
okay okay Lloyd do you have a question
or is that that
mistake um yes I do um for forgive me
for my for the question I'm going to ask
it's a bit um condescending in my part
but um it's my first time hearing about
um Kai
squares so uh it's it's it's a very good
good website can you just um give me
some an overview of of what what you
guys do and like what you guys
in and and
um did you guys also give uh public data
sets Okay yeah so um okay sorry go ahead
sorry I interrupted I thought you were
done go
ahead and then if you do um are they
based like African data sets South
Africa
depending on location yeah I think I'm
done okay thank you so kqu is a um we're
a tech firm with a tech um company which
um specializes in research right so um
it's a platform for researchers by
researchers but it's not just for
research you can collect any kind of
data uh high score aims to be a One-Stop
shop for everything that is related to
research so that means that you can
design your studies you can let's say
you are a master student who has no idea
what what to do for their research you
can come to K Quest and go to our survey
bank and you explore to see the possible
topics that you could work on and you go
if you go to research survey you could
then select the area of your area of
Interest let's say I'm interested in
chronic diseases I can say okay I just
want chronic disease topics and I see
all the surveys there I can say oh I'm
interested in you know online health
information on and the impact on
lifestyle choices so you have questions
that can then choose to copy the entire
survey for my work right I can just say
copy this entire survey and use it for
for my for my research so kqu has
thousands of of of questions and surveys
and the whole idea is based on the
principle of open science that we can we
can share right this is a survey I just
copied now and so I can then use that
for my for my thesis so that's what the
survey Bank does so a survey Bank you
can explore and use but you can also
contribute when you contribute questions
to the bank they go through a vetting
process to make sure that questions are
valid scientifically and that they are
not offensive and there is no bias in
how the questions were framed um you can
also calculate sample size if you're
writing a study or if you're doing a
Grant application obviously you need to
calculate sample size so to calculate
sample size you come here the platform
supports almost every single kind of
epidemiologic study that you would want
to C you want to perform from
observational studies to experimental
designs and everything in between so if
I was doing for example a survey of
let's say Ghana I provide the input and
the platform does not assume knowledge
so for every input there are hints to
guide you along the way so that you can
get some some guidance on what to do and
once you hit on get results it tells you
the sample size you need you can click
on show reference to see the formula
that was used because science after all
has to be reproducible right if you did
something I should be able to do the
exact same thing and get the exact well
not the exact result that get something
close to that so you can copy the
formula these are the parameters that
are in the formula which I explained
here and you can copy the citation and
use for your work so that's what the
sample size calculation is all about
the platform also supports sampling
sampling is if I want to draw a
representative sample from let's say
country or state or a location so
imagine I was doing this survey in the
US and I know that I need
1,281 people from the US for my for my
St for my survey but that only answers
half of the question the other half of
the question is okay where should I go
and select those people from right I I
don't know so that's where sampling
comes in so sampling you go to sampling
and you
select whether you want to do automatic
sampling you select your country in this
case United
States I can then say all right I want
to access 50 clusters and my sample size
is
1,281 right once I click on that the app
automatically tell me all the places
that I need to go these are the counties
these are the states they are in and
these are the number of people people
have to sample from each cluster and
there's also a map provided to help you
visualize that I mean say a picture is
what a thousand words so you can just
see that looking at the map you can see
how that cut across all of the United
States um so the the platform allows you
to sample whether it is for the United
States or for any country in the world
um it doesn't have to be countries it
can be subnational entities um or you
can even upload your own sampling frame
if you have a particular let's say
spreadsheet you want to sample from you
can import that spreadsheet and sample
from it um the platform also allows you
to do
surveys so whether your surveys are
cross-sectional or longitudinal
crosssectional studies are St where we
collect dat that one single snapshot in
time one point in time so if we're doing
for example like the evaluation survey
where I just send you guys a link and
say hey let me know what you thought of
the class that's just a onetime servey
that's a cross-sectional study but if I
were to come back over and over to the
same
individuals to collect repeated data
that is a longitudinal study the
platform conducts both crosssectional
studies and longitudinal studies and for
the longal studies you don't have to go
and do the followup yourself manually
the platform will do the followup for
you automatically because if it's a l
study it will tell the people that okay
by the way there's a followup coming up
in one month or two months or whatever
you spend ified do you want to
participate they say yes provide an
email and that's how the platform
automates um the followup and like you
saw in the demo it translates your
service for you um it can tell you the
readability of your survey so here we
know that our survey is fairly difficult
the language in the survey is pretty
hard according to according to platform
um it's tged at 10 to 12th grade
according to the flesh King kid um
reading
score you can offer incentives as well
if you want to pay people money as part
of incentives for your survey you can
offer
incentives um whether it's a raffle draw
or it's a payout for everybody um and
then if you wanted to do your survey in
a place where there is no internet the
platform has an offline mode where you
can collect the data and it's saved on
your device such that whenever you come
back to where there is internet the
platform will automatically upload the
data into your your your platform for
you so that you don't have to start
doing manual data entry and you know um
and and start you know start copying and
P you know that that again the whole
point is how do you create a seamless
experience for people in whatever you're
doing um and so those are some of the
features of the platform and so like we
mentioned you can use it for any kind of
data whether it's you know human
responses or from from from animals or
you know whatever kind of data at all um
can be collected on the
platform so I hope that provides you
with some context for what the platform
is and if you're using that you have
questions you can always come to our Ai
and ask you know any question you have
like okay how do I set up
a a survey on the kis Quest platform the
AI is trained on all of our data again
this is one of the things you would have
to do as a as data
scientist um again this is just one of
those things um so you can ask questions
of the ai ai has been trained on all of
the information we have and it's going
to give you an answer right if you don't
want that you can try and generate
another response um but essentially
that's that's the that's how the kis
platform
Works does the kis squee platform
support realtime tracking of data and
creation of dashboard relevant in
infectious disease uh
surveillance yes it does um so we have
um we have if you go to the s bank we
have again surveys of different types so
if you go to health and wellness for
example um we have case investigation
forms we have contact tracing forms
right so those are forms that you use in
in the case of an outbreak you have to
investigate the cases but you also have
to investigate the contact of those
cases right and that's what we call
contact tracing when you collect data in
the platform the information you can
access it in real time like we saw in
our data this is this is our data here
you can look at the the responses as
they are collected we have 26
respondents now you can just download
all of the analysis in One Click by just
clicking on download if it's not for the
whole population you want let's say you
only want to analyze data only among men
or only among men who are older you can
say download for a subset of a
population and you can then set up of
the subset to say okay um education if
we want to analyze for if education is
exactly equal to nursing for example
with submit so yes you can collect data
in real time data analyze for you
automatically in real time you can
download reports in real time for the
whole population for a subset of a
population any other
questions can someone access data for
use on the kis
platform and someone access data for use
do you mean exist data
from elsewhere or I don't understand or
do you mean can someone access your data
please
clarify so if you're doing a survey you
you can share your survey to with people
that's how you platform has no way of
knowing who your population is so again
that is something that is inherently
human remember we said earlier that one
of the Hallmarks of a suboptimal
engineer is trying to optimize that
which should not be optimized it's your
study you know the population so it's
it's it's your responsibility to say
okay here are the people that my survey
is meant to be targeted at so what we
can do is provide you with all a whole
variety of ways to share your survey you
can copy the link you can share on
social soci media you can upload emails
if you have your if you have emails you
can upload them right and then you know
upload file and then the platform will
automatically send messages for you you
can click here to see the default
messages that will be sent you can edit
the messages as you want right the
platform will then send emails the
emails you've just typed to all the
people that were in that um spreadsheet
you uploaded but um the participants def
defining the participant is an
inherently human task and that is
something that no data scientist should
try to optimize because that takes us
back to the H marks of a bad engineer
which is trying to optimize that which
should not be optimized so you should
know which tasks are inherently human
and must be done by a human being and
those that we can
automate obviously it' be wrong if we
just say okay we'll just take your
survey and just propagate it to a whole
bunch of random people that you have no
idea who they are that would that would
not be science at that
point okay Dr do you by any chance have
a meeting in like 10 minutes that's
correct yes I
do
okay just a reminder for that
um once you have created an account and
built a survey
is it open access to public or private
until you want to
share your survey is your survey alone
nobody knows who nobody can access your
survey except you and your collaborators
when you want to share your survey you
come to this button and say share
survey and you can say you can say you
you want to disable link forwarding in
which case people won't be able to
access it from social media right but
you are the one who has to take the step
of saying I want to share it on social
media or I want to share it um I want to
copy this link you see how I copied the
link and shared it with all of you well
that was a deliberate intentional choice
I did the sharing so until you do the
sharing with people they cannot access
the survey and again if you have emails
you can upload them as well and the
platform will automatically send emails
on your behalf so if you have 10,000
emails you don't have to start saying
gez how do I send this emails myself the
platform automates all of that for you
again this ties back to what we're
talking about data science right and how
the whole point of hiring data scientist
is to make life easy to solve a problem
in a way that is seamless and makes life
easy for the
users what is the maximum size of data
set that can be up
updated uploaded on the sites for
analysis there's no limit to the number
of responses in the platform so you can
ask there's no limit to number of
questions you can ask in the survey and
there's no limit to number of
participants who can respond to your
survey so the platform has been