Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Video Player
00:00
00:00
00:00
- 2x 2x
- 1.75x 1.75x
- 1.5x 1.5x
- 1.25x 1.25x
- 1.1x 1.1x
- 1x 1x
- 0.75x 0.75x
- 0.5x 0.5x
It's not always so easy to collect good data. In this video we'll look at a few common issues with data collection and talk about how we could handle them.
This video doesn't have any notes.
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
We've got our data and we're ready to
start uncovering some hidden truths.
0:00
But first,
before we start analyzing anything,
0:04
it's usually a good idea to think
about where the data comes from.
0:07
The data we receive isn't always
guaranteed to be 100% accurate.
0:11
When analyzing the accuracy of your data,
you need to think about a few things.
0:16
Including the source of the information,
the methods for
0:20
collecting the data and
the way the data is measured.
0:23
Now, for this example,
our data comes from a pretty good source.
0:26
The Boston Athletic Association
keeps detailed records, and
0:30
they let you search through the results.
0:33
However, they don't just let
you download all the results.
0:36
That would be way too easy.
0:39
Instead, we rely on other data
scientists to write code to
0:41
repeatedly search through the results and
collect them all into a CSV file.
0:45
This is the first bit of messiness
that we need to be aware of.
0:50
Our data doesn't come
directly from the source.
0:53
We're relying on somebody else to collect
the data without making any mistakes.
0:56
So if we see anything strange,
1:01
like somebody running the whole marathon
in less than an hour, [SOUND] we'd want to
1:02
double check that with the official
results before continuing our analysis.
1:07
Or more likely, finding a new data source.
1:11
Another example of messy data
would be survey results.
1:14
Unlike the Boston Marathon where the
runner's times are automatically recorded.
1:18
With surveys,
1:22
we have to deal with the possibility
that respondents are lying to us.
1:23
Or more likely they're being
influenced by a cognitive bias.
1:27
One such cognitive bias is
the social desirability bias.
1:31
When answering a survey,
respondents have a tendency to
1:35
respond in a way that will
make them look good to others.
1:38
For example,
1:41
when the dentist asks how frequently
you floss, do you tell the truth?
1:42
Turns out, one in four of you don't.
1:47
Which leaves this awesome
headline by N P R.
1:49
Are you flossing, or
just lying about flossing?
1:52
But the social desirability bias doesn't
only mean over-reporting good behaviors.
1:56
It can also mean
under-reporting bad behaviors.
2:00
If a survey asks how frequently you use
recreational drugs or how many sexual
2:04
partners you had, it's pretty unlikely
that everyone's going to tell the truth.
2:08
In fact, getting accurate data
about recreational drug use is so
2:13
hard that scientists have turned
to testing waste water to try and
2:17
figure out which substances
are being used in a community.
2:21
Luckily we don't have to go quite that
far to make sure we have usable data.
2:25
Though just because our data
is automatically recorded,
2:30
doesn't mean that it's accurate.
2:33
Take, for example, the step counter
on your phone or smart watch.
2:35
I don't know about you, but
mine's not particularly accurate.
2:38
Sometimes I get in the car to go to work,
and
2:42
by the time I've arrived,
I've added another hundred steps.
2:44
Now for something like step counting,
2:48
it's probably not a big
deal to have a few extra.
2:50
But what if we had a step
competition with people using
2:53
all kinds of different devices
to record their steps?
2:56
All of a sudden, those extra
steps would matter a whole lot.
2:59
What if somebody else had a much
more accurate step counter?
3:04
And on that same drive,
3:07
instead of recording an extra hundred
steps, it's perfectly accurate.
3:08
It wouldn't really be fair to
compare our steps directly.
3:13
So instead, before any analysis, we would
want to correct for any extra steps.
3:16
This process is known as cleaning or
preparing your data.
3:22
Before doing an analysis, you want to
make sure that your data is valid.
3:25
This can be as simple as combining
several misspellings of a name
3:29
into one category or
3:32
as difficult as trying to figure out which
responses are genuine in an online survey.
3:34
Lucky for us,
our data is already pretty clean.
3:40
But we do have a few unused columns.
3:43
The first column seems to be
an unnecessary line number.
3:46
Column J is just completely empty, and
3:49
the Projected Time column is empty for
all runners.
3:52
We can safely delete each of these columns
by right-clicking on the column header and
3:55
selecting Delete column.
4:00
Data cleaning and preparing is the real
unsung hero of data analytics.
4:02
It takes a lot of work to turn
raw information into good,
4:07
valid data, that we can analyze.
4:10
And, I can't stress enough
how important it is to
4:12
understand where your data comes from.
4:15
That's enough about
dealing with messy data.
4:18
Coming up,
we'll finally start doing some analysis.
4:20
And by the end of this course,
4:23
you'll be ready to draw insights
from all the data all around you.
4:24
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up