Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Video Player
00:00
00:00
00:00
- 2x 2x
- 1.75x 1.75x
- 1.5x 1.5x
- 1.25x 1.25x
- 1.1x 1.1x
- 1x 1x
- 0.75x 0.75x
- 0.5x 0.5x
Learn the different types of bad data and what they mean.
Terms
- Duplicates - repeated data
- Missing data - data labeled as unknown, Nan, or empty
- Formatting - misspellings, extra whitespace, differences after combining multiple datasets
- Type - data that is a different type than expected
- Nonsensical - data that does not make sense
- Saturated - data that is at the extremes of the measurement
- Confidential - personally identifiable information
- Individual Error - errors that affect a single value
- Systematic Error - errors that affect all or large portions of the data ### Further PII Resources:
- DOL PII
- EU GDPR
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
Hello again, let's take a look at this
dirty version of the Pokemon dataset.
0:00
You might be able to pick out
a few of the issues already, but
0:05
let's give them a name.
0:09
These are in no particular order.
0:10
First, duplicates.
0:13
There are two Blastoise
entries in the data set.
0:16
Duplicates can bias your results and
0:20
can also make the dataset take
up more space than it needs to.
0:23
Next is missing data on line 44.
0:28
Golbat has some values that are missing.
0:32
Instead of values there are empty cells or
even NAN,
0:37
or not a number,
like the one down here on line 150.
0:42
This is an instance where you
will need to make a decision.
0:48
Fortunately, Pokemon is pretty popular and
0:53
there's reference material that you can
consult to find the correct values.
0:55
The next might be a bit harder.
1:00
If a name was missing or unknown,
this might be a bit more difficult
1:02
to find the matching Pokemon name using
the data from the rest of the columns.
1:08
Instead, you might need to exclude
this row in some cases because
1:14
of the missing information.
1:19
Following that, formatting.
1:23
There are quite a few
examples in this dataset.
1:25
This includes misspellings,
excess whitespace, and
1:28
differences in how something was formatted
when two different data sets are combined.
1:31
In the first row here, ice is misspelled.
1:37
On line 57,
the weaknesses are separated by
1:40
a dash instead of a comma
like the rest of the rows.
1:45
After that, type in row 38.
1:51
The height is written out as a string
1:59
instead of numerically in rows 31,
2:04
26, And 10.
2:11
These all include the notation for
inches or pounds,
2:19
which would make these
entries non numerical.
2:23
The last example in this
data set is nonsensical.
2:28
This includes data that
doesn't make sense.
2:33
On line 68, There's
2:36
a negative value for weight,
but weight cannot be negative.
2:41
The last two types of errors you may
encounter are called saturated and
2:47
confidential.
2:52
Saturated data are where the values are at
the extreme limits of the measurement.
2:54
For example, let's say there's
one of their "Your Speed (blank)"
3:00
signs that light up with your
car's speed as you go by.
3:05
But this sign only shows speeds
between 20 and 40 miles per hour.
3:09
If a car drove by faster
than 40 miles per hour,
3:16
the sign just flashes slow down instead
of showing the cars actual speed.
3:20
As a driver, you may not be receiving
any new reliable information
3:26
from the speed sign as you slow down or
speed up.
3:31
Similarly, when forming or
sharing your analysis,
3:35
saturated data can introduce
instability in your data sets.
3:38
In our Pokemon data set,
saturated data could be something
3:43
like measuring the Pokemon's height,
but with only a ruler, so
3:48
anything larger than 12 inches would just
have 12 inches listed as their height.
3:53
So all of these would be 12, 12, 12, etc.
3:59
You can imagine how this would skew
your analysis and your graphs.
4:06
Lastly, confidential data, here Pokemon
have addresses and credit card numbers.
4:13
I know it's silly,
just go with me on this.
4:19
Credit card information should never be
included as it is highly confidential
4:22
information.
4:26
You may also need to remove other
personally identifiable information or
4:28
PII from your table depending on who will
be viewing your analysis or data set.
4:34
PII may be necessary for a company's
internal use but you have an obligation
4:41
to protect this information if sharing
a data set or analysis publicly.
4:47
I'll post some resources in the teacher's
notes to dive into PII even more.
4:52
Sharing personally identifiable
information publicly and
4:58
without consent can cause
legal ramifications.
5:02
Another aspect to consider
when reviewing your data set's
5:07
errors is if they are individual errors or
systematic errors.
5:12
A single misspelling would
be an individual error,
5:18
it affects a single row or value.
5:22
A systematic error is one that affects
a large portion or even all of a data set.
5:25
For instance, if the ruler used to measure
the height was only 11 inches long instead
5:32
of 12, all of these heights would then be
incorrect, causing a systematic error.
5:39
This is why it's important to
understand your data set and
5:47
how the data was collected
as much as possible.
5:50
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up