Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Video Player
00:00
00:00
00:00
- 2x 2x
- 1.75x 1.75x
- 1.5x 1.5x
- 1.25x 1.25x
- 1.1x 1.1x
- 1x 1x
- 0.75x 0.75x
- 0.5x 0.5x
Dig into the books dataset to determine the most popular book.
What is the most popular book of the 1960's?
Use pd.Timestamp to compare dates. For example:
books['publication_date'] > pd.Timestamp(1960,1,1)
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
We're going to get into
analyzing book popularity.
0:00
I'm gonna start off with
a couple of questions I have and
0:03
I'll add these to my notebook
as markdown cells, markdown.
0:06
First, what is the most popular book?
0:13
And then second, Are books
0:20
with fewer pages rated higher than
0:26
those with large page counts?
0:31
Now that we've got our initial questions,
0:38
let's start digging into our
data to find the answers.
0:40
If I scroll up a bit, I can see that we
0:43
have average ratings for our books.
0:47
This shows us how users on Goodreads have
rated each book on a scale of one for
0:54
the worst and five for the best.
0:58
This one seems pretty easy to see what
the max value is currently in the rating
1:01
column.
1:06
So let's do add a cell here.
1:07
Books where the and
we're gonna get the average rating and
1:12
we're gonna get the max value and
we get back a five.
1:19
So the highest rating in
the database is a five.
1:26
Out of curiosity, let's do a quick min.
1:30
And unsurprisingly it's a zero.
1:34
So we need to see what the five
star rated book or books are.
1:37
So let's change this up.
1:44
Let's do books.loc, L-O-C,
1:46
where we're looking for the books,
1:51
average rating is equal to a 5.0.
1:56
Looks like we get a few books
with five star ratings.
2:03
You may think this question is complete,
but
2:10
on further look at the data I see that
there's also a ratings count column.
2:13
This says how many people
have rated a book.
2:20
This is important to add to our
analysis because what if one person
2:23
writes a book as a 5, but
5,000 people rated another book and
2:28
it's at a 4.7,
which is actually more popular?
2:33
The first book here has zero ratings,
so is it really a popular book?
2:38
I think we should go solely based on
the number of reviews to show many people
2:44
at least read a book, and then use
the ratings as a secondary ranking.
2:49
In my head this makes more sense.
2:54
If a book is popular,
it's probably going to have many reviews.
2:56
And then if it's a good book,
it should have a high rating.
3:00
Let's make a note here, so
we don't lose our thoughts.
3:04
Great, now let's fix our code.
3:28
Let's sort to see the books
with a high number of ratings.
3:30
I think this is a good place to start.
3:34
Let's look at our data now.
3:49
It looks like we have some
Harry Potter books and
3:51
looks like some His Dark Materials,
and quite a few others.
3:54
I think we can agree that these are names
you may recognize more compared to our
4:02
previous results.
4:06
So I think we're getting somewhere.
4:08
I think there's another
layer needed here though.
4:10
Our top book has a rating of 4.57.
4:13
But there may be others that
are rated higher than the one that
4:16
we've currently found like
this one that's rated 4.78.
4:20
I think we need to specify
a rating conditional as in
4:25
a rating should be above a 4.0,
or maybe even a 4.5.
4:29
Let's try both to see what we get.
4:35
I'm gonna save this as a variable.
4:37
And then, Where popularity,
4:43
Average rating, Is greater than a 4.0.
4:53
So our results look very similar,
5:02
we still have a book lower
down that's rated higher.
5:05
So let's also sort our values by the
average rating just to make sure we end up
5:10
with the highest rated books at the top.
5:15
I'm gonna set our same variable
5:18
equal to our new filter.
5:23
And then I'm gonna do popularity.sort
5:29
values by, Average rating.
5:36
Ascending, Equals false.
5:44
Now with all that put together
I have The Complete Calvin and
5:49
Hobbes by Bill Watterson has
an average rating of 4.82 and
5:54
has over 32,000 readings.
5:58
I think we can call that a popular book.
6:01
And just to make it super clear
I'm gonna add a slice here
6:04
at the end just to get one result.
6:08
There we go.
6:13
That way, we just don't have
a whole bunch of rows there.
6:14
We only need the first one.
6:16
On to the next question.
6:17
Are books with fewer pages rated higher
than those with large page counts?
6:21
This one is a comparison to see if there's
a correlation between the number of pages
6:26
and a book's rating.
6:30
We can filter again to see
the books with low page numbers.
6:32
So let's do books where the books
6:36
num pages is let's say less than 300.
6:41
We also need to organize the books by
rating count to make sure we're getting
6:50
books that have a good amount of
ratings to support their score.
6:55
So I'm gonna set this as few pages.
6:59
And let's do few pages where
7:05
few pages, ratings count.
7:10
And we'll do the same as we've been
doing before, greater than 1,000, okay?
7:16
And then lastly, let's make sure
we sort by the average ratings so
7:24
we can see the best one of the bunch.
7:28
So I'm gonna set this equal
to the variable again.
7:32
So it now contains both of our filters and
7:35
then we can do few pages.sort
7:42
values by the average rating.
7:47
We want ascending equals false.
7:52
And it looks like we got It's
a Magical World which is Calvin and
7:58
Hobbes number 11.
8:03
Rated a 4.76.
8:04
It has 176 pages and 23,000 ratings.
8:06
Same thing here I'm gonna add a slice so
8:08
we just get the first one.
8:14
Just to clear up our notebook a bit.
8:19
Cool, now we need to do the opposite.
8:22
We can do this by
modifying the first line.
8:24
So, I'm just gonna copy all of this,
I'm gonna paste it and then,
8:27
just to be clear with our, Variable names,
I'm gonna switch this to be many,
8:33
And I think that's all with that, cool.
8:52
And if I run it, Oops, we got the same
one, I forgot to change this.
8:54
[LAUGH] This will probably be helpful.
9:01
So we had less than 300 for a few pages.
9:03
Let's do greater than 300 for most pages,
and we run it and Calvin and Hobbes again.
9:06
It's our same popular book
that we got previously.
9:12
With an average rating of 4.82,
so between our two books,
9:16
we don't have much of a difference
in the overall rating.
9:20
Between point seven, six and point eight
two, it's what 6. 0.06 between the two.
9:24
That's not a lot.
9:29
Let's add a note in here.
9:33
There isn't a large difference
between the book ratings.
9:37
Only 0.06 between the top,
9:49
In each category.
9:56
Now while we don't really see a difference
between these two numbers, a chart might
9:59
better show if there is a correlation and
may just give us a better visualization.
10:03
We won't get into charting
in this workshop, but
10:08
it's a good thing to note in your
analysis for future improvements.
10:11
As a challenge in
the teachers notes below,
10:22
see if you can find the most
popular book of the 1960s.
10:27
There's some hints in the teachers
notes to help you out.
10:35
Nice work Pythonistas,
you've done a ton of code so far.
10:38
Keep it up.
10:42
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up