✨ Earn college credits in Cybersecurity, JS, HTML, CSS and Python

🤑 Join the Treehouse affiliate program and earn 25% recurring commission!

New No-Code Track! 🚀start learning today!

🌟 Dreaming of a bright future? 🎓 Ask about the Treehouse Scholarship program! 🚀

✨ Earn college credits in Cybersecurity, JS, HTML, CSS and Python

🤑 Join the Treehouse affiliate program and earn 25% recurring commission!

Well done!

You have completed Analyzing Books with Pandas!

Sign up for Treehouse Back to Library

Preview

Sign up for Treehouse Continue

Video Player

00:00

00:00

00:00

2x 2x
1.75x 1.75x
1.5x 1.5x
1.25x 1.25x
1.1x 1.1x
1x 1x
0.75x 0.75x
0.5x 0.5x

None
English

Use Up/Down Arrow keys to increase or decrease volume.

Popularity

10:44 with Megan Amendola

Dig into the books dataset to determine the most popular book.

Teacher's Notes
Questions?1
Video Transcript
Downloads
Workspaces

We're going to get into analyzing book popularity. 0:00

I'm gonna start off with a couple of questions I have and 0:03

I'll add these to my notebook as markdown cells, markdown. 0:06

First, what is the most popular book? 0:13

And then second, Are books 0:20

with fewer pages rated higher than 0:26

those with large page counts? 0:31

Now that we've got our initial questions, 0:38

let's start digging into our data to find the answers. 0:40

If I scroll up a bit, I can see that we 0:43

have average ratings for our books. 0:47

This shows us how users on Goodreads have rated each book on a scale of one for 0:54

the worst and five for the best. 0:58

This one seems pretty easy to see what the max value is currently in the rating 1:01

column. 1:06

So let's do add a cell here. 1:07

Books where the and we're gonna get the average rating and 1:12

we're gonna get the max value and we get back a five. 1:19

So the highest rating in the database is a five. 1:26

Out of curiosity, let's do a quick min. 1:30

And unsurprisingly it's a zero. 1:34

So we need to see what the five star rated book or books are. 1:37

So let's change this up. 1:44

Let's do books.loc, L-O-C, 1:46

where we're looking for the books, 1:51

average rating is equal to a 5.0. 1:56

Looks like we get a few books with five star ratings. 2:03

You may think this question is complete, but 2:10

on further look at the data I see that there's also a ratings count column. 2:13

This says how many people have rated a book. 2:20

This is important to add to our analysis because what if one person 2:23

writes a book as a 5, but 5,000 people rated another book and 2:28

it's at a 4.7, which is actually more popular? 2:33

The first book here has zero ratings, so is it really a popular book? 2:38

I think we should go solely based on the number of reviews to show many people 2:44

at least read a book, and then use the ratings as a secondary ranking. 2:49

In my head this makes more sense. 2:54

If a book is popular, it's probably going to have many reviews. 2:56

And then if it's a good book, it should have a high rating. 3:00

Let's make a note here, so we don't lose our thoughts. 3:04

Great, now let's fix our code. 3:28

Let's sort to see the books with a high number of ratings. 3:30

I think this is a good place to start. 3:34

Let's look at our data now. 3:49

It looks like we have some Harry Potter books and 3:51

looks like some His Dark Materials, and quite a few others. 3:54

I think we can agree that these are names you may recognize more compared to our 4:02

previous results. 4:06

So I think we're getting somewhere. 4:08

I think there's another layer needed here though. 4:10

Our top book has a rating of 4.57. 4:13

But there may be others that are rated higher than the one that 4:16

we've currently found like this one that's rated 4.78. 4:20

I think we need to specify a rating conditional as in 4:25

a rating should be above a 4.0, or maybe even a 4.5. 4:29

Let's try both to see what we get. 4:35

I'm gonna save this as a variable. 4:37

And then, Where popularity, 4:43

Average rating, Is greater than a 4.0. 4:53

So our results look very similar, 5:02

we still have a book lower down that's rated higher. 5:05

So let's also sort our values by the average rating just to make sure we end up 5:10

with the highest rated books at the top. 5:15

I'm gonna set our same variable 5:18

equal to our new filter. 5:23

And then I'm gonna do popularity.sort 5:29

values by, Average rating. 5:36

Ascending, Equals false. 5:44

Now with all that put together I have The Complete Calvin and 5:49

Hobbes by Bill Watterson has an average rating of 4.82 and 5:54

has over 32,000 readings. 5:58

I think we can call that a popular book. 6:01

And just to make it super clear I'm gonna add a slice here 6:04

at the end just to get one result. 6:08

There we go. 6:13

That way, we just don't have a whole bunch of rows there. 6:14

We only need the first one. 6:16

On to the next question. 6:17

Are books with fewer pages rated higher than those with large page counts? 6:21

This one is a comparison to see if there's a correlation between the number of pages 6:26

and a book's rating. 6:30

We can filter again to see the books with low page numbers. 6:32

So let's do books where the books 6:36

num pages is let's say less than 300. 6:41

We also need to organize the books by rating count to make sure we're getting 6:50

books that have a good amount of ratings to support their score. 6:55

So I'm gonna set this as few pages. 6:59

And let's do few pages where 7:05

few pages, ratings count. 7:10

And we'll do the same as we've been doing before, greater than 1,000, okay? 7:16

And then lastly, let's make sure we sort by the average ratings so 7:24

we can see the best one of the bunch. 7:28

So I'm gonna set this equal to the variable again. 7:32

So it now contains both of our filters and 7:35

then we can do few pages.sort 7:42

values by the average rating. 7:47

We want ascending equals false. 7:52

And it looks like we got It's a Magical World which is Calvin and 7:58

Hobbes number 11. 8:03

Rated a 4.76. 8:04

It has 176 pages and 23,000 ratings. 8:06

Same thing here I'm gonna add a slice so 8:08

we just get the first one. 8:14

Just to clear up our notebook a bit. 8:19

Cool, now we need to do the opposite. 8:22

We can do this by modifying the first line. 8:24

So, I'm just gonna copy all of this, I'm gonna paste it and then, 8:27

just to be clear with our, Variable names, I'm gonna switch this to be many, 8:33

And I think that's all with that, cool. 8:52

And if I run it, Oops, we got the same one, I forgot to change this. 8:54

[LAUGH] This will probably be helpful. 9:01

So we had less than 300 for a few pages. 9:03

Let's do greater than 300 for most pages, and we run it and Calvin and Hobbes again. 9:06

It's our same popular book that we got previously. 9:12

With an average rating of 4.82, so between our two books, 9:16

we don't have much of a difference in the overall rating. 9:20

Between point seven, six and point eight two, it's what 6. 0.06 between the two. 9:24

That's not a lot. 9:29

Let's add a note in here. 9:33

There isn't a large difference between the book ratings. 9:37

Only 0.06 between the top, 9:49

In each category. 9:56

Now while we don't really see a difference between these two numbers, a chart might 9:59

better show if there is a correlation and may just give us a better visualization. 10:03

We won't get into charting in this workshop, but 10:08

it's a good thing to note in your analysis for future improvements. 10:11

As a challenge in the teachers notes below, 10:22

see if you can find the most popular book of the 1960s. 10:27

There's some hints in the teachers notes to help you out. 10:35

Nice work Pythonistas, you've done a ton of code so far. 10:38

Keep it up. 10:42