1 00:00:00,000 --> 00:00:09,320 [MUSIC] 2 00:00:09,320 --> 00:00:11,260 Hello, I'm Craig, and I'm a developer. 3 00:00:11,260 --> 00:00:12,053 In this course, 4 00:00:12,053 --> 00:00:15,404 we're going to to be exploring the wonderful data library, pandas. 5 00:00:15,404 --> 00:00:19,013 Now, pandas is a portmanteau, or a combination of two words, 6 00:00:19,013 --> 00:00:21,337 in this case, the words panel and data. 7 00:00:21,337 --> 00:00:23,685 Panel data is data that is multidimensional, 8 00:00:23,685 --> 00:00:25,556 involving measurements over time. 9 00:00:25,556 --> 00:00:29,567 pandas are also an adorable creature, and I hope that you're here for the former, 10 00:00:29,567 --> 00:00:33,260 but I totally understand that I might have clickbaited you into the latter. 11 00:00:33,260 --> 00:00:37,487 pandas provides fast, flexible, and expressive data structures that have been 12 00:00:37,487 --> 00:00:39,753 designed to make working with relational or 13 00:00:39,753 --> 00:00:42,342 labeled data not only easy, but also intuitive. 14 00:00:42,342 --> 00:00:46,186 It's the fundamental high level building block for doing practical and 15 00:00:46,186 --> 00:00:48,148 real world data analysis in Python. 16 00:00:48,148 --> 00:00:51,267 Before we get cooking, let's make sure that we're on the same page. 17 00:00:51,267 --> 00:00:54,321 There are definitely some prerequisites for this course, 18 00:00:54,321 --> 00:00:56,956 so please double check that you're all caught up. 19 00:00:56,956 --> 00:00:59,321 The most important of the prerequisites is NumPy. 20 00:00:59,321 --> 00:01:03,887 I'd like to make sure that you had a nice introduction to the NumPy library. 21 00:01:03,887 --> 00:01:05,797 pandas relies heavily on NumPy, and 22 00:01:05,797 --> 00:01:10,305 I'm going to assume that you have a basic understanding of its overarching concepts. 23 00:01:10,305 --> 00:01:13,047 Now, don't worry if it's been a while since you've used it, 24 00:01:13,047 --> 00:01:15,920 we'll retouch upon the concepts that you need here in just a bit. 25 00:01:15,920 --> 00:01:19,468 Don't forget to check the teacher's notes that are attached to each video. 26 00:01:19,468 --> 00:01:21,346 I'll try and remind you to look in there, but 27 00:01:21,346 --> 00:01:23,677 please do get in the habit of checking that section out. 28 00:01:23,677 --> 00:01:27,130 Lots of great information is tucked away in there waiting for you to dig into it. 29 00:01:27,130 --> 00:01:30,091 In this course, I'm gonna try a new approach. 30 00:01:30,091 --> 00:01:34,043 In an effort to give you more practice of how data professionals interact, 31 00:01:34,043 --> 00:01:37,510 I'm going to rely more heavily than usual on Jupyter notebooks. 32 00:01:37,510 --> 00:01:39,326 As you are most likely already aware, 33 00:01:39,326 --> 00:01:42,407 Jupyter notebooks are a great place to capture your learnings. 34 00:01:42,407 --> 00:01:45,212 They're also intended to be used for teaching. 35 00:01:45,212 --> 00:01:48,768 I've gone ahead and build up some interactive content that will assist you 36 00:01:48,768 --> 00:01:50,411 in exploring the pandas library. 37 00:01:50,411 --> 00:01:51,427 In the Treehouse app, 38 00:01:51,427 --> 00:01:54,486 you'll encounter these notebooks as textual instruction steps. 39 00:01:54,486 --> 00:01:57,913 I've included information in the teacher's notes about how to get a hold of 40 00:01:57,913 --> 00:02:00,818 the notebooks so that you can run them and follow along locally. 41 00:02:00,818 --> 00:02:02,850 I'd love for you, as a lifelong learner, 42 00:02:02,850 --> 00:02:06,129 to get in the habit of exploring every notebook that you come across. 43 00:02:06,129 --> 00:02:08,540 Use it to poke around as you learn a new library, 44 00:02:08,540 --> 00:02:11,148 much like you might expect to use the Python shell. 45 00:02:11,148 --> 00:02:15,386 Explore the API and practice different approaches, and most importantly, 46 00:02:15,386 --> 00:02:16,570 keep your own notes. 47 00:02:16,570 --> 00:02:19,884 A common data science workflow involves multiple stages. 48 00:02:19,884 --> 00:02:22,322 First you clean the data and then you analyze and model it. 49 00:02:22,322 --> 00:02:26,520 And finally, you organize the results of the analysis into either a graph or 50 00:02:26,520 --> 00:02:27,052 a table. 51 00:02:27,052 --> 00:02:30,194 Great news, pandas can do all that, the entire workflow. 52 00:02:30,194 --> 00:02:34,165 Even better news, it's really a pleasure to use. 53 00:02:34,165 --> 00:02:38,527 Since you already have a fundamental understanding of the numerical library, 54 00:02:38,527 --> 00:02:41,467 NumPy, pandas is going to feel very familiar to you. 55 00:02:41,467 --> 00:02:45,455 In fact, pandas sits directly on top of NumPy like a little hat. 56 00:02:45,455 --> 00:02:46,703 I don't know about you, but 57 00:02:46,703 --> 00:02:49,875 one of the things that I have trouble with in NumPy is when I have an array. 58 00:02:49,875 --> 00:02:53,183 I never know just which value is which. 59 00:02:53,183 --> 00:02:58,194 Like for instance, in this array here, I don't really know who got the high score. 60 00:02:58,194 --> 00:03:02,091 I have to remember that Robbie is the first one here at index zero. 61 00:03:02,091 --> 00:03:03,416 But I just have to know that. 62 00:03:03,416 --> 00:03:07,023 pandas gives you a new ability, you can label each value. 63 00:03:07,023 --> 00:03:09,484 It's like a dictionary, a key and a value. 64 00:03:09,484 --> 00:03:11,786 And that works great for a single dimension. 65 00:03:11,786 --> 00:03:16,238 This example is the series of high scorers for a single game, Donkey Kong, 66 00:03:16,238 --> 00:03:18,227 labeled by players' initials. 67 00:03:18,227 --> 00:03:21,757 But as you know, we often want to have multidimensional data. 68 00:03:21,757 --> 00:03:25,023 We could track more games by adding a new game dimension, 69 00:03:25,023 --> 00:03:27,076 like we could add Pac-Man scores. 70 00:03:27,076 --> 00:03:29,693 But now we have to remember two indexes, and 71 00:03:29,693 --> 00:03:34,164 I have to remember that index zero is Donkey Kong and index one is Pac-Man. 72 00:03:34,164 --> 00:03:37,641 Now, again, pandas does a great job with labeling. 73 00:03:37,641 --> 00:03:41,629 You can also label each of these columns, so you end up with tabular data. 74 00:03:41,629 --> 00:03:46,010 The two-dimensioned data structure here is known as a data frame. 75 00:03:46,010 --> 00:03:49,698 This is a data frame of high scores on multiple games indexed by players' 76 00:03:49,698 --> 00:03:50,307 initials. 77 00:03:50,307 --> 00:03:54,555 And that ought to feel pretty familiar, assuming you've used tabular or 78 00:03:54,555 --> 00:03:58,392 table based data before, like a spreadsheet or a database table, 79 00:03:58,392 --> 00:04:00,325 anything with rows and columns. 80 00:04:00,325 --> 00:04:03,276 With pandas, you can put any sort of data in there too. 81 00:04:03,276 --> 00:04:05,869 It doesn't have the same restrictions like NumPy did. 82 00:04:05,869 --> 00:04:08,835 pandas also lets you relate datasets by label. 83 00:04:08,835 --> 00:04:09,838 So you can merge and 84 00:04:09,838 --> 00:04:13,611 join together related information in a very straightforward manner. 85 00:04:13,611 --> 00:04:18,167 pandas is a full-featured library, and we simply won't be able to get to all of its 86 00:04:18,167 --> 00:04:20,715 amazing powers in this introductory course. 87 00:04:20,715 --> 00:04:24,476 I do hope to give you a firm foundation and guide you to where you can learn and 88 00:04:24,476 --> 00:04:25,324 practice more. 89 00:04:25,324 --> 00:04:29,088 For this course, I'm gonna ask that you imagine that there is a new company 90 00:04:29,088 --> 00:04:32,931 in town jumping in on that social banking app craze, like Cash App or Venmo. 91 00:04:32,931 --> 00:04:34,778 They call themselves Cash Box. 92 00:04:34,778 --> 00:04:39,138 Basically the way that their app works is that a user signs up, chooses a username, 93 00:04:39,138 --> 00:04:43,079 and then they can send money to other users of the system by their username. 94 00:04:43,079 --> 00:04:46,157 Now, a common use case for their app is when it's lunch time and 95 00:04:46,157 --> 00:04:47,794 people don't have cash on them. 96 00:04:47,794 --> 00:04:51,225 Their users can just send money through Cash Box to the person picking up 97 00:04:51,225 --> 00:04:51,748 the bill. 98 00:04:51,748 --> 00:04:55,284 Now, each user on Cash Box keeps a balance of their funds, and 99 00:04:55,284 --> 00:04:57,394 the app tracks their transactions. 100 00:04:57,394 --> 00:05:03,188 Good news, Cash Box is hiring and they are looking for a junior data scientist. 101 00:05:03,188 --> 00:05:05,064 They've sent out a hiring challenge and 102 00:05:05,064 --> 00:05:07,170 access to a sample of some of their datasets. 103 00:05:07,170 --> 00:05:09,888 So what do you say we explore their data sets and 104 00:05:09,888 --> 00:05:12,127 pick up some job skills along the way? 105 00:05:12,127 --> 00:05:14,090 Let's get ready to rock the Cash Box. 106 00:05:14,090 --> 00:05:18,737 [LAUGH] Good thing we aren't applying to be part of the marketing department.