Python for Data Science: 8 Concepts You May Have Forgotten

One-Line List Comprehension

Writing out a for loop every time you need to define some sort of list is tedious, luckily Python has a built-in way to address this problem in just one line of code. The syntax can be a little hard to wrap your head around but once you get familiar with this technique you’ll use it fairly often.

See the example above and below for how you would normally go about list comprehension with a for loop vs. creating your list with in one simple line with no loops necessary.

x = [1,2,3,4]
out = []
for item in x:
    out.append(item**2)
print(out)

[1, 4, 9, 16]

# vs.

x = [1,2,3,4]
out = [item**2 for item in x]
print(out)

[1, 4, 9, 16]

Lambda Functions

Ever get tired of creating function after function for limited use cases? Lambda functions to the rescue! Lambda functions are used for creating small, one-time and anonymous function objects in Python. Basically, they let you create a function, without creating a function.

The basic syntax of lambda functions is:

lambda arguments: expression

Note that lambda functions can do everything that regular functions can do, as long as there’s just one expression. Check out the simple example below and the upcoming video to get a better feel for the power of lambda functions:

double = lambda x: x * 2
print(double(5))

Map and Filter

Once you have a grasp on lambda functions, learning to pair them with the map and filter functions can be a powerful tool.

Specifically, map takes in a list and transforms it into a new list by performing some sort of operation on each element. In this example, it goes through each element and maps the result of itself times 2 to a new list. Note that the list function simply converts the output to list type.

# Map
seq = [1, 2, 3, 4, 5]
result = list(map(lambda var: var*2, seq))
print(result)

[2, 4, 6, 8, 10]

The filter function takes in a list and a rule, much like map, however it returns a subset of the original list by comparing each element against the boolean filtering rule.

# Filter
seq = [1, 2, 3, 4, 5]
result = list(filter(lambda x: x > 2, seq))
print(result)

[3, 4, 5]

Arange and LinspaceFor creating quick and easy Numpy arrays, look no further than the arange and linspace functions. Each one has their specific purpose, but the appeal here (instead of using range), is that they output NumPy arrays, which are typically easier to work with for data science.Arange returns evenly spaced values within a given interval. Along with a starting and stopping point, you can also define a step size or data type if necessary. Note that the stopping point is a ‘cut-off’ value, so it will not be included in the array output.

# np.arange(start, stop, step)
np.arange(3, 7, 2)

array([3, 5])

Linspace is very similar, but with a slight twist. Linspace returns evenly spaced numbers over a specified interval. So given a starting and stopping point, as well as a number of values, linspace will evenly space them out for you in a NumPy array. This is especially helpful for data visualizations and declaring axes when plotting.

# np.linspace(start, stop, num)
np.linspace(2.0, 3.0, num=5)

array([ 2.0,  2.25,  2.5,  2.75, 3.0])

What Axis Really Means

You may have ran into this when dropping a column in Pandas or summing values in NumPy matrix. If not, then you surely will at some point. Let’s use the example of dropping a column for now:

df.drop('Row A', axis=0)
df.drop('Column A', axis=1)

I don’t know how many times I wrote this line of code before I actually knew why I was declaring axis what I was. As you can probably deduce from above, set axis to 1 if you want to deal with columns and set it to 0 if you want rows. But why is this? My favorite reasoning, or atleast how I remember this:

df.shape

(# of Rows, # of Columns)

Calling the shape attribute from a Pandas dataframe gives us back a tuple with the first value representing the number of rows and the second value representing the number of columns. If you think about how this is indexed in Python, rows are at 0 and columns are at 1, much like how we declare our axis value. Crazy, right?

Concat, Merge, and Join

If you’re familiar with SQL, then these concepts will probably come a lot easier for you. Anyhow, these functions are essentially just ways to combine dataframes in specific ways. It can be difficult to keep track of which is best to use at which time, so let’s review it.

Concat allows the user to append one or more dataframes to each other either below or next to it (depending on how you define the axis).

Merge combines multiple dataframes on specific, common columns that serve as the primary key.

Join, much like merge, combines two dataframes. However, it joins them based on their indices, rather than some specified column.

Check out the excellent Pandas documentation for specific syntax and more concrete examples, as well as some special cases that you may run into.

Pandas Apply

Think of apply as a map function, but made for Pandas DataFrames or more specifically, for Series. If you’re not as familiar, Series are pretty similar to NumPy arrays for the most part.

Apply sends a function to every element along a column or row depending on what you specify. You might imagine how useful this can be, especially for formatting and manipulating values across a whole DataFrame column, without having to loop at all.

Pivot Tables

Last but certainly not least is pivot tables. If you’re familiar with Microsoft Excel, then you’ve probably heard of pivot tables in some respect. The Pandas built-in pivot_table function creates a spreadsheet-style pivot table as a DataFrame. Note that the levels in the pivot table are stored in MultiIndex objects on the index and columns of the resulting DataFrame.

Wrapping up

That’s it for now. I hope a couple of these overviews have effectively jogged your memory regarding important yet somewhat tricky methods, functions, and concepts you frequently encounter when using Python for data science. Personally, I know that even the act of writing these out and trying to explain them in simple terms has helped me out a ton.

If you’re interested in receiving my weekly rundown of interesting articles and resources focused on data science, machine learning, and artificial intelligence, then subscribe to Self Driven Data Science using the form below!

TechiCrunch

Search This Blog