Background
If you have worked with data analysis or data sciences roles you would have worked with the Pandas/numpy libraries in Python which comes in handy. In this post, we will see how to get started on working with the pandas library. This post assumes you have a basic understanding of Python
Installing pandas library
You can install the pandas library using pip
Setting -> Project ->Python interpreter and click on "+" icon to search & add the pandas package.
Once you have installed pandas library you can import it as
- import pandas as pd
and start using it
Data structures supported in Pandas
Pandas library supports 2 main data structures
- Series: One dimensional array with an axis.
- DataFrame: Two dimensional data structure with labelled rows and columns
Series
Let's try to see how Series works. You can simply create a series from a python array
import pandas as pd test_series = pd.Series([11, 12, 13]) print(test_series)
Output is:
0 11
1 12
2 13
dtype: int64
As you can see from the output we have integer index (0,1,2) and 1-dimensional data with values [11,12,13]. You can also have a string index for this one-dimensional array, You can use this index to access the data from Series.
import pandas as pd test_series = pd.Series([11, 12, 13], index=["RowA", "RowB", "RowC"]) print(test_series) print(test_series["RowA"])
Output is:
RowA 11
RowB 12
RowC 13
dtype: int64
11
DataFrame
To create a data frame you can simply pass a dictionary where the key of the dictionary forms the columns and the actual values of that keys from the data.
import pandas as pd df = pd.DataFrame({ "ColA": [11, 12, 13], "Col B": [21, 22, 23], "Col C": [31, 32, 33], }, index=["Row A", "Row B", "Row C"]) print(df)
Output is:
ColA Col B Col C
Row A 11 21 31
Row B 12 22 32
Row C 13 23 33
As with series, the passing index is optional, if you do not pass default behavior is to use integer indexes (0,1,2... etc.). Similarly if you do not assign explicit column names then integer columns are used.
DataFrame support various methods
- df.head(): Gives first 5 rows.
- df.size: Gives number of cells in data frame (no of rows * no of columns). For above example output will be 9.
- df.shape: Gives dimension of data frame. For above example output will be (3,3)
- len(df): Give number of rows in data frame. For above example output will be 3.
For checking the data type and converting the column type we can use below methods:
- df.dtypes : Give data types of columns present in the data frame
- df.astype: Mehtod to convert data type of a column
Consider the following example:
import pandas as pd df = pd.DataFrame({ "ColA": [11, 12, 13], "Col B": [21, 22, 23], "Col C": [31, 32, 33], }, index=["Row A", "Row B", "Row C"]) print(df.dtypes) df = df.astype({"ColA": float}) print(df.dtypes)
Output:
ColA int64
Col B int64
Col C int64
dtype: object
ColA float64
Col B int64
Col C int64
dtype: object
Once you have a dateframe in place you can reference the individual columns and perform analysis on it. A single column referenced from data frame can perform below operations:
- df.col.nunique(): Returns number of unique elements
- df.col.uniqie(): Return actual unique elements
- df.col.mean(): Retuns mean of column values
- df.col.median(): Returns median of column values
- df.col.value_counts(): Return unique values and their counts
Consider below example:
import pandas as pd df = pd.DataFrame({ "Col A": [11, 12, 13], "Col B": [21, 22, 23], "Col C": [31, 32, 33], }, index=["Row A", "Row B", "Row C"]) print(df["Col A"].nunique()) print(df["Col A"].unique()) print(df["Col A"].mean()) print(df["Col A"].median()) print(df["Col A"].value_counts())
Output is:
3
[11 12 13]
12.0
12.0
Col A
11 1
12 1
13 1
Name: count, dtype: int64
Note that column of data frame is actually a series
df = pd.DataFrame({"A":[1,2,3], "C": [1,2,3]}) print(type(df)) print(type(df.A))
Output:
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
No comments:
Post a Comment