TL;DR
Python dominates data science in 2026. The core stack — NumPy for arrays, Pandas for tabular data, Matplotlib/Seaborn for visualization, and Scikit-learn for machine learning — runs on every major cloud and integrates with Jupyter notebooks for interactive analysis. For production, wrap your model in FastAPI and track experiments with MLflow.
Key Takeaways
- NumPy ndarray is the foundation of all scientific computing in Python — master broadcasting and vectorization to avoid slow Python loops.
- Pandas DataFrame is the standard for tabular data; learn
groupby,merge, andpivot_tablefor real-world analysis. - Always split train/test before any preprocessing to prevent data leakage; use Scikit-learn Pipelines to enforce this.
- Matplotlib is the backbone of Python visualization; Seaborn adds statistical plots; Plotly adds interactivity.
- A production ML pipeline goes: EDA → feature engineering → model training → evaluation → FastAPI serving → MLflow tracking.
- Python beats R, Julia, and MATLAB for ecosystem breadth, deployment, and community support in 2026.
- Polars is a faster Pandas alternative for large datasets; RAPIDS provides GPU acceleration for NumPy/Pandas workflows.
- Jupyter notebooks excel at exploration and sharing; VS Code combines notebook interactivity with full IDE power.
1. Python Data Science Stack Overview
Python became the dominant language for data science because it combines a gentle learning curve with an unmatched library ecosystem. Unlike specialized tools such as R or MATLAB, Python is a general-purpose language — the same code that cleans your dataset can serve a REST API, scrape websites, or orchestrate cloud jobs. This versatility means data science skills transfer directly to production engineering.
The modern Python data science stack layers cleanly on top of each other:
| Library | Purpose | Key Objects | Install |
|---|---|---|---|
| NumPy | N-dimensional array operations, linear algebra | ndarray | pip install numpy |
| Pandas | Tabular data manipulation and analysis | Series, DataFrame | pip install pandas |
| Matplotlib | Static 2D/3D plotting (the base layer) | Figure, Axes | pip install matplotlib |
| Seaborn | Statistical data visualization on top of Matplotlib | FacetGrid, PairGrid | pip install seaborn |
| Plotly | Interactive charts for browser/notebooks | Figure, go.Scatter | pip install plotly |
| Scikit-learn | ML algorithms, preprocessing, pipelines | Pipeline, BaseEstimator | pip install scikit-learn |
| Jupyter | Interactive notebook environment | Notebook, JupyterLab | pip install jupyterlab |
| SciPy | Scientific algorithms (stats, optimization, signal) | sparse, linalg | pip install scipy |
Install the full stack in one shot:
# Create a dedicated environment (recommended)
conda create -n ds python=3.12
conda activate ds
# Install via conda-forge (binary wheels, no compilation)
conda install -c conda-forge numpy pandas matplotlib seaborn plotly scikit-learn jupyterlab scipy
# Or install via pip
pip install numpy pandas matplotlib seaborn plotly scikit-learn jupyterlab scipy
# Verify versions
python -c "import numpy as np; import pandas as pd; print(np.__version__, pd.__version__)"The Anaconda distribution bundles most of these libraries and is recommended for beginners. For reproducible team environments, use conda env export > environment.yml or pin versions in requirements.txt.
2. NumPy Arrays — The Foundation of Scientific Python
NumPy's ndarray is a fixed-type, contiguous-memory array that enables vectorized operations 10–100x faster than Python lists. Every major data science library — Pandas, Scikit-learn, TensorFlow, PyTorch — stores data as NumPy arrays or compatible memory buffers.
Creating Arrays
import numpy as np
# From Python list
a = np.array([1, 2, 3, 4, 5]) # dtype=int64
b = np.array([1.0, 2.0, 3.0]) # dtype=float64
# Built-in constructors
zeros = np.zeros((3, 4)) # 3x4 of 0.0
ones = np.ones((2, 3), dtype=np.int32) # 2x3 of 1 (int)
eye = np.eye(4) # 4x4 identity matrix
range_arr = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]
linear = np.linspace(0, 1, 5) # [0.0, 0.25, 0.5, 0.75, 1.0]
# Random arrays (fixed seed for reproducibility)
rng = np.random.default_rng(seed=42) # preferred new API
rand_float = rng.random((3, 3)) # uniform [0, 1)
rand_normal = rng.normal(loc=0, scale=1, size=(100, 5))
rand_int = rng.integers(0, 100, size=10)
# Shape and dtype
print(rand_normal.shape) # (100, 5)
print(rand_normal.dtype) # float64
print(rand_normal.ndim) # 2
print(rand_normal.size) # 500 (total elements)Vectorized Operations and Broadcasting
Vectorized operations apply element-wise without explicit Python loops. Broadcasting automatically extends arrays of different shapes so they can operate together:
a = np.array([1, 2, 3, 4, 5])
# Element-wise arithmetic (no loop needed)
print(a * 2) # [2, 4, 6, 8, 10]
print(a ** 2) # [1, 4, 9, 16, 25]
print(np.sqrt(a)) # [1.0, 1.414, 1.732, 2.0, 2.236]
# Broadcasting: (5,) and (5,1) produce (5,5)
col = a.reshape(-1, 1) # shape (5,1)
result = a + col # shape (5,5) — outer sum
print(result)
# [[2 3 4 5 6]
# [3 4 5 6 7]
# [4 5 6 7 8]
# [5 6 7 8 9]
# [6 7 8 9 10]]
# Aggregation along axes
matrix = np.array([[1,2,3],[4,5,6],[7,8,9]])
print(matrix.sum(axis=0)) # [12, 15, 18] — column sums
print(matrix.sum(axis=1)) # [6, 15, 24] — row sums
print(matrix.mean()) # 5.0 — global mean
# Boolean indexing
data = rng.normal(0, 1, 1000)
positives = data[data > 0] # all positive values
outliers = data[np.abs(data) > 2] # values beyond 2 sigma
print(f"Outlier count: {len(outliers)}")Reshape and Advanced Indexing
arr = np.arange(24)
# Reshape without copying data
matrix = arr.reshape(4, 6) # shape (4, 6)
cube = arr.reshape(2, 3, 4) # shape (2, 3, 4)
flat = cube.ravel() # back to 1D (view when possible)
# Slicing: [rows, cols]
print(matrix[1:3, 2:5]) # rows 1-2, cols 2-4
print(matrix[:, -1]) # last column
print(cube[0, :, :]) # first "layer" of 3D array
# Fancy indexing
rows = np.array([0, 2])
cols = np.array([1, 4])
print(matrix[rows, cols]) # elements (0,1) and (2,4)
# np.where — vectorized if-else
result = np.where(matrix > 12, matrix, -1) # keep if >12, else -1
# Stacking and splitting
a = np.ones((3, 4))
b = np.zeros((3, 4))
v_stack = np.vstack([a, b]) # shape (6, 4)
h_stack = np.hstack([a, b]) # shape (3, 8)
parts = np.hsplit(h_stack, 2) # split into two (3, 4) arrays3. Pandas DataFrames — Tabular Data Mastery
Pandas introduces two core data structures: Series (a labeled 1D array) and DataFrame (a labeled 2D table). DataFrames are the workhorse of data science — reading CSVs, filtering records, joining tables, computing aggregates, and preparing features for machine learning all happen through the Pandas API.
Loading and Inspecting Data
import pandas as pd
# Reading data
df = pd.read_csv("data.csv") # CSV
df = pd.read_csv("data.csv", parse_dates=["date"]) # parse dates
df = pd.read_parquet("data.parquet") # Parquet (faster)
df = pd.read_excel("report.xlsx", sheet_name="Q1") # Excel
df = pd.read_json("records.json", orient="records") # JSON
# Quick inspection
print(df.shape) # (rows, columns)
print(df.dtypes) # column data types
print(df.head(5)) # first 5 rows
print(df.tail(3)) # last 3 rows
print(df.info()) # non-null counts + dtypes
print(df.describe()) # count, mean, std, quartiles
print(df.isnull().sum()) # missing values per column
print(df.nunique()) # unique values per column
# Selecting columns
age_series = df["age"] # Series
subset = df[["name", "age", "city"]] # DataFrame with 3 columns
# Row selection
row0 = df.iloc[0] # first row by position
row_a = df.loc[df.index[0]] # first row by label
filtered = df[df["age"] > 30] # boolean filterFiltering, Sorting, and Transforming
# Filtering with multiple conditions
high_earners = df[(df["salary"] > 80000) & (df["city"] == "NYC")]
young_or_senior = df[(df["age"] < 25) | (df["age"] > 60)]
not_na = df[df["email"].notna()]
# query() string syntax (more readable for complex filters)
result = df.query("salary > 80000 and city == @target_city", target_city="NYC")
# Sorting
df_sorted = df.sort_values("salary", ascending=False)
df_multi = df.sort_values(["city", "salary"], ascending=[True, False])
# Adding / modifying columns
df["annual_bonus"] = df["salary"] * 0.10
df["full_name"] = df["first_name"] + " " + df["last_name"]
df["age_group"] = pd.cut(df["age"], bins=[0,25,40,60,100],
labels=["young","mid","senior","elder"])
# apply() for custom logic
df["score_grade"] = df["score"].apply(lambda x: "A" if x >= 90 else ("B" if x >= 80 else "C"))
# assign() for chaining (returns new DataFrame)
df2 = (
df
.assign(salary_k=lambda d: d["salary"] / 1000)
.assign(tax=lambda d: d["salary_k"] * 0.3)
.query("salary_k > 50")
.sort_values("tax", ascending=False)
.reset_index(drop=True)
)GroupBy and Aggregation
# Basic groupby
grouped = df.groupby("department")["salary"]
print(grouped.mean()) # average salary per department
print(grouped.agg(["mean","median","std","count"]))
# Multiple columns
df.groupby(["department","level"])["salary"].mean().unstack()
# Custom aggregation
agg_result = df.groupby("city").agg(
avg_salary = ("salary", "mean"),
median_age = ("age", "median"),
total_people = ("name", "count"),
min_exp = ("experience", "min"),
)
print(agg_result)
# pivot_table — spreadsheet-style summaries
pivot = df.pivot_table(
values="revenue",
index="product",
columns="quarter",
aggfunc="sum",
fill_value=0
)
# Merge / join DataFrames
orders = pd.read_csv("orders.csv")
customers = pd.read_csv("customers.csv")
joined = orders.merge(customers, on="customer_id", how="left")
combined = pd.concat([df_2023, df_2024], ignore_index=True)
# Wide → Long and Long → Wide
long = df.melt(id_vars=["id","name"], var_name="metric", value_name="value")
wide = long.pivot(index="id", columns="metric", values="value")4. Data Visualization — Matplotlib, Seaborn, and Plotly
Effective visualization communicates insights that tables cannot. Python offers three complementary visualization layers: Matplotlib for fine-grained control, Seaborn for statistical plots with beautiful defaults, and Plotly for interactive charts that work in notebooks and the browser.
Matplotlib Fundamentals
import matplotlib.pyplot as plt
import numpy as np
# The object-oriented API (preferred for complex plots)
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
# Line plot
x = np.linspace(0, 10, 200)
axes[0, 0].plot(x, np.sin(x), color="#1e40af", linewidth=2, label="sin(x)")
axes[0, 0].plot(x, np.cos(x), color="#dc2626", linestyle="--", label="cos(x)")
axes[0, 0].set_title("Trigonometric Functions")
axes[0, 0].legend()
axes[0, 0].grid(alpha=0.3)
# Histogram
data = np.random.normal(50, 15, 1000)
axes[0, 1].hist(data, bins=30, color="#6366f1", edgecolor="white", alpha=0.8)
axes[0, 1].set_title("Distribution")
axes[0, 1].set_xlabel("Value")
axes[0, 1].set_ylabel("Frequency")
# Scatter plot
axes[1, 0].scatter(df["experience"], df["salary"],
c=df["age"], cmap="viridis", alpha=0.6, s=40)
axes[1, 0].set_title("Experience vs Salary")
# Bar chart
categories = ["A", "B", "C", "D"]
values = [23, 45, 12, 67]
bars = axes[1, 1].bar(categories, values, color=["#0ea5e9","#10b981","#f59e0b","#ef4444"])
axes[1, 1].bar_label(bars, fmt="%.0f")
axes[1, 1].set_title("Category Comparison")
plt.tight_layout()
plt.savefig("overview.png", dpi=150, bbox_inches="tight")
plt.show()Seaborn Statistical Plots
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid", palette="husl")
# Load sample dataset
tips = sns.load_dataset("tips")
iris = sns.load_dataset("iris")
# Distribution plot with KDE
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
sns.histplot(tips["total_bill"], kde=True, ax=axes[0], color="#6366f1")
axes[0].set_title("Total Bill Distribution")
# Box plot — compare distributions by category
sns.boxplot(x="day", y="total_bill", hue="sex", data=tips, ax=axes[1])
axes[1].set_title("Bill by Day and Gender")
# Violin plot — shows distribution shape
sns.violinplot(x="day", y="tip", data=tips, ax=axes[2])
axes[2].set_title("Tip Distribution by Day")
plt.tight_layout(); plt.show()
# Heatmap — correlation matrix
corr = iris.drop("species", axis=1).corr()
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm",
vmin=-1, vmax=1, center=0)
plt.title("Feature Correlations")
plt.show()
# Pairplot — all feature combinations
sns.pairplot(iris, hue="species", diag_kind="kde")
plt.show()Plotly Interactive Charts
import plotly.express as px
import plotly.graph_objects as go
# Plotly Express — high-level API
fig = px.scatter(
df, x="experience", y="salary",
color="department", size="age",
hover_data=["name", "city"],
title="Salary vs Experience",
template="plotly_white"
)
fig.update_traces(marker=dict(opacity=0.7))
fig.show() # opens in browser or renders in notebook
# Interactive histogram
fig2 = px.histogram(df, x="salary", nbins=40, color="department",
barmode="overlay", opacity=0.7)
fig2.show()
# Choropleth map
fig3 = px.choropleth(
gapminder, locations="iso_alpha", color="gdpPercap",
hover_name="country", animation_frame="year",
color_continuous_scale="Viridis"
)
fig3.show()
# Low-level go API for custom charts
fig4 = go.Figure()
fig4.add_trace(go.Candlestick(
x=prices["date"], open=prices["open"],
high=prices["high"], low=prices["low"],
close=prices["close"], name="OHLC"
))
fig4.update_layout(title="Stock Price", xaxis_rangeslider_visible=False)
fig4.show()
# Export to static image or HTML
fig.write_html("chart.html") # shareable HTML with JS
fig.write_image("chart.png", scale=2) # needs kaleido package5. Data Cleaning — Handling NaN, Duplicates, and Type Conversion
Data scientists spend 60–80% of their time cleaning data. Messy real-world datasets contain missing values, duplicate rows, inconsistent formats, erroneous entries, and wrong data types. Systematic cleaning is essential before any analysis or modeling.
import pandas as pd
import numpy as np
df = pd.read_csv("raw_data.csv")
# ── 1. Inspect Missing Values ───────────────────────────────────
print(df.isnull().sum()) # count per column
print(df.isnull().mean() * 100) # percentage missing
# Visualize with heatmap
import seaborn as sns
sns.heatmap(df.isnull(), cbar=False, yticklabels=False)
# ── 2. Handle Missing Values ─────────────────────────────────────
df.dropna(subset=["critical_column"]) # drop rows where key column is NA
df["salary"].fillna(df["salary"].median()) # fill with median
df["category"].fillna("Unknown") # fill categorical with placeholder
df["measurement"].fillna(method="ffill") # forward-fill time series
# Multiple strategies at once
df = df.fillna({
"age": df["age"].median(),
"city": "Unknown",
"salary": df.groupby("department")["salary"].transform("mean"),
})
# ── 3. Remove Duplicates ─────────────────────────────────────────
print(df.duplicated().sum()) # count duplicate rows
df = df.drop_duplicates() # drop exact duplicates
df = df.drop_duplicates(subset=["email"]) # deduplicate by email
df = df.drop_duplicates(subset=["email"], keep="last") # keep most recent
# ── 4. Fix Data Types ───────────────────────────────────────────
df["hire_date"] = pd.to_datetime(df["hire_date"], format="%Y-%m-%d")
df["salary"] = pd.to_numeric(df["salary"], errors="coerce") # NaN for bad values
df["employee_id"] = df["employee_id"].astype(str)
df["score"] = df["score"].astype("float32") # save memory
df["dept_code"] = df["dept_code"].astype("category") # memory efficient
# ── 5. String Normalization ──────────────────────────────────────
df["city"] = df["city"].str.strip().str.lower().str.replace(r"\s+", " ", regex=True)
df["email"] = df["email"].str.lower().str.strip()
df["phone"] = df["phone"].str.replace(r"[^\d]", "", regex=True) # digits only
# Standardize categorical values
city_map = {"new york": "New York", "ny": "New York", "nyc": "New York"}
df["city"] = df["city"].map(city_map).fillna(df["city"])
# ── 6. Outlier Detection and Treatment ──────────────────────────
Q1, Q3 = df["salary"].quantile([0.25, 0.75])
IQR = Q3 - Q1
lower, upper = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
df_clean = df[df["salary"].between(lower, upper)] # remove outliers
df["salary"] = df["salary"].clip(lower, upper) # or clip them
# Z-score method
from scipy import stats
z_scores = np.abs(stats.zscore(df[["salary","age"]].dropna()))
df_no_outliers = df[(z_scores < 3).all(axis=1)]6. Machine Learning with Scikit-learn
Scikit-learn provides a consistent fit/transform/predict API across dozens of algorithms. Its Pipeline class chains preprocessing and modeling into a single object, preventing data leakage and enabling clean cross-validation.
Train-Test Split and Preprocessing
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import pandas as pd
# Load data
df = pd.read_csv("titanic.csv")
X = df.drop("survived", axis=1)
y = df["survived"]
# Train/test split — ALWAYS before any fitting
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Identify column types
numeric_cols = X.select_dtypes(include=["int64","float64"]).columns.tolist()
categoric_cols = X.select_dtypes(include=["object","category"]).columns.tolist()
# Build preprocessing pipelines per column type
numeric_pipe = Pipeline([
("impute", SimpleImputer(strategy="median")),
("scale", StandardScaler()),
])
categoric_pipe = Pipeline([
("impute", SimpleImputer(strategy="constant", fill_value="missing")),
("encode", OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
])
# Combine with ColumnTransformer
preprocessor = ColumnTransformer([
("num", numeric_pipe, numeric_cols),
("cat", categoric_pipe, categoric_cols),
])
print(f"Training on {X_train.shape[0]} samples")
print(f"Testing on {X_test.shape[0]} samples")Training and Evaluating Models
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import (
classification_report, confusion_matrix,
roc_auc_score, roc_curve, ConfusionMatrixDisplay
)
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
# Full pipeline: preprocessing + model
pipeline = Pipeline([
("prep", preprocessor),
("model", RandomForestClassifier(n_estimators=200, random_state=42))
])
# Train
pipeline.fit(X_train, y_train)
# Evaluate
y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)[:, 1]
print(classification_report(y_test, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")
# Confusion matrix
fig, ax = plt.subplots(figsize=(6,5))
ConfusionMatrixDisplay.from_estimator(pipeline, X_test, y_test, ax=ax)
plt.title("Confusion Matrix")
plt.tight_layout()
# Cross-validation (more robust estimate)
cv_scores = cross_val_score(pipeline, X, y, cv=5, scoring="roc_auc")
print(f"CV ROC-AUC: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
# Hyperparameter tuning with GridSearchCV
param_grid = {
"model__n_estimators": [100, 200, 300],
"model__max_depth": [None, 10, 20],
"model__min_samples_split": [2, 5],
}
grid_search = GridSearchCV(
pipeline, param_grid,
cv=5, scoring="roc_auc", n_jobs=-1, verbose=1
)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
# Save model
import joblib
joblib.dump(grid_search.best_estimator_, "model.pkl")
loaded = joblib.load("model.pkl")7. Jupyter Notebooks and VS Code for Data Science
The notebook environment revolutionized data science by interleaving code, narrative text, math equations, and rich output (plots, tables) in a single document. In 2026, JupyterLab and VS Code's Jupyter extension are the two dominant environments.
Essential Jupyter Magic Commands
# ── Line magics (prefix single line) ───────────────────────────
%time my_function() # time a single statement
%timeit my_function() # average over multiple runs
%run analysis.py # run external Python script in kernel
%matplotlib inline # render Matplotlib plots inline
%matplotlib widget # interactive plots with pan/zoom
%load data_loader.py # paste contents of file into cell
%who # list all variables in namespace
%whos # detailed variable info (type, size)
%history # show command history
%lsmagic # list all available magic commands
# ── Cell magics (prefix entire cell) ──────────────────────────
%%time # time entire cell
%%timeit # benchmark entire cell
%%bash # run cell as shell script
%%writefile config.py # write cell contents to file
%%capture output # capture stdout/stderr
%%html # render cell as HTML
# ── Debugging ──────────────────────────────────────────────────
%debug # drop into debugger after exception
%pdb on # auto-enter debugger on any exception
# ── Profile performance ────────────────────────────────────────
%prun my_slow_function() # line-by-line profiling
%lprun -f my_function my_function() # needs line_profilerNotebook Best Practices
# ── 1. Restart and Run All before sharing ──────────────────────
# Kernel → Restart & Run All ensures reproducibility
# Hidden state from deleted cells causes silent bugs
# ── 2. Use Papermill for parameterized notebooks ────────────────
# pip install papermill
import papermill as pm
pm.execute_notebook(
"template.ipynb",
"output_2024_Q4.ipynb",
parameters={"year": 2024, "quarter": "Q4"}
)
# ── 3. Display formatting helpers ───────────────────────────────
from IPython.display import display, HTML, Markdown
display(df.head().style.highlight_max(color="lightgreen")) # styled DataFrame
display(HTML("<b>Custom HTML in notebook</b>"))
display(Markdown("## Heading in Markdown cell via code"))
# ── 4. Pandas display options ────────────────────────────────────
pd.set_option("display.max_columns", 50)
pd.set_option("display.max_rows", 100)
pd.set_option("display.float_format", "{:.4f}".format)
pd.set_option("display.width", 120)
# ── 5. nbconvert — export to HTML / PDF / slides ─────────────────
# jupyter nbconvert --to html analysis.ipynb
# jupyter nbconvert --to slides analysis.ipynb --post serve
# ── 6. nbformat — programmatic notebook creation ─────────────────
import nbformat
nb = nbformat.v4.new_notebook()
cells = [
nbformat.v4.new_markdown_cell("# Auto-Generated Notebook"),
nbformat.v4.new_code_cell("import pandas as pd\nprint(pd.__version__)"),
]
nb.cells = cells
with open("generated.ipynb", "w") as f:
nbformat.write(nb, f)8. Python vs R vs Julia vs MATLAB — Data Science Language Comparison
Choosing the right language depends on your team's background, production requirements, and the specific domain. Here is a definitive 2026 comparison across the most important dimensions:
| Criterion | Python | R | Julia | MATLAB |
|---|---|---|---|---|
| Learning Curve | Gentle — general-purpose syntax | Moderate — data-first idioms | Steep — type system, macros | Moderate — proprietary syntax |
| ML Ecosystem | Excellent (Sklearn, PyTorch, TF) | Good (caret, tidymodels) | Growing (Flux.jl, MLJ.jl) | Limited (Statistics Toolbox) |
| Statistical Analysis | Good (scipy, statsmodels) | Excellent — built for stats | Good (HypothesisTests.jl) | Good (Statistics Toolbox) |
| Data Visualization | Excellent (Matplotlib, Plotly) | Excellent (ggplot2, Shiny) | Good (Plots.jl, Makie.jl) | Good (built-in) |
| Raw Performance | Good with NumPy/Cython | Slow (R is interpreted) | Excellent — near C speed | Good (JIT in some ops) |
| Deep Learning | Best (PyTorch, TensorFlow, JAX) | Limited (torch via reticulate) | Growing (Flux.jl) | Limited (Deep Learning Toolbox) |
| Big Data / Cloud | Excellent (PySpark, Dask, Ray) | Limited (sparklyr) | Emerging | Poor |
| Production Deployment | Excellent (FastAPI, Flask) | Limited (Plumber, Shiny) | Growing (Genie.jl) | Poor (MCR required) |
| Cost | Free (open source) | Free (open source) | Free (open source) | Expensive ($$$) |
| Community Size | Largest globally | Large in academia/biostat | Small but growing | Shrinking |
| Job Market Demand | Highest — by far | High in academia/pharma | Niche (HPC, finance) | Declining |
| Best For | General DS, ML, production | Statistics, academia, bioinfo | Scientific computing, HPC | Engineering, signal processing |
Verdict 2026: Python is the clear winner for most data science work, especially when production deployment matters. Choose R when your team is statistician-heavy or the project is purely academic/biostatistical. Choose Julia only if you need C-level numeric performance in scientific computing. Avoid MATLAB for new projects due to cost and declining community.
9. Real-World Pipeline: EDA → Feature Engineering → Model → FastAPI Deployment
A production data science project has five stages: exploratory data analysis (EDA), feature engineering, model training, evaluation, and deployment. Here is a complete end-to-end example for a customer churn prediction model.
Stage 1: Exploratory Data Analysis
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("churn_data.csv")
# Target distribution
print(df["churned"].value_counts(normalize=True))
# churned
# 0 0.855 (85.5% retained)
# 1 0.145 (14.5% churned)
# Numeric feature summary
df.describe().T.sort_values("std", ascending=False)
# Correlation with target
numeric_df = df.select_dtypes(include="number")
corr_with_target = numeric_df.corr()["churned"].sort_values(key=abs, ascending=False)
print(corr_with_target.head(10))
# Plot distributions split by target
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
for ax, col in zip(axes.flat, ["tenure","monthly_charges","total_charges","num_products","support_calls","age"]):
sns.histplot(df, x=col, hue="churned", kde=True, stat="density",
common_norm=False, ax=ax, palette={0:"#6366f1",1:"#ef4444"})
plt.tight_layout()
plt.savefig("eda_distributions.png", dpi=120)Stage 2: Feature Engineering
import numpy as np
# Domain-driven features
df["charge_per_month"] = df["total_charges"] / (df["tenure"] + 1)
df["products_per_month"] = df["num_products"] / (df["tenure"] + 1)
df["support_rate"] = df["support_calls"] / (df["tenure"] + 1)
df["contract_is_monthly"] = (df["contract_type"] == "Month-to-month").astype(int)
df["tenure_bucket"] = pd.cut(df["tenure"], bins=[0,6,12,24,60,120],
labels=["0-6m","6-12m","1-2y","2-5y","5y+"])
# Log-transform skewed features
for col in ["total_charges","monthly_charges"]:
df[f"log_{col}"] = np.log1p(df[col].clip(lower=0))
# Interaction feature
df["high_spend_short_tenure"] = ((df["monthly_charges"] > 80) & (df["tenure"] < 12)).astype(int)
# One-hot encode categoricals
df = pd.get_dummies(df, columns=["contract_type","payment_method","tenure_bucket"],
drop_first=True, dtype=int)
# Define final feature set
FEATURES = [c for c in df.columns if c not in ["churned","customer_id"]]
X = df[FEATURES]
y = df["churned"]Stage 3: Model Training with MLflow Tracking
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, f1_score
import mlflow
import mlflow.sklearn
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
mlflow.set_experiment("churn-prediction")
with mlflow.start_run(run_name="GBM-v1"):
# Log parameters
params = {"n_estimators": 300, "learning_rate": 0.05, "max_depth": 5, "subsample": 0.8}
mlflow.log_params(params)
# Build and train pipeline
pipe = Pipeline([
("scaler", StandardScaler()),
("model", GradientBoostingClassifier(**params, random_state=42))
])
pipe.fit(X_train, y_train)
# Evaluate
y_pred = pipe.predict(X_test)
y_proba = pipe.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_proba)
f1 = f1_score(y_test, y_pred)
# Log metrics
mlflow.log_metric("roc_auc", auc)
mlflow.log_metric("f1_score", f1)
# Log model artifact
mlflow.sklearn.log_model(pipe, "churn_model")
print(f"ROC-AUC: {auc:.4f} F1: {f1:.4f}")
# Feature importance
import pandas as pd
importances = pd.Series(
pipe.named_steps["model"].feature_importances_,
index=FEATURES
).sort_values(ascending=False)
print(importances.head(15))Stage 4: FastAPI Deployment
# main.py — FastAPI serving for the churn model
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import joblib
import pandas as pd
import numpy as np
from typing import Optional
# Load model once at startup
app = FastAPI(title="Churn Prediction API", version="1.0.0")
model = joblib.load("churn_model.pkl")
# Request schema
class CustomerFeatures(BaseModel):
tenure: float
monthly_charges: float
total_charges: float
num_products: int
support_calls: int
contract_type: str
payment_method: str
age: Optional[float] = None
# Response schema
class PredictionResponse(BaseModel):
customer_id: str
churn_probability: float
churn_predicted: bool
risk_tier: str
@app.post("/predict", response_model=PredictionResponse)
def predict_churn(customer_id: str, features: CustomerFeatures):
try:
# Convert to DataFrame
data = pd.DataFrame([features.dict()])
# Feature engineering (must mirror training)
data["charge_per_month"] = data["total_charges"] / (data["tenure"] + 1)
data["support_rate"] = data["support_calls"] / (data["tenure"] + 1)
data["log_total_charges"] = np.log1p(data["total_charges"].clip(lower=0))
data["log_monthly_charges"]= np.log1p(data["monthly_charges"].clip(lower=0))
data["contract_is_monthly"]= (data["contract_type"] == "Month-to-month").astype(int)
# Predict
proba = float(model.predict_proba(data)[0, 1])
risk = "high" if proba > 0.6 else ("medium" if proba > 0.3 else "low")
return PredictionResponse(
customer_id=customer_id,
churn_probability=round(proba, 4),
churn_predicted=proba > 0.5,
risk_tier=risk
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.get("/health")
def health_check():
return {"status": "healthy", "model_loaded": model is not None}
# Run with: uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4Dockerizing the API
# Dockerfile
FROM python:3.12-slim
WORKDIR /app
# Install dependencies first (cached layer)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application
COPY churn_model.pkl .
COPY main.py .
# Expose and run
EXPOSE 8000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]
# requirements.txt
# fastapi==0.111.0
# uvicorn[standard]==0.30.1
# scikit-learn==1.5.0
# pandas==2.2.2
# numpy==1.26.4
# joblib==1.4.2
# pydantic==2.7.1
# Build and run
# docker build -t churn-api .
# docker run -p 8000:8000 churn-api
# Test with curl
# curl -X POST http://localhost:8000/predict?customer_id=C123 \
# -H "Content-Type: application/json" \
# -d '{"tenure":12,"monthly_charges":85,"total_charges":1020,\
# "num_products":2,"support_calls":5,"contract_type":"Month-to-month",\
# "payment_method":"Electronic check"}'10. Advanced Topics — Polars, GPU Acceleration, and Distributed Computing
Polars — The Faster Pandas Alternative
import polars as pl
# Polars uses lazy evaluation — build a plan, then execute
result = (
pl.scan_csv("large_file.csv") # lazy — no data loaded yet
.filter(pl.col("salary") > 50000)
.groupby("department")
.agg([
pl.col("salary").mean().alias("avg_salary"),
pl.col("name").count().alias("headcount"),
])
.sort("avg_salary", descending=True)
.collect() # execute the optimized query
)
print(result)
# Eager API (like Pandas)
df = pl.read_csv("data.csv")
df2 = df.with_columns([
(pl.col("salary") * 1.1).alias("new_salary"),
pl.col("name").str.to_uppercase().alias("name_upper"),
])
# Polars is much faster for large DataFrames:
# Pandas: 100M rows groupby = ~45s
# Polars: 100M rows groupby = ~3s (using all cores)Dask — Distributed Pandas on Large Datasets
import dask.dataframe as dd
# Reads only metadata, lazy evaluation
ddf = dd.read_csv("large_folder/*.csv")
# Pandas-compatible API
result = (
ddf[ddf["salary"] > 50000]
.groupby("department")["salary"]
.mean()
.compute() # trigger parallel execution
)
# Scale to a cluster
from dask.distributed import Client
client = Client("scheduler-address:8786")
# Now .compute() runs distributed across workers
# PySpark for truly massive data (Hadoop ecosystem)
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("ChurnAnalysis").getOrCreate()
df_spark = spark.read.csv("hdfs://data/churn.csv", header=True, inferSchema=True)
result = df_spark.groupBy("department").agg(F.avg("salary").alias("avg_salary"))
result.show()Frequently Asked Questions
Should I use Pandas or Polars for data processing in 2026?
Polars is significantly faster than Pandas for large datasets (often 5-50x) due to its Rust foundation and lazy evaluation with query optimization. Use Pandas when ecosystem compatibility matters most. Switch to Polars when processing datasets over 1 GB or when you need better multi-core utilization. Both are production-ready.
How do I accelerate NumPy and Pandas with GPU?
NVIDIA RAPIDS provides GPU-accelerated equivalents: CuPy replaces NumPy and cuDF replaces Pandas with nearly identical APIs. Install via conda from the rapids channel and change your import statements. For non-NVIDIA hardware, JAX with XLA compilation provides GPU/TPU acceleration for array operations.
What is MLflow and why should I use it?
MLflow is an open-source platform for managing the full ML lifecycle: experiment tracking (log parameters, metrics, artifacts), model registry (version control for trained models), and model serving. Call mlflow.sklearn.autolog() before training to automatically log all Scikit-learn experiments.
What is the best Python library for interactive data visualization?
Plotly is the most popular choice for interactive charts — it renders in Jupyter notebooks and browsers with zoom, hover tooltips, and download capabilities. For publication-quality static charts, Matplotlib and Seaborn remain standard. Altair uses a declarative grammar-of-graphics approach for quick exploratory plots.
How do I handle missing data in Pandas?
Pandas represents missing values as NaN. Key methods: df.isna() detects them, df.dropna() removes affected rows, df.fillna(value) fills with a constant, and df.interpolate() uses linear interpolation for time series. For ML, use Scikit-learn's SimpleImputer as part of a Pipeline to handle this automatically and safely during cross-validation.
What is the difference between fit(), transform(), and fit_transform() in Scikit-learn?
fit() computes statistics from training data (e.g., mean and std for StandardScaler). transform() applies the learned transformation to any data. fit_transform() does both at once. Critical rule: call fit() only on training data, then transform() on both training and test data to prevent data leakage.
How do I deploy a Scikit-learn model to production?
Save the trained model with joblib.dump(model, "model.pkl"), then wrap it in a FastAPI endpoint that loads the model once at startup and exposes a POST route for predictions. Containerize with Docker and deploy to any cloud. MLflow Model Registry provides versioned model storage with staging/production environments.
Is Jupyter Notebook or VS Code better for data science?
Both are excellent and many data scientists use them together. Jupyter excels at interactive exploration and sharing results with rich output. VS Code with the Jupyter extension gives the same notebook interface plus full IDE features: debugger, Git integration, IntelliSense, and refactoring tools. Use VS Code for production-quality code; use notebooks for exploration and collaboration.
Related DevToolBox Tools
Summary
Python's data science ecosystem is unmatched in breadth, depth, and production-readiness. The stack — NumPy for arrays, Pandas for tabular data, Matplotlib/Seaborn/Plotly for visualization, and Scikit-learn for machine learning — covers 95% of data science tasks. For large-scale workloads, Polars and Dask extend this stack without sacrificing the Python-native experience. Combine Jupyter notebooks for exploration with VS Code for production code, track every experiment with MLflow, and deploy with FastAPI and Docker for a fully professional data science workflow in 2026.