Tools used to build AI systems.
Written by Junkun Yuan.
Click here to go back to main contents.
Table of contents:
argparse(command-line arguments)
data transforms(transform and augment data) data loader(build and load dataset) operation(tensor operations) module(modules to build models) activation function(activation functions) optimizer(optimizers) huggingface(huggingface tools)
fsdp(fully sharded data parallel framework) torchrun(console script for Distributed Framework) deepspeed(deepspeed framework) ray(ray framework)
vscode(vscode configs) macbook-reimage(macOS setup checklist) git(git tools) docker(docker tools)
argparse
command-line arguments
Sep 29, 2025 | argparse
Parser for command-line options, arguments, and sub-commands.
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--name", type=str, default="John", help="your name")
parser.add_argument("--debug", action="store_true", help="debug mode") # "store_true" means default is False
args = parser.parse_args()
print(args.name)
Data Transforms
transform and augment data
Jul 01, 2024 | data transforms
| category | class / function (alphabetical) |
|---|---|
| geometry | RandomHorizontalFlip |
| resizing | Resize |
| conversion | Normalize ToTensor |
| else | Compose |
from torchvision import transforms
from torchvision.transforms import InterpolationMode # InterpolationMode.BILINEAR, NEAREST, BICUBIC, LANCZOS
RandomHorizontalFlip: horizontally flip the image randomly with a probability.
p = 0.5 # *** float. Probability to flip
trans = transforms.RandomHorizontalFlip(p)
image_trans = trans(image) # PIL Image => PIL Image, or Tensor => Tensor
Resize: resize the image to a size.
## When `size` is int, the image shorter size will be resized to `size` with aspect ratio fixed
## When `size` is tuple, the image size will be resized to `size` with aspect ratio changed
size = / # *** tuple or int
## NEAREST: fastest; lowest quality, jagged
## BILINEAR: fast; low quality, blur
## (recommend) BICUBIC: slow; good quality
## (recommend) LANCZOS: slowest; best quality
interpolation = InterpolationMode.BILINEAR # *** InterpolationMode
## The shorter size may be lower than `size` if longer size exceeds `max_size` after resizing
max_size = None # int. Maximum allowed for the longer edge, supported if `size` is int
trans = transforms.Resize(size, interpolation, max_size)
image_trans = trans(image) # PIL Image => PIL Image, or Tensor => Tensor
ToTensor: convert a PIL Image or ndarray to tensor and scale the values accordingly.
## Input: PIL Image / numpy.ndarray (np.uint8) of shape (HxWxC) in the range [0, 255]
## Output: torch.FloatTensor of shape (CxHxW) in the range (0.0, 1.0)
## Other inputs: only apply type transform
trans = transforms.ToTensor()
image_trans = trans(image)
Compose: compose several transforms.
transforms = / # *** list of Transform objects
trans = transforms.Compose(transforms)
image_trans = trans(image) # PIL Image / ndarray / Tensor => Tensor
Normalize: normalize a tensor image with mean and standard deviation.
mean = / # *** sequence. Means for each channel
std = / # *** sequence. Standard deviations for each channel
inplace = False # bool. Bool to make this operation in-place
trans = transforms.Normalize(mean, std, inplace)
image_trans = trans(image) # Tensor => Tensor
Data Loader
build and load dataset
Jun 30, 2024 | data loader
import torch
from torch.utils.data import DataLoader, Dataset
from torch.utils.data.distributed import DistributedSampler
# Example: Build a dataset
class MyDataset(Dataset):
def __init__(self, data):
self.data = data
def __len__(self):
return len(self.data)
def __getitem__(self, index):
return self.data[index]
# Example: Build a distributed sampler
datset = / # *** Dataset
num_replicas = world_size # *** int. Number of replicas
rank = rank # *** int. Rank of the current process
shuffle = False # *** bool. If True, have the data shuffled at every epoch
seed = 0 # *** int. Random seed used to shuffle the sampler if `shuffle` is True
drop_last = False # *** bool. If True, drop the last incomplete batch
sampler = DistributedSampler(dataset, num_replicas, rank, shuffle, seed, drop_last)
# Example: Build a data loader
# Note: if sampler is not None, shuffle must be False, drop_last can be either True or False
# Note: if sampler is None, one can set `torch.manual_seed(SEED)` to fix the random seed
dataset = / # *** Dataset
batch_size = 1 # *** int. Number of samples per batch
shuffle = False # *** bool. If True, have the data shuffled at every epoch
sampler = None # Sampler or Iterable. Define how to draw samples
num_workers = 0 # *** int. Number of subprocesses to use for data loading
collate_fn = None # Callable. Merge a list of samples to form a batch
pin_memory = False # *** bool. If True, copy Tensors into CUDA pinned memory
drop_last = False # *** bool. If True, drop the last incomplete batch
timeout = 0 # numeric. If positive, set timeout for collecting a batch from workers
prefetch_factor = None # int. Default = None if num_workers == 0 else 2
data_loader = DataLoader(dataset, batch_size, shuffle, sampler, num_workers,
collate_fn, pin_memory, drop_last, timeout, prefetch_factor)
Operation
tensor operations
Jun 30, 2023 | operation
| category | class / function (alphabetical) |
|---|---|
| operations | basic operations einsum isclose & allclose matmul mean & var softmax |
| data generation | arange uniform & normal zeros & ones |
| size | cat chunk & split flatten permute reshape & view size & shape squeeze & unsqueeze transpose unbind |
| else | where |
import torch
basic operations: exp, sin, cos, sqrt.
y = torch.function(x) # function: exp, sin, cos, sqrt
dim = / # *** int or tuple of ints. Dims to reduce
keepdim = False # *** bool. If True, return tensor with the same dims
mean = x.mean(dim, keepdim)
## In version>=2.0, `correction=1` equals to `unbiased=True`, `correction=0` equals to `unbiased=False`
correction = 1 # *** int.
var = x.var(dim, keepdim, correction)
dim = None # *** int. Dim to apply softmax
y = x.softmax(dim)
matmul: matrix multiplication.
other = / # *** tensor
y = x.matmul(other)
einsum: Einstein summation convention.
equation = / # *** str. The subscript for the Einstein summation
operands = / # *** list of tensor. The tensor to be computed
## torch.einsum("ii", tensor) # trace
## torch.einsum("ii->i", tensor) # diagonal
## torch.einsum("i,j->ij", tensor1, tensor2) # outer product
## torch.einsum("bij,bjk->bik", tensor1, tensor2) # batch matrix multiplication
## torch.einsum("...ij->...jk", tensor) # batch permute
y = torch.einsum(equation, operands)
isclose & allclose: check whether two tensors are close.
other = / # *** tensor. The second tensor to compare
rtol = 1e-5 # float. Relative tolerance
atol = 1e-8 # float. Absolute tolerance
equal_nan = False # bool. If True, then two NaN will be considered equal
## Check if elements satisfy: |input - other| <= atol + rtol * other
x.isclose(other, rtol, atol, equal_nan) # return a tensor of bool
x.allclose(other, rtol, atol, equal_nan) # return True or False
## --------------------------------------------------------------------------------
zeros & ones: fill a tensor with a given value.
size = / # *** sequence of int. The shape of output
y = torch.zeros(size)
y = torch.ones(size)
uniform & normal: fill a tensor with a given value.
size = / # *** sequence of int. The shape of output
generator = None # torch.Generator. A pseudorandom number generator for sampling
requires_grad = False # bool. If use autograd
dtype = None # torch.dtype. The desired data type
device = None # torch.device. The desired device
y = torch.rand(size, generator, requires_grad, dytpe, device) # uniform distribution U(0, 1)
y = torch.randn(size, generator, requires_grad, dytpe, device) # standard normal distribution N(0, 1)
arange: a sequence in order.
start = 0 # *** number. The starting value
end = / # *** number. The ending value
step = 1 # *** number. The gap between adjacent points
arange = torch.arange(start, end, step)
size & shape: get tensor size.
dim = None # int. Dim to retrieve the size
size = x.size(dim) # => torch.Size or int
size = x.shape # => torch.Size
reshape & view: reshape a tensor with the given shape.
shape = / # sequence of int. The new shape. A single dim could be -1
y = x.reshape(shape) # recommend since it could call .contiguous() if needed
y = x.view(shape)
flatten: flatten along the given dimensions.
start_dim = 0 # *** int. The first dimension to flatten
end_dim = -1 # *** int. The last dimension to flatten
y = x.flatten(start_dim, end_dim)
transpose: swap two dimensions.
dim0 = / # *** int. The first dim to be transposed
dim1 = / # *** int. The second dim to be transposed
y = x.transpose(dim0, dim1)
permute: permute dimensions of a tensor.
dims = / # *** sequence of int. The desired ordering of dims
y = x.permute(dims)
squeeze & unsqueeze: insert and remove dimensions.
dim = None # *** int or tuple of ints. If given, only the dim will be squeezed
y = x.squeeze(dim)
dim = / # *** int. The index at which to insert the singleton dim
y = x.unsqueeze(dim) # Eqaul to y = x[:, :, None, :] when dim = 2
cat: concatenate tensors along a dimension.
tensors = / # *** tuple of tensors. Tensors with the same shape except in the cat dim
dim = 0 # *** int. The concatenation dim
y = torch.cat(tensors, dim)
unbind: remove a dimension by splitting it.
dim = 0 # *** int. Dim to remove
y = x.unbind(dim)
chunk & split: split a tensor with chunk numbers or split sizes.
chunks = / # *** int
dim = 0 # *** int
## If the given dim is divisible by chunks, all returned chunks will be the same size
## If the given dim is not divisible by chunks, the last one will not be the same size
## If such division is not possible, it returns fewer than the specified number of chunks
y = x.chunk(chunks, dim)
indices_or_sections = / # *** tensor, int, list, tuple of ints
dim = 0 # *** int. Dim along which to split the tensor
## If split_size_or_sections is an integer type, split into equally sized chunks
## If split_size_or_sections is a list, split into len(split_size_or_sections) chunks
y = x.split(indices_or_sections, dim)
where: select elements from a tensor.
condition = / # *** bool. When True, yield input, otherwise yield other
input = / # *** tensor or scalar
other = / # *** tensor or scalar
y = torch.where(condition, input, output)
Module
modules to build models
Jun 29, 2024 | module
| category | tool (alphabetical) |
|---|---|
| parameter | Parameter & Buffer |
| convolution | Conv2d Conv3d |
| other module | Linear |
| else | Dropout |
import torch
from torch import nn
data = / # *** tensor. Parameter tensor
requires_grad = True
gamma = torch.Parameter(data, requires_grad)
persistent = True # whether the buffer is part of the module's state_dict
gamma = self.register_buffer(data, persistent) # usually used in __init__
gamma = nn.parameter.Buffer(data, persistent) # not usually used
Linear: affine linear transformation.
in_features = / # *** int. Input features
out_features = / # *** int. Output features
bias = True # *** bool. If True, learn an additive bias
device = None # torch.device or int
dtype = None # torch.dtype
linear = Linear(in_features, out_features, bias, device, dtype)
y = linear(x) # [..., H_in] => [..., H_out]
Conv2d: 2D convolution.
in_channels = / # *** int. Number of channels in the input
out_channels = / # *** int. Number of channels in the output
kernel_size = / # *** int, tuple. Size of convolving kernel
stride = 1 # *** int, tuple. Stride of convolution
padding = 0 # int, tuple, str. Padding added to all four sides of the input
dilation = 1 # int, tuple. Spacing between kernel elements
groups = 1 # int. Number of blocked connections from input channels to output
bias = True # bool. If True, add a learnable bias to the output
padding_mode = "zeros" # str. "zeros", "reflect", "replicate", or "circular"
device = None # torch.device or int
dtype = None # torch.dtype
## Weight. Shape: [out_channels, in_channels/groups, k_size[0], k_size[1]]
## Bias. Shape: [out_channels,]
conv2d = nn.Conv2d(in_channels, out_channels, kernel_size, stride,
padding, dilation, groups, bias, padding_mode, device, dtype)
## H_out = [(H_in + 2*padding[0] - dilation[0]*(kernel[0]-1)-1) / stride[0] + 1]
## W_out = [(W_in + 2*padding[1] - dilation[1]*(kernel[1]-1)-1) / stride[1] + 1]
y = conv2d(x) # [B, C, H_in, W_in] => [B, C, H_out, W_out]
Conv3d: 3D convolution.
in_channels = / # *** int. Number of channels in the input
out_channels = / # *** int. Number of channels in the output
kernel_size = / # *** int, tuple. Size of convolving kernel
stride = 1 # *** int, tuple. Stride of convolution
padding = 0 # int, tuple, str. Padding added to all six sides of the input
dilation = 1 # int, tuple. Spacing between kernel elements
groups = 1 # int. Number of blocked connections from input channels to output
bias = True # bool. If True, add a learnable bias to the output
padding_mode = "zeros" # str. "zeros", "reflect", "replicate", or "circular"
device = None # torch.device or int
dtype = None # torch.dtype
## Weight. Shape: [out_channels, in_channels/groups, k_size[0], k_size[1], k_size[2]]
## Bias. Shape: [out_channels,]
conv3d = nn.Conv3d(in_channels, out_channels, kernel_size, stride,
padding, dilation, groups, bias, padding_mode, device, dtype)
## D_out = [(D_in + 2*padding[0] - dilation[0]*(kernel[0]-1)-1) / stride[0] + 1]
## H_out = [(H_in + 2*padding[1] - dilation[1]*(kernel[1]-1)-1) / stride[1] + 1]
## W_out = [(W_in + 2*padding[2] - dilation[2]*(kernel[2]-1)-2) / stride[2] + 1]
y = conv3d(x) # [B, C, D_in, H_in, W_in] => [B, C, D_out, H_out, W_out]
p = 0.5 # *** float. Probability of an element to be zeroed
inplace = False # bool
dropout = nn.Dropout(p, inplace)
y = dropout(x)
Activation Function
activation functions
Jun 28, 2024 | activation function
| tool (alphabetical) | popular applications |
|---|---|
| GeLU | / |
| SiLU/swish | / |
from torch import nn
GeLU: Gaussian Error Linear Units function, \(\mathrm{GeLU}(x)=x*\phi(x)\), where \(\phi\) is cumulative distribution function.
gelu = nn.GeLU()
y = gelu(x)
SiLU/swish: Sigmoid Linear Unit function, \(\mathrm{SiLU}(x)=x*\sigma(x)\), where \(\sigma\) is logistic function.
inplace = False
silu = nn.SiLU(inplace)
y = silu(x)
Optimizer
optimizers
Jun 28, 2024 | optimizer
It includes tools for building optimization algorithms: AdamW.
from torch.optim import AdamW
## --------------------------------------------------------------------------------
## AdamW
## --------------------------------------------------------------------------------
params = / # *** iterable. Parameters / named_parameters / parameter groups to optimize
lr = 0.001 # *** float, Tensor. Learning rate
betas = (0.9, 0.999) # tuple. For computing running averages of gradients & squares
weight_decay = 0.01 # float. Weight decay coefficient
# ...
adam_optim = AdamW(params, lr, betas, weight_decay)
## --------------------------------------------------------------------------------
huggingface
huggingface tools
May 30, 2023 | huggingface
It includes tools from huggingface: snapshot_download
from huggingface_hub import snapshot_download
## --------------------------------------------------------------------------------
## Download checkpoints from huggingface
## --------------------------------------------------------------------------------
repo_id = / # *** str. A user name and a repo name, e.g., "Qwen/Qwen-VL-Chat"
repo_type = None # *** str. "dataset", "space", or "model"
local_dir = None # *** str or Path. If provided, directory to place the downloaded files
token = None # str, bool. User token
max_workers = 8 # int. Number of concurrent threads to download files
# ...
snapshot_download(repo_id, repo_type, local_dir, token, max_workers)
FSDP
fully sharded data parallel framework
Dec 03, 2025 | fsdp
Wrap model with FSDP to enable parallelism.
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp.ShardingStrategy import FULL_SHARD, SHARD_GRAD_OP, NO_SHARD, HYBRID_SHARD, _HYBRID_SHARD_ZERO2
from torch.distributed.fsdp.BackwardPrefetch import BACKWARD_PRE, BACKWARD_POST
# FULL_SHARD: shard parameters, gradients, and optimizer
# SHARD_GRAD_OP: shard gradients and optimizer
# NO_SHARD: no sharding, like DDP
# HYBRID_SHARD: apply FULL_SHARD within a node
# _HYBRID_SHARD_ZERO2: apply SHARD_GRAD_OP within a node
sharding_strategy = None # *** ShardingStrategy. The sharding strategy to use
# The auto wrap policy to use. If None, only apply to submodules of `module`
auto_wrap_policy = None # *** ModuleWrapPolicy, CustomPolicy. User can specify the classes to wrap
# My example:
def custom_auto_wrap_policy(module, recurse, nonwrapped_numel, min_num_params: int = int(1e8)) -> bool:
if recurse:
return True
return nonwrapped_numel >= min_num_params
my_auto_wrap_policy = functools.partial(custom_auto_wrap_policy, min_num_params=int(1e5))
# BACKWARD_PRE: prefetch the next set of para before current set of para's grad computation
# BACKWARD_POST: prefetch the next set of para after current set of para's grad computation
backward_prefetch = BACKWARD_PRE # BACKWARD_PRE, BACKWARD_POST or None
module = / # *** nn.Module. The module to be wrapped
process_group = None # ProcessGroup. The process group to work on (use the default if None)
cpu_offload = None # *** CPUOffload. If True, offload parameters and gradients to CPU
mixed_precision = None # *** MixedPrecision. The mixed precision to use
ignored_modules = None # Module. Modules to ignore. To be deprecated, use `ignored_states` instead
param_init_fn = None # Module. How to initialize parameters onto a device
device_id = None # *** int or torch.device. The device to use
sync_module_states = False # bool. If True, synchronize module states across processes
forward_prefetch = False # bool. If True, prefetch the next forward before current forward
limit_all_gathers = True # bool. If True, synchronize CPU thread
use_orig_params = False # bool. If True, expose the original para instead of the sharded para
ignored_states = None # *** Parameter. States to ignore
device_mesh = None # *** DeviceMesh. The device mesh to use
# Shard module parameters across data parallel workers
sharded_model = FSDP(
module, process_group, sharding_strategy, cpu_offload, auto_wrap_policy, backward_prefetch,
mixed_precision, ignored_modules, param_init_fn, device_id, sync_module_states,
forward_prefetch, limit_all_gathers, use_orig_params, ignored_states, device_mesh
)
torchrun
console script for Distributed Framework
Sep 29, 2025 | torchrun
Build distributed framework with torchrun.
# Method 1: Use rdzv_endpoint (recommended)
$torchrun
--nnodes ${NNODES}
--nproc_per_node ${NPROC_PER_NODE}
--node_rank ${NODE_RANK}
--rdzv_backend c10d
--rdzv_endpoint ${MASTER_ADDR}:${MASTER_PORT}
--rdzv_id ${RDZV_ID}
train.py
# Method 2: Use master_addr and master_port
$torchrun
--nnodes ${NNODES}
--nproc_per_node ${NPROC_PER_NODE}
--node_rank ${NODE_RANK}
--master_addr ${MASTER_ADDR}
--master_port ${MASTER_PORT}
train.py
from torch.distributed import barrier
group = None # ProcessGroup. The process group to work on
async_op = False # bool. If True, the barrier is asynchronous
device_ids = None # list[int]. If provided, the barrier will only synchronize the devices in this list
barrier(group, async_op, device_ids)
DeepSpeed
deepspeed framework
Aug 21, 2025 | deepspeed
DeepSpeed is an open-sourced deep learning optimization library developed by Microsoft Research, designed to simplify and accelerate the training and deployment of large-scale deep learning models.
| stage | partition | memory saving | complexity |
|---|---|---|---|
| stage 1 | optimizer states | ~40% - 60% (for Adam) | low |
| stage 2 | optimizer states & gradients | additional ~15% - 25% | medium |
| stage 3 | optimizer states & gradients & model parameters | up to 80% - 90% | high |
Ray
ray framework
Oct 01, 2025 | ray
Ray is a distributed computing framework, developed by UC Berkeley, allowing you to scale machine learning and data processing workflows across multiple machines and GPUs. Ray employs a dynamic task graph computation model. Some important concepts:
VSCode
vscode configs
Oct 15, 2025 | vscode
Extensions (Remote SSH, Dev Containers), SSH keys, and LaTeX (LaTeX Workshop + TeX Live).
"latex-workshop.latex.recipes": [
{
"name": "XeLaTeX",
"tools": [
"xelatexmk"
]
},
{
"name": "PdfLaTeX",
"tools": [
"pdflatexmk"
]
}
],
"latex-workshop.latex.tools": [
{
"args": [
"-synctex=1",
"-pdfxe",
"-interaction=nonstopmode",
"-file-line-error",
"-outdir=%OUTDIR%",
"%DOC%"
],
"command": "latexmk",
"env": {},
"name": "xelatexmk"
},
{
"args": [
"-synctex=1",
"-pdf",
"-interaction=nonstopmode",
"-file-line-error",
"-outdir=%OUTDIR%",
"%DOC%"
],
"command": "latexmk",
"env": {},
"name": "pdflatexmk"
}
],
MacBook Reimage
macOS setup checklist
Apr 02, 2026 | macbook-reimage
Checklist after reimaging a MacBook: System Settings, Logi Options+, and daily apps.
Git
git tools
May 14, 2026 | git
Git is a distributed version control system that allows you to track changes in your code and collaborate with others.
| category | tool (alphabetical) |
|---|---|
| Setup and configure | config ssh keys |
| Get and create projects | clone |
| Branching and workspace | worktree |
clone: clone a repository into a new directory.
git clone git@github.com:[user name]/[repo name].git
config: get and set repository or global options.
git config --list # list all config
git config user.name [your name] && git config user.email [your email] # repo config; cd to repo and execute
git config --global user.name [your name] && git config --global user.email [your email] # global config
ssh keys: generate a new SSH key to use for authentication.
## 1. Check if SSH key exists: *.pub
ls -al ~/.ssh
## 2. If not, generate a new SSH key
ssh-keygen -t rsa -b 4096 -C [your GitHub email]
## 3. Copy SSH key
cat ~/.ssh/[your key name].pub
## 4. Open GitHub -> Settings -> SSH and GPG keys -> New SSH key -> paste the SSH key
## 5. Test
ssh -T git@github.com # to see if it prints "Hi *! You've successfully ..."
worktree:
check out multiple branches into separate directories from the same repo. All worktrees share the same .git object store, so adding one costs almost no disk space and avoids the stash + checkout dance when switching branches.
## Typical uses:
## 1. Handle a hotfix without disturbing in-progress work
## 2. Run builds/tests on multiple branches in parallel
## 3. Review a PR branch alongside your own
## 4. Let an AI agent edit code in an isolated workspace
## Note:
## 1. A branch can only be checked out in one worktree at a time
## 2. The main repo directory itself is a worktree
## 3. Deleting a worktree folder with `rm -rf` leaves dangling metadata that `git worktree prune` must clean up
## Add a worktree for a branch
git worktree add [worktree path] [remote/local branch name]
## Remove a worktree
git worktree remove [worktree path]
## Clean up records of worktrees whose directories were deleted
git worktree prune
## List all worktrees
git worktree list
Docker
docker tools
Sep 23, 2025 | docker
Docker is a containerization tool that allows you to package your application with all its dependencies into a container.
| category | tool (alphabetical) |
|---|---|
| docker | start & restart & stop |
| image | ls & pull & rm |
| container | ps & run & enter & stop |
ps & run & enter & stop: list containers, run a container, enter a container, stop a container.
docker ps -a # list all containers
## --------------------------------------------------------------------------------
## Recommended params: -i: interactive mode; -d: detached mode; -t: allocate a pseudo-TTY
## If need mapping: add "-v [local path, e.g., /home/user]:/root"
## If run GPUs: add "--gpu all", "--ipc host"
## If run container without executing any commands, append "tail -f /dev/null"
docker run [params, e.g., -dit] --name [container name] [image name] # run a container
## --------------------------------------------------------------------------------
docker exec -it [container name or container id] /bin/bash # enter a container
docker stop [container name or container id] # stop a container
docker restart [container name or container id] # restart a container
docker rm [container name or container id] # remove a container
ls & pull & rm: list, pull, and remove images.
docker images # or docker image ls; list all images
docker pull [image name] # pull an image from a registry
docker rmi [image name or image id] # remove an image
start & restart & stop: start, restart, stop docker.
sudo service docker start # start docker
sudo service docker restart # restart docker
sudo service docker stop # stop docker
Last updated on May 18, 2026 at 10:47 (UTC-7).