Dynamic Programming

Veröffentlicht am 2019-04-18

Introduction

职业规划原因今年计划自学Go语言，前两个月也断断续续用Golang刷了100道左右的算法题。然而:(最近一个月工作比较忙就搁置了一段，今天看了下https://github.com/xiaozhazi/leetcode_go 距上次commit已经有近一个月的时间了 - _ -#。

所以今天开始强制自己记录下个人解题过程以及go语言踩雷，也便于随时复盘【其实是为了自我监督防止懒癌：）

LeetCode真题系列

Minimum Path Sum(64)

64 Minumum Path Sum

Question

Given a m x n grid filled with non-negative numbers, find a path from top left to bottom right which minimizes the sum of all numbers along its path.

Note: You can only move either down or right at any point in time.

Example

Input:
[
  [1,3,1],
  [1,5,1],
  [4,2,1]
]
Output: 7
Explanation: Because the path 1→3→1→1→1 minimizes the sum.

Analysis

从（0，0）走到（m-1，n-1）每个阶段只能选择向右或向下两种决策，且每个阶段都会对应一个状态集合。我们把状态定义为dp[i][j]表示从起点走到（i，j）位置时的最短路径长度。因此它是一个多阶段最优解问题符合动态规划模型。

dp[i][j] = min(dp[i-1][j],dp[i][j-1]) + grid[i][j]
直接动手上代码

func minPathSum(grid [][]int) int {
  if grid == nil || len(grid) == 0 {
	return 0
  }
  m, n := len(grid), len(grid[0])
  dp := make([][]int, m)
	
  for i := 0; i < m; i++ {
	dp[i] = make([]int, n)
  }
	
  dp[0][0] = grid[0][0]
  for i := 1; i < m; i++ {
	dp[i][0] = dp[i-1][0] + grid[i][0]
  }
  for i := 1; i < n; i++ {
	dp[0][i] = dp[0][i-1] + grid[0][i]
  }
  for i := 1; i < m; i++ {
	for j := 1; j < n; j++ {
	  dp[i][j] = min(dp[i-1][j], dp[i][j-1]) + grid[i][j]
	}
  }
  return dp[m-1][n-1]
}

提交后发现 Runtime: 12 ms, faster than 28.57% of Go online submissions.
Memory Usage: 4.4 MB, less than 13.64% of Go online submissions for Minimum Path Sum. 效率太低了，想办法来优化下。

实际上我们只需要一个长度为n-1的一维数组来存储状态集合，状态转移的过程都可以基于这个一维数组来操作。继续上代码

func minPathSum(grid [][]int) int {
  if grid == nil || len(grid) == 0 {
    return 0
  }
  m, n := len(grid), len(grid[0])
  dp := make([]int, n)
   
  for j, _ := range dp {
    dp[j] = math.MaxInt32
  }
  dp[0] = 0
  for i := 0; i < m; i++ {
    for j := 0; j < n; j++ {
      if j == 0 {
        dp[j] += grid[i][j]
      } else {
        dp[j] = min(dp[j], dp[j-1]) + grid[i][j]
      }
    }
  }
  return dp[n-1]
}

再次提交后 Runtime: 8 ms, faster than 100.00% of Go online submissions for Minimum Path Sum.
Memory Usage: 3.9 MB, less than 59.09% of Go online submissions for Minimum Path Sum. : )

Unique Paths II(63)

63 Unique Paths II

Question

A robot is located at the top-left corner of a m x n grid (marked ‘Start’ in the diagram below).

The robot can only move either down or right at any point in time. The robot is trying to reach the bottom-right corner of the grid (marked ‘Finish’ in the diagram below).

Now consider if some obstacles are added to the grids. How many unique paths would there be?

Example

Input:
[
  [0,0,0],
  [0,1,0],
  [0,0,0]
]
Output: 2
Explanation:
There is one obstacle in the middle of the 3x3 grid above.
There are two ways to reach the bottom-right corner:
1. Right -> Right -> Down -> Down
2. Down -> Down -> Right -> Right

Analysis

第一感觉是可以用回溯来解决此问题。对每一个点先选择向右走，如果遇到障碍物或者右边无路说明此路径不通，回退到该点选择向下走，如果遇到障碍物或下边无路可走说明不存在任何一条路径可以经过该点走到终点。依次类推，每次递归向后走到终点时count加一。

需要特别注意的是要开始OR结束点值为1的情况

核心代码如下

func DFS(obstacleGrid [][]int, i, j int) {
    m, n := len(obstacleGrid), len(obstacleGrid[0])
    if i == m - 1 && j == n - 1  && obstacleGrid[i][j] != 1{
        count++
        return 
    }
    if i < m - 1 && obstacleGrid[i+1][j] == 0 {
        DFS(obstacleGrid, i + 1, j)
    }
    if j < n - 1 && obstacleGrid[i][j+1] == 0 {
        DFS(obstacleGrid, i, j + 1)
    }
}

递推算法基本类似于暴力求解，无重复又没有遗漏的穷举出所有走法。果不其然代码提交后TLE : (

继续分析这道题，对于图中任一个非障碍点(i,j)，从起始位置走到该点的路径个数可以通过走到该点的左边位置（i-1,j)和上方位置(i,j-1)的路径个数之和计算，即

dp[i][j] = dp[i-1][j] + dp[i][j-1] if obstacleGrid[i][j] == 0

因此可以改用DP算法来解决，同样为了优化算法效率我们通过一个一维数组在存储状态集合，此时状态方程为
dp[i] += dp[i-1] if obstacleGrid[i][j] == 0

具体代码如下

func uniquePathsWithObstacles(obstacleGrid [][]int) int {
    if obstacleGrid == nil || len(obstacleGrid) == 0 {
        return 0
    }
    
    m, n, count := len(obstacleGrid), len(obstacleGrid[0]), 0
    if (obstacleGrid[0][0] == 1 || obstacleGrid[m-1][n-1] == 1) {
        return count
    }
    
    dp := make([]int, n)
    dp[0] = 1
    for i := 0; i < m; i++ {
        for j := 0; j < n; j++ {
            if obstacleGrid[i][j] == 1 {
                dp[j] = 0 
            } else {
                if j != 0 {
                    dp[j] += dp[j-1]
                }
            }
            
        }
    }
    return dp[n-1]
}

Runtime: 0 ms, faster than 100.00% of Go online submissions for Unique Paths II.
Memory Usage: 2.6 MB, less than 66.67% of Go online submissions for Unique Paths II.

重归博客

Veröffentlicht am 2019-04-17

Hello, My Personal Blog

After nearly two years of work, i want to share what i have learned in my work and life.

Blogs can also help me to improve my ability to make things clearer and record my growth.

Let’s start!!!

AI入门之——Andrew Ng “Machine Learning”课程学习笔记第十一周

Veröffentlicht am 2016-12-31

11、Application Example: Photo OCR

11.1、Photo OCR

11.1.1、problem description and pipeline

photo OCR pipeline

Text detection
Character segmentation
Character classification

11.1.2、sliding window

对整个图片进行分别窗口化检测

11.1.3、Getting lots of data: Artificial data synthesis

Synthesizing data by introducing distortions

Distortion introduced should be representation of the type of noise/distortions in the test set.
Usually does not help to add purely random/meaningless noise to your data.

11.1.4、What part of the pipeline to work on next

进行分块处理的目的是，我们可以很容易的分析出那一步骤是系统性能瓶颈，需要在那一步骤上投入更多的精力。

AI入门之——Andrew Ng “Machine Learning”课程学习笔记第十周

Veröffentlicht am 2016-12-23

10、Large Scale Machine Learning

10.1、Gradient Descent with Large Datasets

10.1.1、Learning with large dataset

It’s not who has the best algorithm that wins.
It’s who has the most data.

10.1.2、Stochastic gradient descent

不像Batch gradient descent每次迭代时都需要将数据集代入计算。

当数据集大的时候我们需要随机shuffle数据集，
然后选取第一个数据进行计算，然后进行梯度下降，再接着使用接下来的数据依次进行此过程。

10.1.3、Mini-batch gradient descent

batch 和 stochastic的结合。每次选一组数据进行计算，然后再接着使用下一组重复此过程。

10.1.4、Stochastic gradient descent convergence

Learning rate α is tapically held constant. Can slowly descrease α over time if we want θ to converge.

10.2、Online Learning

上述方法可以应用在实时在线学习上，数据集大小不固定，以数据流的形式出现。

10.3、Map-reduce and data parallelism

在进行梯度下降计算时，中间有步骤需要求和，我们可以利用Map-reduce和并行计算来缩短处理时间。

AI入门之——Andrew Ng “Machine Learning”课程学习笔记第九周

Veröffentlicht am 2016-12-16

9、Anomaly detection

9.1、Density Estimation

9.1.1、Problem motivation

密度估计，判断一个test实例是否为不正常的。

Anomaly detection example：

Fraud detection

xi = features of user i's ativities.
Model p(x) from data.
Identify unusual users by checking which have p(x)<ε

Manufacturing

Monitoring computers in a data center.

xi = features of machine i.
memory use,number of disk access/sec,cpu load...

9.1.2、Gaussian distribution

9.1.3、Anomaly detection algorithm

Choose features xi that might be indicative of anomalous examples.

Fit parameters μ1, … μn,σ12,…σn2

μj = 1/m ξxji
σj2 = 1/m ξ(xji-μj)2

Given new example x, compute p(x),Anomaly if p(x)<ε.

9.1.4、Developing and evaluating an anomaly detection system

The importance of real-number evaluation

例如 10000 good engines， 20 flawed engines，我们可以进行如下划分：

Training set：6000 good engines
CV:2000 good engines,10 anomalous
Test:2000 good engines,10 anomalous

Algorithm evaluation

可以利用F1-score来评估算法，我们也可以用CV来选择参数ε。

9.1.5、Anomaly detection VS supervised learning

Anomaly detection:

Very small number of positive example.
Large number of negative example.
Many different types od anomalies. 很难通过positive实例来学习异常的特征
未来和异常和目前的异常实例不相关

Supervised Learning：

Large number of positive and negative examples.
可以根据大量的positive值推断出其特征值，未来的positive和现在的训练集非常相似

9.1.6、多元高斯分布

通过μ矩阵和ξ矩阵来对多远高斯分布进行调整。

9.2 推荐系统

9.3.1 基于内容的推荐

问题描述：

r(i,j)=1 if user j has rated movie i
y(i,j)=rating by user j on movie i
θ(j)=paramater vector for user j
x(i)=feature vector for movie i
For user j,movie i, predicted rating θ(j)T(x(i))
m(j)=no. of movies rated by user j

9.3.2 正交过滤

9.3.2 实现技巧

归一化，计算平均值，然后同时减去该值

AI入门之——Andrew Ng “Machine Learning”课程学习笔记第八周

Veröffentlicht am 2016-12-08

8、Unsupervised Learning

8.1 Clustering

8.1.1 Unsupervised Learning: Introduction

Applications of clustering:

Market segementation
Social network analysis
Organize computing clusters
Astronomical data analysis

无监督学习，数据是没有label的。

8.1.2 K-Means Algorithm

Input:

K (number of clusters)
Training set {x1,x2,...xm}

算法：

随机初始化K个聚类簇（cluster centroids）u1,u2…uK

然后repeat操作：

cluster assignment step：

for     i = 1 to m
  c(i):=index(from 1 to K) of cluster centroids closest to xi

Move centroid:

for k = 1 to K
  uk := average(mean)of points assigned to cluster k

8.1.3 Optimization objective

最小化distortion：

因此K-means算法也可以表示为：

Randomly initialize K cluster centroids u1,u2,…uk

Repeat:

for i = 1 to m
    c(i):= index(from 1 to K) of cluster centroid closest to xi

for k = 1 to K
    uk:= average(mean) of points assigned to cluster k [minize J]

8.1.4 Random initialization

Random initialize:

首先满足 K < m；

然后随机选择K个训练实例；

将u1,u2,…uk等于这K个实例；

for i = 1 to 100{
    Randomly initialize K-means.
    Run K-means. Get c1,c2,...cm,u1,...uk.
    Compute cost function(distortion)

选择最小cost的聚类

8.1.5 Choosing the number of clusters

目前最好的方法还是手动选择；

Elbow cost：手肘方法，不是所有情况都能画出这样的图形，有时候会下降趋势较平滑。

sometimes, you’re running K-means to get clusters to use for some later/downstream purpose.

Evaluate K-means based on a metric for how well it performs for that later purpose.

8.2 Dimensionality Reduction

介绍了第二种无监督学习算法：维数约减。

主要分为两大类应用：

Data Compression
Data Visualization

8.2.1 Data Compression

将数据从2D 降维到 1D

将数据从3D 降维到 2D （将三维中的点投影到一个平面上）

8.2.2 Data Visualization

数据可视化，便于用户更好的观察数据，理解数据之间的含义。

将数据比较相关的进行降维，原来N维的数据降低到K维

k <= N；并且通常K=2或3，便于进行可视化视图分析

8.3 Principal Component Analysis

8.3.1 Principal Component Analysis Problem Formulation

Reduce from 2-dimension to 1-dimension:

Find a direction (a vector v1) onto which to project the data so as to minimize the projection error.

Reduce from n-dimension to k-dimension:

Find k vectors v1,v2…vk on to which to project the data, so as to minimize the projection error.

PCA 不是线性回归：线性回归是竖着映射，PCA是垂直映射。

8.3.2 Principal Component Analysis algorithm

Data Preprocessing:

Training set: x1,x2,…xm

Preprocessing(feature scaling/mean normalization):

uj = 1/m(x1+x2+...+xm)
xj' = xj - uj

If different features on different scales, scale features to have comparable range of values.

Principal Component Analysis (PCA) algorithm

Reduce data from n-dimensions to k-dimensions

compute “covariance matrix”: sigma = 1/m(x1x1T+…+xnxnT)

compute “eigenvectors” of matrix sigma: [U,S,V] = svd(sigma)

Ureduce = U(:,1:k)

z = Ureduce’ * x

8.4 Applying PCA

8.4.1 Reconstruction from Compressed Representation

Xapprox = Ureduce*z

8.4.2 Choosing the number of principal components

choosing k

(1)Average squared projection error: 1/m(||x1-xapprox1||2 +…)

(2)Total variation in the data : 1/m(||x1||2 + … +||xm||2)

Typically, choose k to be smallest value so that (1)/(2)<=0.01

“99% of variance is retained”

Advice for applying PCA

Supervised learning speedup

Compression:

Reduce memory/disk needed to store data
speed up learning algorithm

Visualization

Bad use of PCA: To prevent overfitting

fewer features, less likely to overfit.

This might work ok, but isn’t good way to address overfitting. Use regularization instead.

Before implementing PCA, first try running whatever you want to do with the original/raw data xi. Only if that doesn’t do what you want, then implement PCA and consider using zi.

AI入门之——Andrew Ng “Machine Learning”课程学习笔记第七周

Veröffentlicht am 2016-12-04

7、Support Vector Machines

7.1 Optimization objective

7.2 Large Margin Intuition

C如果越大，两个分类的间距越小。

7.3 The Mathematics behind large margin classification

先讲解了||u||符号的含义,可以利用向量的长度来计算矩阵的乘积.

SVM Decision Boundary 计算时可以利用这个特性。

7.4 Kernels

7.4.1 Kernels I

Gaussian Kernel.

7.5 Kernels II

SVM parameters:
Large C: Lower bias, high variance

Small C: High bias, low variance

Large segma*segma: more smoothly. High bias, low variance.

Small segma*segma: less smoothly. Lower bias, higher variance.

7.6 Using an SVM

Use SVM software package(liblinear,libsvm…)to solve for parameters theta.

Need to specify:

Choice of parameter C.
Choice of Kernel(similarity function):
    No Kernel('Linear Kernel')
    Gaussian Kernel.(need to choose segma)

Note:

Do not perform feature scaling before using the Gaussian kernel.

Other Choice of kernel:

Not all similarity functions similarity(x,l) make valid kernels. 
(Need to satisfy 'mercer's Theorem' to make sure SVM package's optimications run correctly, and do not diverge)

Multi-class classification:

Train K SVMs, 需要对K个类进行分类，而非k-1。

Logistic regression vs SVM:

n= number of features, m =number of training examples

if n is large（相对于m）:
    using Logistic regression, or SVM without kernel.
if n is small, m is intermediate:
    use SVM with Gaussian Kernel.
if n is small, m is large:
    create/add more features,
    then use logistic regression or SVM without kernel.

Neural network likely to work well for most od these settings,
but many slower to train.

AI入门之——Andrew Ng “Machine Learning”课程学习笔记第六周

Veröffentlicht am 2016-11-27

6、Advice for Applying Machine Learning

6.1、Evaluating a learning Algorithm

6.1.1、Deciding What to Try Next

Which of the following statements about diagnostics are true? (BCD)

A. It’s hard to tell what will work to improve a learning algorithm, so the best approach is to go with gut feeling and just see what works.

B. Diagnostics can give guidance as to what might be more fruitful things to try to improve a learning algorithm.

C. Diagnostics can be time-consuming to implement and try, but they can still be a very good use of your time.

D. A diagnostic can sometimes rule out certain courses of action(changes to your learning algorithm) as being unlikely to improve its performance signigicantly.

6.1.2、Evaluating a Hypothesis

如何评估假设函数，后面可以以此为基础讨论如何避免过拟合和欠拟合的问题。

Suppose an implementation of linear regression(without reguarization) is badly overfitting the training set. In this case, we would expect:

The training error to be low, and the test error to be high.

数据划分为训练集和测试集，一般比例为7:3,random select。如果数据不是随机的，最好自己随机排序或打乱顺序后再选取70%.

Misclassification error(0/1 misclassification error)

6.1.3、Model Selection and Training/Validation/Test Sets

如何确定对于某组数据，最合适的多项式次数是几次？如何选择学习算法中的正规化参数lamda？

模型选择问题。（泛化误差）

我们将数据集分为三部分，training set(60%), cross validation set(20%), test set(20%).

Training error

cross validation error

test error

我们使用交叉验证集来选择模型，看这些假设在交叉验证集表现如何。选择最小交叉验证集误差的模型。

Consider the model selection procedure where we choose the degree of polynomial using a cross validation set. For the final model, we might generally expect Jcv to be lower than Jtest because:

An extra parameter(d,the degree of the polynomial) has been fit to the cross validation set.

6.2、Bias vs. Variance(偏差和方差)

6.2.1、Diagnosing Bias vs. Variance

High Bias Overfitting

High Variance Underfitting

6.2.2、Regularization and Bias/Variance

6.2.3、Learning Curves

判断一个假设是否存在偏差方差问题

High Bias
If a learning algorithm is suffering from high bias, getting more training data will not help much.

因此知道自己的假设是否存在高偏差/方差问题，非常有用。

High variance
If a learning algorithm is suffering from high variance, getting more training data is likely to help.

6.2.4、Deciding What to Do Next Revisited

6.3、Machine Learning system design

6.3.1、Priorityzing what to work on

6.3.2、Error Analysis

建议的方法：

1、尽快的实现一个最基本的算法，然后在交叉验证数据集上进行验证；

2、画出learning curves，看是否更多的数据、更多的features能提升正确性；

3、Error Analysis：主要分析那些算法预测出错的例子，看能否发现一些系统趋势。

Error analysis may not be helpful for deciding is this is likely to improve performance, only solution is to try it and say if it works.

6.3.3 Error Metrics for Skewed Classes

不能仅仅依靠错误率来判断算法的性能，还需要利用Precision 和 Recall

6.3.4 Trading off precision and recall

我们可以通过计算F1来自动选择性能较高的算法。

F1 = 2*(P*R)/(P+R);

6.3.5 Data for Machine Learning

Banko anf Brill 在2001年提出了一个观点：

It's not who has the best algorithm that wins, 
It's who has the most data.

这个观点有时候不一定正确：

当feature很少不足以预测时（例，预测房价时仅知道尺寸），此时再多的数据可能也无法提升算法的性能；

当算法参数很多时（例，逻辑回归和线性回归有很多的feature，或者神经网络有很多的隐藏单元），更大的测试集可以尽可能的减少过拟合。

AI入门之——Andrew Ng “Machine Learning”课程学习笔记第五周

Veröffentlicht am 2016-11-24

5、Neural Networks：Learning

5.1、Cost Function and Backpropagation

5.2、Backprogation algorithm

5.3、Gradient checking

Important:
Be sure to disable your gradient checking code before training your claaifier. otherwise code will be very slow.

5.4 Random initialization

initial each initialtheta to a random value in[-EPSILON,EPSILON]

5.5 Put it together

毕设论文相关知识索引笔记

Veröffentlicht am 2016-11-24

压缩相关

速度之王-lz4压缩算法

使用tar+lz4+ssh更快的数据传输

Docker源码分析之Docker daemon启动和销毁

部署MAC上的Sublime Text + LaTex 中文环境

Cgroup介绍、应用实例及原理描述

Linux资源管理之cgroups简介

CRIU support in Docker C/R

Cgroup设计原理分析

Linux中，管理进程的数据结构是task_struct

#ifdef CONFIG_CGROUPS
/*Control Group info protected by css_set_lock*/
struct css_set *cgroups;
/*cg_list protected by css_set_lock and tsk->alloc_lock*/
struct list_head cg_list;
#endif

其中*cgroups指向css_set结构，css_set存储了与进程相关的cgroups信息。cg_list是一个嵌入式的list_head结构，将链到同一个css_set的进程组织成一个链表。

struct css_set{
  atomic_t refcount;
  struct hlist_node hlist;
  struct list_head tasks;
  struct list_head cg_links;
  struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
  struct rcu_head ruc_head;
};

其中，refcount是该css_set的引用数，因为一个css_set可以被多个进程公用，只要这些进程的cgroups信息相同。

hlist是嵌入的hlist_node，用于把所有css_set组成一个hash表，这样内核可以快速定位css_set。

tasks指向所有连到此css_set的进程指向的链表。

cg_links指向一个由struct_cg_group_link连城的链表。

subsys是一个指针数组，存储一组指向cgroup_subsys_state的指针，一个cgroup_subsys_state就是进程与一个特定子系统相关的信息。通过这个数组，进程可以获得相应的cgroup控制信息了。

struct cgroup_subsys_state{
  struct cgroup *cgroup;
  atomic_t refcnt;
  unsigned long flags;
  struct css_id *id;
};

*cgroup指向一个cgroup结构，也就是进程属于的cgroup。

task_struct-->css_set-->cgroup_subsys_state-->cgroup

这样进程就和cgroup连接起来了。

struct cgroup{
  unsigned long flags;
  atomic_t count;
  struct list_head sibling;
  struct list_head children;
  struct cgroup *parent;
  struct dentry *dentry;
  struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT]
  struct cgroupfs_root *root;
  struct cgroup *top_cgroup;
  struct list_head css_sets;
  struct list_head release_list;
  struct list_head pidlists;
  struct mutex pidlist_mutex;
  struct rcu_head rcu_head;
  struct list_head event_list;
  spinlock_t event_list_lock;
};

sibling,children,parent三个嵌入的list_head负责将同一层级的cgroup连接成一颗cgroup树。

subsys指针数组，存储一组指向cgroup_subsys_state的指针。这组指针指向了此cgroup跟各个子系统相关的信息。类似css_set中。

root指向一个cgroupfs_root结构，就是cgroup所在层级对应的结构体。

top_cgroup指向了所在层级的根cgroup，即创建层级时自动创建的那个cgroup。

css_set指向一个由struct_cg_cgroup_link连成的链表。

struct cg_cgroup_link{
  struct list_head cgrp_link_list;
  struct cgroup *cgrp;
  struct list_head cg_link_list;
  struct css_set *cg;
};

cgrp_link_list连入到 cgroup->css_set链表，cgrp指向此cg_group_link相关的cgroup。

cg_link_list连入到css_set->cg_links指向的链表，cg指向此cg_group_link相关的css_set。

cgroup和css_set是一个多对多的关系，必须添加一个中间结构来将两者联系起来，这既是cg_cgroup_link的作用。cgrp和cg为此结构体的联合主键，cgrp_link_list和cg_link_list分别连入cgroup和css_set，使后两者都可以尽心遍历查询。

####为什么cgroup和css_set是多对多的关系？
一个进程对应一个css_set，一个css_set存储了一组进程跟各个子系统相关的信息，但是这些信息由可能不是从一个cgroup那里获得的，因为一个进程可以同时属于多个cgroup，只要这些cgroup不在同一个层级。

eg：
我们创建一个层级A，A上面附加了CPU和memory两个子系统，进程a属于A的根cgroup；创建一个层级B，B上面附加了ns和blkio两个子系统，进程a也属于B的根cgroup；那么进程a对应的CPU和memory是从A的根cgroup获得的，ns和blkio则是从B的根cgroup获得。因此一个css_set存储的cgroup_subsys_state可以对应多个cgroup。另一方面cgroup也存储了一组cgroup_subsys_state，这一组则是cgroup从所在的层级附加的子系统获得的。一个cgroup可以有多个进程，这些进程的css_set不一定都相同，因为有些进程可能还加入了其他cgroup，但是同一个cgroup中的进程与该cgroup关联的cgroup_subsys_state都受到该cgroup的管理，所以一个cgroup也可以对应多个css_set。

从前面的分析，我们可以看出从 task 到 cgroup 是很容易定位的，但是从 cgroup 获取此 cgroup 的所有的 task 就必须通过这个结构了。每个进程都回指向一个 css_set，而与这个 css_set 关联的所有进程都会链入到 css_set->tasks 链表，而 cgroup 又通过一个中间结构 cg_cgroup_link 来寻找所有与之关联的所有 css_set，从而可以得到与 cgroup 关联的所有进程。

上面这个图从整体结构上描述了进程与 cgroups 之间的关系。最下面的P代表一个进程。每一个进程的描述符中有一个指针指向了一个辅助数据结构css_set（cgroups subsystem set）。指向某一个css_set的进程会被加入到当前css_set的进程链表中。一个进程只能隶属于一个css_set，一个css_set可以包含多个进程，隶属于同一css_set的进程受到同一个css_set所关联的资源限制。

上图中的”M×N Linkage”说明的是css_set通过辅助数据结构可以与 cgroups 节点进行多对多的关联。但是 cgroups 的实现不允许css_set同时关联同一个cgroups层级结构下多个节点。这是因为 cgroups 对同一种资源不允许有多个限制配置。

一个css_set关联多个 cgroups 层级结构的节点时，表明需要对当前css_set下的进程进行多种资源的控制。而一个 cgroups 节点关联多个css_set时，表明多个css_set下的进程列表受到同一份资源的相同限制。