[TPAMI 2022]HGNN: General Hypergraph Neural Networks+

论文网址：HGNN+: General Hypergraph Neural Networks | IEEE Journals & Magazine | IEEE Xplore

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用

1. 心得

2. 论文逐段精读

2.1. Abstract

2.2. Introduction

2.3. Related Work

2.3.1. Graph Neural Networks

2.3.2. Hypergraph Learning

2.4. Preliminaries of Hypergraphs

2.5. The Framework of Hypergraph Neural Network HGNN+

2.5.1. Hypergraph Modeling

2.5.2. Hypergraph Convolution

2.6. Discussions

2.6.1. Hypergraph vs. Graph

2.6.2. HGNN/HGNN vs. GNN+

2.6.3. HGNN vs. HGNN+

2.7. Experiments and Discussions

2.7.1. Experimental Settings

2.7.2. Vertex Classification on the Data With Graph Structure

2.7.3. Vertex Classification on the Data Without Graph Structure

2.7.4. Vertex Classification on the Data With Hypergraph Structure

2.7.5. Visualization

2.8. THU-DeepHypergraph: An Open Toolbox of the HGNN Framework+

2.9. Conclusion

3. Reference

1. 心得

（1）经典还是得读一下

2. 论文逐段精读

2.1. Abstract

①Limitations of GNN: feature extraction on multi modal/type

②Limitations of existing HGNN: shared weight between modalities/types

③So, they proposed HGNN+ and THU-DeepHypergraph toolbox

2.2. Introduction

①High order relationship:

②Framework of HGNN+:

③The conception of graph and hypergraph:

④Hyperedge group: pairwise edge, attribute, k-Hop and neighbors

⑤Advantages of HGNN+: not only for adaptive aggregation, but also expand to directed hypergraph

2.3. Related Work

2.3.1. Graph Neural Networks

①Lists spectral and spatial based GNN

2.3.2. Hypergraph Learning

①Lists the development of hypergraph and hypergraph neural network

2.4. Preliminaries of Hypergraphs

①Notations:

for graph $\mathcal{G}=(\mathcal{V},\mathcal{E},\mathbf{W})$ , $|\mathcal{V}|\times|\mathcal{E}|$ incidence matrix $\mathbf{H}$ can be written as:

$\mathbf{H}(v,e)=\left\{ \begin{array} {lr}1, & \mathrm{if~}v\in e \\ 0, & \mathrm{otherwise} \end{array}\right.$

and the initial node feature is $\mathbf{X}^0=\{x_1^0,x_2^0,\cdots,x_N^0\},x_i^0\in\mathbb{R}^{C_0}$ , where $C_0$ denotes the dimension of feature

②Optimization goal in node classfication task:

$\arg\min_f\left\{\mathcal{R}_{emp}(f)+\Omega(f)\right\}$

where $\Omega(f)$ is a regularizer on hypergraph, $\mathcal{R}_{emp}(f)$ denotes the supervised empirical loss and $f\left ( \cdot \right )$ is classification function

③Regularizer $\Omega(f)$ can be represented by:

$\Omega(f)=\frac{1}{2}\sum_{e\in\mathcal{E}}\sum_{\{u,v\}\in\mathcal{V}}\frac{w(e)\mathbf{H}(u,e)\mathbf{H}(v,e)}{\delta(e)}\left(\frac{f(u)}{\sqrt{d(u)}}-\frac{f(v)}{\sqrt{d(v)}}\right)^{2}$

2.5. The Framework of Hypergraph Neural Network HGNN+

2.5.1. Hypergraph Modeling

（1）Hyperedge Group Generation

①When the data correlation is with graph structure:

(a top) pairwise edge

Similar to traditional graph:

$\mathcal{E}_{\mathrm{pair}}=\{\{v_i,v_j\}\mid(v_i,v_j)\in\mathcal{E}_s\}$

(a bottom) k-hop neighbor

The k-hop neighbor is:

$N_{\mathrm{hop}_k}(v)=\{u\mid\mathbf{A}_{uv}^k\neq0,u\in\mathcal{V}_s\}$ , where $k\in \left [ 2,n_v \right ]$ . The hyperedge can be represented by:

$\mathcal{E}_{\mathrm{hop}_k}= \begin{Bmatrix} N_{\mathrm{hop}_k}(v)\mid v\in\mathcal{V} \end{Bmatrix}$

②When the data correlation is without graph structure:

(b top) attributes

Hyperedges are constructed by the same attributes (geo-locations, time and other specific information)（比如每个节点都是一篇论文的话，综述论文共享一条超边，研究论文共享一条超边，workshop共享一条等等）. The construction of hyperedge:

$\mathcal{E}_\mathrm{attribute}=\{N_\mathrm{att}(a)\mid a\in\mathcal{A}\}$

(b bottom) feature

k nearest neighbor in feature space:

$\left.\left\{ \begin{array} {l}\mathcal{E}_{\mathrm{feature}}^{\mathrm{KNN}_k}=\{N_{\mathrm{KNN}_k}(v)\mid v\in\mathcal{V}\} \\ \mathcal{E}_{\mathrm{feature}}^{\mathrm{distance}_d}=\{N_{\mathrm{dis}_d}(v)\mid v\in\mathcal{V}\} \end{array}\right.\right.$

（2）Combination of Hyperedge Groups

①Coequal fusion of each hyperedge group:

$\mathbf{H}=\mathbf{H}_1\|\mathbf{H}_2\|\cdots\|\mathbf{H}_K$

②Adaptive fusion:

$\left\{ \begin{array} {l}\mathbf{w}_{k}=\operatorname{copy}(\operatorname{sigmoid}(w_{k}),M_{k}) \\ \mathbf{W}=\operatorname{diag}(\mathbf{w}_{1}^{1},\cdots,\mathbf{w}_{1}^{M_{1}},\cdots,\mathbf{w}_{K}^{1},\cdots,\mathbf{w}_{K}^{M_{K}}) \\ \mathbf{H}=\mathbf{H}_{1}\|\mathbf{H}_{2}\|\cdots\|\mathbf{H}_{K} \end{array}\right.$

where $\mathbf{w}_{k} \in \mathbb{R}$ denotes trainable parameter shared by all hyperedges inside a specified hyperedge group $k$ , $\operatorname{copy}\left ( a,b \right )$ returns a vector of size $b$ with all the value of $ab$ , $\mathbf{W}\in\mathbb{R}^{M\times M}$ denotes a diagonal weight matrix,

2.5.2. Hypergraph Convolution

（1）Spectral Convolution on Hypergraph

①The spectral convolution of signal $\mathbf{x}$ and filter $\mathbf{g}$ can be denoted as:

$\mathbf{g}\star x=\mathbf{\Phi}((\Phi^\top\mathbf{g})\odot(\Phi^\top\mathbf{x}))=\mathbf{\Phi}g(\mathbf{\Lambda})\mathbf{\Phi}^\top\mathbf{x}$

where $\odot$ denotes the element-wise Hadamard product, $g(\boldsymbol{\Lambda})=\mathrm{diag}(g(\lambda_1),\ldots,g(\lambda_n))$

②Further simplify it by Chebyshv:

$\mathbf{g}\star\mathbf{x}\approx\sum_{k=0}^K\theta_kT_k(\tilde{\mathbf{\Delta}})\mathbf{x}$

③Limite Chebyshv in $K=1$ :

$\mathbf{g}\star x\approx\theta_0\mathbf{x}-\theta_1\mathbf{D}_v^{-1/2}\mathbf{H}\mathbf{W}\mathbf{D}_e^{-1}\mathbf{H}^\top\mathbf{D}_v^{-1/2}\mathbf{x}$

④Reduce parameter to avoid overfitting and simplify graph conv:

$\left.\left\{ \begin{array} {l}\theta_1=-\frac{1}{2}\theta \\ \theta_0=\frac{1}{2}\theta\mathbf{D}_v^{-1/2}\mathbf{H}\mathbf{D}_e^{-1}\mathbf{H}^\top\mathbf{D}_v^{-1/2} \end{array}\right.\right.$

$\begin{gathered} \mathbf{g}\star x\approx\frac{1}{2}\theta\mathbf{D}_v^{-1/2}\mathbf{H}(\mathbf{W}+\mathbf{I})\mathbf{D}_e^{-1}\mathbf{H}^\top\mathbf{D}_v^{-1/2}\mathbf{x} \\ \approx\theta\mathbf{D}_v^{-1/2}\mathbf{H}\mathbf{W}\mathbf{D}_e^{-1}\mathbf{H}^\top\mathbf{D}_v^{-1/2}\mathbf{x}, \end{gathered}$

（2）General Spatial Convolution on Hypergraph

①Inter-Neighbor Relation:

$N=\{ \begin{array} {c}(v,e)\mid\mathbf{H}(v,e)=1,v\in\mathcal{V}\mathrm{and~}e\in\mathcal{E} \end{array}\}$

and the vertex inter-neighbor set of hyperedge is defined as:

$\mathcal{N}_v(e)=\{v\mid vNe,v\in\mathcal{V}\mathrm{and}e\in\mathcal{E}\}$

the hyperedge inter-neighbor set of vertex is defined as:

$\mathcal{N}_e(v)=\{e\mid vNe,v\in\mathcal{V}\mathrm{and}e\in\mathcal{E}\}$

②Spatial hypergraph convolution:

where $m_\beta^t$ is the message of hyperedge $\beta\in\mathcal{E}$ , $y_\beta^t$ is the hyperedge feature of hyperedge $\beta$ which is a element of hyperedge feature set $Y^t=\{y_1^t,y_2^t,\cdots,y_M^t\},y_i^t\in\mathbb{R}^{C_t}$ in layer $t$ , $M_v^t(\cdot),U_e^t(\cdot),M_e^t(\cdot),U_v^t(\cdot)$ are the vertex message function, hyperedge update functions, hyperedge message function and vertex update function in $t$ -th layer

（3）HGNN Convolution Layer Configurations+

①Functions in HGNN+:

$\begin{cases} M_v^{t}(x_{\alpha}^{t}) =\frac{x_{\alpha}^{t}\vert}{\vert N_e(\beta)} \\ U_e^{t}(w_{\beta},m_{\beta}^{t}) =w_{\beta}\cdot m_{\beta}^{t} \\ M_e^{t}(x_{\alpha}^{t},y_{\beta}^{t}) =\frac{y_{\beta}^{t}\vert}{\vert N_e(\alpha)} \\ U_v^{t}(x_{\alpha}^{t},m_{\alpha}^{T+1}) =\sigma(m_{\beta}^{T+1}\cdot\Theta) \end{cases}$

②Hyper graph conv in matrix format:

$\mathbf{X}^{t+1}=\sigma(\mathbf{D}_v^{-1}\mathbf{H}\mathbf{W}\mathbf{D}_e^{-1}\mathbf{H}^\top X^t\mathbf{\Theta}^t)$

2.6. Discussions

2.6.1. Hypergraph vs. Graph

①"from the random walks’ aspect, a hypergraph with edge-independent vertex weights is equivalent to a weighted graph, and a hypergraph with edge-dependent vertex weights cannot be reduced to a weighted graph."

②For undirected graph without relation to edge, $\mathbf{H}\in\{0,1\}^{|\mathcal{V}|\times|\mathcal{E}|}$ and $\mathcal{G}_{\mathrm{in}}=\{\mathcal{V},\mathcal{E},\mathbf{W}\}$ (edge-independent).

③For those edge-dependent graph, $\mathbf{R}\in\mathbb{R}^{|\mathcal{V}|\times|\mathcal{E}|}$ and $\mathcal{G}_{\mathrm{de}}=\{\mathcal{V},\mathcal{E},\mathbf{W},\gamma\}$ . Degree matrix of vertex $d\left ( v \right )$ and edge $\delta \left ( e \right )$ can be represented by:

$\left.\left\{ \begin{array} {ll}d(v) & =\sum_{\beta\in\mathcal{N}_e(v)}w(\beta) \\ \delta(e) & =\sum_{\alpha\in\mathcal{N}_v(e)}\gamma_e(\alpha) \end{array}\right.\right.$

④之后是用随机游走还有马尔科夫链证明超图blabla，就是①，这里就不誊抄了，不是数学达人

2.6.2. HGNN/HGNN vs. GNN+

①For simple hyper graph (traditional graph):

$\begin{cases} \mathbf{HH}^\top =\mathbf{A}+\mathbf{D} \\ \mathbf{D}_e^{-1} =1/2\mathbf{I} \\ \mathbf{W} =\mathbf{I} \end{cases}$

the HGNN can be:

$\begin{aligned} \mathbf{X}^{t+1} & =\sigma(\mathbf{D}_v^{-1/2}\mathbf{H}\mathbf{W}\mathbf{D}_e^{-1}\mathbf{H}^\top\mathbf{D}_v^{-1/2}\mathbf{X}^t\boldsymbol{\Theta}^t) \\ & =\sigma\left(\mathbf{D}_v^{-1/2}\mathbf{H}\left(\frac{1}{2}\mathbf{I}\right)\mathbf{H}^\top\mathbf{D}_v^{-1/2}\mathbf{X}^t\boldsymbol{\Theta}^t\right) \\ & =\sigma\left(\frac{1}{2}\mathbf{D}^{-1/2}(\mathbf{A}+\mathbf{D})\mathbf{D}^{-1/2}\mathbf{X}^t\boldsymbol{\Theta}^t\right) \\ & =\sigma\left(\frac{1}{2}(\mathbf{I}+\mathbf{D}^{-1/2}\mathbf{A}\mathbf{D}^{-1/2})\mathbf{X}^t\boldsymbol{\Theta}^t\right) \\ & =\sigma(\mathbf{D}^{-1/2}\hat{\mathbf{A}}\mathbf{D}^{-1/2}\mathbf{X}^t\hat{\boldsymbol{\Theta}}^t) \end{aligned}$

②Utilize rooted subtree to analogize 2-uniform hypergraph:

2.6.3. HGNN vs. HGNN+

①Difference between two convs:

$\begin{cases} \mathbf{X^{t+1}}\xleftarrow{\mathrm{HGNNConv}}\sigma(\underline{\mathbf{D_v^{-1/2}HWD_e^{-1}H^\top D_v^{-1/2}}}\mathbf{X^t}\mathbf{\Theta^t}) \\ \mathbf{X^{t+1}}\xleftarrow{\mathrm{HGNNConv^+}}\sigma(\underline{\mathbf{D_v^{-1}HWD_e^{-1}H^\top}}\mathbf{X^t}\mathbf{\Theta^t}) & \end{cases}$

上面的是对称（无向图）消息传递，而下面的是非对称（无向图）传递

②Define of directed graph:

$\hat{\mathbf{H}}(v,e)= \begin{cases} -1 & \mathrm{if}v\in T(e) \\ 1 & \mathrm{if}v\in S(e) \\ 0 & \mathrm{otherwise} \end{cases}$

where $T\left ( e \right )$ and $S\left ( e \right )$ are the set of target and source vertices for hyperedge $e$

③Specific message passing of directed hyper graph:

$\mathbf{X}^{T+1}=\sigma(\mathbf{D}_t^{-1}\hat{\mathbf{H}}_t\mathbf{W}\mathbf{D}_s^{-1}\hat{\mathbf{H}}_s\mathbf{X}^t\mathbf{\Theta}^t)$

where

$\begin{cases} \mathbf{D}_s=\operatorname{diag}(\operatorname{col\_sum}(\hat{\mathbf{H}}_s)) \\ \mathbf{D}_t=\operatorname{diag}(\operatorname{col\_sum}(\hat{\mathbf{H}}_t)) & \end{cases}$

2.7. Experiments and Discussions

2.7.1. Experimental Settings

①Learning rate: 0.01 and 0.001 for 3 citation network datasets and 2 social media network datasets

②Dropout: 0.5

③Optimizer: Adam with 0.0005 weight decay

④Loss: CE:

$\mathcal{L}=-\sum_{v\in V}\sum_{c=1}^CY_{vc}\log(O_{vc})$

2.7.2. Vertex Classification on the Data With Graph Structure

①Statics of datasets:

②Performance on 5 datasets when 5 samples per category were trained:

③Performance on 5 datasets when 10 samples per category were trained:

④Comparison on the effectiveness of different hyperedge groups with HGNN+:

⑤Comparison of HGNN and HGNN+:

⑥Convolution strategy comparison:

2.7.3. Vertex Classification on the Data Without Graph Structure

①3D object datasets: ModelNet40 and NTU

②Performance: