chore: stop tracking reference directory; keep local files via .gitignore

2025-08-29 23:26:05 +08:00
parent 3cafde7171
commit e72092a8a1
41 changed files with 2 additions and 153240 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -1,2 +1,4 @@
 reference/
 .venv/
 reference/
--- a/reference/.DS_Store
+++ b/reference/.DS_Store
--- a/reference/Geo-Layout-Transformer.md
+++ b/reference/Geo-Layout-Transformer.md
@@ -1,316 +0,0 @@
 # Geo-Layout Transformer技术路线图：一种用于物理设计分析的统一、自监督基础模型
 ## 摘要
 本报告旨在为电子设计自动化（EDA）领域的下一代物理设计分析工具制定一项全面的技术路线图。随着半导体工艺节点不断缩小至纳米尺度，传统的、基于启发式规则的后端验证工具在应对日益增长的设计复杂性、互连寄生效应主导以及严峻的工艺可变性方面已显得力不从心。设计周期的延长和功耗、性能、面积（PPA）优化的瓶颈，正迫使业界寻求一种根本性的范式转变。
 本文提出“Geo-Layout Transformer”——一种新颖的、统一的混合图神经网络（GNN）与Transformer架构，旨在通过学习物理版图的深度、上下文感知表征，来彻底改变后端分析流程。该模型的核心战略是利用海量的、未标记的GDSII版图数据，通过自监督学习（SSL）范式进行预训练，从而构建一个可复用的“物理设计基础模型”。这种方法旨在将EDA工具从一系列孤立的、任务特定的解决方案，演进为一个集中的、可跨任务迁移的“版图理解引擎”。
 Geo-Layout Transformer的变革性潜力将在三个关键的后端应用中得到验证：
 1. **预测性热点检测（Hotspot Detection）：** 通过捕捉长程物理效应和全局版图上下文，该模型有望显著超越传统基于模式匹配和卷积神经网络（CNN）的方法，在提高检测准确率的同时大幅降低误报率。
 2. **高速连通性验证（Connectivity Verification）：** 将连通性问题（如开路和短路）重新定义为图上的链接预测和异常检测任务，利用模型的全局拓扑理解能力，实现比传统几何规则检查（DRC）更快、更精确的验证。
 3. **结构化版图匹配与复用（Layout Matching and Reuse）：** 通过学习版图的结构化相似性度量，该模型能够实现对IP模块的高效检索、设计抄袭检测以及模拟版图迁移的加速，从而极大地提升设计复用效率。
 本报告详细阐述了Geo-Layout Transformer的理论基础、创新的混合模型架构、针对上述应用的可行性分析，并提出了一套分阶段的技术实现路线图。该路线图涵盖了从数据整理、基础模型开发到特定任务微调、最终实现规模化部署的全过程，同时识别了潜在的技术挑战并提出了相应的缓解策略。我们相信，对Geo-Layout Transformer的研发投资，将为EDA供应商和半导体设计公司构建起一道难以逾越的技术壁垒和数据护城河，引领物理设计自动化进入一个由数据驱动、深度学习赋能的新纪元。
 ## 1. 物理设计分析的范式转变：从启发式到学习化表征
 ### 1.1. 规模化之墙：传统EDA在纳米时代的局限性
 随着半导体工艺节点以前所未有的速度缩小，超大规模集成电路（VLSI）的后端设计正面临着一道由物理定律和制造成本构筑的“规模化之墙” 1。晶体管尺寸的减小带来了设计复杂性的指数级增长，数以亿计的器件被集成在单一芯片上，使得传统的电子设计自动化（EDA）方法论承受着巨大的压力 4。在深亚微米时代，设计的性能不再仅仅由晶体管本身决定，互连线的寄生效应（电阻和电容）已成为主导因素，严重影响着电路的时序、功耗和信号完整性 3。同时，严峻的工艺可变性导致设计窗口急剧缩小，使得确保良率和可靠性成为一项艰巨的挑战。
 在这种背景下，传统EDA工具的局限性日益凸显。它们大多依赖于人工制定的启发式规则和算法，这些规则在面对复杂的物理相互作用时往往显得过于简化。例如，为了实现设计收敛，设计工程师通常需要进行多轮布局布线迭代，以优化线长、时序和拥塞等关键指标 5。这个过程高度依赖工程师的经验，不仅耗时巨大，而且计算效率低下，往往导致次优的功耗、性能和面积（PPA）结果 4。
 物理验证环节是这一挑战的集中体现。以光刻热点检测为例，为了确保设计的可制造性，必须在流片前识别出所有对工艺变化敏感的版图图形（即热点）。最精确的方法是进行全芯片光刻仿真，但其计算成本高昂，一次完整的仿真可能需要数天甚至数周时间，这在现代敏捷的设计流程中是不可接受的 7。这种计算瓶颈迫使设计流程在精度和速度之间做出痛苦的妥协，严重阻碍了技术创新的步伐。
 ### 1.2. 机器学习在物理设计自动化中的兴起
 为了应对现代设计的复杂性，将机器学习（ML）技术集成到EDA流程中已成为一种必然的演进 1。ML模型，特别是深度学习模型，擅长从大规模数据中学习复杂的、非线性的关系，这使其成为解决传统算法难以处理的优化和预测问题的理想工具 12。近年来，基于ML的方法在多个EDA任务中已经展现出超越现有技术（SOTA）传统方法的潜力。
 具体的成功案例包括：
 * **布局规划指导：** PL-GNN等框架利用图神经网络（GNN）对网表进行无监督节点表示学习，从而为商业布局工具提供关于哪些实例应被放置在一起的指导，以优化线长和时序 5。
 * **拥塞预测：** CongestionNet等模型能够在逻辑综合阶段，仅根据综合后的网表，利用GNN预测布线拥塞，从而提前规避后端实现的困难 13。
 * **图分割：** GNN也被应用于电路划分，通过学习将大型图划分为平衡的子集，同时最小化切割边，这对于多层次布局布线至关重要 14。
 这些应用的成功，催生了一套通用的、端到端的GNN应用流程。该流程为在集成电路（IC）设计中应用GNN提供了一个结构化的方法论，它明确地将问题分解为四个阶段：输入电路表示、电路到图的转换、GNN模型层构建以及下游任务处理 11。这个框架的建立，为系统性地开发更先进、更统一的版图分析模型（如本文提出的Geo-Layout Transformer）奠定了形式化的基础。
 ### 1.3. 版图表示的关键转变：从像素到图
 在将机器学习应用于版图分析的早期探索中，最直观的方法是将版图片段（clips）视为图像，并应用在计算机视觉领域取得巨大成功的卷积神经网络（CNN） 8。这种基于图像的方法将热点检测等问题转化为图像分类任务。尽管这种方法取得了一定的成功，但它存在根本性的缺陷。首先，CNN要求固定尺寸的输入，这对于尺寸和形状各异的版图图形来说是一个严重的限制，通常需要进行裁剪或填充，从而可能丢失关键信息 8。其次，版图本质上是稀疏的，大部分区域是空白的，使用密集的像素网格表示在计算上是低效的。最重要的是，CNN的架构内含欧几里得空间的归纳偏置（即假设数据存在于规则的网格结构中），这使其无法直接理解电路的非欧几里得、关系型结构，例如组件之间的物理邻接和电气连接 14。
 为了克服这些限制，业界逐渐认识到，电路和版图的自然表示形式是图（Graph），其中物理组件（如多边形、通孔）是节点，它们之间的物理或电气关系是边 8。图神经网络（GNN）正是为处理这种不规则的、图结构化数据而设计的，使其在根本上比CNN更适合版图分析任务 14。这种表示方法正确地捕捉了设计的底层拓扑和连通性，这对于精确的物理设计分析至关重要。
 从CNN到GNN的演进，代表了一次根本性的概念飞跃。它标志着分析范式从将版图视为静态的“图片”，转变为将其理解为一个动态的“关系系统”。CNN必须从像素模式中隐式且低效地推断出几何关系，而GNN则通过边的定义显式地接收这种关系声明 20。这种数据结构与模型架构的对齐，带来了更高效的学习、更好的泛化能力和更具语义意义的表征。这种视角上的转变，是开发真正智能化的EDA工具的基石，也构成了Geo-Layout Transformer不可动摇的基础。
 **表1：版图表示模态对比**
 |  |  |  |  |  |
 | --- | --- | --- | --- | --- |
 | 表示模态 | 核心概念 | 优势 | 劣势 | 主要EDA应用 |
 | **基于图像 (CNN)** | 版图是像素网格 | 可利用成熟的计算机视觉架构 | 输入尺寸固定；对稀疏数据计算效率低；忽略显式连通性；对旋转/缩放非原生不变 | 早期热点检测 |
 | **基于图 (GNN/Transformer)** | 版图是节点（形状）和边（关系）的图 | 原生处理不规则几何；捕捉拓扑/连通性；稀疏、可扩展；通过设计实现置换/旋转等变性 | 数据准备（图构建）复杂度较高 | 所有提议任务（热点、连通性、匹配）及更广泛的应用 |
 ## 2. 基础支柱：用于VLSI数据的GNN与Transformer
 ### 2.1. 图神经网络：编码局部结构与连通性
 图神经网络的核心工作原理是消息传递（Message Passing）范式 14。在该范式中，GNN通过递归地聚合其局部邻域的特征信息来构建节点的表征 8。每一轮消息传递，节点都会从其直接邻居那里“收集”信息，并结合自身原有的信息来更新自己的状态。通过堆叠多层GNN，每个节点可以感知到其K跳（K-hop）邻域内的信息。这种机制与VLSI版图的物理现实完美契合，能够学习一个版图元素如何受到其直接几何和电气环境的影响。
 多种GNN架构已在EDA领域得到成功应用，证明了其强大的局部结构编码能力：
 * **GraphSAGE：** 该架构以其强大的归纳学习能力而著称，能够处理在训练期间未见过的节点。在布局规划中，GraphSAGE被用于无监督的节点表示学习，以捕捉网表的逻辑亲和性，从而指导商业布局工具 5。
 * **图注意力网络（GAT）：** GAT引入了注意力机制，允许模型在聚合邻居信息时为不同的邻居分配不同的权重。这在处理复杂的物理场景时尤其有效，例如在时钟网络时序分析中，多个驱动单元对一个接收端（sink）延迟的贡献是不同的，GAT可以学习到这种差异化的重要性 18。
 * **关系图卷积网络（R-GCN）：** 真实的VLSI版图是异构的，包含多种类型的节点（金属多边形、通孔、单元）和多种类型的边（邻接关系、连通关系）。R-GCN通过为每种关系类型使用不同的可学习变换矩阵，专门用于处理这种异构图，这对于精确建模真实世界版图至关重要 8。
 尽管GNN在编码局部信息方面表现出色，但其自身也存在固有的挑战，这些挑战正是集成Transformer架构的主要动机：
 * **过平滑（Over-smoothing）：** 这是GNN最关键的限制之一。在深度GNN中，随着消息传递层数的增加，所有节点的特征表示会趋于收敛到一个相同的值，导致节点变得难以区分 14。这使得GNN难以捕捉图中节点之间的长程依赖关系。
 * **可扩展性与性能：** 在邻居聚合过程中，不规则的内存访问模式使得GNN在处理大规模、芯片级的图时成为一个受内存带宽限制的瓶颈，这是实现高性能模型必须解决的工程挑战 10。
 * **对未见图的泛化能力：** EDA领域的一个核心难题是确保在一个特定电路上训练的模型能够很好地泛化到全新的、在训练中从未见过的设计上 13。
 ### 2.2. Transformer架构：捕捉全局上下文与长程依赖
 Transformer架构的核心是自注意力（Self-Attention）机制，这是一种强大的机制，它通过计算集合中所有元素之间的成对交互来运作 22。与GNN的局部消息传递不同，自注意力允许模型在单层计算中直接建立任意两个输入元素之间的依赖关系，无论它们在序列中的距离有多远。这使得Transformer能够高效地建模长程依赖，直接克服了GNN的感受野限制和过平滑问题 23。
 然而，将Transformer应用于二维几何数据（如VLSI版图）需要解决一个关键问题。标准的Transformer是置换不变的（permutation-invariant），它将输入视为一个无序的集合，这意味着当版图元素被“符号化”（tokenized）后，所有至关重要的空间位置信息都会丢失 24。解决方案是显式地将位置信息注入到模型中，即
 **二维位置编码（2D Positional Encoding）**。
 为VLSI版图这类几何数据选择合适的位置编码方案，并非一个微不足道的实现细节，而是一个决定模型几何理解能力的核心特征工程挑战。不同的编码方案向模型注入了关于空间和距离本质的强大先验知识。
 * **绝对位置编码（APE）：** 为每个元素的(x, y)坐标分配一个唯一的向量。这可以通过固定的正弦/余弦函数或可学习的嵌入来实现 24。APE为每个元素提供了全局坐标系中的位置感，对于理解依赖于芯片全局位置的效应（例如，靠近IO区域与核心区域的效应差异）至关重要 26。
 * **相对位置编码（RPE）：** 将元素对之间的相对距离和方向直接编码到注意力计算中 27。这种方法对于学习由局部几何规则主导的任务（例如，热点检测中的间距规则、模拟电路中的器件匹配）非常有效 26。
 * **高级方案：** 近年来还出现了更复杂的编码方法，如旋转位置嵌入（RoPE），因其良好的旋转特性而受到关注 26；以及语义感知位置编码（SaPE），它不仅考虑几何距离，还考虑特征的相似性 28。
 GNN和Transformer并非相互竞争的版图分析架构，它们在根本上是互补的。GNN可以被视为强大的“空间卷积器”，通过共享的消息传递函数学习局部的、平移不变的物理规则，非常适合识别DRC违规或简单的热点模式等局部几何特征 8。然而，诸如IR-Drop或关键路径时序违规等复杂问题，可能由物理上相距遥远的组件之间的相互作用引起。GNN需要一个不切实际的深度网络来传播这种长程影响，从而不可避免地导致过平滑 18。相比之下，Transformer的自注意力机制可以在一个计算步骤内连接这些遥远的组件，模拟VLSI设计中固有的全局场效应 23。
 因此，最佳架构是分层的：首先由GNN创建丰富的、具备局部感知能力的特征嵌入，然后将这些嵌入传递给Transformer，以推理它们的全局相互依赖关系。这种协同作用比任何单一范式的模型都更高效、更有效、更具可解释性。基于此，一个新颖的架构思想是，Geo-Layout Transformer应采用一种**混合位置编码方案**，将绝对编码和相对编码相结合。这将允许模型的注意力机制根据具体的任务和上下文，自适应地学习哪种空间参照系最为重要，这是对现有方法的重大改进。
 ## 3. Geo-Layout Transformer的架构蓝图
 ### 3.1. 核心理念：用于分层特征提取的混合模型
 Geo-Layout Transformer的核心设计理念是构建一个多阶段的混合架构，以分层的方式处理版图数据。这种处理流程旨在模仿设计专家分析版图的认知过程：从单个图形的几何属性，到局部图形的组合模式，再到整个系统级的全局交互。该架构明确地定义为GNN与Transformer的融合体，直接体现了前述的“互补性原则”，即利用GNN进行局部特征学习，再利用Transformer进行全局上下文的理解和推理 23。
 为了清晰地论证这一架构选择的合理性，下表对不同架构的权衡进行了分析。
 **表2：架构权衡：GNN vs. Transformer vs. 混合模型**
 |  |  |  |  |  |  |
 | --- | --- | --- | --- | --- | --- |
 | 架构 | 局部上下文捕捉 | 全局上下文捕捉 | 计算复杂度 | 主要归纳偏置 | 对VLSI版图的适用性 |
 | **纯GNN** | 优秀（通过消息传递） | 差（受限于过平滑） | 高效（与边数成线性关系） | 强局部性和关系偏置 | 适合局部模式，不适合芯片级效应 |
 | **纯Transformer** | 弱（无内建局部性） | 优秀（通过自注意力） | 差（与节点数的平方成正比） | 弱，置换不变性 | 对原始多边形不切实际，忽略局部几何规则 |
 | **Geo-Layout Transformer (混合)** | 优秀（通过GNN编码器） | 优秀（通过Transformer骨干） | 可控（与GNN聚合的超节点数的平方成正比） | 结合局部关系偏置和全局注意力 | 最佳，利用两者优势构建分层表示 |
 ### 3.2. 阶段一：GDSII到图的转换流水线
 这是将原始几何数据结构化的第一步，也是整个模型的基础。
 * **解析：** 建立一个强大的数据注入流水线，使用如gdstk等高性能开源库来解析GDSII或OASIS文件。选择gdstk是因其拥有高性能的C++后端和强大的布尔运算能力，这对于处理复杂的版图几何至关重要 31。同时，
  python-gdsii等库也提供了灵活的Python接口 33。
 * **异构图表示：** 为了全面地捕捉版图信息，我们提出一个包含多种节点和边类型的丰富异构图模式：
  + **节点类型：** Polygon（多边形）、Via（通孔）、CellInstance（单元实例）、Port（端口）。这种区分使得模型能够识别不同的物理实体 8。
  + **边类型：** Adjacency（同一层上的物理邻近）、Connectivity（通过通孔连接多边形）、Containment（单元内部的多边形）、NetMembership（连接同一逻辑网络的所有图形）。这从多个维度捕捉了版图元素之间的关系。
 * **丰富的特征工程：** 为图中的节点和边定义一套全面的特征集：
  + **几何特征：** 归一化的边界框坐标、面积、长宽比、形状复杂度（如顶点数量）等 8。
  + **层特征：** 为每个金属层、通孔层和器件层创建一个可学习的嵌入向量。
  + **电气特征（可选，来自网表）：** 预先计算的寄生参数、来自标准单元库的单元类型、网络的扇出等 18。
  + **层次化特征：** 一个表示设计层次结构中父单元/模块的嵌入向量，因为具有共同层次结构的实例往往连接更紧密，对布局质量影响更大 5。
 ### 3.3. 阶段二：用于局部邻域聚合的GNN编码器
 此阶段的功能是一个可学习的特征工程模块，旨在取代传统方法中手工设计的特征提取器。我们提议使用一个由**多层关系图注意力网络（R-GAT）**组成的编码器。这一选择结合了GAT的注意力机制（能够权衡邻居的重要性）和R-GCN处理多类型边的能力，使其成为处理我们所定义的复杂异构图的理想选择。此阶段的输出是一组丰富的、例如512维的节点嵌入向量。每个向量都浓缩了其对应版图元素及其K跳邻域内的上下文信息，这些向量将作为下一阶段Transformer的输入“符号”（tokens）。
 ### 3.4. 阶段三：用于全局版图理解的Transformer骨干
 这是模型的核心推理引擎，负责处理来自GNN编码器的、已具备上下文感知的节点嵌入序列。
 * **位置编码集成：** 在进入第一个Transformer层之前，每个节点嵌入向量都会与其对应的、我们提出的混合二维位置编码向量（结合绝对和相对分量）相加。
 * **架构：** 采用标准的Transformer编码器架构，由多个多头自注意力（MHSA）层和前馈网络层堆叠而成。MHSA层使每个版图元素能够与所有其他元素进行交互，从而捕捉关键的长程物理效应，例如跨晶圆变异、长路径时序、电源网络压降等，这些效应对于纯局部模型是不可见的。这种方法直接受到了LUM和FAM等成功的版图分析Transformer模型的启发 7。
 ### 3.5. 阶段四：用于下游应用的特定任务头
 来自Transformer骨干的、具备全局感知能力的节点嵌入，将被送入简单、轻量级的神经网络“头”（heads）中，以进行具体的预测。这种模块化的设计允许同一个核心模型通过更换或添加不同的任务头，来适应多种应用。
 * **连通性头（Connectivity Head）：** 一个简单的二元分类器（如多层感知机MLP），接收两个节点的嵌入，并预测它们之间存在连接的概率（即链接预测）。
 * **匹配头（Matching Head）：** 一个图池化层（例如，在 8 中使用的
  GlobalMaxPool），将一个版图窗口内的所有节点嵌入聚合成一个单一的图级别向量。该向量随后被用于基于三元组损失（triplet loss）的相似性学习框架，类似于LayoutGMN 35。
 * **热点头（Hotspot Head）：** 一个简单的节点级分类器（MLP），预测一个节点（代表一个多边形）属于热点区域的概率。
 ### 3.6. 训练策略：通过自监督学习构建“基础模型”
 在EDA领域，获取大规模、高质量的标记数据集是一个主要的瓶颈，原因在于标注成本高昂以及设计数据的知识产权（IP）机密性 9。为了克服这一挑战，我们提出一种两阶段的训练范式，旨在创建一个可复用的“物理设计基础模型”。
 * 阶段一：自监督预训练（Self-Supervised Pre-training）：
  这是整个策略的核心。我们将利用海量的、未标记的GDSII数据来预训练完整的GNN-Transformer骨干网络。提议的前置任务（pretext task）是掩码版图建模（Masked Layout Modeling），其灵感来源于计算机视觉领域的掩码自编码器（Masked Autoencoders）以及在模拟版图自监督学习中的类似工作 36。具体来说，我们会随机“掩盖”掉版图中的一部分元素（例如，将其特征置零或替换为特殊掩码符号），然后训练模型根据其周围的上下文来预测这些被掩盖元素的原始特征（如几何形状、层信息）。这个过程迫使模型学习物理设计的基本“语法”和内在规律，而无需任何人工标注。
 * 阶段二：监督微调（Supervised Fine-tuning）：
  经过预训练的骨干网络，已经具备了对版图的强大、通用的理解能力。随后，我们可以使用规模小得多的、针对特定任务的标记数据集来微调该模型。例如，用几千个已知的热点样本来微调热点检测头。这种迁移学习的方法能够极大地减少为新任务或新工艺节点开发高性能模型所需的数据量和训练时间 36。
 这种分层架构的设计创造了一个强大且可解释的数据处理流水线。阶段一将原始几何结构化为图。阶段二通过GNN学习局部的物理规则，可以被看作是一个智能的“语义压缩器”，它学会将一个复杂的局部多边形集群表示为一个单一的、丰富的特征向量。阶段三的Transformer则在这个更高层次的、数量少得多的语义符号上进行操作，使得全局注意力的计算变得可行。它不再是比较原始形状，而是在比较整个“邻域上下文”。这种分层处理方式不仅模仿了人类专家分析版图的思维过程，也是模型实现高效率和高性能的关键。
 从商业战略的角度看，自监督预训练策略是整个路线图中最关键的元素。大多数学术研究受限于在公开基准上进行监督学习 8，这些基准可能无法反映先进工业设计的复杂性。而一个EDA供应商或大型半导体公司拥有数十年积累的、数PB的专有、未标记GDSII数据。所提出的SSL策略能够解锁这一沉睡数据资产的巨大价值，允许创建一个拥有无与伦比的、由数据驱动的、跨多个工艺节点的真实世界版图模式理解能力的基础模型。这将构建一个强大的竞争优势或“数据护城河”，因为竞争对手或初创公司几乎不可能复制相同规模和多样性的训练数据。
 ## 4. 可行性分析与应用深度剖析
 Geo-Layout Transformer的统一表征能力使其能够灵活地应用于多个关键的后端分析任务。通过为每个任务设计一个特定的预测头并进行微调，该模型可以高效地解决看似不相关的问题。
 ### 4.1. 应用一：高精度连通性验证
 * **问题定义：** 传统的连通性验证依赖于设计规则检查（DRC）工具，通过几何运算来检查开路（opens）和短路（shorts）。我们将此问题重新定义为图上的学习任务：
  + **链接预测（Link Prediction）：** 通过预测相邻多边形和通孔之间是否存在connectivity类型的边来验证网络的完整性。缺失的预测边可能表示开路 40。
  + **节点异常检测（Node Anomaly Detection）：** 通过检测属于不同网络的节点之间是否存在意外的链接来识别短路。这种方法将一个几何问题转化为图拓扑问题，直接与预测开路/短路等制造缺陷相关联 7。
 * **方法论：** 使用微调后的Geo-Layout Transformer的连通性头进行预测。模型的Transformer骨干提供的全局上下文至关重要，它能够准确地追踪贯穿芯片的长网络，并识别由遥远布线之间的相互作用引起的复杂短路。
 * **预期性能：** 预计该方法将比传统的几何DRC工具和电路仿真器实现显著的速度提升 18。学习到的模型能够捕捉到纯粹基于规则的系统常常忽略的微妙物理相互作用（例如，电容耦合），从而带来更高的准确性 21。
 ### 4.2. 应用二：结构化版图匹配与复用
 * **问题定义：** 此应用被定义为一个图相似性学习任务。目标是给定一个查询版图（例如，一个模拟IP模块），从一个庞大的数据库中检索出结构上相似的版图块。
 * **方法论：**
  + 我们将直接借鉴并采用成功的LayoutGMN模型的架构和训练方法 35。
  + 微调后的模型匹配头将为任何给定的版图窗口生成一个单一的嵌入向量。
  + 版图之间的相似度可以高效地计算为这些嵌入向量在低维空间中的余弦距离。
  + 采用三元组损失函数，并利用交并比（Intersection-over-Union, IoU）作为弱监督信号来生成训练样本（即，高IoU的对作为正样本，低IoU的对作为负样本）。这是一种高度可行的训练策略，它避免了对“相似”版图进行昂贵的人工标注 35。
 * **预期性能：** 模型通过图匹配学习到的对结构关系的深刻理解，将远远优于简单的基于像素（IoU）或手工特征的比较方法。这将实现强大的IP模块识别、设计抄袭检测，并加速模拟版图的工艺迁移。
 ### 4.3. 应用三：预测性热点检测
 * **问题定义：** 热点检测被定义为版图图上的节点分类任务。图中的每个节点（代表一个多边形或一个关键区域）被分类为“热点”或“非热点”。
 * **方法论：**
  + 使用微调后的Geo-Layout Transformer的热点头执行分类任务。
  + 将在公认的公开基准数据集（如ICCAD 2012和更具挑战性的ICCAD 2019/2020）上进行训练和验证，以便与SOTA方法进行直接的、定量的比较 8。
 * **预期性能与优势：**
  + **卓越的上下文感知能力：** Transformer的全局感受野是其关键优势。它能够建模长程物理现象，如光刻邻近效应、刻蚀负载效应和版图密度变化，这些现象会影响热点的形成，但对于局部模式匹配器或纯CNN/GNN模型是不可见的 7。
  + **降低误报率：** 通过理解更广泛的版图上下文，模型能更准确地区分几何上相似但一个是良性、另一个是恶性的图形，从而显著降低困扰当前方法的高昂的误报率 8。
  + **增强的泛化能力：** SSL预训练阶段将为模型提供关于有效版图模式的强大先验知识，使其能够比仅在固定的已知热点库上训练的模型更有效地检测新颖的、前所未见的热点类型 48。
 Geo-Layout Transformer的最高价值在于其能够为这三个看似独立的应用程序提供一个**单一、统一的表示**。在当前的EDA流程中，DRC/LVS（连通性）、IP管理（匹配）和DFM（热点）由不同的、高度专业化的工具和团队处理。然而，Geo-Layout Transformer提出，这三个任务的核心智力挑战——深刻理解版图的几何和电气语义——在根本上是相同的。通过使用一个强大的基础模型一次性解决这个核心的表示学习问题，开发单个应用工具就变成了微调特定头的简单任务。这一理念预示着EDA研发的战略转变，即从构建孤立的点解决方案，转向创建一个可以在整个后端流程中复用的、核心的“版图理解引擎”。
 ## 5. 实施路线设想
 ### 5.1. 阶段一：数据整理与基础模型开发
 * **任务1：构建可扩展的GDSII到图的流水线。**
  + 评估并选择高性能的库，如gdstk (C++/Python)，因其处理速度和先进的几何运算能力而备受青睐 31。
  + 开发一个并行化的数据处理流水线，能够将TB级的GDSII数据高效地转换为所提出的异构图格式，并针对PyTorch Geometric等ML框架的存储和加载进行优化。
 * **任务2：整理和处理数据集。**
  + 系统地下载、解析和准备用于微调和评估阶段的公开基准数据集，包括用于热点检测的ICCAD 2012/2019/2020 39，以及来自GNN4IC中心等资源的相关电路数据集 11。
  + 启动大规模的内部数据整理计划，处理跨多个工艺节点的、多样化的专有、未标记GDSII设计。这些数据将是自监督预训练的燃料。
 **表3：可用于模型训练和基准测试的公开数据集**
 |  |  |  |  |  |
 | --- | --- | --- | --- | --- |
 | 数据集名称 | 主要任务 | 描述与关键特征 | 数据格式 | 来源/参考文献 |
 | **ICCAD 2012 Contest** | 热点检测 | 广泛使用的基准，但模式被认为相对简单 | 版图片段 | 8 |
 | **ICCAD 2019/2020** | 热点检测 | 更具挑战性，包含现代通孔层热点，更好地反映当前DFM问题 | 版图片段 | 39 |
 | **RPLAN / Rico (UI)** | 版图匹配 | 用于训练结构相似性模型的平面图和用户界面数据集 | JSON/图像 | 46 |
 | **CircuitNet** | 时序、可布线性、IR-Drop | 包含网表和布线后数据的大规模数据集，可用于相关物理设计任务的预训练 | Bookshelf, SPEF | 51 |
 | **GNN4IC Hub Benchmarks** | 多样化（安全、可靠性、EDA） | 为各种IC相关的GNN任务策划的基准集合 | 多样 | 11 |
 * **任务3：开发和训练自监督基础模型。**
  + 实现所提出的混合GNN-Transformer骨干架构。
  + 实现“掩码版图建模”自监督学习任务 36。
  + 确保并配置必要的高性能计算（HPC）基础设施（例如，一个由A100/H100 GPU组成的集群），以支持这一大规模的训练工作。
 ### 5.2. 阶段二：针对目标应用的微调与验证
 * **任务1：开发和微调特定任务头。**
  + 为连通性、匹配和热点检测任务实现轻量级的预测头。
  + 在已标记的公开和专有数据集上进行系统的微调实验。
 * **任务2：严格的基准测试和消融研究。**
  + 针对每个应用，将微调后的模型与已发表的SOTA结果进行直接比较（例如，与 8 比较热点检测，与 35 比较匹配）。
  + 进行全面的消融研究，以经验性地验证关键的架构决策（例如，GNN编码器的影响、不同位置编码类型的贡献、预训练的价值）。
 * **任务3：开发模型可解释性工具。**
  + 实现可视化Transformer注意力图的方法，允许设计人员直观地看到模型在进行特定预测时关注了版图的哪些部分。这对于调试和建立用户信任至关重要 15。
 ### 5.3. 阶段三：扩展、优化与集成
 * **任务1：解决全芯片可扩展性问题。**
  + 研究并实现先进的技术，如图分割和采样（例如，Cluster-GCN, GraphSAINT），使模型能够处理超出单个GPU内存容量的全芯片版图 10。
  + 研究模型优化技术，如量化和知识蒸馏，以创建更小、更快的模型，用于交互式应用场景。
 * **任务2：为EDA工具集成开发API。**
  + 设计并构建一个健壮的、版本化的API，允许现有的EDA工具（如版图编辑器、验证平台）调用Geo-Layout Transformer进行按需分析。
 * **任务3：试点部署与持续学习。**
  + 与选定的设计团队启动一个试点项目，将模型集成到他们的工作流程中。
  + 建立一个反馈循环，收集错误的预测和具有挑战性的案例，用于持续地微调和改进模型。
 ### 5.4. 已识别的挑战与缓解策略
 * **数据不平衡：** 关键事件（如热点或DRC违规）在数据集中本质上是罕见的。
  + **缓解策略：** 采用先进的损失函数（如focal loss）、复杂的数据采样策略（对稀有事件进行过采样），并将问题构建在异常检测的框架内 9。
 * **计算成本：** 训练大型基础模型的资源消耗巨大。
  + **缓解策略：** 在Transformer中利用稀疏注意力机制，使用高效的图数据结构，并投资于专用的硬件加速器。SSL预训练是一次性成本，可以分摊到多个下游任务中 2。
 * **模型可解释性（“黑箱”问题）：** 设计人员在没有合理解释的情况下，不愿信任模型的预测。
  + **缓解策略：** 优先开发可解释性工具，如注意力可视化和特征归因方法，以便在提供预测的同时提供可操作的见解 15。
 * **IP与数据隐私：** 设计数据是高度机密的。
  + **缓解策略：** SSL基础模型方法是主要的缓解措施，因为它允许组织在自己的私有数据上进行训练。对于多组织合作，联邦学习是一个可行的未来方向 16。
 ## 6. 结论与未来展望
 Geo-Layout Transformer代表了EDA行业的一项战略性、变革性的技术。它通过一个通用的、深度学习的表示，统一了多个分散的后端分析任务。本报告阐述的路线图证明了其技术上的可行性，并揭示了其通过加速设计周期和提高芯片质量所带来的巨大投资回报潜力。
 展望未来，Geo-Layout Transformer的成功将为物理设计自动化开辟更广阔的前景：
 * **扩展到更多任务：** 将这个统一的模型扩展到其他关键的后端分析任务，如可布线性预测、IR-Drop分析和详细的时序预测。
 * **从分析到综合：** 利用模型学习到的强大表示，在一个生成式框架（如扩散模型或GANs）中，自动生成优化的、“构建即正确”（correct-by-construction）的版图模式，实现从“验证设计”到“生成设计”的飞跃。
 * **多模态EDA：** 最终的愿景是创建一个能够将版图图与其他设计模态（如逻辑网表图和文本化的设计规范）相集成的模型。这将实现对整个芯片设计过程的真正全面的、跨领域的理解，最终赋能一个更加自动化、智能和高效的芯片设计未来 53。
 #### 引用的著作
 1. Feature Learning and Optimization in VLSI CAD - CSE, CUHK, <http://www.cse.cuhk.edu.hk/~byu/papers/PHD-thesis-2021-Hao-Geng.pdf>
 2. Integrating Deep Learning into VLSI Technology: Challenges and Opportunities, <https://www.researchgate.net/publication/385798085_Integrating_Deep_Learning_into_VLSI_Technology_Challenges_and_Opportunities>
 3. AI and machine learning-driven optimization for physical design in advanced node semiconductors, <https://wjarr.com/sites/default/files/WJARR-2022-0415.pdf>
 4. Machine Learning in Physical Verification, Mask Synthesis, and Physical Design - Yibo Lin, <https://yibolin.com/publications/papers/ML4CAD_Springer2018_Pan.pdf>
 5. VLSI Placement Optimization using Graph Neural Networks - ML For Systems, <https://mlforsystems.org/assets/papers/neurips2020/vlsi_placement_lu_2020.pdf>
 6. Cross-Stage Machine Learning (ML) Integration for Adaptive Power, Performance and Area (PPA) Optimization in Nanochips - International Journal of Communication Networks and Information Security (IJCNIS), <https://www.ijcnis.org/index.php/ijcnis/article/view/8511/2549>
 7. Learning-Driven Physical Verification - CUHK CSE, <http://www.cse.cuhk.edu.hk/~byu/papers/PHD-thesis-2024-Binwu-Zhu.pdf>
 8. Efficient Hotspot Detection via Graph Neural Network - CUHK CSE, <https://www.cse.cuhk.edu.hk/~byu/papers/C134-DATE2022-GNN-HSD.pdf>
 9. Application of Deep Learning in Back-End Simulation: Challenges and Opportunities, <https://www.ssslab.cn/assets/papers/2022-chen-backend.pdf>
 10. Accelerating GNN Training through Locality-aware Dropout and Merge - arXiv, <https://arxiv.org/html/2506.21414v1>
 11. Graph Neural Networks: A Powerful and Versatile Tool for ... - arXiv, <https://arxiv.org/pdf/2211.16495>
 12. Seminar Series 2022/2023 - CUHK CSE, <https://www.cse.cuhk.edu.hk/events/2022-2023/>
 13. Generalizable Cross-Graph Embedding for GNN-based Congestion Prediction - arXiv, <http://arxiv.org/pdf/2111.05941>
 14. VLSI Hypergraph Partitioning with Deep Learning - arXiv, <https://arxiv.org/html/2409.01387v1>
 15. Interpretable CNN-Based Lithographic Hotspot Detection Through Error Marker Learning - hkust (gz), <https://personal.hkust-gz.edu.cn/yuzhema/papers/J25-TCAD2025-INT-HSD.pdf>
 16. The composition of ICCAD 2012 benchmark suite. - ResearchGate, <https://www.researchgate.net/figure/The-composition-of-ICCAD-2012-benchmark-suite_tbl1_358756986>
 17. Full article: Advances in spatiotemporal graph neural network prediction research, <https://www.tandfonline.com/doi/full/10.1080/17538947.2023.2220610>
 18. GATMesh: Clock Mesh Timing Analysis using Graph Neural ... - arXiv, <https://arxiv.org/html/2507.05681>
 19. Recent Research Progress of Graph Neural Networks in Computer Vision - MDPI, <https://www.mdpi.com/2079-9292/14/9/1742>
 20. Graph Neural Network and Some of GNN Applications: Everything You Need to Know, <https://neptune.ai/blog/graph-neural-network-and-some-of-gnn-applications>
 21. ParaGraph: Layout Parasitics and Device Parameter Prediction using Graph Neural Networks - Research at NVIDIA, <https://research.nvidia.com/sites/default/files/pubs/2020-07_ParaGraph%3A-Layout-Parasitics/057_4_Paragraph.pdf>
 22. Positional Embeddings in Transformer Models: Evolution from Text to Vision Domains | ICLR Blogposts 2025 - Cloudfront.net, <https://d2jud02ci9yv69.cloudfront.net/2025-04-28-positional-embedding-19/blog/positional-embedding/>
 23. A Survey of Graph Transformers: Architectures, Theories and Applications - arXiv, <https://arxiv.org/pdf/2502.16533>
 24. Exploring Spatial-Based Position Encoding for Image Captioning - MDPI, <https://www.mdpi.com/2227-7390/11/21/4550>
 25. A Gentle Introduction to Positional Encoding in Transformer Models, Part 1 - MachineLearningMastery.com, <https://machinelearningmastery.com/a-gentle-introduction-to-positional-encoding-in-transformer-models-part-1/>
 26. s-chh/2D-Positional-Encoding-Vision-Transformer - GitHub, <https://github.com/s-chh/2D-Positional-Encoding-Vision-Transformer>
 27. SpatialFormer: Towards Generalizable Vision Transformers with Explicit Spatial Understanding, <https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/02019.pdf>
 28. A 2D Semantic-Aware Position Encoding for Vision Transformers - arXiv, <https://arxiv.org/html/2505.09466v1>
 29. Hybrid GNN and Transformer Models for Cross-Domain Entity Resolution in Cloud-Native Applications - ResearchGate, <https://www.researchgate.net/publication/394486311_Hybrid_GNN_and_Transformer_Models_for_Cross-Domain_Entity_Resolution_in_Cloud-Native_Applications>
 30. The architecture of GNN Transformers. They can be seen as a combination... - ResearchGate, <https://www.researchgate.net/figure/The-architecture-of-GNN-Transformers-They-can-be-seen-as-a-combination-of-Graph_fig18_373262042>
 31. Gdstk (GDSII Tool Kit) is a C++/Python library for creation and manipulation of GDSII and OASIS files. - GitHub, <https://github.com/heitzmann/gdstk>
 32. purdue-onchip/gds2Para: GDSII File Parsing, IC Layout Analysis, and Parameter Extraction - GitHub, <https://github.com/purdue-onchip/gds2Para>
 33. Welcome to python-gdsii's documentation! - Pythonhosted.org, <https://pythonhosted.org/python-gdsii/>
 34. python-gdsii - PyPI, <https://pypi.org/project/python-gdsii/>
 35. LayoutGMN: Neural Graph Matching for ... - CVF Open Access, <https://openaccess.thecvf.com/content/CVPR2021/papers/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper.pdf>
 36. [2503.22143] A Self-Supervised Learning of a Foundation Model for Analog Layout Design Automation - arXiv, <https://arxiv.org/abs/2503.22143>
 37. [2301.08243] Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture - arXiv, <https://arxiv.org/abs/2301.08243>
 38. [2210.10807] Self-Supervised Representation Learning for CAD - arXiv, <https://arxiv.org/abs/2210.10807>
 39. Hotspot Detection via Attention-based Deep Layout Metric Learning - CUHK CSE, <https://www.cse.cuhk.edu.hk/~byu/papers/C106-ICCAD2020-Metric-HSD.pdf>
 40. HashGNN - Neo4j Graph Data Science, <https://neo4j.com/docs/graph-data-science/current/machine-learning/node-embeddings/hashgnn/>
 41. Efficient Hotspot Detection via Graph Neural Network | Request PDF - ResearchGate, <https://www.researchgate.net/publication/360732290_Efficient_Hotspot_Detection_via_Graph_Neural_Network>
 42. PowerGNN: A Topology-Aware Graph Neural Network for Electricity Grids - arXiv, <https://arxiv.org/html/2503.22721v1>
 43. PowerGNN: A Topology-Aware Graph Neural Network for Electricity Grids - arXiv, <https://arxiv.org/pdf/2503.22721>
 44. LayoutGMN: Neural Graph Matching for Structural Layout Similarity | Request PDF, <https://www.researchgate.net/publication/346973286_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity>
 45. Neural Graph Matching for Pre-training Graph Neural Networks - Binbin Hu, <https://librahu.github.io/data/GMPT_SDM22.pdf>
 46. agp-ka32/LayoutGMN-pytorch: Pytorch implementation of ... - GitHub, <https://github.com/agp-ka32/LayoutGMN-pytorch>
 47. Autoencoder-Based Data Sampling for Machine Learning-Based Lithography Hotspot Detection, <https://www1.aucegypt.edu/faculty/kseddik/ewExternalFiles/Tarek_MLCAD_22_AESamplingMLHotSpotDet.pdf>
 48. 62 Efficient Layout Hotspot Detection via Neural Architecture Search - CUHK CSE, <https://www.cse.cuhk.edu.hk/~byu/papers/J66-TODAES2022-NAS-HSD.pdf>
 49. Lithography Hotspot Detection Method Based on Transfer Learning Using Pre-Trained Deep Convolutional Neural Network - MDPI, <https://www.mdpi.com/2076-3417/12/4/2192>
 50. DfX-NYUAD/GNN4IC: Must-read papers on Graph Neural ... - GitHub, <https://github.com/DfX-NYUAD/GNN4IC>
 51. CIRCUITNET 2.0: AN ADVANCED DATASET FOR PRO- MOTING MACHINE LEARNING INNOVATIONS IN REAL- ISTIC CHIP DESIGN ENVIRONMENT, <https://proceedings.iclr.cc/paper_files/paper/2024/file/464917b6103e074e1f9df7a2bf3bf6ba-Paper-Conference.pdf>
 52. GNN-CNN: An Efficient Hybrid Model of Convolutional and Graph Neural Networks for Text Representation - arXiv, <https://arxiv.org/html/2507.07414v1>
 53. The Dawn of AI-Native EDA: Promises and Challenges of Large Circuit Models - arXiv, <https://arxiv.org/html/2403.07257v1>
 54. (PDF) Large circuit models: opportunities and challenges - ResearchGate, <https://www.researchgate.net/publication/384432502_Large_circuit_models_opportunities_and_challenges>
--- a/reference/LayoutGMN.md
+++ b/reference/LayoutGMN.md
@@ -1,267 +0,0 @@
 # LayoutGMN: Neural Graph Matching for Structural Layout Similarity
 Akshay Gadi Patil 1 Manyi Li1† Matthew Fisher2 Manolis Savva1 Hao Zhang1
 1Simon Fraser University 2Adobe Research
 # Abstract
 We present a deep neural network to predict structural similarity between 2D layouts by leveraging Graph Matching Networks (GMN). Our network, coined LayoutGMN, learns the layout metric via neural graph matching, using an attention-based GMN designed under a triplet network setting. To train our network, we utilize weak labels obtained by pixel-wise Intersection-over-Union (IoUs) to define the triplet loss. Importantly, LayoutGMN is built with a structural bias which can effectively compensate for the lack of structure awareness in IoUs. We demonstrate this on two prominent forms of layouts, viz., floorplans and UI designs, via retrieval experiments on large-scale datasets. In particular, retrieval results by our network better match human judgement of structural layout similarity compared to both IoUs and other baselines including a state-of-theart method based on graph neural networks and image convolution. In addition, LayoutGMN is the first deep model to offer both metric learning of structural layout similarity and structural matching between layout elements.
 # 1. Introduction
 Two-dimensional layouts are ubiquitous visual abstractions in graphic and architectural designs. They typically represent blueprints or conceptual sketches for such data as floorplans, documents, scene arrangements, and UI designs. Recent advances in pattern analysis and synthesis have propelled the development of generative models for layouts [11, 25, 47, 15, 26] and led to a steady accumulation of relevant datasets [48, 42, 10, 46]. Despite these developments however, there have been few attempts at employing a deeply learned metric to reason about layout data, e.g., for retrieval, data embedding, and evaluation. For example, current evaluation protocols for layout generation still rely heavily on segmentation metrics such as intersection-overunion (IoU) [15, 30] and human judgement [15, 26].
 The ability to compare data effectively and efficiently is arguably the most foundational task in data analysis. The key challenge in comparing layouts is that it is not purely a task of visual comparison — it depends critically on inference and reasoning about structures, which are expressed by the semantics and organizational arrangements of the elements or subdivisions which compose a layout. Hence, none of the well-established image-space metrics, whether model-driven, perceptual, or deeply learned, are best suited to measure structural layout similarity. Frequently applied similarity measures for image segmentation such as IoUs and F1 scores all perform pixel-level matching “in place” — they are not structural and can be sensitive to element misalignments which are structure-preserving.
 ![](images/516817b84bdaf3db241d1a3b87d316578c8f2d9adb29bb8a247a3e00042ba1d0.jpg)  
 Figure 1. LayoutGMN learns a structural layout similarity metric between floorplans and other 2D layouts, through attention-based neural graph matching. The learned attention weights (numbers shown in the boxes) can be used to match the structural elements.
 In this work, we develop a deep neural network to predict structural similarity between two 2D layouts, e.g., floorplans or UI designs. We take a predominantly structural view of layouts for both data representation and layout comparison. Specifically, we represent each layout using a directed, fully connected graph over its semantic elements. Our network learns structural layout similarity via neural graph matching, where an attention-based graph matching network [27] is designed under a triplet network setting. The network, coined LayoutGMN, takes as input a triplet of layout graphs, composed together by one pair of anchorpositive and one pair of anchor-negative graphs, and performs intra-graph message passing and cross-graph information communication per pair, to learn a graph embedding for layout similarity prediction. In addition to returning a metric, the attention weights learned by our network can also be used to match the layout elements; see Figure 1.
 ![](images/76179359f537652a648a8d2094196e528e584399d6cb01cf8f854181aa609e51.jpg)  
 Figure 2. Structure matching in LayoutGMN “neutralizes” IoU feedback. In each example (left: floorplan; right: UI design), a training sample $N$ labeled as “Negative” by IoU is more structurally similar to the anchor $( A )$ than $P$ , a “Positive” sample. With structure matching, our network predicts a smaller $A$ -to- $N$ distance than $A$ -to- $P$ distance in each case, which contradicts IoU.
 To train our triplet network, it is natural to consider human labeling of positive and negative samples. However, it is well-known that subjective judgements by humans over structured data such as layouts are often unreliable, especially with non-experts [45, 2]. When domain experts are employed, the task becomes time-consuming and expensive [45, 2, 14, 9, 20, 41], where discrepancies among even these experts still remain [14]. In our work, we avoid this issue by resorting to weakly supervised training of LayoutGMN, which obtains positive and negative labels from the training data through thresholding using layout IoUs [30].
 The motivations behind our network training using IoUs are three-fold, despite the IoU’s shortcomings for structural matching. First, as one of the most widely-used layout similarity measures [30, 15], IoU does have its merits. Second, IoUs are objective and much easier to obtain than expert annotations. Finally and most importantly, our network has a built-in inductive bias to enforce structural correspondence, via inter-graph information exchange, when learning the graph embeddings. The inductive bias results from an attention-based graph matching mechanism, which learns structural matching between two graphs at the node level (Eq 3, 6). Such a structural bias can effectively compensate for the lack of structure awareness in the IoU-based triplet loss during training. In Figure 2, we illustrate the effect of this structural bias on the metric learned by our network. Observe that the last two layouts are more similar structurally than the first two. This is agreed with by our metric LayoutGMN, but not by IoU feedback.
 We evaluate our network on retrieval tasks over large datasets of floorplans and UI designs, via Precision $@ k$ scores, and investigate the stability of the proposed metric by checking retrieval consistency between a query and its top-1 result, over many such pairs; see Sec. 5.2.Overall, retrieval results by LayoutGMN better match human judgement of structural layout similarity compared to both IoUs and other baselines including a state-of-the-art method based on graph neural networks [30]. Finally, we show a label transfer application for floorplans enabled by the structure matching learned by our network (Sec 5.5).
 # 2. Related Work
 Layout analysis. Early works [18, 3] on document analysis involved primitive heuristics to analyse document structures. Organizing a large collection of such structures into meaningful clusters requires a distance measure between layouts, which typically involved content-based heuristics [34] for documents and constrained graph matching algorithm for floorplans [40]. An improved distance measure relied on rich layout representation obtained using autoencoders [7, 29], operating on an entire UI layout. Although such models capture rich raster properties of layout images, layout structures are not modeled, leading to noisy recommendations in contextual search over layout datasets.
 Layout generation. Early works on synthesizing 2D layouts relied on exemplars [16, 23, 37] and rule-based heuristics [33, 38], and were unable to capture complex element distributions. The advent of deep learning led to generative models of layouts of floorplans [42, 15, 5, 32], documents [25, 11, 47], and UIs [7, 6]. Perceptual studies aside, evaluation of generated layouts, in terms of diversity and generalization, has mostly revolved around IoUs of the constituent semantic entities [25, 11, 15]. While IoU provides a visual similarity measure, it is expensive to compute over a large number of semantic entities, and is sensitive to element positions within a layout. Developing a tool for structural comparison would perhaps complement visual features in contextual similarity search. In particular, a learning-based method that compares layouts structurally can prove useful in tasks such as layout correspondence, component labeling and layout retargeting. We present a Layout Graph Matching Network, called LayoutGMN, for learning to compare two graphical layouts in a structured manner.
 Structural similarity in 3D. Fisher et al. [8] develop Graph Kernels for characterizing structural relationships in 3D indoor scenes. Indoor scenes are represented as graphs, and the Graph Kernel compares substructures in the graphs to capture similarity between the corresponding scenes. A challenging problem of organizing a heterogeneous collection of such 3D indoor scenes was accomplished in [43] by focusing on a subscene, and using it as a reference point for distance measures between two scenes. Shape Edit Distance, SHED, [22] is another fine-grained sub-structure similarity measure for comparing two 3D shapes. These works provide valuable cues on developing an effective structural metric for layout similarity. Graph Neural Networks (GNN) [28, 21, 4, 36] model node dependencies in a graph via message passing, and are the perfect tool for learning on structured data. GNNs provide coarse-level graph embeddings, which, although useful for many tasks [39, 1, 17, 19], can lose useful structural information in contextual search, if each graph is processed in isolation. We make use of Graph Matching Network [27] to retain structural correspondence between layout elements.
 ![](images/f0a4eb226a10834e1fc610ecbc06337c5ffae80644cf03814bb2d4bf0775005e.jpg)  
 Figure 3. Given an input floorplan image with room segmentations in (a), we abstract each room into a bounding box and obtain layout features from the constituent semantic elements, as shown in (b). These features form the initial node and edge features (Section 3.1) of the corresponding layout graph shown in (c).
 GNNs for structural layout similarity. To the best of our knowledge, the recent work by Manandhar et al. [30] is the first to leverage GNNs to learn structural similarity of 2D graphical layouts, focusing on UI layouts with rectangular boundaries. They employ a GCN-CNN architecture on a graph of UI layout images, also under an IoU-trained triplet network [13], but obtain the graph embeddings for the anchor, positive, and negative graphs independently.
 In contrast, LayoutGMN learns the graph embeddings in a dependent manner. Through cross-graph information exchange, the embeddings are learned in the context of the anchor-positive (respectively, the anchor-negative) pair. This is a critical distinction to GCN-CNN [30], while both train their triplet networks using IoUs. However, since IoU does not involve structure matching, it is not a reliable measure of structural similarity, leading to labels which are considered “structurally incorrect”; see Figure 2.
 In addition, our network does not perform any convolutional processing over layout images; it only involves eight MLPs, placing more emphasis on learning finer-scale structural variations for graph embedding, and less on imagespace features. We clearly observe that the cross-graph communication module in our GMNs does help in learning finer graph embeddings than the GCN-CNN framework [30]. Finally, another advantage of moving away from any reliance on image alignment is that similarity predictions by our network are more robust against highly varied, non-rectangular layout boundaries, e.g., for floorplans.
 # 3. Method
 The Graph Matching Network (GMN) [27] consumes a pair of graphs, processes the graph interactions via an attention-based cross-graph communication mechanism and results in graph embeddings for the two input graphs, as shown in Fig 4. Our LayoutGMN plugs in the Graph
 ![](images/939bcda0c0c4de7dc9855979ac03e34cc2fece15e7d532d2941505334eb83594.jpg)  
 Figure 4. LayoutGMN takes two layout graphs as input, performs intra-graph message passing (Eq. 2), along with cross-graph information exchange (Eq. 3) via an attention mechanism (Eq. 5, also visualized in Figure 1) to update node features, from which final graph embeddings are obtained (Eq. 7).
 Matching Network into a Triplet backbone architecture for learning a (pseudo) metric-space for similarity on 2D layouts such as floorplans, UIs and documents.
 # 3.1. Layout Graphs
 Given a layout image of height $H$ and width $W$ with semantic annotations, we abstract each element into a bounding box, which form the nodes of the resulting layout graph. Specifically, for a layout image $I _ { 1 }$ , its layout graph $G _ { l }$ is given by $G _ { l } ~ = ~ ( V , E )$ , where the node set $V =$ $\{ v _ { 1 } , v _ { 2 } , . . . , v _ { n } \}$ represents the semantic elements in the layout, and $E = \left\{ e _ { 1 2 } , . . . , e _ { i j } , . . , e _ { n \left( n - 1 \right) } \right\}$ , the edge set, represents the set of edges connecting the constituent elements. Our layout graphs are directed and fully-connected.
 Initial Node Features. There exist a variety of visual and content-based features that could be incorporated as the initial node features; ex. the text data/font size/font type of an UI element or the image features of a room in a floorplan. For structured learning tasks as ours, we ignore such content-based features and only focus on the box abstractions. Specifically, similar to [11, 12], the initial node features contain semantic and geometric information of the layout elements. As shown in Fig 3, for a layout element $k$ centered at $( x _ { k } , y _ { k } )$ , with dimensions $( w _ { k } , h _ { k } )$ , its geometric information is:
 $$
 g _ { k } = \left[ { \frac { x _ { k } } { W } } , { \frac { y _ { k } } { H } } , { \frac { w _ { k } } { W } } , { \frac { h _ { k } } { H } } , { \frac { w _ { k } h _ { k } } { \sqrt { W H } } } \right] .
 $$
 Instead of one-hot encoding of the semantics, we use a learnable embedding layer to embed a semantic type into a 128-D code, $s _ { k }$ . A two-layer MLP embeds the $5 \times 1$ geometric vector $g _ { k }$ into a 128-D code, and is concatenated with the 128-D semantic embedding $s _ { k }$ to form the initial node features $U = \{ { \pmb u } _ { 1 } , { \pmb u } _ { 2 } , . . . , { \pmb u } _ { n } \}$ .
 Initial Edge Features. In visual reasoning and relationship detection tasks, edge features in a graph are designed to capture relative difference of the abstracted semantic entities (represented as nodes) [12, 44]. Thus, for an edge $e _ { i j }$ , we capture the spatial relationship (see Fig 3) between the semantic entities by a $8 \times 1$ vector:
 $$
 e _ { i j } = \left[ \frac { \Delta x _ { i j } } { \sqrt { A _ { i } } } , \frac { \Delta y _ { i j } } { \sqrt { A _ { i } } } , \sqrt { \frac { A _ { j } } { A _ { i } } } , U _ { i j } , \frac { w _ { i } } { h _ { i } } , \frac { w _ { j } } { h _ { j } } , \frac { \sqrt { \Delta x ^ { 2 } + \Delta y ^ { 2 } } } { \sqrt { W ^ { 2 } + H ^ { 2 } } } , \theta \right] ,
 $$
 where $A _ { i }$ is the area of the element box $i$ ; $\begin{array} { r } { U _ { i j } = \frac { B _ { i } \cap B _ { j } } { B _ { i } \cup B _ { j } } } \end{array}$ is the IoU of the bounding boxes of the layout elements $i , j$ ; $\begin{array} { r } { \theta = a t a n 2 ( \frac { \Delta y } { \Delta x } ) } \end{array}$ is the relative angle between the two components, $\theta \in [ - \pi , \pi ] ; \Delta x _ { i j } = x _ { j } - x _ { i }$ and $\Delta y _ { i j } = y _ { j } - y _ { i }$ . This edge vector accounts for the translation between the two layout elements, in addition to encoding their box IoUs, individual aspect ratios and relative orientation.
 # 3.2. Graph Matching Network
 The graph matching module employed in LayoutGMN is made up of three parts: (1) node and edge encoders, (2) message propagation layers and (3) an aggregator.
 Node and Edge Encoders. We use two MLPs to embed the initial node and edge features and compute their corresponding code vectors:
 $$
 \begin{array} { r } { { h _ { i } } ^ { ( 0 ) } = M L P _ { n o d e } ( \pmb { u _ { i } } ) , \forall i \in U } \\ { r _ { i j } = M L P _ { e d g e } ( \pmb { e _ { i j } } ) , \forall ( i , j ) \in E } \end{array}
 $$
 The above MLPs map the initial node and edge features to their 128-D code vectors.
 Message Propagation Layers. The graph matching framework hinges on coherent information exchange between graphs to compare two layouts in a structural manner. The propagation layers update the node features by aggregating messages along the edges within a graph, in addition to relying on a graph matching vector that measures how similar a node in one layout graph is to one or more nodes in the other. Specifically, given two node embeddings ${ h _ { i } ^ { ( 0 ) } }$ and $h _ { p } ^ { ( 0 ) }$ from two different layout graphs, the node updates for the node $i$ are given by:
 $$
 \begin{array} { c } { { m _ { j  i } = f _ { i n t r a } ( h _ { i } ^ { ( t ) } , h _ { j } ^ { ( t ) } , r _ { i j } ) , \forall ( i , j ) \in E _ { 1 } } } \\ { { \displaystyle \mu _ { p  i } = f _ { c r o s s } ( h _ { i } ^ { ( t ) } , h _ { p } ^ { ( t ) } ) , \forall i \in V _ { 1 } , p \in V _ { 2 } } } \\ { { \displaystyle h _ { i } ^ { ( t + 1 ) } = f _ { u p d a t e } ( h _ { i } ^ { ( t ) } , \displaystyle \sum _ { j } m _ { j  i } , \displaystyle \sum _ { p } \mu _ { p  i } ) } } \end{array}
 $$
 where $f _ { i n t r a }$ is an MLP on the initial node embedding code that aggregates information from other nodes within the same graph, $f _ { c r o s s }$ is a function that communicates cross-graph information, and $f _ { u p d a t e }$ is an MLP used to update the node features in the graph, whose input is the concatenation of the current node features, the aggregated information from within, and across the graphs. $f _ { c r o s s }$ is designed as an Attention-based module:
 $$
 a _ { p  i } = \frac { \exp ( s _ { h } ( \pmb { h } _ { i } ^ { ( t ) } , \pmb { h } _ { p } ^ { ( t ) } ) } { \sum _ { p } \exp ( s _ { h } ( \pmb { h } _ { i } ^ { ( t ) } , \pmb { h } _ { p } ^ { ( t ) } ) }
 $$
 $$
 \pmb { \mu } _ { p  i } = a _ { p  i } ( \pmb { h } _ { i } ^ { ( t ) } - \pmb { h } _ { p } ^ { ( t ) } )
 $$
 where $a _ { p  i }$ is the attention value (scalar) between node $p$ in the second graph and node $i$ in the first, and such attention weights are calculated for every pair of nodes across the two graphs; $s _ { h }$ is implemented as the dot product of the embedded code vectors. The interaction of all the nodes $p \in V _ { 2 }$ with the node $i$ in $V _ { 1 }$ is then given by:
 $$
 \sum _ { p } \pmb { \mu } _ { p  i } = \sum _ { p } a _ { p  i } ( \pmb { h } _ { i } ^ { ( t ) } - \pmb { h } _ { p } ^ { ( t ) } ) = \pmb { h } _ { i } ^ { ( t ) } - \sum _ { p } a _ { p  i } \pmb { h } _ { p } ^ { ( t ) }
 $$
 Intuitively, $\textstyle \sum _ { p } \pmb { \mu } _ { p \to i }$ measures the (dis)similarity between h(t)i and its nearest neighbor in the other graph. The pairwise attention computation results in stronger structural bonds between the two graphs, but requires additional computation. We use five rounds of message propagation, then the representation for each node is updated accordingly.
 Aggregator. A 1024-D graph-level representation, $h _ { G }$ , is obtained via a feature aggregator MLP, $f _ { G }$ , that takes as input, the set of node representations $\{ h _ { i } ^ { ( T ) } \}$ , as given below:
 $$
 h _ { G } = M L P _ { G } \left( \sum _ { i \in V } \sigma ( M L P _ { g a t e } ( \pmb { h } _ { i } ^ { ( T ) } ) ) \odot M L P ( \pmb { h } _ { i } ^ { ( T ) } ) \right)
 $$
 Graph-level embeddings for the two layout graphs is similarly computed.
 $$
 \begin{array} { r } { \pmb { h } _ { G _ { 1 } } = f _ { G } ( \{ \pmb { h } _ { i } ^ { ( T ) } \} _ { i \in V _ { 1 } } ) } \\ { \pmb { h } _ { G _ { 2 } } = f _ { G } ( \{ \pmb { h } _ { p } ^ { ( T ) } \} _ { p \in V _ { 2 } } ) } \end{array}
 $$
 # 3.3. Training
 To learn a layout similarity metric, we borrow the Triplet training framework [13]. Specifically, given two pairs of layout graphs, i.e., anchor-positive and anchor-negative, each pair is passed through the same GMN module to get the graph embeddings in the context of the other graph, as shown in Fig 5. A margin loss based on the $L _ { 2 }$ distance between the graph embeddings, as given in equation 8, is used to backpropagate the gradients through GMN.
 $$
 \begin{array} { r } { L _ { t r i } ( a , p , n ) = m a x ( 0 , \gamma + \left. h _ { G _ { a } } - h _ { G _ { p } } \right. _ { 2 } } \\ { - \left. h ^ { \prime } _ { G _ { a } } - h _ { G _ { n } } \right. _ { 2 } ) } \end{array}
 $$
 # 4. Datasets
 We use two kinds of layout datasets in our experiments: (1) UI layouts from the RICO dataset [7], and (2) floorplans from the RPLAN dataset [42]. After some data filtering , the size of the two datasets is respectively, 66261 and 77669.
 ![](images/1e1f54d6b4c7441623fd6af31c439e83cd8f899efc5f9d2f7465ab923b69b261.jpg)  
 Figure 5. Given a triplet of graphs $G _ { a }$ , $G _ { p }$ and $G _ { n }$ corresponding to the anchor, positive and negative examples respectively, the anchor graph paired with each of other two graphs is passed through a Graph Matching Network (Fig 4) to get two 1024-D embeddings. Note that the anchor graph has different contextual embeddings $h _ { G a }$ and $\pmb { h } ^ { \prime } G a$ . LayoutGMN is trained using the margin loss (mar$\mathrm { g i n } { = } 5 ,$ ) on the $L _ { 2 }$ distances of the two paired embeddings.
 In the absence of a ground truth label set and the need for obtaining the triplets in a consistent manner, we resort to using IoU values of two layouts, represented as multichannel images, to ascertain their closeness. Given an anchor layout, the threshold on IoU values to classify another layout as positive, from observations, is 0.6 for both UIs and floorplans. Negative examples are those that have a threshold value of at least 0.1 less than the positive ones, avoiding some incorrect ”negatives” during training. The train-test sizes for the aforementioned datasets are respectively: 7,700-1,588, 25,000-7,204. In the filtered floorplan training dataset [42], the distinct number of semantic categories/rooms across the dataset is nine and the maximum number of rooms per floorplan is eight. Similarly, for the filtered UI layout dataset [7], the number of distinct semantic categories is twenty-five and the number of elements per UI layout across the dataset is at most hundred.
 # 5. Results and Evaluation
 We evaluate LayoutGMN by comparing its retrieval results to those of several baselines, evaluated using human judgements. Similarity prediction by our network is efficient: taking 33 milliseconds per layout pair on a CPU. With our learning framework, we can efficiently retrieve multiple, sorted results by batching the database samples.
 # 5.1. Baselines
 Graph Kernel (GK) [8]. GK is one of the earliest structural similarity metrics, initially developed to compare indoor 3D scenes. We adopt it to 2D layouts of floorplans and UI designs. We input the same layout graphs to GK to get retrievals from the two databases, and use the best setting based on result quality/computation cost trade-off.
 <table><tr><td rowspan="2">Method</td><td colspan="3">k-1 (k-5() (k-10 (t)</td></tr><tr><td></td><td></td><td></td></tr><tr><td>Graph Kernel [8]</td><td>33.33</td><td>15.83</td><td>11.46</td></tr><tr><td>U-Net _Triplet [35]</td><td>27.08</td><td>10.83</td><td>7.92</td></tr><tr><td>IoU Metric</td><td>43.75</td><td>22.92</td><td>14.38</td></tr><tr><td>GCN-CNN_Triplet [30]</td><td>39.6</td><td>17.1</td><td>13.33</td></tr><tr><td>LayoutGMN</td><td>47.91</td><td>22.92</td><td>15.83</td></tr><tr><td>Graph Kernel [8]</td><td>27.27</td><td>15.15</td><td>12.42</td></tr><tr><td>U-Net_Triplet [35]</td><td>28.28</td><td>18.18</td><td>15.05</td></tr><tr><td>IoU Metric</td><td>33.84</td><td>24.04</td><td>17.48</td></tr><tr><td>GCN-CNN_Triplet [30]</td><td>37.37</td><td>22.02</td><td>17.02</td></tr><tr><td>LayoutGMN</td><td>38.38</td><td>25.35</td><td>21.21</td></tr></table>
 Table 1. Precision scores for the top-k retrieved results obtained using different methods, on a set of randomly chosen UI and floorplan queries. The first set of five comparisons is for UI layouts, followed by floorplans.
 U-Net [35]. As one of the best segmentation networks, we use U-Net in a triplet network setting to auto-encode layout images. The input to the network is a multi-channel image with semantic segmentations. The network is trained on the same set of triplets as LayoutGMN until convergence.
 IoU Metric. Given two multi-channel images, we use the IoU values between two layout images to get their IoU score, and use this score to sort the examples in the datasets to rank the retrievals for a given query.
 GCN-CNN [30]. The state-of-the-art network for structural similarity on UI layouts is a hybrid network comprised of an attention-based GCN, similar to the gating mechanism in [28], coupled with a CNN. In this original GCN-CNN, the training triplets are randomly sampled every epoch, leading to better training due to diverse training data. In our work, for a fair comparison over all the aforementioned networks, we sample a fixed set of triplets in every epoch of training. The GCN-CNN network is trained on the two datasets of our interest, using the same training data as ours.
 Qualitative retrieval results for GCN-CNN, IoU metric and LayoutGMN for a given query are shown in Figure 6.
 # 5.2. Evaluation Metrics
 Precision $@ k$ scores. To validate the correctness of LayoutGMN as a tool for measuring layout similarity, we start by evaluating layout retrieval from a large database. A standard evaluation protocol for the relevance of ranked lists is the Precision $@ k$ scores [31], or $P \ @ k$ , for short. Given a query $q _ { i }$ from the query set $Q \ = \ \{ q _ { 1 } , q _ { 2 } , q _ { 3 } , . . . , q _ { n } \}$ , we measure the relevance of the ranked lists $L ( q _ { i } ) \ =$ $[ l _ { i 1 } , l _ { i 2 } , . . . . , l _ { i k } , . . . . ]$ using the precision score,
 $$
 P @ k ( Q , L ) = \frac { 1 } { k | Q | } \sum _ { q _ { i } \in Q } \sum _ { j = 1 } ^ { k } r e l ( L _ { i j } , q _ { i } ) ,
 $$
 ![](images/817e17e26c81262c41e6cfdecb5f3145cb19873bc1193aab7bf50bb54c10308a.jpg)  
 Figure 6. Top-5 retrieved results for an input query based on IoU metric, GCN-CNN Triplet [30] and LayoutGMN. We observe that the ranked results returned by LayoutGMN are closer to the input query than the other two methods, although it was trained on triplets computed using the IoU metric. Attention weights for understanding structural correspondence in LayoutGMN are shown in Figure 1 and also provided in the supplementary material. UI and floorplan IDs from the RICO dataset [7] and RPLAN dataset [42], respectively, are indicated on top of each result. More results can be found in the supplementary material.
 where $r e l ( L _ { i j } , q _ { i } )$ is a binary indicator of the relevance of the returned element $L _ { i j }$ for query $q _ { i }$ . In our evaluation, due to the lack of a labeled and exhaustive recommendation set for any query over the layout datasets employed, such a binary indicator is determined by human subjects.
 Table 1 shows the $P \ @ k$ scores for different networks described in Section 5.1 employed for the layout retrieval task. To get the precision scores, similar to [30], we conducted a crowd-sourced annotation study via Amazon Mechanical Turk (AMT) on the top-10 retrievals per query, for $N$ $N = 5 0$ for UIs and 100 for floorplans) randomly chosen queries outside the training set. 10 turkers were asked to indicate the structural relevance of each of the top-10 results per query, without any specific instructions on what a structural comparison means. A result was considered relevant if at least 6 turkers agreed. For details on the AMT study, please see the supplementary material.
 We observe that LayoutGMN better matches humans’ notion of structural similarity. [30] performs better than the IoU metric on floorplan data $( + 3 . 5 \% )$ on the top-1 retrievals and is comparable to IoU metric on top-5 and top-10 results. On UI layouts, the IoU metric is judged better by turkers than [30]. U-Net fails to retrieve structurally similar results as it overfits on the small amount of training data, and relies more on image pixels due to its convolutional structure. LayoutGMN outperforms other methods by at least $1 \%$ for all $k$ , on both datasets. The precision scores on floorplans (bottom-set) are lower than on UI layouts perhaps because they are easier to compare owing to smaller set of semantic elements than UIs and turkers tend to focus more on the size and boundary of the floorplans in additional to the structural arrangements. We believe that when a lot of semantics are present in the layouts and are scattered (as in UIs), the users tend to look at the overall structure instead of trying to match every single element owing to reduced attentionspan, which likely explains higher scores for UIs.
 Overlap $@ \mathbf { k }$ score. We propose another measure to quantify the stability of retrieved results: the Overlap $@ k$ score, or $O \nu @ k$ for short. The intuition behind $O \nu @ k$ is to quantify the consistency of retrievals for any similarity metric, by checking the number of similarly retrieved results for a query and its top-1 result. The higher this score, the better the retrieval consistency, and thus, higher the retrieval stability. Specifically, if $Q _ { 1 }$ is a set of queries and $Q _ { 1 } ^ { t o p 1 }$ the set of top-1 retrieved results for every query in $Q _ { 1 }$ , then
 Table 2. Overlap scores for checking the consistency of retrievals for a query and its top-1 retrieved result, over 50 such pairs. The first set of three rows are for UI layouts, followed by floorplans.   
 <table><tr><td>Method</td><td colspan="2">k=5 (D|k=10(t)</td></tr><tr><td>IoUMetric</td><td>50.6</td><td>49.4</td></tr><tr><td>GCN-CNN_Triplet [30]</td><td>46.8</td><td>45.6</td></tr><tr><td>LayoutGMN</td><td>49.8</td><td>49.8</td></tr><tr><td>IoU Metric</td><td>30.42</td><td>30.8</td></tr><tr><td>GCN-CNN_Triplet [30]</td><td>43.2</td><td>46.8</td></tr><tr><td>LayoutGMN</td><td>47.6</td><td>50.8</td></tr></table>
 $$
 O v @ k ( Q _ { 1 } , Q _ { 1 } ^ { t o p 1 } ) = \frac { 1 } { k | Q _ { 1 } | } \sum _ { \underset { q _ { p } = t o p 1 ( q _ { m } ) } { q _ { m } \in Q _ { 1 } } } \sum _ { j = 1 } ^ { k } ( L _ { m j } \wedge L _ { p j } ) ,
 $$
 where $L _ { i j }$ is the $j ^ { t h }$ ranked result for the query $q _ { i }$ , and $\wedge$ is the logical AND. Thus, $( L _ { m j } \land L _ { p j } )$ is 1 if the $j ^ { t h }$ result for query $q _ { m } \in Q _ { 1 }$ and query $q _ { p } = \mathrm { t o p } 1 ( Q _ { 1 } ) \in Q _ { 1 } ^ { t o p 1 }$ are the same. $O \nu @ k$ measures the ability of the layout similarity metric to replicate the distance field implied by a query by its top-ranked retrieved result. The score makes sense only when the ranked results returned by a layout similarity tool are deemed reasonable, as assessed by the $P \ @ k$ scores.
 Table 2 shows the $O \nu @ k$ scores with $k \ = \ 5 , 1 0$ for IoU, GCN-CNN [30], and LayoutGMN on 50 such pairs. On UIs (first three rows), IoU metric has a slightly higher $O \nu @ 5$ score $( + 0 . 6 \% )$ than LayoutGMN. Also, it shares the largest $P \ @ 5$ score with LayoutGMN, indicating that IoU metric has slightly better retrieval stability for the top-5 results. However, in the case of $O \nu @ I O$ , LayoutGMN has a higher score $( + 0 . 4 \% )$ than the IoU metric and also has a higher $P @ { \mathit { P } } \omega { \mathit { I } } O$ score than the other two methods, indicating that when top-10 retrievals are considered, LayoutGMN has slightly better consistency on the retrievals.
 As for floorplans (last three rows), Table 1 already shows that LayoutGMN has the best $P \ @ k$ scores. This, coupled with a higher $O \nu @ k$ scores, indicate that on floorplans, LayoutGMN has better retrieval stability. In the supplementary material, we show qualitative results on the stability of retrievals for the three methods.
 Classification accuracy. We also measure the classification accuracy of test-triplets as a sanity check. However, such a measure alone is not a sufficient one for correctness of a similarity metric employed in information retrieval tasks [31]. We present it alongside $P \ @ k$ and $O \nu @ k$ scores for a broader, informed evaluation, in Table 3. Since user annotations are expensive and time consuming (and hence the motivation to use IoU metric to get weak training labels), we only get user annotations on 452 triplets for both UIs and floorplans, and the last column of Table 3 reflects the accuracy on such triplets. LayoutGMN outperforms all the baselines by atleast $1 . 3 2 \%$ , on triplets obtained using both, IoU metric and user annotations.
 Table 3. Classification accuracy on test triplets obtained using IoU metric (IoU-based) and annotated by users (User-based). The first set of comparisons is for UI layouts, followed by floorplans.   
 <table><tr><td>Method</td><td colspan="2">Test Accuracy on Triplets IoU-based (↑) User-based (↑)</td></tr><tr><td>Graph Kernel [8]</td><td>90.09</td><td>90.73</td></tr><tr><td>U-Net _Triplet [35]</td><td>96.67</td><td>93.38</td></tr><tr><td>GCN-CNN_Triplet [30]</td><td>96.45</td><td>94.48</td></tr><tr><td>LayoutGMN</td><td>98.96</td><td>95.80</td></tr><tr><td>Graph Kernel [8]</td><td>92.07</td><td>95.60</td></tr><tr><td>U-Net_Triplet [35]</td><td>93.01</td><td>91.00</td></tr><tr><td>GCN-CNN_Triplet [30]</td><td>92.50</td><td>91.8</td></tr><tr><td>LayoutGMN</td><td>97.54</td><td>97.60</td></tr></table>
 ![](images/d21068a819bf1cb5f15f4b3be9c971729ba516dfc823ec24a76c51e0bfdf0b9a.jpg)  
 Figure 7. Retrieval results for the bottom-left query in Fig 6, when adjacency graphs are used. We observe, on most of the queries, that the performance of LayoutGMN improves, but degrades in the case of GCN-CNN [30] on floorplan data.
 # 5.3. Fully-connected vs. Adjacency Graphs
 Following [30], we employed fully connected graphs for our experiments until now and observed that such graphs are a good design for training graph neural networks for learning structural similarity. We also performed experiments using adjacency graphs on GCN-CNN [30] and LayoutGMN, and observed that, for floorplans (where the graph node count is small), the quality of retrievals improved in the case of LayoutGMN, but degraded for GCN-CNN. This is mainly because GCN-CNN obtains independent graph embeddings for each input graph and when the graphs are built only on adjacency connections, some amount of global structural prior is lost. On the other hand, GMNs obtain better contextual embeddings by now matching the sparsely connected adjacency graphs, as a result of narrower search space; for a qualitative result using adjacency graphs, see Figure 7. However, for UIs (where the graph node count is large), the elements are scattered all over the layout, and no one heuristic is able to capture adjacency relations perfectly. The quality of retrievals for both the networks degraded when using adjacency graphs on UIs. More results can be found in the supplementary material.
 <table><tr><td rowspan="2">Structure encoding with</td><td colspan="3">k-1 (k-5 k-10 (t)</td></tr><tr><td></td><td></td><td></td></tr><tr><td>No edges</td><td>30</td><td>16.39</td><td>11.3</td></tr><tr><td>No box positions</td><td>15</td><td>7.2</td><td>5.4</td></tr><tr><td>No node semantics</td><td>24</td><td>11.2</td><td>8.4</td></tr></table>
 Table 4. Precision $\overline { { \ @ \mathrm { K } } }$ scores for ablation studies on structural encoding of floorplan graphs. The setup for crowd-sourced relevance judgements via AMT is the same as in Table 1, on the same set of 100 randomly chosen queries.
 # 5.4. Ablation Studies on Structural Representation
 To evaluate how the node and edge features in our layout representation contribute to network performance, we conduct an ablation study by gradually removing these features. Our design of the initial representation of the layout graphs (Sec 3.1) are well studied in prior works on layout generation [11, 26], visual reasoning, and relationship detection tasks [12, 44, 30]. As such, we focus on analyzing LayoutGMN’s behavior when strong structural priors viz., the edges, box positions, and element semantics, are ablated.
 Graph edges. Removing graph edges results in loss of structural information, with only the attention-weighted node update (Eq. 4) taking place. When the number of graph nodes is small, e.g., for floorplans, edge removal does not lead to random retrievals, but the retrieved results are poorer compared to when edges are present; see Table 4.
 Effect of box positions. The nodes of the layout graphs encode both the absolute box positions and the element semantics. When the position encoding information is withdrawn, arguably, the most important cue is lost. The resulting retrievals from such a poorly trained model, as seen in the second row of Table 4, are noisy as semantics alone do not provide enough structural priors.
 Effect of node semantics. Next, when the box positions are preserved but the element semantics are not encoded, we observe that the network slowly begins to understand element comparison guided by the position info, but falls short of understanding the overall structure information, see Table 4. LayoutGMN takes into account all the above information returning structurally sound results (Table 1), even relative to the IoU metric.
 # 5.5. Attention-based Layout Label Transfer
 We present layout label transfer, via attention-based structural element matching, as a natural application of LayoutGMN. Given a source layout image $I _ { 1 }$ with known labels, the goal is to transfer the labels to a target layout $I _ { 2 }$ .
 ![](images/ed308e04292b05893b2144d0c5147d0b580f1e468750bac4cdac2e7eddcc3460.jpg)  
 Figure 8. Element-level label transfer results from a source image $I _ { 1 }$ to a target image $I _ { 2 }$ , using a pretrained LayoutGMN vs. maximum pixel-overlap matching. LayoutGMN predicts correct labels via attention-based element matching.
 A straight-forward approach to establishing element correspondence is via maximum area/pixel-overlap matching for every element in $I _ { 2 }$ with respect to all the elements in $I _ { 1 }$ . However, this scheme is highly sensitive to element positions within the two layouts. Moreover, rasteralignment (via translations) of layouts is non-trivial to formulate when the two layout images have different boundaries and structures. LayoutGMN, on the other hand, is robust to such boundary variations, and can be directly used to obtain element-level correspondences using the built-in attention mechanism that provides an attention score for every element-level match. Specifically, we use a pretrained LayoutGMN which is fed with two layout graphs, where the semantic encoding of all nodes is set to a vector of ones.
 As shown in Figure 8, the pretrained LayoutGMN is able to find the correct labels despite masking the semantic information at the input. Note that when semantic information is masked at the input, such a transfer can not be applied to any two layouts. It is limited by a weak/floating alignment of $I _ { 1 }$ and $I _ { 2 }$ , as seen in Figure 8.
 # 6. Conclusion, limitation, and future work
 We present the first deep neural network to offer both metric learning of structural layout similarity and structural matching between layout elements. Extensive experiments demonstrate that our metric best matches human judgement of structural similarity for both floorplans and UI designs, compared to all well-known baselines.
 The main limitation of our current learning framework is the requirement for strong supervision, which justifies, in part, the use of the less-than-ideal IoU metric for network training. An interesting future direction is to combine fewshot or active learning with our GMN-based triplet network, e.g., by finding ways to obtain small sets of training triplets that are both informative and diverse [24]. Another limitation of our current network is that it does not learn hierarchical graph representations or structural matching, which would have been desirable when handling large graphs.
 Acknowledgements. We thank the anonymous reviewers for their valuable comments, and the AMT workers for offering their feedback. This work was supported, in part, by an NSERC grant (611370) and an Adobe gift.
 # References
 [1] Oron Ashual and Lior Wolf. Specifying object attributes and relations in interactive scene generation. In Proceedings of the IEEE International Conference on Computer Vision, pages 4561–4569, 2019. 3   
 [2] Thorsten Brants. Inter-annotator agreement for a german newspaper corpus. In International Conference on Knowledge Engineering and Knowledge Management, 2000. 2   
 [3] Thomas M Breuel. High performance document layout analysis. In Proceedings of the Symposium on Document Image Understanding Technology, pages 209–218, 2003. 2   
 [4] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017. 2   
 [5] Qi Chen, Qi Wu, Rui Tang, Yuhan Wang, Shuai Wang, and Mingkui Tan. Intelligent home 3d: Automatic 3d-house design from linguistic descriptions only. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12625–12634, 2020. 2   
 [6] Niraj Ramesh Dayama, Kashyap Todi, Taru Saarelainen, and Antti Oulasvirta. GRIDS: Interactive layout design with integer programming. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pages 1–13, 2020. 2   
 [7] Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ranjitha Kumar. Rico: A mobile app dataset for building datadriven design applications. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology, pages 845–854, 2017. 2, 4, 5, 6   
 [8] Matthew Fisher, Manolis Savva, and Pat Hanrahan. Characterizing structural relationships in scenes using graph kernels. In ACM SIGGRAPH 2011 papers, pages 1–12. 2011. 2, 5, 7   
 [9] Karen Fort, Maud Ehrmann, and Adeline Nazarenko. To- ¨ wards a methodology for named entities annotation. 2009. 2   
 [10] Huan Fu, Bowen Cai, Lin Gao, Lingxiao Zhang, Rongfei Jia, Binqiang Zhao, and Hao Zhang. 3D-FRONT: 3D Furnished Rooms with layOuts and semaNTics, 2020. 1   
 [11] Akshay Gadi Patil, Omri Ben-Eliezer, Or Perel, and Hadar Averbuch-Elor. READ: Recursive autoencoders for document layout generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 544–545, 2020. 1, 2, 3, 8   
 [12] Longteng Guo, Jing Liu, Jinhui Tang, Jiangwei Li, Wei Luo, and Hanqing Lu. Aligning linguistic words and visual semantic units for image captioning. In Proceedings of the 27th ACM International Conference on Multimedia, pages 765–773, 2019. 3, 8   
 [13] Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pages 84–92. Springer, 2015. 3, 4   
 [14] George Hripcsak and Adam Wilcox. Reference standards, judges, and comparison subjects: roles for experts in evaluating system performance. Journal of the American Medical Informatics Association, 9(1):1–15, 2002. 2 [15] Ruizhen Hu, Zeyu Huang, Yuhan Tang, Oliver van Kaick, Hao Zhang, and Hui Huang. Graph2Plan: Learning floorplan generation from layout graphs. ACM Transaction on Graphics (TOG), 2020. 1, 2 [16] Nathan Hurst, Wilmot Li, and Kim Marriott. Review of automatic document formatting. In Proceedings of the 9th ACM symposium on Document engineering, pages 99–108, 2009.   
 2 [17] Justin Johnson, Agrim Gupta, and Li Fei-Fei. Image generation from scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages   
 1219–1228, 2018. 3 [18] Rangachar Kasturi. Document image analysis, volume 39. 2 [19] Nagma Khan, Ushasi Chaudhuri, Biplab Banerjee, and Subhasis Chaudhuri. Graph convolutional network for multilabel vhr remote sensing scene recognition. Neurocomputing, 357:36–46, 2019. 3 [20] Jin-Dong Kim, Tomoko Ohta, and Jun’ichi Tsujii. Corpus annotation for mining biomedical events from literature. BMC bioinformatics, 9(1):10, 2008. 2 [21] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. 2017. 2 [22] Yanir Kleiman, Oliver van Kaick, Olga Sorkine-Hornung, and Daniel Cohen-Or. SHED: shape edit distance for finegrained shape similarity. ACM Transactions on Graphics (TOG), 34(6):1–11, 2015. 2 [23] Ranjitha Kumar, Jerry O Talton, Salman Ahmad, and Scott R Klemmer. Bricolage: example-based retargeting for web design. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 2197–2206, 2011. 2 [24] Priyadarshini Kumari, Ritesh Goru, Siddhartha Chaudhuri, and Subhasis Chaudhuri. Batch decorrelation for active metric learning. In IJCAI-PRICAI, 2020. 8 [25] Jianan Li, Tingfa Xu, Jianming Zhang, Aaron Hertzmann, and Jimei Yang. LayoutGAN: Generating graphic layouts with wireframe discriminator. In International Conference on Learning Representations, 2019. 1, 2 [26] Manyi Li, Akshay Gadi Patil, Kai Xu, Siddhartha Chaudhuri, Owais Khan, Ariel Shamir, Changhe Tu, Baoquan Chen, Daniel Cohen-Or, and Hao Zhang. GRAINS: Generative recursive autoencoders for indoor scenes. ACM Transactions on Graphics (TOG), 38(2):1–16, 2019. 1, 8 [27] Yujia Li, Chenjie Gu, Thomas Dullien, Oriol Vinyals, and Pushmeet Kohli. Graph matching networks for learning the similarity of graph structured objects. In ICML, 2019. 1, 3 [28] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. 2016. 2, 5 [29] Thomas F Liu, Mark Craft, Jason Situ, Ersin Yumer, Radomir Mech, and Ranjitha Kumar. Learning design semantics for mobile apps. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology, pages 569–579, 2018. 2 [30] Dipu Manandhar, Dan Ruta, and John Collomosse. Learning structural similarity of user interface layouts using graph networks. In Proceedings of the European Conference on Computer Vision (ECCV), 2020. 1, 2, 3, 5, 6, 7, 8 [31] Christopher D Manning, Hinrich Schutze, and Prabhakar ¨ Raghavan. Chapter 8: Evaluation in information retrieval in “Introduction to Information Retrieval”. pages 151–175. Cambridge university press, 2008. 5, 7 [32] Nelson Nauata, Kai-Hung Chang, Chin-Yi Cheng, Greg Mori, and Yasutaka Furukawa. House-gan: Relational generative adversarial networks for graph-constrained house layout generation. Eur. Conf. Comput. Vis., 2020. 2 [33] Peter O’Donovan, Aseem Agarwala, and Aaron Hertzmann. Learning layouts for single-page graphic designs. IEEE transactions on visualization and computer graphics,   
 20(8):1200–1213, 2014. 2 [34] Daniel Ritchie, Ankita Arvind Kejriwal, and Scott R Klemmer. d. tour: Style-based exploration of design example galleries. In Proceedings of the 24th annual ACM symposium on User interface software and technology, pages 165–174,   
 2011. 2 [35] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Unet: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015. 5, 7 [36] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pages 593–607. Springer, 2018.   
 2 [37] Amanda Swearngin, Mira Dontcheva, Wilmot Li, Joel Brandt, Morgan Dixon, and Andrew J Ko. Rewire: Interface design assistance from examples. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pages 1–12, 2018. 2 [38] Sou Tabata, Hiroki Yoshihara, Haruka Maeda, and Kei Yokoyama. Automatic layout generation for graphical design magazines. In ACM SIGGRAPH 2019 Posters, pages   
 1–2. 2019. 2 [39] Subarna Tripathi, Sharath Nittur Sridhar, Sairam Sundaresan, and Hanlin Tang. Compact scene graphs for layout composition and patch retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019. 3 [40] Raoul Wessel, Ina Blumel, and Reinhard Klein. The room ¨ connectivity graph: Shape retrieval in the architectural domain. 2008. 2 [41] W John Wilbur, Andrey Rzhetsky, and Hagit Shatkay. New directions in biomedical text annotation: definitions, guidelines and corpus construction. BMC bioinformatics, 7(1):1–   
 10, 2006. 2 [42] Wenming Wu, Xiao-Ming Fu, Rui Tang, Yuhan Wang, YuHao Qi, and Ligang Liu. Data-driven interior plan generation for residential buildings. ACM Transactions on Graphics (TOG), 38(6):1–12, 2019. 1, 2, 5, 6 [43] Kai Xu, Rui Ma, Hao Zhang, Chenyang Zhu, Ariel Shamir, Daniel Cohen-Or, and Hui Huang. Organizing heterogeneous scene collections through contextual focal points. ACM Transactions on Graphics (TOG), 33(4):1–12, 2014. 2 [44] Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. Exploring visual relationship for image captioning. In Proceedings of the European conference on computer vision (ECCV), pages 684–699, 2018. 3, 8   
 [45] Ziqi Zhang, Sam Chapman, and Fabio Ciravegna. A methodology towards effective and efficient manual document annotation: addressing annotator discrepancy and annotation quality. In International Conference on Knowledge Engineering and Knowledge Management, pages 301–315. Springer, 2010. 2   
 [46] Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou. Structured3D: A Large Photo-realistic Dataset for Structured 3D Modeling. In Eur. Conf. Comput. Vis., 2020. 1   
 [47] Xinru Zheng, Xiaotian Qiao, Ying Cao, and Rynson WH Lau. Content-aware generative modeling of graphic design layouts. ACM Transactions on Graphics (TOG), 38(4):1–15, 2019. 1, 2   
 [48] Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Publaynet: largest dataset ever for document layout analysis. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1015–1022. IEEE, 2019. 1
--- a/reference/LayoutGMN/.DS_Store
+++ b/reference/LayoutGMN/.DS_Store
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/.DS_Store
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/.DS_Store
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper.md
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper.md
@@ -1,267 +0,0 @@
 # LayoutGMN: Neural Graph Matching for Structural Layout Similarity
 Akshay Gadi Patil 1 Manyi Li1† Matthew Fisher2 Manolis Savva1 Hao Zhang1
 1Simon Fraser University 2Adobe Research
 # Abstract
 We present a deep neural network to predict structural similarity between 2D layouts by leveraging Graph Matching Networks (GMN). Our network, coined LayoutGMN, learns the layout metric via neural graph matching, using an attention-based GMN designed under a triplet network setting. To train our network, we utilize weak labels obtained by pixel-wise Intersection-over-Union (IoUs) to define the triplet loss. Importantly, LayoutGMN is built with a structural bias which can effectively compensate for the lack of structure awareness in IoUs. We demonstrate this on two prominent forms of layouts, viz., floorplans and UI designs, via retrieval experiments on large-scale datasets. In particular, retrieval results by our network better match human judgement of structural layout similarity compared to both IoUs and other baselines including a state-of-theart method based on graph neural networks and image convolution. In addition, LayoutGMN is the first deep model to offer both metric learning of structural layout similarity and structural matching between layout elements.
 # 1. Introduction
 Two-dimensional layouts are ubiquitous visual abstractions in graphic and architectural designs. They typically represent blueprints or conceptual sketches for such data as floorplans, documents, scene arrangements, and UI designs. Recent advances in pattern analysis and synthesis have propelled the development of generative models for layouts [11, 25, 47, 15, 26] and led to a steady accumulation of relevant datasets [48, 42, 10, 46]. Despite these developments however, there have been few attempts at employing a deeply learned metric to reason about layout data, e.g., for retrieval, data embedding, and evaluation. For example, current evaluation protocols for layout generation still rely heavily on segmentation metrics such as intersection-overunion (IoU) [15, 30] and human judgement [15, 26].
 The ability to compare data effectively and efficiently is arguably the most foundational task in data analysis. The key challenge in comparing layouts is that it is not purely a task of visual comparison — it depends critically on inference and reasoning about structures, which are expressed by the semantics and organizational arrangements of the elements or subdivisions which compose a layout. Hence, none of the well-established image-space metrics, whether model-driven, perceptual, or deeply learned, are best suited to measure structural layout similarity. Frequently applied similarity measures for image segmentation such as IoUs and F1 scores all perform pixel-level matching “in place” — they are not structural and can be sensitive to element misalignments which are structure-preserving.
 ![](images/516817b84bdaf3db241d1a3b87d316578c8f2d9adb29bb8a247a3e00042ba1d0.jpg)  
 Figure 1. LayoutGMN learns a structural layout similarity metric between floorplans and other 2D layouts, through attention-based neural graph matching. The learned attention weights (numbers shown in the boxes) can be used to match the structural elements.
 In this work, we develop a deep neural network to predict structural similarity between two 2D layouts, e.g., floorplans or UI designs. We take a predominantly structural view of layouts for both data representation and layout comparison. Specifically, we represent each layout using a directed, fully connected graph over its semantic elements. Our network learns structural layout similarity via neural graph matching, where an attention-based graph matching network [27] is designed under a triplet network setting. The network, coined LayoutGMN, takes as input a triplet of layout graphs, composed together by one pair of anchorpositive and one pair of anchor-negative graphs, and performs intra-graph message passing and cross-graph information communication per pair, to learn a graph embedding for layout similarity prediction. In addition to returning a metric, the attention weights learned by our network can also be used to match the layout elements; see Figure 1.
 ![](images/76179359f537652a648a8d2094196e528e584399d6cb01cf8f854181aa609e51.jpg)  
 Figure 2. Structure matching in LayoutGMN “neutralizes” IoU feedback. In each example (left: floorplan; right: UI design), a training sample $N$ labeled as “Negative” by IoU is more structurally similar to the anchor $( A )$ than $P$ , a “Positive” sample. With structure matching, our network predicts a smaller $A$ -to- $N$ distance than $A$ -to- $P$ distance in each case, which contradicts IoU.
 To train our triplet network, it is natural to consider human labeling of positive and negative samples. However, it is well-known that subjective judgements by humans over structured data such as layouts are often unreliable, especially with non-experts [45, 2]. When domain experts are employed, the task becomes time-consuming and expensive [45, 2, 14, 9, 20, 41], where discrepancies among even these experts still remain [14]. In our work, we avoid this issue by resorting to weakly supervised training of LayoutGMN, which obtains positive and negative labels from the training data through thresholding using layout IoUs [30].
 The motivations behind our network training using IoUs are three-fold, despite the IoU’s shortcomings for structural matching. First, as one of the most widely-used layout similarity measures [30, 15], IoU does have its merits. Second, IoUs are objective and much easier to obtain than expert annotations. Finally and most importantly, our network has a built-in inductive bias to enforce structural correspondence, via inter-graph information exchange, when learning the graph embeddings. The inductive bias results from an attention-based graph matching mechanism, which learns structural matching between two graphs at the node level (Eq 3, 6). Such a structural bias can effectively compensate for the lack of structure awareness in the IoU-based triplet loss during training. In Figure 2, we illustrate the effect of this structural bias on the metric learned by our network. Observe that the last two layouts are more similar structurally than the first two. This is agreed with by our metric LayoutGMN, but not by IoU feedback.
 We evaluate our network on retrieval tasks over large datasets of floorplans and UI designs, via Precision $@ k$ scores, and investigate the stability of the proposed metric by checking retrieval consistency between a query and its top-1 result, over many such pairs; see Sec. 5.2.Overall, retrieval results by LayoutGMN better match human judgement of structural layout similarity compared to both IoUs and other baselines including a state-of-the-art method based on graph neural networks [30]. Finally, we show a label transfer application for floorplans enabled by the structure matching learned by our network (Sec 5.5).
 # 2. Related Work
 Layout analysis. Early works [18, 3] on document analysis involved primitive heuristics to analyse document structures. Organizing a large collection of such structures into meaningful clusters requires a distance measure between layouts, which typically involved content-based heuristics [34] for documents and constrained graph matching algorithm for floorplans [40]. An improved distance measure relied on rich layout representation obtained using autoencoders [7, 29], operating on an entire UI layout. Although such models capture rich raster properties of layout images, layout structures are not modeled, leading to noisy recommendations in contextual search over layout datasets.
 Layout generation. Early works on synthesizing 2D layouts relied on exemplars [16, 23, 37] and rule-based heuristics [33, 38], and were unable to capture complex element distributions. The advent of deep learning led to generative models of layouts of floorplans [42, 15, 5, 32], documents [25, 11, 47], and UIs [7, 6]. Perceptual studies aside, evaluation of generated layouts, in terms of diversity and generalization, has mostly revolved around IoUs of the constituent semantic entities [25, 11, 15]. While IoU provides a visual similarity measure, it is expensive to compute over a large number of semantic entities, and is sensitive to element positions within a layout. Developing a tool for structural comparison would perhaps complement visual features in contextual similarity search. In particular, a learning-based method that compares layouts structurally can prove useful in tasks such as layout correspondence, component labeling and layout retargeting. We present a Layout Graph Matching Network, called LayoutGMN, for learning to compare two graphical layouts in a structured manner.
 Structural similarity in 3D. Fisher et al. [8] develop Graph Kernels for characterizing structural relationships in 3D indoor scenes. Indoor scenes are represented as graphs, and the Graph Kernel compares substructures in the graphs to capture similarity between the corresponding scenes. A challenging problem of organizing a heterogeneous collection of such 3D indoor scenes was accomplished in [43] by focusing on a subscene, and using it as a reference point for distance measures between two scenes. Shape Edit Distance, SHED, [22] is another fine-grained sub-structure similarity measure for comparing two 3D shapes. These works provide valuable cues on developing an effective structural metric for layout similarity. Graph Neural Networks (GNN) [28, 21, 4, 36] model node dependencies in a graph via message passing, and are the perfect tool for learning on structured data. GNNs provide coarse-level graph embeddings, which, although useful for many tasks [39, 1, 17, 19], can lose useful structural information in contextual search, if each graph is processed in isolation. We make use of Graph Matching Network [27] to retain structural correspondence between layout elements.
 ![](images/f0a4eb226a10834e1fc610ecbc06337c5ffae80644cf03814bb2d4bf0775005e.jpg)  
 Figure 3. Given an input floorplan image with room segmentations in (a), we abstract each room into a bounding box and obtain layout features from the constituent semantic elements, as shown in (b). These features form the initial node and edge features (Section 3.1) of the corresponding layout graph shown in (c).
 GNNs for structural layout similarity. To the best of our knowledge, the recent work by Manandhar et al. [30] is the first to leverage GNNs to learn structural similarity of 2D graphical layouts, focusing on UI layouts with rectangular boundaries. They employ a GCN-CNN architecture on a graph of UI layout images, also under an IoU-trained triplet network [13], but obtain the graph embeddings for the anchor, positive, and negative graphs independently.
 In contrast, LayoutGMN learns the graph embeddings in a dependent manner. Through cross-graph information exchange, the embeddings are learned in the context of the anchor-positive (respectively, the anchor-negative) pair. This is a critical distinction to GCN-CNN [30], while both train their triplet networks using IoUs. However, since IoU does not involve structure matching, it is not a reliable measure of structural similarity, leading to labels which are considered “structurally incorrect”; see Figure 2.
 In addition, our network does not perform any convolutional processing over layout images; it only involves eight MLPs, placing more emphasis on learning finer-scale structural variations for graph embedding, and less on imagespace features. We clearly observe that the cross-graph communication module in our GMNs does help in learning finer graph embeddings than the GCN-CNN framework [30]. Finally, another advantage of moving away from any reliance on image alignment is that similarity predictions by our network are more robust against highly varied, non-rectangular layout boundaries, e.g., for floorplans.
 # 3. Method
 The Graph Matching Network (GMN) [27] consumes a pair of graphs, processes the graph interactions via an attention-based cross-graph communication mechanism and results in graph embeddings for the two input graphs, as shown in Fig 4. Our LayoutGMN plugs in the Graph
 ![](images/939bcda0c0c4de7dc9855979ac03e34cc2fece15e7d532d2941505334eb83594.jpg)  
 Figure 4. LayoutGMN takes two layout graphs as input, performs intra-graph message passing (Eq. 2), along with cross-graph information exchange (Eq. 3) via an attention mechanism (Eq. 5, also visualized in Figure 1) to update node features, from which final graph embeddings are obtained (Eq. 7).
 Matching Network into a Triplet backbone architecture for learning a (pseudo) metric-space for similarity on 2D layouts such as floorplans, UIs and documents.
 # 3.1. Layout Graphs
 Given a layout image of height $H$ and width $W$ with semantic annotations, we abstract each element into a bounding box, which form the nodes of the resulting layout graph. Specifically, for a layout image $I _ { 1 }$ , its layout graph $G _ { l }$ is given by $G _ { l } ~ = ~ ( V , E )$ , where the node set $V =$ $\{ v _ { 1 } , v _ { 2 } , . . . , v _ { n } \}$ represents the semantic elements in the layout, and $E = \left\{ e _ { 1 2 } , . . . , e _ { i j } , . . , e _ { n \left( n - 1 \right) } \right\}$ , the edge set, represents the set of edges connecting the constituent elements. Our layout graphs are directed and fully-connected.
 Initial Node Features. There exist a variety of visual and content-based features that could be incorporated as the initial node features; ex. the text data/font size/font type of an UI element or the image features of a room in a floorplan. For structured learning tasks as ours, we ignore such content-based features and only focus on the box abstractions. Specifically, similar to [11, 12], the initial node features contain semantic and geometric information of the layout elements. As shown in Fig 3, for a layout element $k$ centered at $( x _ { k } , y _ { k } )$ , with dimensions $( w _ { k } , h _ { k } )$ , its geometric information is:
 $$
 g _ { k } = \left[ { \frac { x _ { k } } { W } } , { \frac { y _ { k } } { H } } , { \frac { w _ { k } } { W } } , { \frac { h _ { k } } { H } } , { \frac { w _ { k } h _ { k } } { \sqrt { W H } } } \right] .
 $$
 Instead of one-hot encoding of the semantics, we use a learnable embedding layer to embed a semantic type into a 128-D code, $s _ { k }$ . A two-layer MLP embeds the $5 \times 1$ geometric vector $g _ { k }$ into a 128-D code, and is concatenated with the 128-D semantic embedding $s _ { k }$ to form the initial node features $U = \{ { \pmb u } _ { 1 } , { \pmb u } _ { 2 } , . . . , { \pmb u } _ { n } \}$ .
 Initial Edge Features. In visual reasoning and relationship detection tasks, edge features in a graph are designed to capture relative difference of the abstracted semantic entities (represented as nodes) [12, 44]. Thus, for an edge $e _ { i j }$ , we capture the spatial relationship (see Fig 3) between the semantic entities by a $8 \times 1$ vector:
 $$
 e _ { i j } = \left[ \frac { \Delta x _ { i j } } { \sqrt { A _ { i } } } , \frac { \Delta y _ { i j } } { \sqrt { A _ { i } } } , \sqrt { \frac { A _ { j } } { A _ { i } } } , U _ { i j } , \frac { w _ { i } } { h _ { i } } , \frac { w _ { j } } { h _ { j } } , \frac { \sqrt { \Delta x ^ { 2 } + \Delta y ^ { 2 } } } { \sqrt { W ^ { 2 } + H ^ { 2 } } } , \theta \right] ,
 $$
 where $A _ { i }$ is the area of the element box $i$ ; $\begin{array} { r } { U _ { i j } = \frac { B _ { i } \cap B _ { j } } { B _ { i } \cup B _ { j } } } \end{array}$ is the IoU of the bounding boxes of the layout elements $i , j$ ; $\begin{array} { r } { \theta = a t a n 2 ( \frac { \Delta y } { \Delta x } ) } \end{array}$ is the relative angle between the two components, $\theta \in [ - \pi , \pi ] ; \Delta x _ { i j } = x _ { j } - x _ { i }$ and $\Delta y _ { i j } = y _ { j } - y _ { i }$ . This edge vector accounts for the translation between the two layout elements, in addition to encoding their box IoUs, individual aspect ratios and relative orientation.
 # 3.2. Graph Matching Network
 The graph matching module employed in LayoutGMN is made up of three parts: (1) node and edge encoders, (2) message propagation layers and (3) an aggregator.
 Node and Edge Encoders. We use two MLPs to embed the initial node and edge features and compute their corresponding code vectors:
 $$
 \begin{array} { r } { { h _ { i } } ^ { ( 0 ) } = M L P _ { n o d e } ( \pmb { u _ { i } } ) , \forall i \in U } \\ { r _ { i j } = M L P _ { e d g e } ( \pmb { e _ { i j } } ) , \forall ( i , j ) \in E } \end{array}
 $$
 The above MLPs map the initial node and edge features to their 128-D code vectors.
 Message Propagation Layers. The graph matching framework hinges on coherent information exchange between graphs to compare two layouts in a structural manner. The propagation layers update the node features by aggregating messages along the edges within a graph, in addition to relying on a graph matching vector that measures how similar a node in one layout graph is to one or more nodes in the other. Specifically, given two node embeddings ${ h _ { i } ^ { ( 0 ) } }$ and $h _ { p } ^ { ( 0 ) }$ from two different layout graphs, the node updates for the node $i$ are given by:
 $$
 \begin{array} { c } { { m _ { j  i } = f _ { i n t r a } ( h _ { i } ^ { ( t ) } , h _ { j } ^ { ( t ) } , r _ { i j } ) , \forall ( i , j ) \in E _ { 1 } } } \\ { { \displaystyle \mu _ { p  i } = f _ { c r o s s } ( h _ { i } ^ { ( t ) } , h _ { p } ^ { ( t ) } ) , \forall i \in V _ { 1 } , p \in V _ { 2 } } } \\ { { \displaystyle h _ { i } ^ { ( t + 1 ) } = f _ { u p d a t e } ( h _ { i } ^ { ( t ) } , \displaystyle \sum _ { j } m _ { j  i } , \displaystyle \sum _ { p } \mu _ { p  i } ) } } \end{array}
 $$
 where $f _ { i n t r a }$ is an MLP on the initial node embedding code that aggregates information from other nodes within the same graph, $f _ { c r o s s }$ is a function that communicates cross-graph information, and $f _ { u p d a t e }$ is an MLP used to update the node features in the graph, whose input is the concatenation of the current node features, the aggregated information from within, and across the graphs. $f _ { c r o s s }$ is designed as an Attention-based module:
 $$
 a _ { p  i } = \frac { \exp ( s _ { h } ( \pmb { h } _ { i } ^ { ( t ) } , \pmb { h } _ { p } ^ { ( t ) } ) } { \sum _ { p } \exp ( s _ { h } ( \pmb { h } _ { i } ^ { ( t ) } , \pmb { h } _ { p } ^ { ( t ) } ) }
 $$
 $$
 \pmb { \mu } _ { p  i } = a _ { p  i } ( \pmb { h } _ { i } ^ { ( t ) } - \pmb { h } _ { p } ^ { ( t ) } )
 $$
 where $a _ { p  i }$ is the attention value (scalar) between node $p$ in the second graph and node $i$ in the first, and such attention weights are calculated for every pair of nodes across the two graphs; $s _ { h }$ is implemented as the dot product of the embedded code vectors. The interaction of all the nodes $p \in V _ { 2 }$ with the node $i$ in $V _ { 1 }$ is then given by:
 $$
 \sum _ { p } \pmb { \mu } _ { p  i } = \sum _ { p } a _ { p  i } ( \pmb { h } _ { i } ^ { ( t ) } - \pmb { h } _ { p } ^ { ( t ) } ) = \pmb { h } _ { i } ^ { ( t ) } - \sum _ { p } a _ { p  i } \pmb { h } _ { p } ^ { ( t ) }
 $$
 Intuitively, $\textstyle \sum _ { p } \pmb { \mu } _ { p \to i }$ measures the (dis)similarity between h(t)i and its nearest neighbor in the other graph. The pairwise attention computation results in stronger structural bonds between the two graphs, but requires additional computation. We use five rounds of message propagation, then the representation for each node is updated accordingly.
 Aggregator. A 1024-D graph-level representation, $h _ { G }$ , is obtained via a feature aggregator MLP, $f _ { G }$ , that takes as input, the set of node representations $\{ h _ { i } ^ { ( T ) } \}$ , as given below:
 $$
 h _ { G } = M L P _ { G } \left( \sum _ { i \in V } \sigma ( M L P _ { g a t e } ( \pmb { h } _ { i } ^ { ( T ) } ) ) \odot M L P ( \pmb { h } _ { i } ^ { ( T ) } ) \right)
 $$
 Graph-level embeddings for the two layout graphs is similarly computed.
 $$
 \begin{array} { r } { \pmb { h } _ { G _ { 1 } } = f _ { G } ( \{ \pmb { h } _ { i } ^ { ( T ) } \} _ { i \in V _ { 1 } } ) } \\ { \pmb { h } _ { G _ { 2 } } = f _ { G } ( \{ \pmb { h } _ { p } ^ { ( T ) } \} _ { p \in V _ { 2 } } ) } \end{array}
 $$
 # 3.3. Training
 To learn a layout similarity metric, we borrow the Triplet training framework [13]. Specifically, given two pairs of layout graphs, i.e., anchor-positive and anchor-negative, each pair is passed through the same GMN module to get the graph embeddings in the context of the other graph, as shown in Fig 5. A margin loss based on the $L _ { 2 }$ distance between the graph embeddings, as given in equation 8, is used to backpropagate the gradients through GMN.
 $$
 \begin{array} { r } { L _ { t r i } ( a , p , n ) = m a x ( 0 , \gamma + \left. h _ { G _ { a } } - h _ { G _ { p } } \right. _ { 2 } } \\ { - \left. h ^ { \prime } _ { G _ { a } } - h _ { G _ { n } } \right. _ { 2 } ) } \end{array}
 $$
 # 4. Datasets
 We use two kinds of layout datasets in our experiments: (1) UI layouts from the RICO dataset [7], and (2) floorplans from the RPLAN dataset [42]. After some data filtering , the size of the two datasets is respectively, 66261 and 77669.
 ![](images/1e1f54d6b4c7441623fd6af31c439e83cd8f899efc5f9d2f7465ab923b69b261.jpg)  
 Figure 5. Given a triplet of graphs $G _ { a }$ , $G _ { p }$ and $G _ { n }$ corresponding to the anchor, positive and negative examples respectively, the anchor graph paired with each of other two graphs is passed through a Graph Matching Network (Fig 4) to get two 1024-D embeddings. Note that the anchor graph has different contextual embeddings $h _ { G a }$ and $\pmb { h } ^ { \prime } G a$ . LayoutGMN is trained using the margin loss (mar$\mathrm { g i n } { = } 5 ,$ ) on the $L _ { 2 }$ distances of the two paired embeddings.
 In the absence of a ground truth label set and the need for obtaining the triplets in a consistent manner, we resort to using IoU values of two layouts, represented as multichannel images, to ascertain their closeness. Given an anchor layout, the threshold on IoU values to classify another layout as positive, from observations, is 0.6 for both UIs and floorplans. Negative examples are those that have a threshold value of at least 0.1 less than the positive ones, avoiding some incorrect ”negatives” during training. The train-test sizes for the aforementioned datasets are respectively: 7,700-1,588, 25,000-7,204. In the filtered floorplan training dataset [42], the distinct number of semantic categories/rooms across the dataset is nine and the maximum number of rooms per floorplan is eight. Similarly, for the filtered UI layout dataset [7], the number of distinct semantic categories is twenty-five and the number of elements per UI layout across the dataset is at most hundred.
 # 5. Results and Evaluation
 We evaluate LayoutGMN by comparing its retrieval results to those of several baselines, evaluated using human judgements. Similarity prediction by our network is efficient: taking 33 milliseconds per layout pair on a CPU. With our learning framework, we can efficiently retrieve multiple, sorted results by batching the database samples.
 # 5.1. Baselines
 Graph Kernel (GK) [8]. GK is one of the earliest structural similarity metrics, initially developed to compare indoor 3D scenes. We adopt it to 2D layouts of floorplans and UI designs. We input the same layout graphs to GK to get retrievals from the two databases, and use the best setting based on result quality/computation cost trade-off.
 <table><tr><td rowspan="2">Method</td><td colspan="3">k-1 (k-5() (k-10 (t)</td></tr><tr><td></td><td></td><td></td></tr><tr><td>Graph Kernel [8]</td><td>33.33</td><td>15.83</td><td>11.46</td></tr><tr><td>U-Net _Triplet [35]</td><td>27.08</td><td>10.83</td><td>7.92</td></tr><tr><td>IoU Metric</td><td>43.75</td><td>22.92</td><td>14.38</td></tr><tr><td>GCN-CNN_Triplet [30]</td><td>39.6</td><td>17.1</td><td>13.33</td></tr><tr><td>LayoutGMN</td><td>47.91</td><td>22.92</td><td>15.83</td></tr><tr><td>Graph Kernel [8]</td><td>27.27</td><td>15.15</td><td>12.42</td></tr><tr><td>U-Net_Triplet [35]</td><td>28.28</td><td>18.18</td><td>15.05</td></tr><tr><td>IoU Metric</td><td>33.84</td><td>24.04</td><td>17.48</td></tr><tr><td>GCN-CNN_Triplet [30]</td><td>37.37</td><td>22.02</td><td>17.02</td></tr><tr><td>LayoutGMN</td><td>38.38</td><td>25.35</td><td>21.21</td></tr></table>
 Table 1. Precision scores for the top-k retrieved results obtained using different methods, on a set of randomly chosen UI and floorplan queries. The first set of five comparisons is for UI layouts, followed by floorplans.
 U-Net [35]. As one of the best segmentation networks, we use U-Net in a triplet network setting to auto-encode layout images. The input to the network is a multi-channel image with semantic segmentations. The network is trained on the same set of triplets as LayoutGMN until convergence.
 IoU Metric. Given two multi-channel images, we use the IoU values between two layout images to get their IoU score, and use this score to sort the examples in the datasets to rank the retrievals for a given query.
 GCN-CNN [30]. The state-of-the-art network for structural similarity on UI layouts is a hybrid network comprised of an attention-based GCN, similar to the gating mechanism in [28], coupled with a CNN. In this original GCN-CNN, the training triplets are randomly sampled every epoch, leading to better training due to diverse training data. In our work, for a fair comparison over all the aforementioned networks, we sample a fixed set of triplets in every epoch of training. The GCN-CNN network is trained on the two datasets of our interest, using the same training data as ours.
 Qualitative retrieval results for GCN-CNN, IoU metric and LayoutGMN for a given query are shown in Figure 6.
 # 5.2. Evaluation Metrics
 Precision $@ k$ scores. To validate the correctness of LayoutGMN as a tool for measuring layout similarity, we start by evaluating layout retrieval from a large database. A standard evaluation protocol for the relevance of ranked lists is the Precision $@ k$ scores [31], or $P \ @ k$ , for short. Given a query $q _ { i }$ from the query set $Q \ = \ \{ q _ { 1 } , q _ { 2 } , q _ { 3 } , . . . , q _ { n } \}$ , we measure the relevance of the ranked lists $L ( q _ { i } ) \ =$ $[ l _ { i 1 } , l _ { i 2 } , . . . . , l _ { i k } , . . . . ]$ using the precision score,
 $$
 P @ k ( Q , L ) = \frac { 1 } { k | Q | } \sum _ { q _ { i } \in Q } \sum _ { j = 1 } ^ { k } r e l ( L _ { i j } , q _ { i } ) ,
 $$
 ![](images/817e17e26c81262c41e6cfdecb5f3145cb19873bc1193aab7bf50bb54c10308a.jpg)  
 Figure 6. Top-5 retrieved results for an input query based on IoU metric, GCN-CNN Triplet [30] and LayoutGMN. We observe that the ranked results returned by LayoutGMN are closer to the input query than the other two methods, although it was trained on triplets computed using the IoU metric. Attention weights for understanding structural correspondence in LayoutGMN are shown in Figure 1 and also provided in the supplementary material. UI and floorplan IDs from the RICO dataset [7] and RPLAN dataset [42], respectively, are indicated on top of each result. More results can be found in the supplementary material.
 where $r e l ( L _ { i j } , q _ { i } )$ is a binary indicator of the relevance of the returned element $L _ { i j }$ for query $q _ { i }$ . In our evaluation, due to the lack of a labeled and exhaustive recommendation set for any query over the layout datasets employed, such a binary indicator is determined by human subjects.
 Table 1 shows the $P \ @ k$ scores for different networks described in Section 5.1 employed for the layout retrieval task. To get the precision scores, similar to [30], we conducted a crowd-sourced annotation study via Amazon Mechanical Turk (AMT) on the top-10 retrievals per query, for $N$ $N = 5 0$ for UIs and 100 for floorplans) randomly chosen queries outside the training set. 10 turkers were asked to indicate the structural relevance of each of the top-10 results per query, without any specific instructions on what a structural comparison means. A result was considered relevant if at least 6 turkers agreed. For details on the AMT study, please see the supplementary material.
 We observe that LayoutGMN better matches humans’ notion of structural similarity. [30] performs better than the IoU metric on floorplan data $( + 3 . 5 \% )$ on the top-1 retrievals and is comparable to IoU metric on top-5 and top-10 results. On UI layouts, the IoU metric is judged better by turkers than [30]. U-Net fails to retrieve structurally similar results as it overfits on the small amount of training data, and relies more on image pixels due to its convolutional structure. LayoutGMN outperforms other methods by at least $1 \%$ for all $k$ , on both datasets. The precision scores on floorplans (bottom-set) are lower than on UI layouts perhaps because they are easier to compare owing to smaller set of semantic elements than UIs and turkers tend to focus more on the size and boundary of the floorplans in additional to the structural arrangements. We believe that when a lot of semantics are present in the layouts and are scattered (as in UIs), the users tend to look at the overall structure instead of trying to match every single element owing to reduced attentionspan, which likely explains higher scores for UIs.
 Overlap $@ \mathbf { k }$ score. We propose another measure to quantify the stability of retrieved results: the Overlap $@ k$ score, or $O \nu @ k$ for short. The intuition behind $O \nu @ k$ is to quantify the consistency of retrievals for any similarity metric, by checking the number of similarly retrieved results for a query and its top-1 result. The higher this score, the better the retrieval consistency, and thus, higher the retrieval stability. Specifically, if $Q _ { 1 }$ is a set of queries and $Q _ { 1 } ^ { t o p 1 }$ the set of top-1 retrieved results for every query in $Q _ { 1 }$ , then
 Table 2. Overlap scores for checking the consistency of retrievals for a query and its top-1 retrieved result, over 50 such pairs. The first set of three rows are for UI layouts, followed by floorplans.   
 <table><tr><td>Method</td><td colspan="2">k=5 (D|k=10(t)</td></tr><tr><td>IoUMetric</td><td>50.6</td><td>49.4</td></tr><tr><td>GCN-CNN_Triplet [30]</td><td>46.8</td><td>45.6</td></tr><tr><td>LayoutGMN</td><td>49.8</td><td>49.8</td></tr><tr><td>IoU Metric</td><td>30.42</td><td>30.8</td></tr><tr><td>GCN-CNN_Triplet [30]</td><td>43.2</td><td>46.8</td></tr><tr><td>LayoutGMN</td><td>47.6</td><td>50.8</td></tr></table>
 $$
 O v @ k ( Q _ { 1 } , Q _ { 1 } ^ { t o p 1 } ) = \frac { 1 } { k | Q _ { 1 } | } \sum _ { \underset { q _ { p } = t o p 1 ( q _ { m } ) } { q _ { m } \in Q _ { 1 } } } \sum _ { j = 1 } ^ { k } ( L _ { m j } \wedge L _ { p j } ) ,
 $$
 where $L _ { i j }$ is the $j ^ { t h }$ ranked result for the query $q _ { i }$ , and $\wedge$ is the logical AND. Thus, $( L _ { m j } \land L _ { p j } )$ is 1 if the $j ^ { t h }$ result for query $q _ { m } \in Q _ { 1 }$ and query $q _ { p } = \mathrm { t o p } 1 ( Q _ { 1 } ) \in Q _ { 1 } ^ { t o p 1 }$ are the same. $O \nu @ k$ measures the ability of the layout similarity metric to replicate the distance field implied by a query by its top-ranked retrieved result. The score makes sense only when the ranked results returned by a layout similarity tool are deemed reasonable, as assessed by the $P \ @ k$ scores.
 Table 2 shows the $O \nu @ k$ scores with $k \ = \ 5 , 1 0$ for IoU, GCN-CNN [30], and LayoutGMN on 50 such pairs. On UIs (first three rows), IoU metric has a slightly higher $O \nu @ 5$ score $( + 0 . 6 \% )$ than LayoutGMN. Also, it shares the largest $P \ @ 5$ score with LayoutGMN, indicating that IoU metric has slightly better retrieval stability for the top-5 results. However, in the case of $O \nu @ I O$ , LayoutGMN has a higher score $( + 0 . 4 \% )$ than the IoU metric and also has a higher $P @ { \mathit { P } } \omega { \mathit { I } } O$ score than the other two methods, indicating that when top-10 retrievals are considered, LayoutGMN has slightly better consistency on the retrievals.
 As for floorplans (last three rows), Table 1 already shows that LayoutGMN has the best $P \ @ k$ scores. This, coupled with a higher $O \nu @ k$ scores, indicate that on floorplans, LayoutGMN has better retrieval stability. In the supplementary material, we show qualitative results on the stability of retrievals for the three methods.
 Classification accuracy. We also measure the classification accuracy of test-triplets as a sanity check. However, such a measure alone is not a sufficient one for correctness of a similarity metric employed in information retrieval tasks [31]. We present it alongside $P \ @ k$ and $O \nu @ k$ scores for a broader, informed evaluation, in Table 3. Since user annotations are expensive and time consuming (and hence the motivation to use IoU metric to get weak training labels), we only get user annotations on 452 triplets for both UIs and floorplans, and the last column of Table 3 reflects the accuracy on such triplets. LayoutGMN outperforms all the baselines by atleast $1 . 3 2 \%$ , on triplets obtained using both, IoU metric and user annotations.
 Table 3. Classification accuracy on test triplets obtained using IoU metric (IoU-based) and annotated by users (User-based). The first set of comparisons is for UI layouts, followed by floorplans.   
 <table><tr><td>Method</td><td colspan="2">Test Accuracy on Triplets IoU-based (↑) User-based (↑)</td></tr><tr><td>Graph Kernel [8]</td><td>90.09</td><td>90.73</td></tr><tr><td>U-Net _Triplet [35]</td><td>96.67</td><td>93.38</td></tr><tr><td>GCN-CNN_Triplet [30]</td><td>96.45</td><td>94.48</td></tr><tr><td>LayoutGMN</td><td>98.96</td><td>95.80</td></tr><tr><td>Graph Kernel [8]</td><td>92.07</td><td>95.60</td></tr><tr><td>U-Net_Triplet [35]</td><td>93.01</td><td>91.00</td></tr><tr><td>GCN-CNN_Triplet [30]</td><td>92.50</td><td>91.8</td></tr><tr><td>LayoutGMN</td><td>97.54</td><td>97.60</td></tr></table>
 ![](images/d21068a819bf1cb5f15f4b3be9c971729ba516dfc823ec24a76c51e0bfdf0b9a.jpg)  
 Figure 7. Retrieval results for the bottom-left query in Fig 6, when adjacency graphs are used. We observe, on most of the queries, that the performance of LayoutGMN improves, but degrades in the case of GCN-CNN [30] on floorplan data.
 # 5.3. Fully-connected vs. Adjacency Graphs
 Following [30], we employed fully connected graphs for our experiments until now and observed that such graphs are a good design for training graph neural networks for learning structural similarity. We also performed experiments using adjacency graphs on GCN-CNN [30] and LayoutGMN, and observed that, for floorplans (where the graph node count is small), the quality of retrievals improved in the case of LayoutGMN, but degraded for GCN-CNN. This is mainly because GCN-CNN obtains independent graph embeddings for each input graph and when the graphs are built only on adjacency connections, some amount of global structural prior is lost. On the other hand, GMNs obtain better contextual embeddings by now matching the sparsely connected adjacency graphs, as a result of narrower search space; for a qualitative result using adjacency graphs, see Figure 7. However, for UIs (where the graph node count is large), the elements are scattered all over the layout, and no one heuristic is able to capture adjacency relations perfectly. The quality of retrievals for both the networks degraded when using adjacency graphs on UIs. More results can be found in the supplementary material.
 <table><tr><td rowspan="2">Structure encoding with</td><td colspan="3">k-1 (k-5 k-10 (t)</td></tr><tr><td></td><td></td><td></td></tr><tr><td>No edges</td><td>30</td><td>16.39</td><td>11.3</td></tr><tr><td>No box positions</td><td>15</td><td>7.2</td><td>5.4</td></tr><tr><td>No node semantics</td><td>24</td><td>11.2</td><td>8.4</td></tr></table>
 Table 4. Precision $\overline { { \ @ \mathrm { K } } }$ scores for ablation studies on structural encoding of floorplan graphs. The setup for crowd-sourced relevance judgements via AMT is the same as in Table 1, on the same set of 100 randomly chosen queries.
 # 5.4. Ablation Studies on Structural Representation
 To evaluate how the node and edge features in our layout representation contribute to network performance, we conduct an ablation study by gradually removing these features. Our design of the initial representation of the layout graphs (Sec 3.1) are well studied in prior works on layout generation [11, 26], visual reasoning, and relationship detection tasks [12, 44, 30]. As such, we focus on analyzing LayoutGMN’s behavior when strong structural priors viz., the edges, box positions, and element semantics, are ablated.
 Graph edges. Removing graph edges results in loss of structural information, with only the attention-weighted node update (Eq. 4) taking place. When the number of graph nodes is small, e.g., for floorplans, edge removal does not lead to random retrievals, but the retrieved results are poorer compared to when edges are present; see Table 4.
 Effect of box positions. The nodes of the layout graphs encode both the absolute box positions and the element semantics. When the position encoding information is withdrawn, arguably, the most important cue is lost. The resulting retrievals from such a poorly trained model, as seen in the second row of Table 4, are noisy as semantics alone do not provide enough structural priors.
 Effect of node semantics. Next, when the box positions are preserved but the element semantics are not encoded, we observe that the network slowly begins to understand element comparison guided by the position info, but falls short of understanding the overall structure information, see Table 4. LayoutGMN takes into account all the above information returning structurally sound results (Table 1), even relative to the IoU metric.
 # 5.5. Attention-based Layout Label Transfer
 We present layout label transfer, via attention-based structural element matching, as a natural application of LayoutGMN. Given a source layout image $I _ { 1 }$ with known labels, the goal is to transfer the labels to a target layout $I _ { 2 }$ .
 ![](images/ed308e04292b05893b2144d0c5147d0b580f1e468750bac4cdac2e7eddcc3460.jpg)  
 Figure 8. Element-level label transfer results from a source image $I _ { 1 }$ to a target image $I _ { 2 }$ , using a pretrained LayoutGMN vs. maximum pixel-overlap matching. LayoutGMN predicts correct labels via attention-based element matching.
 A straight-forward approach to establishing element correspondence is via maximum area/pixel-overlap matching for every element in $I _ { 2 }$ with respect to all the elements in $I _ { 1 }$ . However, this scheme is highly sensitive to element positions within the two layouts. Moreover, rasteralignment (via translations) of layouts is non-trivial to formulate when the two layout images have different boundaries and structures. LayoutGMN, on the other hand, is robust to such boundary variations, and can be directly used to obtain element-level correspondences using the built-in attention mechanism that provides an attention score for every element-level match. Specifically, we use a pretrained LayoutGMN which is fed with two layout graphs, where the semantic encoding of all nodes is set to a vector of ones.
 As shown in Figure 8, the pretrained LayoutGMN is able to find the correct labels despite masking the semantic information at the input. Note that when semantic information is masked at the input, such a transfer can not be applied to any two layouts. It is limited by a weak/floating alignment of $I _ { 1 }$ and $I _ { 2 }$ , as seen in Figure 8.
 # 6. Conclusion, limitation, and future work
 We present the first deep neural network to offer both metric learning of structural layout similarity and structural matching between layout elements. Extensive experiments demonstrate that our metric best matches human judgement of structural similarity for both floorplans and UI designs, compared to all well-known baselines.
 The main limitation of our current learning framework is the requirement for strong supervision, which justifies, in part, the use of the less-than-ideal IoU metric for network training. An interesting future direction is to combine fewshot or active learning with our GMN-based triplet network, e.g., by finding ways to obtain small sets of training triplets that are both informative and diverse [24]. Another limitation of our current network is that it does not learn hierarchical graph representations or structural matching, which would have been desirable when handling large graphs.
 Acknowledgements. We thank the anonymous reviewers for their valuable comments, and the AMT workers for offering their feedback. This work was supported, in part, by an NSERC grant (611370) and an Adobe gift.
 # References
 [1] Oron Ashual and Lior Wolf. Specifying object attributes and relations in interactive scene generation. In Proceedings of the IEEE International Conference on Computer Vision, pages 4561–4569, 2019. 3   
 [2] Thorsten Brants. Inter-annotator agreement for a german newspaper corpus. In International Conference on Knowledge Engineering and Knowledge Management, 2000. 2   
 [3] Thomas M Breuel. High performance document layout analysis. In Proceedings of the Symposium on Document Image Understanding Technology, pages 209–218, 2003. 2   
 [4] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017. 2   
 [5] Qi Chen, Qi Wu, Rui Tang, Yuhan Wang, Shuai Wang, and Mingkui Tan. Intelligent home 3d: Automatic 3d-house design from linguistic descriptions only. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12625–12634, 2020. 2   
 [6] Niraj Ramesh Dayama, Kashyap Todi, Taru Saarelainen, and Antti Oulasvirta. GRIDS: Interactive layout design with integer programming. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pages 1–13, 2020. 2   
 [7] Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ranjitha Kumar. Rico: A mobile app dataset for building datadriven design applications. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology, pages 845–854, 2017. 2, 4, 5, 6   
 [8] Matthew Fisher, Manolis Savva, and Pat Hanrahan. Characterizing structural relationships in scenes using graph kernels. In ACM SIGGRAPH 2011 papers, pages 1–12. 2011. 2, 5, 7   
 [9] Karen Fort, Maud Ehrmann, and Adeline Nazarenko. To- ¨ wards a methodology for named entities annotation. 2009. 2   
 [10] Huan Fu, Bowen Cai, Lin Gao, Lingxiao Zhang, Rongfei Jia, Binqiang Zhao, and Hao Zhang. 3D-FRONT: 3D Furnished Rooms with layOuts and semaNTics, 2020. 1   
 [11] Akshay Gadi Patil, Omri Ben-Eliezer, Or Perel, and Hadar Averbuch-Elor. READ: Recursive autoencoders for document layout generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 544–545, 2020. 1, 2, 3, 8   
 [12] Longteng Guo, Jing Liu, Jinhui Tang, Jiangwei Li, Wei Luo, and Hanqing Lu. Aligning linguistic words and visual semantic units for image captioning. In Proceedings of the 27th ACM International Conference on Multimedia, pages 765–773, 2019. 3, 8   
 [13] Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pages 84–92. Springer, 2015. 3, 4   
 [14] George Hripcsak and Adam Wilcox. Reference standards, judges, and comparison subjects: roles for experts in evaluating system performance. Journal of the American Medical Informatics Association, 9(1):1–15, 2002. 2 [15] Ruizhen Hu, Zeyu Huang, Yuhan Tang, Oliver van Kaick, Hao Zhang, and Hui Huang. Graph2Plan: Learning floorplan generation from layout graphs. ACM Transaction on Graphics (TOG), 2020. 1, 2 [16] Nathan Hurst, Wilmot Li, and Kim Marriott. Review of automatic document formatting. In Proceedings of the 9th ACM symposium on Document engineering, pages 99–108, 2009.   
 2 [17] Justin Johnson, Agrim Gupta, and Li Fei-Fei. Image generation from scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages   
 1219–1228, 2018. 3 [18] Rangachar Kasturi. Document image analysis, volume 39. 2 [19] Nagma Khan, Ushasi Chaudhuri, Biplab Banerjee, and Subhasis Chaudhuri. Graph convolutional network for multilabel vhr remote sensing scene recognition. Neurocomputing, 357:36–46, 2019. 3 [20] Jin-Dong Kim, Tomoko Ohta, and Jun’ichi Tsujii. Corpus annotation for mining biomedical events from literature. BMC bioinformatics, 9(1):10, 2008. 2 [21] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. 2017. 2 [22] Yanir Kleiman, Oliver van Kaick, Olga Sorkine-Hornung, and Daniel Cohen-Or. SHED: shape edit distance for finegrained shape similarity. ACM Transactions on Graphics (TOG), 34(6):1–11, 2015. 2 [23] Ranjitha Kumar, Jerry O Talton, Salman Ahmad, and Scott R Klemmer. Bricolage: example-based retargeting for web design. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 2197–2206, 2011. 2 [24] Priyadarshini Kumari, Ritesh Goru, Siddhartha Chaudhuri, and Subhasis Chaudhuri. Batch decorrelation for active metric learning. In IJCAI-PRICAI, 2020. 8 [25] Jianan Li, Tingfa Xu, Jianming Zhang, Aaron Hertzmann, and Jimei Yang. LayoutGAN: Generating graphic layouts with wireframe discriminator. In International Conference on Learning Representations, 2019. 1, 2 [26] Manyi Li, Akshay Gadi Patil, Kai Xu, Siddhartha Chaudhuri, Owais Khan, Ariel Shamir, Changhe Tu, Baoquan Chen, Daniel Cohen-Or, and Hao Zhang. GRAINS: Generative recursive autoencoders for indoor scenes. ACM Transactions on Graphics (TOG), 38(2):1–16, 2019. 1, 8 [27] Yujia Li, Chenjie Gu, Thomas Dullien, Oriol Vinyals, and Pushmeet Kohli. Graph matching networks for learning the similarity of graph structured objects. In ICML, 2019. 1, 3 [28] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. 2016. 2, 5 [29] Thomas F Liu, Mark Craft, Jason Situ, Ersin Yumer, Radomir Mech, and Ranjitha Kumar. Learning design semantics for mobile apps. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology, pages 569–579, 2018. 2 [30] Dipu Manandhar, Dan Ruta, and John Collomosse. Learning structural similarity of user interface layouts using graph networks. In Proceedings of the European Conference on Computer Vision (ECCV), 2020. 1, 2, 3, 5, 6, 7, 8 [31] Christopher D Manning, Hinrich Schutze, and Prabhakar ¨ Raghavan. Chapter 8: Evaluation in information retrieval in “Introduction to Information Retrieval”. pages 151–175. Cambridge university press, 2008. 5, 7 [32] Nelson Nauata, Kai-Hung Chang, Chin-Yi Cheng, Greg Mori, and Yasutaka Furukawa. House-gan: Relational generative adversarial networks for graph-constrained house layout generation. Eur. Conf. Comput. Vis., 2020. 2 [33] Peter O’Donovan, Aseem Agarwala, and Aaron Hertzmann. Learning layouts for single-page graphic designs. IEEE transactions on visualization and computer graphics,   
 20(8):1200–1213, 2014. 2 [34] Daniel Ritchie, Ankita Arvind Kejriwal, and Scott R Klemmer. d. tour: Style-based exploration of design example galleries. In Proceedings of the 24th annual ACM symposium on User interface software and technology, pages 165–174,   
 2011. 2 [35] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Unet: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015. 5, 7 [36] Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pages 593–607. Springer, 2018.   
 2 [37] Amanda Swearngin, Mira Dontcheva, Wilmot Li, Joel Brandt, Morgan Dixon, and Andrew J Ko. Rewire: Interface design assistance from examples. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pages 1–12, 2018. 2 [38] Sou Tabata, Hiroki Yoshihara, Haruka Maeda, and Kei Yokoyama. Automatic layout generation for graphical design magazines. In ACM SIGGRAPH 2019 Posters, pages   
 1–2. 2019. 2 [39] Subarna Tripathi, Sharath Nittur Sridhar, Sairam Sundaresan, and Hanlin Tang. Compact scene graphs for layout composition and patch retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019. 3 [40] Raoul Wessel, Ina Blumel, and Reinhard Klein. The room ¨ connectivity graph: Shape retrieval in the architectural domain. 2008. 2 [41] W John Wilbur, Andrey Rzhetsky, and Hagit Shatkay. New directions in biomedical text annotation: definitions, guidelines and corpus construction. BMC bioinformatics, 7(1):1–   
 10, 2006. 2 [42] Wenming Wu, Xiao-Ming Fu, Rui Tang, Yuhan Wang, YuHao Qi, and Ligang Liu. Data-driven interior plan generation for residential buildings. ACM Transactions on Graphics (TOG), 38(6):1–12, 2019. 1, 2, 5, 6 [43] Kai Xu, Rui Ma, Hao Zhang, Chenyang Zhu, Ariel Shamir, Daniel Cohen-Or, and Hui Huang. Organizing heterogeneous scene collections through contextual focal points. ACM Transactions on Graphics (TOG), 33(4):1–12, 2014. 2 [44] Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. Exploring visual relationship for image captioning. In Proceedings of the European conference on computer vision (ECCV), pages 684–699, 2018. 3, 8   
 [45] Ziqi Zhang, Sam Chapman, and Fabio Ciravegna. A methodology towards effective and efficient manual document annotation: addressing annotator discrepancy and annotation quality. In International Conference on Knowledge Engineering and Knowledge Management, pages 301–315. Springer, 2010. 2   
 [46] Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, and Zihan Zhou. Structured3D: A Large Photo-realistic Dataset for Structured 3D Modeling. In Eur. Conf. Comput. Vis., 2020. 1   
 [47] Xinru Zheng, Xiaotian Qiao, Ying Cao, and Rynson WH Lau. Content-aware generative modeling of graphic design layouts. ACM Transactions on Graphics (TOG), 38(4):1–15, 2019. 1, 2   
 [48] Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Publaynet: largest dataset ever for document layout analysis. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1015–1022. IEEE, 2019. 1
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper_content_list.json
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper_content_list.json
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper_layout.pdf
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper_layout.pdf
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper_middle.json
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper_middle.json
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper_model.json
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper_model.json
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper_origin.pdf
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper_origin.pdf
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper_span.pdf
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper_span.pdf
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/10feb976b30a50b3bf9498ec94785e62b94d2096b0b00037ecb547ef258cdaff.jpg
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/10feb976b30a50b3bf9498ec94785e62b94d2096b0b00037ecb547ef258cdaff.jpg
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/1e1f54d6b4c7441623fd6af31c439e83cd8f899efc5f9d2f7465ab923b69b261.jpg
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/1e1f54d6b4c7441623fd6af31c439e83cd8f899efc5f9d2f7465ab923b69b261.jpg
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/2f667b39a7233bf24a6506b694fc4bf4eb9bd23803bde0a0d4e1f2ff6463b3c0.jpg
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/2f667b39a7233bf24a6506b694fc4bf4eb9bd23803bde0a0d4e1f2ff6463b3c0.jpg
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/411eea64f4e568016807787a0af3f7b46defabc3457da38da3b7a1ef6bd1b54b.jpg
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/411eea64f4e568016807787a0af3f7b46defabc3457da38da3b7a1ef6bd1b54b.jpg
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/4c6bdd27721e5b02267b9edbef1694ea83392e585a4f5ef8c2b10194e2df1499.jpg
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/4c6bdd27721e5b02267b9edbef1694ea83392e585a4f5ef8c2b10194e2df1499.jpg
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/516817b84bdaf3db241d1a3b87d316578c8f2d9adb29bb8a247a3e00042ba1d0.jpg
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/516817b84bdaf3db241d1a3b87d316578c8f2d9adb29bb8a247a3e00042ba1d0.jpg
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/59dd33c8a21d88fc50b1df4467be74ee3f29b6f6c64f9d70b6281e57a8abe758.jpg
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/59dd33c8a21d88fc50b1df4467be74ee3f29b6f6c64f9d70b6281e57a8abe758.jpg
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/76179359f537652a648a8d2094196e528e584399d6cb01cf8f854181aa609e51.jpg
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/76179359f537652a648a8d2094196e528e584399d6cb01cf8f854181aa609e51.jpg
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/817e17e26c81262c41e6cfdecb5f3145cb19873bc1193aab7bf50bb54c10308a.jpg
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/817e17e26c81262c41e6cfdecb5f3145cb19873bc1193aab7bf50bb54c10308a.jpg
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/82dde9ffc461703a8ff5e225a05bcb0ad64d1f549bc247921030e9314aaf9122.jpg
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/82dde9ffc461703a8ff5e225a05bcb0ad64d1f549bc247921030e9314aaf9122.jpg
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/87b947faa6d2b8bcb8b0379632e969c4ca927ce4c48b34ef6864ef146e70723b.jpg
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/87b947faa6d2b8bcb8b0379632e969c4ca927ce4c48b34ef6864ef146e70723b.jpg
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/8f6e01e62970eb20310114fd8dda2f3e2764438b6978d68e67448446402af404.jpg
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/8f6e01e62970eb20310114fd8dda2f3e2764438b6978d68e67448446402af404.jpg
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/939bcda0c0c4de7dc9855979ac03e34cc2fece15e7d532d2941505334eb83594.jpg
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/939bcda0c0c4de7dc9855979ac03e34cc2fece15e7d532d2941505334eb83594.jpg
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/a088045c41ff7c6ab0685a32660ff5a77ef9e48d97a751e3bcb1ba9d4388203d.jpg
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/a088045c41ff7c6ab0685a32660ff5a77ef9e48d97a751e3bcb1ba9d4388203d.jpg
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/c3a7552c1806cf424e1f85d301f328c30dbb89ea44f29e4b5e5e75c3b504328f.jpg
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/c3a7552c1806cf424e1f85d301f328c30dbb89ea44f29e4b5e5e75c3b504328f.jpg
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/c98ae761342b0ee9f72f191d39f126a99c0a57864cd22dc4e625d2649a4df09e.jpg
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/c98ae761342b0ee9f72f191d39f126a99c0a57864cd22dc4e625d2649a4df09e.jpg
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/d21068a819bf1cb5f15f4b3be9c971729ba516dfc823ec24a76c51e0bfdf0b9a.jpg
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/d21068a819bf1cb5f15f4b3be9c971729ba516dfc823ec24a76c51e0bfdf0b9a.jpg
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/d3d7e23e91a10c7264a6b7ba97a79a1125925a47fd46b386252856201f198a34.jpg
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/d3d7e23e91a10c7264a6b7ba97a79a1125925a47fd46b386252856201f198a34.jpg
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/d59914d8e0f16bd4dff520e0e6726be5185fc2f68bc0692a634dd432805eefa1.jpg
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/d59914d8e0f16bd4dff520e0e6726be5185fc2f68bc0692a634dd432805eefa1.jpg
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/db449fea3e3d3d80d9beaa007a4b1e918e94a547e1f4366136f50cbd95079528.jpg
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/db449fea3e3d3d80d9beaa007a4b1e918e94a547e1f4366136f50cbd95079528.jpg
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/dbf36a9bc74a952d669f6c8dbcf2bf61923c7463d458d9df26ace1fd070e44a6.jpg
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/dbf36a9bc74a952d669f6c8dbcf2bf61923c7463d458d9df26ace1fd070e44a6.jpg
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/ed308e04292b05893b2144d0c5147d0b580f1e468750bac4cdac2e7eddcc3460.jpg
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/ed308e04292b05893b2144d0c5147d0b580f1e468750bac4cdac2e7eddcc3460.jpg
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/f0a4eb226a10834e1fc610ecbc06337c5ffae80644cf03814bb2d4bf0775005e.jpg
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/f0a4eb226a10834e1fc610ecbc06337c5ffae80644cf03814bb2d4bf0775005e.jpg
--- a/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/f10481981e35d39196c2e87361807eefc1bb42a4a053a94383ee39daf78b1368.jpg
+++ b/reference/LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images/f10481981e35d39196c2e87361807eefc1bb42a4a053a94383ee39daf78b1368.jpg
--- a/reference/LayoutGMN_zh.md
+++ b/reference/LayoutGMN_zh.md
@@ -1,114 +0,0 @@
 # LayoutGMN：用于结构化版图相似性的神经图匹配
 Akshay Gadi Patil 1  Manyi Li1†  Matthew Fisher2  Manolis Savva1  Hao Zhang1  
 1Simon Fraser University  2Adobe Research
 # 摘要
 我们提出了一种利用图匹配网络（Graph Matching Networks, GMN）来预测二维版图之间结构相似性的深度神经网络。该网络称为 LayoutGMN，在三元组（Triplet）学习框架下，通过注意力式 GMN 实现神经图匹配，从而学习版图度量。为训练网络，我们使用基于像素级交并比（IoU）的弱标签来定义三元组损失。重要的是，LayoutGMN 具备结构归纳偏置，可有效弥补 IoU 对结构感知不足的问题。我们在两类代表性版图（建筑平面图与 UI 设计）的大规模数据集上通过检索实验验证方法的有效性。与 IoU 及包括基于图神经网络与图像卷积的最新方法在内的基线相比，我们的网络在结构相似性方面更贴近人类判断。此外，LayoutGMN 是首个同时提供结构相似度度量与元素级结构匹配的深度模型。
 # 1. 引言
 二维版图广泛存在于图形与建筑设计等领域，常作为蓝图或概念草图，例如户型图、文档排版与 UI 设计。现有评估协议仍大量依赖 IoU 等像素级指标与人工评估，而这类指标缺乏对结构的建模能力。我们提出 LayoutGMN，从结构视角表示与比较版图，将每个版图抽象为基于语义元素的有向、全连接图。网络在三元组设定下进行神经图匹配，通过图内消息传递与跨图信息交换，学习用于相似性预测的图级嵌入。除相似度度量外，网络学习到的注意力权重还可用于布局元素匹配。
 ![](images/516817b84bdaf3db241d1a3b87d316578c8f2d9adb29bb8a247a3e00042ba1d0.jpg)  
 图 1. LayoutGMN 通过注意力式神经图匹配学习结构化版图相似性。学习到的注意力权重（数字）可用于元素级结构匹配。
 # 2. 相关工作
 布局分析与生成、结构相似性度量及 GNN 在结构建模方面已有探索。但像素空间指标（如 IoU、F1）不具结构性且对位置敏感。已有基于 GNN 与 CNN 的方法在 UI 布局上学习结构相似性，但往往独立计算图嵌入，缺少跨图结构对齐。我们的 LayoutGMN 通过跨图注意力通信，在配对上下文中学习图嵌入，提升结构对齐能力。
 ![](images/f0a4eb226a10834e1fc610ecbc06337c5ffae80644cf03814bb2d4bf0775005e.jpg)  
 图 3. 将输入平面图的语义房间抽象为包围盒，并据此构建节点与边特征，得到对应的布局图。
 # 3. 方法
 GMN 接受一对图，通过注意力式跨图通信获得两图的嵌入。LayoutGMN 将 GMN 插入三元组主干中，学习用于二维版图（户型、UI、文档等）的结构相似性度量。
 ![](images/939bcda0c0c4de7dc9855979ac03e34cc2fece15e7d532d2941505334eb83594.jpg)  
 图 4. LayoutGMN 输入两张布局图，执行图内消息传递与跨图注意力信息交换更新节点特征，并经聚合得到图级嵌入。
 ## 3.1 布局图表示
 给定高为 \(H\)、宽为 \(W\) 的布局图像及其语义标注，我们将每个元素抽象为包围盒节点，构成有向全连接图 \(G_l=(V,E)\)。
 初始节点特征：忽略内容特征，仅使用语义与几何信息。类似 [11,12]，语义通过可学习嵌入（128 维），几何向量 \(g_k=[x_k/W, y_k/H, w_k/W, h_k/H, w_k h_k/\sqrt{WH}]\) 经过两层 MLP 嵌入为 128 维，与语义嵌入拼接作为初始节点特征。
 初始边特征：用 8 维向量编码元素间的相对空间关系与 IoU、纵横比和相对方向等，详见原文公式定义。
 ## 3.2 图匹配网络
 模块由（1）节点/边编码器，（2）消息传播层，（3）聚合器组成。节点与边经 MLP 编码为 128 维。
 消息传播：在图内聚合邻接消息的同时，通过跨图注意力计算节点间的对应关系：
 \[ a_{pi} = \frac{\exp(s_h(h_i^{(t)}, h_p^{(t)}))}{\sum_{p} \exp(s_h(h_i^{(t)}, h_p^{(t)}))} \]
 \[ \mu_{pi} = a_{pi}\,(h_i^{(t)} - h_p^{(t)}), \quad h_i^{(t+1)} = f_{update}\big(h_i^{(t)}, \sum_j m_{ji}, \sum_p \mu_{pi}\big). \]
 经过若干轮传播更新节点表示。
 聚合器：使用门控加权的特征聚合 MLP 得到 1024 维图级表示 \(h_G\)。两图分别计算其图级嵌入。
 ## 3.3 训练
 在三元组框架下，锚-正与锚-负配对分别通过 GMN 获得上下文相关的图嵌入，使用基于 \(L_2\) 距离的边际损失进行训练。
 ![](images/1e1f54d6b4c7441623fd6af31c439e83cd8f899efc5f9d2f7465ab923b69b261.jpg)  
 图 5. 锚图与正/负图分别配对经 GMN 得到两组 1024 维嵌入，并使用边际损失训练。
 # 4. 数据集
 实验使用 RICO UI 布局与 RPLAN 户型图两个大规模数据集，并在过滤后进行评测与分析。
 # 5. 结果与评估
 我们通过基于人类标注的检索精度（Precision @k）与一致性度量 Overlap @k 评估方法，并给出可视化定性结果。总体上，LayoutGMN 在两类数据集上的检索结果与人类认知更一致，相比 IoU 与其他基线表现更优；同时在计算效率上也具备实际可用性。
 ## 5.1 基线方法
 包括图核（Graph Kernel）、U-Net（三元组自编码）、IoU 度量、GCN-CNN（注意力式 GCN + CNN）等。我们在相同数据与固定三元组采样条件下对比，确保公平性。LayoutGMN 的跨图通信带来更细粒度的结构嵌入，相比独立编码的框架更具鲁棒性。
 ![](images/817e17e26c81262c41e6cfdecb5f3145cb19873bc1193aab7bf50bb54c10308a.jpg)  
 图 6. 在相同查询上，IoU 度量、GCN-CNN 与 LayoutGMN 的 Top-5 检索结果对比。LayoutGMN 的结果更接近输入查询的结构。
 ## 5.2 评估指标
 - Precision @k：衡量排名前 k 个检索结果的相关性，由众包标注确定相关性。
 - Overlap @k：衡量一个查询与其 Top-1 结果在检索列表上的重叠度，反映检索稳定性与一致性。
 在 UI 与户型数据上，LayoutGMN 的 Precision 与 Overlap 指标整体最佳或具竞争力，表明其更契合结构相似性的真实需求。
 ## 5.3 全连接图 vs. 邻接图
 遵循 [30]，我们默认使用全连接图，验证在学习结构相似性时是合理设计。对于节点数量较少的户型图，使用邻接图可提高 GMN 的检索质量，但会降低独立编码方法（如 GCN-CNN）的表现；对于节点较多且元素分散的 UI，邻接图难以稳定刻画邻接关系，整体表现下降。
 ## 5.4 结构表示消融
 我们逐步移除边、位置或语义信息，分析其对性能的影响：
 - 移除边：结构信息丢失，仅依赖注意力更新，检索效果下降。
 - 移除位置：最关键的结构线索缺失，检索噪声显著增加。
 - 移除语义：仅凭位置可部分恢复结构，但仍不足以获得最优结果。
 完整的节点与边表示可获得最为可靠的结构检索。
 ## 5.5 基于注意力的标签迁移
 我们展示了元素级标签迁移作为自然应用：给定源布局与目标布局，使用预训练的 LayoutGMN 的跨图注意力作为元素匹配信号，可在不依赖像素对齐的前提下实现标签转移。与基于最大像素重叠的简单匹配相比，LayoutGMN 对边界差异与结构变化更鲁棒。
 ![](images/ed308e04292b05893b2144d0c5147d0b580f1e468750bac4cdac2e7eddcc3460.jpg)  
 图 8. 相比最大像素重叠匹配，LayoutGMN 借助注意力更准确地完成元素级标签迁移。
 # 6. 结论、局限与未来工作
 我们提出了首个同时提供结构相似度量与元素级结构匹配的深度模型 LayoutGMN。在两类布局数据上的大量实验显示，其度量相较已知方法更贴近人类对结构相似性的判断。主要局限包括：依赖较强监督（因此采用 IoU 弱标签以降低成本）；未学习分层的图表示与结构匹配。未来可结合小样本/主动学习构造信息量大且多样的三元组，或引入层次化图建模以处理大图。
 # 致谢
 感谢审稿人的宝贵建议与 AMT 众包标注者的帮助。本工作部分由 NSERC（611370）资助并获 Adobe 赞助。
 # 参考文献
 为便于查阅，保留原文献编号与条目，请见英文版文末参考文献列表。
--- a/reference/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper.pdf
+++ b/reference/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper.pdf
--- a/reference/images
+++ b/reference/images
@@ -1 +0,0 @@
 ./LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images
--- a/reference/技术路线图：基于几何图的版图Transformer
+++ b/reference/技术路线图：基于几何图的版图Transformer
@@ -1,141 +0,0 @@
 ### **技术路线图：基于几何/图的版图Transformer (Geo-Layout Transformer)**
 这个路线图分为五个主要阶段：
 1. **环境搭建与工具选型 (Foundation)**
 2. **数据预处理与表征 (Data Preparation & Representation)**
 3. **模型架构设计 (Model Architecture)**
 4. **训练与评估 (Training & Evaluation)**
 5. **迭代与优化 (Iteration & Advanced Topics)**
 #### **阶段一：环境搭建与工具选型**
 这是所有工作的基础，选择合适的工具能事半功倍。
 - **编程语言**: **Python** 是事实上的标准。
 - **GDS/OASIS 解析库**:
  - **KLayout (`klayout.db`)**: 强烈推荐。它不仅是一个查看器，还提供了极其强大和高效的Python API，用于读取、写入和处理复杂的版图几何运算（如布尔运算、尺寸调整等）。它的区域查询（Region Query）功能对于提取Patch内的数据至关重要。
  - **gdspy**: 另一个流行的选择，更轻量级，适合创建和简单处理GDS文件，但在处理大型文件和复杂查询时可能不如KLayout高效。
 - **机器学习/深度学习框架**:
  - **PyTorch**: 主流选择，社区活跃，生态系统丰富。
  - **PyTorch Geometric (PyG)** 或 **Deep Graph Library (DGL)**: 这两个是构建在PyTorch之上的图神经网络库，它们将是实现“Patch编码器”的核心工具。PyG在学术界使用非常广泛。
 - **数据处理与科学计算**:
  - **NumPy**: 用于高效的数值计算。
  - **Pandas**: 用于管理和分析元数据。
  - **Shapely**: 如果需要处理一些几何对象（多边形），这个库也很有用。
 **行动计划**:
 1. 安装Python环境 (建议使用Conda进行环境隔离)。
 2. 安装KLayout并学习其Python API (`import klayout.db as kdb`)。
 3. 安装PyTorch和PyG。
 ------
 #### **阶段二：数据预处理与表征**
 这是整个项目中**最关键、工作量最大**的部分。模型能学到什么，上限就在于你如何表征数据。
 1. **定义“Patch”**:
   - 在GDS/OASIS的坐标空间中，定义一个滑动窗口（或网格）。窗口大小是一个重要的超参数，例如 `10µm x 10µm`。你需要考虑标准单元的高度、金属线的宽度等因素来确定一个有意义的尺寸。
 2. **数据提取**:
   - 编写脚本，遍历整个版图（或感兴趣的区域）。
   - 对于每一个Patch，使用KLayout的`Region.select()`或类似功能，高效地提取出所有完全或部分落入该Patch窗口内的几何图形（多边形、矩形）。
   - **核心输出**: 对每个Patch，你得到一个几何对象的列表。每个对象包含信息：`{coordinates, layer, datatype, texttype}`。
 3. **构建图 (Graph Construction)**:
   - **这是方法二的核心**。你需要将每个Patch内的几何对象列表转换成一个图 `G = (V, E)`。
   - **定义节点 (Nodes, V)**:
     - 最直接的方法：每个几何图形（多边形）是一个节点。
     - 节点的初始特征向量可以包括：
       - **几何特征**: 质心坐标(x, y)，宽度，高度，面积，形状的紧凑度等。
       - **层信息**: 将GDS的层号（如M1, VIA1, M2）进行独热编码 (One-Hot Encoding)。
       - **其他属性**: 如果有，比如文本标签等。
   - **定义边 (Edges, E)**:
     - 边的定义决定了模型能学习到什么样的空间关系。可以尝试多种策略：
       - **邻近关系**: 如果两个图形的距离小于某个阈值，则连接一条边。可以使用K近邻（KNN）图。
       - **重叠/接触关系**: 如果两个图形（例如一个Via和一个Metal Shape）有重叠或接触，连接一条边。
       - **同一层关系**: 在同一层内的邻近图形之间连接边。
       - **跨层关系**: 在相邻层（如M1和VIA1）之间，如果图形在空间上重叠，则连接边。
     - **边的特征**: 边的特征可以为空，也可以包含距离、重叠面积等信息。
 4. **数据集生成**:
   - 处理所有的GDS文件，将每个Patch转换成一个图数据对象（在PyG中是 `Data` 对象）。
   - 为每个Patch图关联一个**标签 (Label)**。标签取决于你的具体任务，例如：
     - **DRC热点预测**: `1` (有DRC违规), `0` (无DRC违规)。
     - **可制造性预测**: `1` (热点), `0` (非热点)。
   - 将所有处理好的图数据对象保存为文件（如`.pt`格式），以便后续高效加载。
 **行动计划**:
 1. 确定你的目标任务和标签来源（例如，使用商业EDA工具运行DRC检查，导出结果作为标签）。
 2. 使用KLayout编写数据提取和Patch划分脚本。
 3. 设计并实现将几何对象列表转换为PyG图对象的算法。
 4. 处理你的数据集，生成一个包含成千上万个（或更多）图样本的训练集、验证集和测试集。
 #### **阶段三：模型架构设计**
 模型分为两个主要部分：**Patch编码器**和**全局Transformer**。
 1. **Patch编码器 (Patch Encoder)**:
   - **目标**: 将每个Patch的图 `G` 编码成一个固定长度的向量 `h_patch`。
   - **架构**: 使用一个**图神经网络 (GNN)**。常见的选择有：
     - **GCN (Graph Convolutional Network)**: 经典、简单。
     - **GraphSAGE**: 通过聚合邻居信息来学习节点表示，对未知图有更好的泛化能力。
     - **GAT (Graph Attention Network)**: 引入注意力机制，为不同的邻居节点分配不同的权重，表达能力更强。
   - **实现**: GNN会对Patch图中的每个节点进行多轮信息传播和更新，得到最终的节点嵌入。然后，使用一个**全局读出函数 (Global Readout Function)**，如 `global_mean_pool`, `global_add_pool`，将所有节点的嵌入聚合起来，形成整个图的嵌入向量 `h_patch`。
 2. **全局Transformer (Global Transformer)**:
   - **输入**: 一个由所有Patch嵌入组成的序列：`[h_patch_1, h_patch_2, ..., h_patch_N]`。
   - **位置编码 (Positional Embedding)**: **至关重要**。因为Transformer本身不感知顺序，你必须告诉模型每个Patch的原始空间位置。可以使用2D绝对或相对位置编码，将其加到 `h_patch` 向量上。
   - **架构**:
     - 一个标准的**Transformer Encoder**。它由多层的多头自注意力（Multi-Head Self-Attention）和前馈网络（Feed-Forward Network）组成。
     - 自注意力机制将允许模型学习到不同Patch之间的全局依赖关系。例如，模型可以学到一条长长的金属线是如何跨越多个Patch的，或者一个标准单元阵列的重复模式。
   - **分类头 (Classification Head)**:
     - 在Transformer的输出序列上接一个或多个全连接层。
     - 你可以使用一个特殊的 `[CLS]` token的输出来进行最终的分类，或者对所有Patch的输出进行平均池化后再分类。
 **行动计划**:
 1. 使用PyG搭建一个GNN模型作为Patch编码器。
 2. 使用PyTorch内置的`nn.TransformerEncoder`模块搭建全局Transformer。
 3. 将两者串联起来，形成完整的Geo-Layout Transformer模型。
 #### **阶段四：训练与评估**
 这是验证你想法的阶段。
 1. **损失函数 (Loss Function)**:
   - 对于二分类任务（如DRC热点预测），使用**二元交叉熵损失 (Binary Cross-Entropy Loss)**。
   - 如果样本不均衡（例如，DRC热点非常少），可以考虑使用**加权交叉熵**或**Focal Loss**。
 2. **优化器 (Optimizer)**:
   - **Adam** 或 **AdamW** 是常用的、稳健的选择。
 3. **训练流程**:
   - 编写标准的训练循环：前向传播 -> 计算损失 -> 反向传播 -> 更新权重。
   - 使用验证集监控模型性能，防止过拟合，并用于调整超参数（如学习率、GNN层数、Transformer头数等）。
 4. **评估指标 (Metrics)**:
   - **准确率 (Accuracy)**: 在样本均衡时有用。
   - **精确率 (Precision)**, **召回率 (Recall)**, **F1-Score**: 在样本不均衡时更为重要。
   - **AUC-ROC (Area Under the ROC Curve)**: 衡量模型整体分类能力的常用指标。
 **行动计划**:
 1. 编写训练脚本，实现数据加载、模型训练和验证。
 2. 运行实验，调整超参数，找到最佳模型。
 3. 在独立的测试集上评估最终模型的性能，并分析结果。
 #### **阶段五：迭代与优化**
 一旦基础模型跑通，你可以在多个方向上进行深入探索。
 - **多尺度Patch (Multi-scale Patching)**: 同时使用不同大小的Patch，让模型能够捕捉不同尺度的特征。
 - **层级化表征 (Hierarchical Representation)**: 如果GDS文件有层级结构（Cell, Instance），可以设计一个能够利用这种层级信息的模型，而不是将所有东西都“拍平”。
 - **自监督学习 (Self-supervised Learning)**: 版图数据量巨大但标签稀缺。可以设计自监督任务（如预测被遮盖的Patch、预测Patch间的相对位置等）来预训练模型，然后再在下游任务上微调。这可能会极大地提升性能。
 - **模型可解释性 (Interpretability)**: 使用注意力可视化等方法，分析模型在做决策时关注了哪些区域和几何特征，这对于理解模型行为和反哺设计流程非常有价值。
 请你针对这个想法进行更加深度的调研，寻找相关文件进行想法扩充和可行性佐证。需要注意的是：这个工具是芯片设计制造中的纯后端工具，不要接触到前端，也就是说不要对网表有接触；需要使用GNN将版图中分割的patch的几何图形构建GNN 编码来输入到transformer，并不是一个单独的transformer模型；模型的目标是理解版图，可以实现验证版图连通性，版图匹配，热点搜索等一系列功能。
		`@@ -1 +0,0 @@`
			`./LayoutGMN/Patil_LayoutGMN_Neural_Graph_Matching_for_Structural_Layout_Similarity_CVPR_2021_paper/auto/images`