fix:sample/plate 之前的开发
This commit is contained in:
313
docs/architecture/03-genotyping-data-flow.md
Normal file
313
docs/architecture/03-genotyping-data-flow.md
Normal file
@@ -0,0 +1,313 @@
|
||||
# Genotyping 模块数据流与表关系
|
||||
|
||||
本文档分析 Genotyping 模块的数据录入顺序、核心表关系,以及 Java 实体名与真实数据库表名之间的对应关系。
|
||||
|
||||
## 结论
|
||||
|
||||
Genotyping 模块的数据主线是:
|
||||
|
||||
```text
|
||||
Core/Pheno 上游数据 -> sample / plate
|
||||
ReferenceSet -> Reference -> ReferenceBases
|
||||
ReferenceSet + Study -> VariantSet -> Variant
|
||||
Sample -> CallSet
|
||||
CallSet + Variant -> Call
|
||||
GenomeMap -> LinkageGroup -> MarkerPosition -> Variant
|
||||
```
|
||||
|
||||
更贴近业务录入的顺序是:
|
||||
|
||||
```text
|
||||
1. 先有 Core/Phenotyping 上游:crop、program、trial、study、observation_unit
|
||||
2. 录入 Plate 和 Sample
|
||||
3. 录入 ReferenceSet、Reference、ReferenceBases
|
||||
4. 录入 VariantSet
|
||||
5. 录入 Variant
|
||||
6. 录入 CallSet
|
||||
7. 录入 Call,也就是 allele_call 表里的基因型结果
|
||||
8. 录入 GenomeMap、LinkageGroup、MarkerPosition
|
||||
```
|
||||
|
||||
初始化脚本中与 Genotyping 相关的执行顺序是:
|
||||
|
||||
```text
|
||||
R__init_data_21_samples.sql
|
||||
R__init_data_22_references.sql
|
||||
R__init_data_23_variant_set_1.sql
|
||||
R__init_data_24_genome_maps.sql
|
||||
src/main/resources/db/sql/variant_set_4/variant_set_4.sql
|
||||
src/main/resources/db/sql/variant_set_4/variant_set_4_alleles.sql
|
||||
```
|
||||
|
||||
## 实体与真实表名
|
||||
|
||||
| 业务概念 | Java 实体 | 数据库表 | 说明 |
|
||||
| --- | --- | --- | --- |
|
||||
| Call | `CallEntity` | `allele_call` | 单个样本在某个 variant 上的 genotype 结果 |
|
||||
| CallSet | `CallSetEntity` | `callset` | 某个 sample 的一组 call,通常对应一个样本的基因型调用集合 |
|
||||
| Sample | `SampleEntity` | `sample` | 送检样本/测序样本 |
|
||||
| Plate | `PlateEntity` | `plate` | 样本板,包含多个 sample |
|
||||
| MarkerPosition | `MarkerPositionEntity` | `marker_position` | variant 在 linkage group 上的位置 |
|
||||
| Variant | `VariantEntity` | `variant` | 变异位点,如 SNP/Indel |
|
||||
| ReferenceSet | `ReferenceSetEntity` | `reference_set` | 参考基因组集合 |
|
||||
| GenomeMap | `GenomeMapEntity` | `genome_map` | 遗传图谱 |
|
||||
| VariantSet | `VariantSetEntity` | `variantset` | 一批 variant 的集合 |
|
||||
| Reference | `ReferenceEntity` | `reference` | 参考序列,如 chromosome/contig |
|
||||
| ReferenceBases | `ReferenceBasesPageEntity` | `reference_bases` | reference 的序列分页 |
|
||||
| LinkageGroup | `LinkageGroupEntity` | `linkageGroup` | 图谱中的连锁群;注意表名是驼峰 `linkageGroup` |
|
||||
|
||||
## 核心表说明
|
||||
|
||||
| 表 | 作用 | 主要上游依赖 | 主要下游 |
|
||||
| --- | --- | --- | --- |
|
||||
| `plate` | 样本板 | `program`、`trial`、`study`,可选 vendor submission | `sample` |
|
||||
| `sample` | 样本 | `plate`、`observation_unit`、`program`、`trial`、`study` | `callset` |
|
||||
| `reference_set` | 参考基因组集合 | 可选 `germplasm` | `reference`、`variantset`、`variant` |
|
||||
| `reference` | 参考序列 | `reference_set` | `reference_bases` |
|
||||
| `reference_bases` | 参考序列片段/分页 | `reference` | 无 |
|
||||
| `variantset` | 变异集合 | `reference_set`、`study` | `variant`、`callset_variant_sets`、`variantset_analysis`、`variantset_format` |
|
||||
| `variant` | 变异位点 | `reference_set`、`variantset` | `allele_call`、`marker_position` |
|
||||
| `callset` | 样本的 call 集合 | `sample` | `allele_call`、`callset_variant_sets` |
|
||||
| `allele_call` | genotype 调用结果 | `callset`、`variant` | 无 |
|
||||
| `genome_map` | 遗传图谱 | `crop`,可关联 `study` | `linkageGroup` |
|
||||
| `linkageGroup` | 连锁群 | `genome_map` | `marker_position` |
|
||||
| `marker_position` | marker/variant 在图谱上的位置 | `linkageGroup`、`variant` | 无 |
|
||||
|
||||
## 建议录入顺序
|
||||
|
||||
### 1. 准备 Core/Phenotyping 上游数据
|
||||
|
||||
Genotyping 数据通常挂在 Core 和 Phenotyping 之上。
|
||||
|
||||
必须或常见上游包括:
|
||||
|
||||
```text
|
||||
crop
|
||||
program
|
||||
trial
|
||||
study
|
||||
observation_unit
|
||||
```
|
||||
|
||||
`sample` 可以关联 `observation_unit`,也会冗余关联 `program/trial/study`,用于查询和筛选。
|
||||
|
||||
### 2. 录入 Plate
|
||||
|
||||
先录入 `plate`,表示样本板。
|
||||
|
||||
`plate` 可关联:
|
||||
|
||||
```text
|
||||
program
|
||||
trial
|
||||
study
|
||||
plate_submission
|
||||
```
|
||||
|
||||
如果样本不走板,也可以直接录入 `sample`;但当前模型中 sample 支持挂到 plate 上。
|
||||
|
||||
### 3. 录入 Sample
|
||||
|
||||
录入 `sample`,它是 genotyping 流程的样本入口。
|
||||
|
||||
主要关系:
|
||||
|
||||
```text
|
||||
sample -> plate
|
||||
sample -> observation_unit
|
||||
sample -> program / trial / study
|
||||
sample -> germplasm_taxon
|
||||
```
|
||||
|
||||
### 4. 录入 ReferenceSet 和 Reference
|
||||
|
||||
录入 `reference_set`,表示参考基因组集合。
|
||||
|
||||
然后录入 `reference`,表示具体参考序列,例如 chromosome、contig。
|
||||
|
||||
如需保存具体序列片段,再录入:
|
||||
|
||||
```text
|
||||
reference_bases
|
||||
```
|
||||
|
||||
### 5. 录入 VariantSet
|
||||
|
||||
录入 `variantset`,它把一批 variant 组织成集合。
|
||||
|
||||
主要关系:
|
||||
|
||||
```text
|
||||
variantset -> reference_set
|
||||
variantset -> study
|
||||
```
|
||||
|
||||
附属表包括:
|
||||
|
||||
```text
|
||||
variantset_analysis
|
||||
variantset_format
|
||||
variantset_additional_info
|
||||
variantset_external_references
|
||||
```
|
||||
|
||||
### 6. 录入 Variant
|
||||
|
||||
录入 `variant`,表示具体变异位点。
|
||||
|
||||
主要关系:
|
||||
|
||||
```text
|
||||
variant -> reference_set
|
||||
variant -> variantset
|
||||
```
|
||||
|
||||
附属表包括:
|
||||
|
||||
```text
|
||||
variant_entity_alternate_bases
|
||||
variant_entity_ciend
|
||||
variant_entity_cipos
|
||||
variant_entity_filters_failed
|
||||
```
|
||||
|
||||
### 7. 录入 CallSet
|
||||
|
||||
录入 `callset`,表示某个样本的一组 genotype calls。
|
||||
|
||||
主要关系:
|
||||
|
||||
```text
|
||||
callset -> sample
|
||||
callset_variant_sets -> variantset
|
||||
```
|
||||
|
||||
`callset_variant_sets` 是 `callset` 和 `variantset` 的多对多关系表。
|
||||
|
||||
### 8. 录入 Call
|
||||
|
||||
录入 `allele_call`,业务上就是 Call。
|
||||
|
||||
它是最终基因型调用结果,核心关系是:
|
||||
|
||||
```text
|
||||
allele_call -> callset
|
||||
allele_call -> variant
|
||||
```
|
||||
|
||||
也就是说,一条 call 表示“某个 sample/callset 在某个 variant 上的 genotype、read depth、likelihood 等结果”。
|
||||
|
||||
### 9. 录入 GenomeMap 和 MarkerPosition
|
||||
|
||||
如果需要遗传图谱定位,录入:
|
||||
|
||||
```text
|
||||
genome_map -> linkageGroup -> marker_position -> variant
|
||||
```
|
||||
|
||||
`marker_position` 实际上把 variant 放到某个 linkage group 的具体位置上。
|
||||
|
||||
## Genotyping 数据流图
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
C["Core: crop"] --> GM["genome_map 遗传图谱"]
|
||||
C --> P["Core: program"]
|
||||
P --> T["Core: trial"]
|
||||
T --> ST["Core: study"]
|
||||
ST --> PL["plate 样本板"]
|
||||
ST --> VS["variantset 变异集合"]
|
||||
ST --> SM["sample 样本"]
|
||||
|
||||
OU["Pheno: observation_unit"] --> SM
|
||||
PL --> SM
|
||||
|
||||
GER["Germplasm 可选"] --> RS["reference_set 参考集合"]
|
||||
RS --> R["reference 参考序列"]
|
||||
R --> RB["reference_bases 参考序列分页"]
|
||||
|
||||
RS --> VS
|
||||
VS --> V["variant 变异位点"]
|
||||
RS --> V
|
||||
|
||||
SM --> CS["callset 样本调用集合"]
|
||||
CS --> CSV["callset_variant_sets"]
|
||||
VS --> CSV
|
||||
|
||||
CS --> CALL["allele_call / Call 基因型结果"]
|
||||
V --> CALL
|
||||
|
||||
GM --> LG["linkageGroup 连锁群"]
|
||||
LG --> MP["marker_position 图谱位置"]
|
||||
V --> MP
|
||||
|
||||
VS --> VSA["variantset_analysis"]
|
||||
VS --> VSF["variantset_format"]
|
||||
```
|
||||
|
||||
## Genotyping ER 关系图
|
||||
|
||||
```mermaid
|
||||
erDiagram
|
||||
program ||--o{ plate : "program_id"
|
||||
trial ||--o{ plate : "trial_id"
|
||||
study ||--o{ plate : "study_id"
|
||||
|
||||
plate ||--o{ sample : "plate_id"
|
||||
observation_unit ||--o{ sample : "observation_unit_id"
|
||||
program ||--o{ sample : "program_id"
|
||||
trial ||--o{ sample : "trial_id"
|
||||
study ||--o{ sample : "study_id"
|
||||
|
||||
germplasm ||--o{ reference_set : "source_germplasm_id"
|
||||
reference_set ||--o{ reference : "reference_set_id"
|
||||
reference ||--o{ reference_bases : "reference_id"
|
||||
|
||||
reference_set ||--o{ variantset : "reference_set_id"
|
||||
study ||--o{ variantset : "study_id"
|
||||
variantset ||--o{ variant : "variant_set_id"
|
||||
reference_set ||--o{ variant : "reference_set_id"
|
||||
|
||||
sample ||--o{ callset : "sample_id"
|
||||
callset ||--o{ callset_variant_sets : "call_sets_id"
|
||||
variantset ||--o{ callset_variant_sets : "variant_sets_id"
|
||||
|
||||
callset ||--o{ allele_call : "call_set_id"
|
||||
variant ||--o{ allele_call : "variant_id"
|
||||
|
||||
crop ||--o{ genome_map : "crop_id"
|
||||
genome_map ||--o{ linkageGroup : "genome_map_id"
|
||||
linkageGroup ||--o{ marker_position : "linkage_group_id"
|
||||
variant ||--o{ marker_position : "variant_id"
|
||||
|
||||
variantset ||--o{ variantset_analysis : "variant_set_id"
|
||||
variantset ||--o{ variantset_format : "variant_set_id"
|
||||
```
|
||||
|
||||
## API 与表的对应关系
|
||||
|
||||
| API | 主表 | 说明 |
|
||||
| --- | --- | --- |
|
||||
| `/brapi/v2/samples` | `sample` | 样本查询、新增、修改 |
|
||||
| `/brapi/v2/plates` | `plate` | 样本板查询、新增、修改 |
|
||||
| `/brapi/v2/callsets` | `callset` | 样本调用集合 |
|
||||
| `/brapi/v2/calls` | `allele_call` | genotype 调用结果 |
|
||||
| `/brapi/v2/variants` | `variant` | 变异位点 |
|
||||
| `/brapi/v2/variantsets` | `variantset` | 变异集合 |
|
||||
| `/brapi/v2/referencesets` | `reference_set` | 参考基因组集合 |
|
||||
| `/brapi/v2/references` | `reference` | 参考序列 |
|
||||
| `/brapi/v2/maps` | `genome_map` | 遗传图谱 |
|
||||
| `/brapi/v2/markerpositions` | `marker_position` | variant/marker 在图谱上的位置 |
|
||||
|
||||
## 关键注意点
|
||||
|
||||
1. `CallEntity` 对应的数据库表不是 `call`,而是 `allele_call`。
|
||||
2. `CallSetEntity` 对应 `callset`,不是 `call_set`。
|
||||
3. `VariantSetEntity` 对应 `variantset`,不是 `variant_set`。
|
||||
4. `LinkageGroupEntity` 对应表名是 `linkageGroup`,schema 里另有外键引用时大小写需要特别注意。
|
||||
5. `sample` 是基因型流程的样本入口,向上关联 `plate/observation_unit/study/trial/program`。
|
||||
6. `variant` 是位点定义,`allele_call` 是样本在位点上的结果;不要把二者混成同一层数据。
|
||||
7. `reference_set/reference/reference_bases` 是参考基因组侧;`variantset/variant/callset/allele_call` 是变异和结果侧。
|
||||
8. `genome_map/linkageGroup/marker_position` 是遗传图谱定位侧,`marker_position` 通过 `variant_id` 与变异位点相连。
|
||||
9. 与前两篇一样,`*_additional_info` 和 `*_external_references` 是通用扩展关系,用于补充业务字段和外部引用。
|
||||
|
||||
Reference in New Issue
Block a user