Thuật toán tính toán cho dịch tễ học di truyền

Jingwu He

Thuật toán tính toán dịch tễ học di truyền - Luận án tiến sĩ

Luận án tiến sĩ khám phá thuật toán tính toán dịch tễ học di truyền. Đề xuất phương pháp tăng tốc phasing, công cụ tagging SNP và dự đoán bệnh di truyền.

Trường ĐH

Georgia State University

Chuyên ngành

Genetic Epidemiology

Tác giả

Luan An

Thể loại

Luận án

Năm xuất bản

2006

Số trang

147

Thời gian đọc

23 phút

Lượt xem

0

Lượt tải

0

Phí lưu trữ

40 Point

ABSTRACT

INDEX WORDS

DEDICATION

ACKNOWLEDGMENTS

LIST OF TABLES

LIST OF FIGURES

Road Map and Contributions

1. BIOLOGY BACKGROUND: SNPS, HAPLOTYPES, GENOTYPES, AND NOTATIONS

2. HAPLOTYPE INFERENCE PROBLEM

2.1. Population Haplotype Inference Problem

2.1.1. Previous Work and Problem Formulation

2.1.2. Linear Dependence of Sites, Haplotypes and Genotypes

2.1.3. Implementation of Linear Reduction Based on Matrix Multiplication

2.1.4. Fixing Caveats in Linear Reduction Approach

2.2. Phasing and Missing data recovery in Family Trios

2.2.1. Previous Work and Problem Formulation

2.2.2. Pure-Parsimony Trio Phasing

2.2.3. Integer Linear Program for Trio Phasing

2.2.4. Greedy Method for Trio Phasing

3. INFORMATIVE SNP SELECTION

3.1. Linear Algebraic Method

3.1.1. Linear Algebraic Tagging

3.1.2. Linear Algebraic Tagging with Prescribed Number of Tags

3.1.3. Tag SNP Selection and SNP Prediction Problems

3.1.4. Multiple Linear Regression SNP Prediction Method

3.1.4.1. Introduction to Multiple Linear Regression

3.1.4.2. The MLR SNP Prediction Algorithm

3.1.4.3. Running Time of MLR SNP prediction and Tag Selection

3.1.5. MLR-tagging Software

3.1.6. Support Vector Machine SNP Prediction Method

3.1.6.1. SVM Haplotype Tagging

3.1.6.2. SVM-tagging Software

3.1.7. Application of Tagging to Disease Association Search

3.1.7.1. Multi-SNP to Disease Association

3.1.7.2. Searching Methods for Disease Association

4. DISEASE SUSCEPTIBILITY PREDICTION

4.1. Measures of Prediction Quality and Cross-validation Methods

4.2. Reduction to Set Covering Problem

4.3. Set Covering Greedy Algorithm

4.4. Prediction Algorithms for Disease Susceptibility

4.4.1. Graph-based Prediction Methods

5. CONCLUSION AND FUTURE WORK

5.1. Unbiased Estimates of MLR Tagging

5.2. Protein substrate prediction

5.3. Simulation of behavior of bacterial cells under specific growth conditions

I. Thuật Toán Tính Toán Dịch Tễ Học Di Truyền

Dịch tễ học di truyền nghiên cứu mối liên hệ giữa biến thể gen và bệnh tật. Các thuật toán tính toán đóng vai trò then chốt trong việc phân tích dữ liệu di truyền quy mô lớn. Nghiên cứu tập trung vào ba vấn đề chính: suy luận haplotype từ genotype, lựa chọn SNP đại diện (tag SNPs), và dự đoán tính nhạy cảm bệnh tật.

Phương pháp vật lý để tách haplotype từ genotype tốn kém. Các phương pháp tính toán cung cấp giải pháp hiệu quả chi phí. Tuy nhiên, tỷ lệ lỗi cao vẫn ảnh hưởng độ chính xác phân tích liên kết. Single nucleotide polymorphism (SNP) là dạng biến thể di truyền phổ biến nhất.

Công nghệ genotyping thông lượng cao tạo ra lượng dữ liệu khổng lồ. Việc chọn lọc SNP thông tin quan trọng để nén dữ liệu. Tag SNPs đại diện cho các SNP khác thông qua linkage disequilibrium. Điều này giảm chi phí genotyping đáng kể.

Luận án áp dụng đại số tuyến tính, lý thuyết đồ thị, quy hoạch tuyến tính và phương pháp tham lam. Các đóng góp bao gồm: tăng tốc công cụ phasing, phát triển công cụ tagging tiên tiến, và phương pháp dựa trên đồ thị để dự đoán nhạy cảm bệnh.

1.1. Bối Cảnh Sinh Học SNP và Haplotype

SNP là biến thể một nucleotide trong trình tự DNA. Mỗi vị trí SNP có thể có hai hoặc nhiều allele khác nhau. Allele frequency phản ánh tỷ lệ xuất hiện của mỗi biến thể trong quần thể.

Haplotype là tổ hợp các allele trên cùng nhiễm sắc thể. Genotype là cặp haplotype từ hai nhiễm sắc thể tương đồng. Phasing là quá trình xác định haplotype từ genotype. Hardy-Weinberg equilibrium mô tả tần số allele ổn định trong quần thể lý tưởng.

1.2. Thách Thức Trong Phân Tích Dữ Liệu Di Truyền

Dữ liệu genotype chứa nhiều thông tin mơ hồ. Hai haplotype không thể phân biệt trực tiếp từ genotype. Genotype imputation suy luận genotype thiếu dựa trên dữ liệu tham chiếu.

Population stratification gây nhiễu trong genome-wide association study (GWAS). Các nhóm quần thể khác nhau có cấu trúc di truyền riêng. Phương pháp thống kê phải điều chỉnh cho sự phân tầng này. Linkage disequilibrium đo lường mối liên kết không ngẫu nhiên giữa các SNP.

1.3. Mục Tiêu Nghiên Cứu Chính

Nghiên cứu nhằm cải thiện độ chính xác suy luận haplotype. Phát triển phương pháp chọn tag SNP hiệu quả hơn. Xây dựng công cụ dự đoán tính nhạy cảm bệnh phức tạp.

Haplotype estimation chính xác cần thiết cho phân tích liên kết gen. Tagging giảm số lượng SNP cần genotyping. Phân tích QTL (Quantitative Trait Loci) xác định vùng gen ảnh hưởng tính trạng định lượng. Các phương pháp tính toán phải cân bằng giữa độ chính xác và hiệu suất.

II. Bài Toán Suy Luận Haplotype Từ Genotype

Suy luận haplotype là bước quan trọng trong phân tích liên kết gen. Genotype không chứa thông tin pha của allele. Hai haplotype trên nhiễm sắc thể tương đồng tạo thành genotype quan sát được.

Phương pháp vật lý như cloning phân tử tốn kém và mất thời gian. Các thuật toán tính toán cung cấp giải pháp thay thế. Chúng dựa trên nguyên lý parsimony hoặc mô hình thống kê.

Suy luận haplotype quần thể khác với suy luận từ dữ liệu gia đình. Dữ liệu quần thể chỉ chứa genotype độc lập. Dữ liệu trio bao gồm cha mẹ và con, cung cấp ràng buộc Mendel.

Các phương pháp phổ biến bao gồm EM algorithm, perfect phylogeny, và maximum parsimony. Linear dependence giữa các site giúp giảm độ phức tạp bài toán. Matrix multiplication tăng tốc quá trình tính toán.

2.1. Công Thức Hóa Bài Toán Phasing Quần Thể

Cho tập genotype, tìm tập haplotype giải thích chúng. Mỗi genotype phải được tạo từ cặp haplotype trong tập. Nguyên lý parsimony tìm tập haplotype nhỏ nhất.

Pure parsimony haplotyping là bài toán NP-hard. Các phương pháp heuristic cung cấp giải pháp gần đúng. Clark's algorithm sử dụng chiến lược greedy. PHASE và fastPHASE áp dụng mô hình Bayesian. Linkage disequilibrium cung cấp thông tin cho suy luận thống kê.

2.2. Phụ Thuộc Tuyến Tính Của Site và Haplotype

Một số site có thể dự đoán từ các site khác. Phụ thuộc tuyến tính giảm số chiều của bài toán. Loại bỏ site phụ thuộc tăng tốc độ tính toán.

Matrix multiplication xác định site độc lập tuyến tính. Phương pháp dựa trên đại số tuyến tính trên trường hữu hạn. Gaussian elimination tìm cơ sở của không gian vector. Độ phức tạp thời gian là O(m²n) với m site và n genotype.

2.3. Phasing Trio và Phục Hồi Dữ Liệu Thiếu

Dữ liệu trio cung cấp ràng buộc Mendel mạnh. Con nhận một allele từ mỗi bố mẹ. Ràng buộc này giảm độ mơ hồ đáng kể.

Pure-parsimony trio phasing tìm số haplotype tối thiểu. Integer linear programming (ILP) mô hình hóa bài toán chính xác. Phương pháp greedy cung cấp giải pháp nhanh hơn. Missing data phổ biến trong dữ liệu thực tế. Genotype imputation phục hồi genotype thiếu dựa trên linkage disequilibrium.

III. Lựa Chọn SNP Thông Tin Tag SNP Selection

Tag SNP selection giảm số SNP cần genotyping. Các SNP có linkage disequilibrium cao tương quan mạnh. Một tập nhỏ tag SNP có thể dự đoán các SNP khác.

Phương pháp đại số tuyến tính sử dụng phụ thuộc tuyến tính. Multiple linear regression (MLR) dự đoán SNP từ tag SNP. Support vector machine (SVM) cung cấp dự đoán phi tuyến.

Tagging giảm chi phí genotyping trong GWAS. Genome-wide association study quét toàn bộ bộ gen. Hàng triệu SNP cần được genotyping. Tag SNP giảm số lượng xuống hàng chục nghìn.

Độ chính xác dự đoán SNP ảnh hưởng kết quả phân tích liên kết gen. Haplotype estimation cải thiện độ chính xác tagging. Các phương pháp phải cân bằng giữa số tag SNP và độ chính xác.

3.1. Phương Pháp Đại Số Tuyến Tính Cho Tagging

Linear algebraic tagging xác định SNP độc lập tuyến tính. Các SNP khác được biểu diễn như tổ hợp tuyến tính. Gaussian elimination tìm cơ sở tối thiểu.

Phương pháp hoạt động trên trường hữu hạn GF(2). Mỗi allele được mã hóa là 0 hoặc 1. Matrix rank xác định số tag SNP tối thiểu. Tagging with prescribed number cho phép điều chỉnh số tag SNP.

3.2. Dự Đoán SNP Bằng Hồi Quy Tuyến Tính

Multiple linear regression (MLR) mô hình hóa quan hệ tuyến tính. Mỗi SNP được dự đoán từ tổ hợp tuyến tính tag SNP. Least squares estimation tìm hệ số hồi quy tối ưu.

MLR SNP prediction có độ phức tạp O(nt²) với n SNP và t tag SNP. Phương pháp nhanh hơn các thuật toán phi tuyến. MLR-tagging software triển khai thuật toán hiệu quả. Cross-validation đánh giá độ chính xác dự đoán.

3.3. Dự Đoán SNP Bằng Support Vector Machine

SVM tìm siêu phẳng phân tách tối ưu. Kernel trick cho phép phân loại phi tuyến. SVM haplotype tagging xử lý dữ liệu phức tạp hơn.

SVM-tagging software cung cấp giao diện thân thiện. Phương pháp phù hợp khi quan hệ phi tuyến. Allele frequency ảnh hưởng hiệu suất dự đoán. Rare allele khó dự đoán hơn common allele.

IV. Ứng Dụng Tagging Trong Phân Tích Liên Kết Bệnh

Phân tích liên kết gen tìm SNP liên quan bệnh tật. Genome-wide association study (GWAS) quét toàn bộ bộ gen. Tag SNP giảm chi phí mà vẫn giữ độ bao phủ.

Multi-SNP association xem xét nhiều SNP đồng thời. Tương tác giữa SNP ảnh hưởng nguy cơ bệnh. Single-SNP analysis bỏ qua tương tác này.

Population stratification gây kết quả dương tính giả. Phân tầng quần thể tạo sự khác biệt tần số allele. Principal component analysis điều chỉnh cho cấu trúc quần thể.

Phân tích QTL xác định vùng gen ảnh hưởng tính trạng định lượng. Linkage disequilibrium mapping sử dụng mẫu LD. Haplotype-based association mạnh hơn single-SNP. Fine mapping thu hẹp vùng chứa biến thể nhân quả.

4.1. Phương Pháp Multi SNP Association

Multi-SNP analysis xem xét tổ hợp nhiều SNP. Epistasis là tương tác giữa các gen khác nhau. Logistic regression mô hình hóa nguy cơ bệnh.

Random forest xử lý tương tác phức tạp. Neural network học mẫu phi tuyến. Phương pháp cần mẫu lớn để tránh overfitting. Hardy-Weinberg equilibrium test kiểm tra chất lượng dữ liệu.

4.2. Chiến Lược Tìm Kiếm Liên Kết Bệnh

Two-stage design giảm chi phí genotyping. Giai đoạn một quét với tag SNP. Giai đoạn hai genotyping chi tiết vùng quan tâm.

Bonferroni correction điều chỉnh multiple testing. False discovery rate (FDR) kiểm soát tỷ lệ phát hiện giả. Permutation test đánh giá ý nghĩa thống kê. Replication study xác nhận kết quả trong quần thể độc lập.

4.3. Xử Lý Population Stratification

Genomic control ước lượng inflation factor. Principal component analysis phát hiện cấu trúc quần thể. Structured association test điều chỉnh cho phân tầng.

Family-based association tránh nhiễu quần thể. Transmission disequilibrium test (TDT) sử dụng dữ liệu trio. Allele frequency khác nhau giữa các nhóm dân tộc. Admixture mapping phát hiện vùng tổ tiên đặc hiệu.

V. Dự Đoán Tính Nhạy Cảm Bệnh Tật Phức Tạp

Bệnh phức tạp do nhiều gen và môi trường gây ra. Dự đoán nhạy cảm bệnh từ genotype là thách thức lớn. Các phương pháp thống kê truyền thống có hiệu quả hạn chế.

Graph-based methods mô hình hóa tương tác gen. Mỗi genotype là đỉnh trong đồ thị. Cạnh kết nối genotype tương tự. Phân loại dựa trên cấu trúc đồ thị.

Set covering problem tìm tập genotype đại diện. Greedy algorithm cung cấp giải pháp xấp xỉ. Độ phức tạp thời gian là đa thức.

Cross-validation đánh giá hiệu suất dự đoán. Leave-one-out cross-validation cho mẫu nhỏ. K-fold cross-validation cân bằng bias và variance. Sensitivity và specificity đo chất lượng phân loại.

5.1. Độ Đo Chất Lượng Dự Đoán

Sensitivity là tỷ lệ bệnh được phát hiện đúng. Specificity là tỷ lệ khỏe được phân loại đúng. ROC curve trực quan hóa trade-off.

Area under curve (AUC) tóm tắt hiệu suất. Positive predictive value phụ thuộc prevalence. Negative predictive value quan trọng cho screening. Accuracy tổng thể có thể gây hiểu lầm với dữ liệu mất cân bằng.

5.2. Quy Về Bài Toán Set Covering

Set covering tìm tập con phủ tất cả phần tử. Mỗi genotype bệnh cần được phủ bởi genotype đại diện. Bài toán là NP-hard.

Greedy algorithm chọn genotype phủ nhiều nhất. Tỷ lệ xấp xỉ là logarithmic. Phương pháp nhanh và hiệu quả thực tế. Linear programming relaxation cho giới hạn dưới.

5.3. Phương Pháp Dựa Trên Đồ Thị

Graph-based prediction xây dựng similarity graph. Khoảng cách Hamming đo độ khác biệt genotype. K-nearest neighbors phân loại theo láng giềng.

Spectral clustering phát hiện cộng đồng. Graph kernel đo độ tương tự cấu trúc. Random walk truyền nhãn qua cạnh. Linkage disequilibrium cung cấp trọng số cạnh.

VI. Kết Luận và Hướng Nghiên Cứu Tương Lai

Luận án phát triển thuật toán cho dịch tễ học di truyền tính toán. Các đóng góp chính bao gồm tăng tốc phasing, cải thiện tagging, và dự đoán nhạy cảm bệnh.

Phương pháp linear reduction tăng tốc công cụ phasing phổ biến. Chất lượng kết quả không bị ảnh hưởng. MLR-tagging và SVM-tagging cung cấp dự đoán SNP chính xác.

Graph-based methods hiệu quả cho bệnh phức tạp. Tương tác gen được mô hình hóa tự nhiên. Set covering greedy algorithm nhanh và chính xác.

Công nghệ sequencing thế hệ mới tạo dữ liệu khổng lồ. Whole genome sequencing thay thế genotyping array. Rare variant association cần phương pháp mới. Deep learning hứa hẹn cho phân tích dữ liệu phức tạp.

6.1. Ước Lượng Không Thiên Vị Cho MLR Tagging

MLR tagging hiện tại có thể thiên vị với mẫu nhỏ. Bootstrap resampling ước lượng phân phối. Bias correction cải thiện độ chính xác.

Cross-validation nested tránh overfitting. Regularization như ridge regression ổn định ước lượng. Allele frequency weighting cân bằng rare và common variant. Haplotype-based MLR khai thác linkage disequilibrium.

6.2. Dự Đoán Protein Substrate

Kinase phosphorylation sites dự đoán từ trình tự. Sequence motif đặc trưng cho substrate. Machine learning phân loại site chức năng.

Structural information cải thiện dự đoán. Protein-protein interaction network cung cấp context. Evolutionary conservation chỉ ra site quan trọng. Experimental validation xác nhận dự đoán tính toán.

6.3. Mô Phỏng Hành Vi Tế Bào Vi Khuẩn

Metabolic network modeling mô phỏng tăng trưởng. Flux balance analysis dự đoán phenotype. Constraint-based methods không cần tham số động học.

Gene regulatory network điều khiển biểu hiện gen. Boolean network mô hình hóa logic control. Stochastic simulation xử lý noise sinh học. Single-cell sequencing cung cấp dữ liệu validation.

24/03/2026

Xem trước tài liệu

Tải đầy đủ để xem toàn bộ nội dung

Luận án tiến sĩ: Algorithms for computational genetic epidemiology

Tải xuống file đầy đủ để xem toàn bộ nội dung

Tải đầy đủ (147 trang)

Trích đoạn nội dung luận án

Tải xuống để đọc toàn bộ

ALGORITHMS FOR COMPUTATIONAL GENETIC EPIDEMIOLOGY by Jingwu He Under the Direction of Alex Zelikovsky ABSTRACT The most intriguing problems in genetics epidemiology are to predict genetic disease susceptibility and to associate single nucleotide polymorphisms (SNPs) with diseases. In such these studies, it is necessary to resolve the ambiguities in genetic data. The primary obstacle for ambiguity resolution is that the physical methods for separating two haplotypes from an individual genotype (phasing) are too expensive. Although computational haplotype inference is a well-explored problem, high error rates continue to deteriorate association accuracy.

Secondly, it is essential to use a small subset of informative SNPs (tag SNPs) accurately representing the rest of the SNPs (tagging). Tagging can achieve budget savings by genotyping only a limited number of SNPs and computationally inferring all other SNPs. Recent successes in high throughput genotyping technologies drastically increase the length of available SNP sequences. This elevates importance of informative SNP selection for compaction of huge genetic data in order to make feasible fine genotype analysis.

Finally, even if complete and accurate data is available, it is unclear if common statistical methods can determine the susceptibility of complex diseases. The dissertation explores above computational problems with a variety of methods, including linear algebra, graph theory, linear programming, and greedy methods. The contributions include (1)significant speed-up of popular phasing tools without compromising their quality, (2)stat-of-the-art tagging tools applied to disease association, and (3)graph-based method for disease tagging and predicting disease susceptibility. INDEX WORDS: Tagging, Phasing, Haplotype, Genotype, SNP, Disease association, Susceptibility prediction ALGORITHMS FOR COMPUTATIONAL GENETIC EPIDEMIOLOGY by Jingwu He A Dissertation Submitted in Partial Fulfillment of Requirements for the Degree of Doctor of Philosophy in the College of Arts and Sciences Georgia State University 2006 UMI Number: 3243235 Copyright 2006 by He, Jingwu All rights reserved.

UMI Microform 3243235 Copyright 2007 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.

Box 1346 Ann Arbor, MI 48106-1346 Copyright by Jingwu He 2006 ALGORITHMS FOR COMPUTATIONAL GENETIC EPIDEMIOLOGY by Jingwu He Major Professor: Alex Zelikovsky Committee: Yi Pan Anu Bourgeois Ion Mandoiu Electronic Version Approved: Office of Graduate Studies College of Arts and Sciences Georgia State University December 2006 DEDICATION To my dear daughter, Jennifer, my wife, Jun and my parents iv ACKNOWLEDGMENTS First, I would like to thank my advisor, Dr. Alexander Zelikovsky for advising and guide for my Ph. Secondly, I want to thank my dissertation committee members, Dr. Yi Pan, Dr.

Anu Bourgeois and Dr. I also appreciate support and assistance from our research group: Dumitru Brinza, Kelly Westbrooks, Weidong Mao and Nisar Hundewale. Finally, I want to thank my family and friends for their support and beliefs. v TABLE OF CONTENTS Page DEDICATION.

v LIST OF TABLES. ix LIST OF FIGURES .1 Road Map and Contributions. BIOLOGY BACKGROUND: SNPS, HAPLOTYPES, GENOTYPES, AND NOTATIONS. HAPLOTYPE INFERENCE PROBLEM .1 Population Haplotype Inference Problem .1 Previous Work and Problem Formulation .2 Linear Dependence of Sites, Haplotypes and Genotypes .3 Implementation of Linear Reduction Based on Matrix Multiplication .4 Fixing Caveats in Linear Reduction Approach .2 Phasing and Missing data recovery in Family Trios .1 Previous Work and Problem Formulation .2 Pure-Parsimony Trio Phasing .3 Integer Linear Program for Trio Phasing .4 Greedy Method for Trio Phasing.

INFORMATIVE SNP SELECTION .2 Linear Algebraic Method .1 Linear Algebraic Tagging .2 Linear Algebraic Tagging with Prescribed Number of Tags .3 Tag SNP Selection and SNP Prediction Problems .4 Multiple Linear Regression SNP Prediction Method .1 Introduction to Multiple Linear Regression .2 The MLR SNP Prediction Algorithm .3 Running Time of MLR SNP prediction and Tag Selection .5 MLR-tagging Software .5 Support Vector Machine SNP Prediction Method .2 SVM Haplotype Tagging .4 SVM-tagging Software .6 Application of Tagging to Disease Association Search .1 Multi-SNP to Disease Association .3 Searching Methods for Disease Association. DISEASE SUSCEPTIBILITY PREDICTION .3 Measures of Prediction Quality and Cross-validation Methods .2 Reduction to Set Covering Problem .3 Set Covering Greedy Algorithm .3 Prediction Algorithms for Disease Susceptibility .2 Graph-based Prediction Methods. CONCLUSION AND FUTURE WORK .1 Unbiased Estimates of MLR Tagging .2 Protein substrate prediction .3 Simulation of behavior of bacterial cells under specific growth conditions. 118 viii LIST OF TABLES Table Page 3.1 The comparison of the running times of DPPH and Linearly Reduced DPPH.

Each value is averaged over 100 datasets. E and D is the CPU time for encoding and decoding and RD is DPPH runtime for the reduced instance.2 The comparison of the running times of PHASE and Linearly Reduced PHASE. Each value is averaged over 25 datasets.3 The comparison of the quality of haplotyping of Linearly Reduced PHASE (LRP) and PHASE (P) vs the original haplotypes (O). Here the difference in haplotype data sets, Hapset1/Hapset2 is the arithmetic mean of numbers of false-positive and false-negative haplotypes over the number of haplotypes Hapset2 times 100%.

Each value is averaged over 25 datasets.4 The comparison of the quality of haplotyping of Linearly Reduced PHASE (LRP) and PHASE (P) vs the original haplotypes (O). Here the difference in haplotype data sets, Hapset1/Hapset2 is the arithmetic mean of numbers of false-positive and false-negative haplotypes over the number of haplotypes Hapset2 times 100%. Each value is averaged over feasible graphs among 25 datasets.5 The comparison of the running times of HAPLOTYPER and Linearly Reduced HAPLOTYPER. Each value is averaged over 25 datasets.6 The comparison of the quality of haplotyping of Linearly Reduced HAPLOTYPER (LRH) and HAPLOTYPER (H) vs the original haplotypes (O).

Here the difference in haplotype data sets, Hapset1/Hapset2 is the arithmetic mean of numbers of false-positive and false-negative haplotypes over the number of haplotypes Hapset2 times 100%. Each value is averaged over feasible graphs among 25 datasets.7 The comparison of the quality of haplotyping of Linearly Reduced HAPLOTYPER (LRH) and HAPLOTYPER (H) vs the original haplotypes (O). Here the difference in haplotype data sets, Hapset1/Hapset2 is the arithmetic mean of numbers of false-positive and false-negative haplotypes over the number of haplotypes Hapset2 times 100%. Each value is averaged over feasible graphs among 25 datasets.8 The comparison of the running times on real data.9 The comparison of Linearly Reduced HAPLOTYPER (LRH), HAPLOTYPER(H), Linearly Reduced PHASE (LRP), PHASE (P), and original haplotypes (O) on biological data.10 The results for three phasing methods on the real data sets [26, 32, 54] and simulated data set.

Error% is the percent sites where (best choice of) paternal and maternal haplotypes disagree with the offspring genotype. D % is the Hamming distance between the phased haplotypes and the closest feasible haplotypes.11 The comparison of the running times, number of variables, number of constraints of three linear programs. Each value is averaged over all blocks. All phasing block sizes are uniform.12 The results for five phasing methods on the real data sets of Daly et al.[26], Gabrile et al.

[32] and on simulated data. The second column corresponds to the ratio of erased data. The C corresponds to the logical error of child. The P corresponds to the logical error of parents.

The T corresponds to the total logical error.13 The results for five phasing methods on the simulated data sets. The column E represents the percent of erased data. The C corresponds to the true error of child. The P corresponds to the true error of parents.

The T corresponds to the true total error.14 The results for missing data recovery on the real and simulated data sets with five methods. The second column corresponds to the ratio of erased data. The C* corresponds to the error of child. The P* corresponds to the error of parents.

The T* corresponds to the total error.1 The quality of SNP prediction from the given number of tags (5% to 15% of the total number of SNPs (in Parentheses). The prediction quality is measured by the prediction accuracy and the average and minimum R2. Total number of SNPs in each dataset is in the parenthesis.2 Number of tags used by MLR-tagging, STAMPA and LR to achieve 80% and 90% prediction accuracy in leave-one-out tests.3 The comparison of MLR’s and STAMPA’s prediction accuracy and running time by using the number of tags (2, 5, 10, 15, 20, 25) on region ENr123 (A) and ENm010 (B) from 2 population: Han Chinese (HCB) and Japanese (JRT). Total number of SNPs in each dataset is in the parenthesis.4 The quality of MLR/STA on Daly et al.

[26] data with two different tagging objectives over different number of tag SNPs.5 The number of tag SNPs for statistical covering of all SNPs required by three methods: MLR/STA with prediction objective, MLR/STA with statistical covering objective, and IdSelect [16].6 Leave-one-out tests are performed on 3 real haplotype datasets. The minimum number of tag SNPs needed to reach from 80% to 99% prediction accuracy is listed. The bold numbers indicate cases when the SVM/STA needs fewer tags than the MLR method of He et al. [45] for reaching same prediction accuracy.7 The comparison of our proposed SVM/STA method and the MLR method of He et al.

[45] over different number of tag SNPs.8 Comparison of four methods for searching disease-associated multi-SNPs combinations.1 Classification contingency table .2 The comparison of the prediction rates of 6 prediction methods for Crohn’s Disease (Daly et al.)[26] and autoimmune disorder (Ueda et al. Genotype data are phased by 4 methods. GERBIL [37]and PHASE [87] are statistical tools for haplotype reconstruction. For Crohn’s Disease, GERBIL feasible and PHASE feasible find the respective closest feasible haplotypes of the trio data.3 The comparison of the prediction rates of two prediction methods (Second Neighbor and Haplotype Weighting) on Daly et al.

[26] phased by GERBIL [37] and GERBIL Feasible. We report bootstrapping rates, i., the 5th worst rate out of 100 runs (95% confidence) and different bootstrapping rates – averaged over 100 random choices of 20 case and 200 control genotypes. 107 xii LIST OF FIGURES Figure Page 1.1 DNA, gene, chromosome, genome .1 An example of Haplotype Inference Problem .2 2SNP Phasing Algorithm .3 An graph representation of Haplotype Inference Problem .4 The Decoding Algorithm.5 (a) The reduced haplotype graph with 3 vertices.6 Resolve child’s haplotypes .1 Problem formulation of Informative SNP Selection .2 Simulated data with 25000 sites and haplotype population 1000. The total number of errors in % to the total number of SNPs depending on the size of the sample population for the three algorithms LR, RLR, RLRP and 3RLRP.3 The dataset of 158 haplotypes with 103 SNPs from [26].

The total number of errors in % to the total number of SNPs depending on the size of the sample population for the three algorithms LR, RLR, RLRP and 3RLRP.4 The dataset of 158 haplotypes with 103 SNPs from [26]. The total number of errors in % to the total number of SNPs depending on the number of the tags for algorithms RLRP and 3RLRP.5 Simulated data with 25000 sites and different sizes of haplotype population. The total number of errors in % to the total number of SNPs depending on the size of the sample population for the different population sizes (p = 300, 500, 1000, 2000).6 The x-axis shows the number of zeros in each column of R of the haplotype matrix and the y-axis shows reconstruction error rate for each column in the sample using the RLRP method.7 (A) The total number of errors as a percentage of the total number of SNPs depending on the size of the sample population for the three algorithms LRP, RLRP, and SLT on Chromosome 5q31. (B) The total number of errors as a percentage of total number of SNPs depending on the size of the sample population and the percentage of missing data for the SLT method on Chromosome 5q31.8 The x-axis shows the number of tag SNPs, and the y-axis shows the fraction of SNPs correctly imputed in a leave-one-out experiment.

Nội dung được bảo vệ bản quyền — Tải xuống đầy đủ

Từ khóa liên quan

Thuật toán dịch tễ học di truyền Phân tích Haplotype SNP Ước tính pha Haplotype Chọn lọc SNP thông tin Dự đoán tính nhạy cảm bệnh Liên kết bệnh SNP

Chủ đề nghiên cứu

Phương pháp tính toán dịch tễ học di truyền Phân tích dữ liệu gen quy mô lớn Khám phá liên kết gen và bệnh Dự đoán nguy cơ bệnh di truyền

Câu hỏi thường gặp

Luận án "Thuật toán tính toán cho dịch tễ học di truyền" nghiên cứu về vấn đề gì?

Luận án tiến sĩ khám phá thuật toán tính toán dịch tễ học di truyền. Đề xuất phương pháp tăng tốc phasing, công cụ tagging SNP và dự đoán bệnh di truyền.

Luận án "Thuật toán tính toán cho dịch tễ học di truyền" được bảo vệ tại trường nào?

Luận án này được bảo vệ tại Georgia State University. Năm bảo vệ: 2006.

Luận án "Thuật toán tính toán cho dịch tễ học di truyền" thuộc chuyên ngành gì?

Luận án "Thuật toán tính toán cho dịch tễ học di truyền" thuộc chuyên ngành Genetic Epidemiology. Danh mục: Dịch Tễ Học.

Luận án "Thuật toán tính toán cho dịch tễ học di truyền" có bao nhiêu trang?

Luận án "Thuật toán tính toán cho dịch tễ học di truyền" có 147 trang. Bạn có thể xem trước một phần tài liệu ngay trên trang web trước khi tải về.

Cách tải luận án "Thuật toán tính toán cho dịch tễ học di truyền" về máy như thế nào?

Để tải luận án về máy, bạn nhấn nút "Tải xuống ngay" trên trang này, sau đó hoàn tất thanh toán phí lưu trữ. File sẽ được tải xuống ngay sau khi thanh toán thành công. Hỗ trợ qua Zalo: 0559 297 239.

Luận án liên quan

Chia sẻ tài liệu: Facebook Twitter

Mục lục chi tiết

Tóm tắt nội dung