Human Datasets


If you want to support NGYX Non-Profit initiative by advertising on this WEB site Click Here!


Human Datasets.


To check how to build your own dataset, click here.


Protein and/or ncRNA coding Human Genes "catalog".


Description.


This dataset covers Human Genes that are Protein and/or ncRNA coding with the following restrictions:


They should be listed from HGNC (version GRCh38.p7). HGNC basic info about genes was recovered as a tab delimited csv file from HGNC WEBsite (click here to see an example of this first input file)


Using HGNC official gene symbol the corresponding FASTA and Genbank files should both be available from NCBI WEB site.


Genbank file content should be validated:

At least one Ensembl Protein or ncRNA entry should be present (referenced whith HGNC ID).

For Protein coding the frame, presence of ATG and expected Stop codon,etc. should be validated.


Some numbers.


The initial list of gens (source HGNC) listed 39942 genes.


Out of these 22024 were validated (Protein coding = 19052; ncRNA coding 2972; rejected 17918)


Products corresponding to these genes: 46618 build using 216999 Coding Part Of Exons (Codex).


Genes Synonymes of these validated genes: 111699.


Here below are the main outputs generated by our proprietary Perl code that handles the whole process inclusive liftOver (to capture hg19 positions) and blastn  (used to identify sequences homologies that could presumably bias NGS data)


A folder contining all validated Genbank and FASTA files


and several TAB delimited csv files:


Genes table

Codex table (Stands for Coding part of exons = as used in the final product)

Extended Codex table (+ 10 nucleotides before and after Codex)

Products table

Genes synonymes table

BED files (hg38 and hg19) for Genes, Codex and Extended Codex

Rejected Genes table (informative about why rejected)


Remark: Our package creates additionnal files (e.g. RUN.LOG, Blast raw output,etc.) that are not described here.


Usages.


This dataset is the primary data source for our Human Genetic Diseases software package (and is part of it). To download this dataset click here.

NGYX I.C. details / Coordinates.

Company N°: BE 0537.471.159

Postal Adress:  NGYX I.C. (P. LECOCQ ),rue des Hausseurs 10, B-4550 Nandrin, BELGIUM. Email:  Info@NGYX.EU              Tel. / GSM: +32 498 532496 IBAN: BE63 7506 5746 0708   BIC:AXABBE22

Email  us: Click Here!