gimi9 Pandas Profiling

Dataset statistics

Number of variables	3
Number of observations	10000
Missing cells	2734
Missing cells (%)	9.1%
Duplicate rows	0
Duplicate rows (%)	0.0%
Total size in memory	312.5 KiB
Average record size in memory	32.0 B

Variable types

Text	3

Dataset

Description	생물 유전정보 중 DNA 바코드 관련 내용으로 그에 대한 정의와 국외 및 국내 연구동향, DNA 바코드의 필요성에 대한 내용 설명 입니다.
Author	환경부 국립생물자원관
URL	https://www.data.go.kr/data/15067608/fileData.do

Alerts

`국명` has 2734 (27.3%) missing values	Missing
`유전정보아이디` has unique values	Unique

Reproduction

Analysis started	2023-12-12 09:19:47.544366
Analysis finished	2023-12-12 09:19:48.382278
Duration	0.84 seconds
Software version	ydata-profiling vv4.5.1
Download configuration	config.json

유전정보아이디
Text

UNIQUE

Distinct	10000
Distinct (%)	100.0%
Missing	0
Missing (%)	0.0%
Memory size	156.2 KiB

Length

Max length	10
Median length	10
Mean length	10
Min length	10

Characters and Unicode

Total characters	100000
Distinct characters	13
Distinct categories	2 ?
Distinct scripts	2 ?
Distinct blocks	1 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	10000 ?
Unique (%)	100.0%

Sample

1st row	WBN0403419
2nd row	WBN0362518
3rd row	WBN0369629
4th row	WBN0339461
5th row	WBN0364430

Value	Count	Frequency (%)
wbn0403419	1	< 0.1%
wbn0378672	1	< 0.1%
wbn0355176	1	< 0.1%
wbn0351627	1	< 0.1%
wbn0377113	1	< 0.1%
wbn0388020	1	< 0.1%
wbn0386675	1	< 0.1%
wbn0338612	1	< 0.1%
wbn0401216	1	< 0.1%
wbn0369089	1	< 0.1%
Other values (9990)	9990	99.9%

Most occurring characters

Value	Count	Frequency (%)
0	14637	14.6%
3	13740	13.7%
W	10000	10.0%
B	10000	10.0%
N	10000	10.0%
4	5864	5.9%
6	5588	5.6%
9	5557	5.6%
7	5556	5.6%
8	5447	5.4%
Other values (3)	13611	13.6%

Most occurring categories

Value	Count	Frequency (%)
Decimal Number	70000	70.0%
Uppercase Letter	30000	30.0%

Most frequent character per category

Decimal Number

Value	Count	Frequency (%)
0	14637	20.9%
3	13740	19.6%
4	5864	8.4%
6	5588	8.0%
9	5557	7.9%
7	5556	7.9%
8	5447	7.8%
5	5403	7.7%
2	4157	5.9%
1	4051	5.8%

Uppercase Letter

Value	Count	Frequency (%)
W	10000	33.3%
B	10000	33.3%
N	10000	33.3%

Most occurring scripts

Value	Count	Frequency (%)
Common	70000	70.0%
Latin	30000	30.0%

Most frequent character per script

Common

Value	Count	Frequency (%)
0	14637	20.9%
3	13740	19.6%
4	5864	8.4%
6	5588	8.0%
9	5557	7.9%
7	5556	7.9%
8	5447	7.8%
5	5403	7.7%
2	4157	5.9%
1	4051	5.8%

Latin

Value	Count	Frequency (%)
W	10000	33.3%
B	10000	33.3%
N	10000	33.3%

Most occurring blocks

Value	Count	Frequency (%)
ASCII	100000	100.0%

Most frequent character per block

ASCII

Value	Count	Frequency (%)
0	14637	14.6%
3	13740	13.7%
W	10000	10.0%
B	10000	10.0%
N	10000	10.0%
4	5864	5.9%
6	5588	5.6%
9	5557	5.6%
7	5556	5.6%
8	5447	5.4%
Other values (3)	13611	13.6%

학명
Text

Distinct	4776
Distinct (%)	47.8%
Missing	0
Missing (%)	0.0%
Memory size	156.2 KiB

Length

Max length	125
Median length	75
Mean length	32.2949
Min length	5

Characters and Unicode

Total characters	322949
Distinct characters	76
Distinct categories	10 ?
Distinct scripts	2 ?
Distinct blocks	3 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	3072 ?
Unique (%)	30.7%

Sample

1st row	Micropsalliota pleurocystidiata Heinem. & Little Flower 1983
2nd row	Agelena limbata Thorell, 1897
3rd row	Impatiens L.
4th row	Chrysosplenium japonicum (Maxim.) Makino
5th row	Chlorostoma lischkei Tapparone Canefri, 1874

Value	Count	Frequency (%)
	1126	2.6%
l	1068	2.5%
et	540	1.3%
al	537	1.3%
ex	450	1.1%
nakai	362	0.8%
japonica	314	0.7%
a	301	0.7%
var	271	0.6%
h	270	0.6%
Other values (8284)	37618	87.8%

Most occurring characters

Value	Count	Frequency (%)
	32857	10.2%
a	28913	9.0%
i	22305	6.9%
e	18941	5.9%
s	15777	4.9%
o	15594	4.8%
r	15195	4.7%
n	14672	4.5%
l	13148	4.1%
u	12720	3.9%
Other values (66)	132827	41.1%

Most occurring categories

Value	Count	Frequency (%)
Lowercase Letter	223530	69.2%
Space Separator	32857	10.2%
Uppercase Letter	26693	8.3%
Decimal Number	18996	5.9%
Other Punctuation	13649	4.2%
Open Punctuation	3483	1.1%
Close Punctuation	3483	1.1%
Dash Punctuation	221	0.1%
Math Symbol	21	< 0.1%
Final Punctuation	16	< 0.1%

Most frequent character per category

Uppercase Letter

Value	Count	Frequency (%)
L	2551	9.6%
C	2516	9.4%
S	2445	9.2%
M	2110	7.9%
A	1766	6.6%
P	1724	6.5%
H	1503	5.6%
B	1340	5.0%
T	1305	4.9%
K	1191	4.5%
Other values (17)	8242	30.9%

Lowercase Letter

Value	Count	Frequency (%)
a	28913	12.9%
i	22305	10.0%
e	18941	8.5%
s	15777	7.1%
o	15594	7.0%
r	15195	6.8%
n	14672	6.6%
l	13148	5.9%
u	12720	5.7%
t	10211	4.6%
Other values (16)	56054	25.1%

Decimal Number

Value	Count	Frequency (%)
1	4912	25.9%
8	3178	16.7%
9	2392	12.6%
0	1850	9.7%
7	1558	8.2%
2	1543	8.1%
5	991	5.2%
6	903	4.8%
3	860	4.5%
4	809	4.3%

Other Punctuation

Value	Count	Frequency (%)
.	8249	60.4%
,	3798	27.8%
&	1124	8.2%
?	468	3.4%
'	10	0.1%

Open Punctuation

Value	Count	Frequency (%)
(	3479	99.9%
[	4	0.1%

Close Punctuation

Value	Count	Frequency (%)
)	3479	99.9%
]	4	0.1%

Space Separator

Value	Count	Frequency (%)
	32857	100.0%

Dash Punctuation

Value	Count	Frequency (%)
-	221	100.0%

Math Symbol

Value	Count	Frequency (%)
×	21	100.0%

Final Punctuation

Value	Count	Frequency (%)
’	16	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Latin	250223	77.5%
Common	72726	22.5%

Most frequent character per script

Latin

Value	Count	Frequency (%)
a	28913	11.6%
i	22305	8.9%
e	18941	7.6%
s	15777	6.3%
o	15594	6.2%
r	15195	6.1%
n	14672	5.9%
l	13148	5.3%
u	12720	5.1%
t	10211	4.1%
Other values (43)	82747	33.1%

Common

Value	Count	Frequency (%)
	32857	45.2%
.	8249	11.3%
1	4912	6.8%
,	3798	5.2%
(	3479	4.8%
)	3479	4.8%
8	3178	4.4%
9	2392	3.3%
0	1850	2.5%
7	1558	2.1%
Other values (13)	6974	9.6%

Most occurring blocks

Value	Count	Frequency (%)
ASCII	322910	> 99.9%
None	23	< 0.1%
Punctuation	16	< 0.1%

Most frequent character per block

ASCII

Value	Count	Frequency (%)
	32857	10.2%
a	28913	9.0%
i	22305	6.9%
e	18941	5.9%
s	15777	4.9%
o	15594	4.8%
r	15195	4.7%
n	14672	4.5%
l	13148	4.1%
u	12720	3.9%
Other values (63)	132788	41.1%

None

Value	Count	Frequency (%)
×	21	91.3%
Ø	2	8.7%

Punctuation

Value	Count	Frequency (%)
’	16	100.0%

국명
Text

MISSING

Distinct	3092
Distinct (%)	42.6%
Missing	2734
Missing (%)	27.3%
Memory size	156.2 KiB

Length

Max length	13
Median length	11
Mean length	4.6526287
Min length	1

Characters and Unicode

Total characters	33806
Distinct characters	696
Distinct categories	4 ?
Distinct scripts	3 ?
Distinct blocks	2 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	1861 ?
Unique (%)	25.6%

Sample

1st row	들풀거미
2nd row	물봉선속
3rd row	산괭이눈
4th row	밤고둥
5th row	세포큰조롱

Value	Count	Frequency (%)
밤고둥	159	2.2%
구멍밤고둥	105	1.4%
낫균속	84	1.2%
극동갯강구	80	1.1%
고랑딱개비	79	1.1%
홍합	79	1.1%
쇠살모사	76	1.0%
가는몸참집게	62	0.9%
갯장대	42	0.6%
덧나무	42	0.6%
Other values (3082)	6458	88.9%

Most occurring characters

Value	Count	Frequency (%)
리	1118	3.3%
나	943	2.8%
무	788	2.3%
이	769	2.3%
속	734	2.2%
고	678	2.0%
개	632	1.9%
미	470	1.4%
구	464	1.4%
사	460	1.4%
Other values (686)	26750	79.1%

Most occurring categories

Value	Count	Frequency (%)
Other Letter	33802	> 99.9%
Uppercase Letter	2	< 0.1%
Other Punctuation	1	< 0.1%
Lowercase Letter	1	< 0.1%

Most frequent character per category

Other Letter

Value	Count	Frequency (%)
리	1118	3.3%
나	943	2.8%
무	788	2.3%
이	769	2.3%
속	734	2.2%
고	678	2.0%
개	632	1.9%
미	470	1.4%
구	464	1.4%
사	460	1.4%
Other values (682)	26746	79.1%

Uppercase Letter

Value	Count	Frequency (%)
U	1	50.0%
K	1	50.0%

Other Punctuation

Value	Count	Frequency (%)
/	1	100.0%

Lowercase Letter

Value	Count	Frequency (%)
a	1	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Hangul	33802	> 99.9%
Latin	3	< 0.1%
Common	1	< 0.1%

Most frequent character per script

Hangul

Value	Count	Frequency (%)
리	1118	3.3%
나	943	2.8%
무	788	2.3%
이	769	2.3%
속	734	2.2%
고	678	2.0%
개	632	1.9%
미	470	1.4%
구	464	1.4%
사	460	1.4%
Other values (682)	26746	79.1%

Latin

Value	Count	Frequency (%)
a	1	33.3%
U	1	33.3%
K	1	33.3%

Common

Value	Count	Frequency (%)
/	1	100.0%

Most occurring blocks

Value	Count	Frequency (%)
Hangul	33802	> 99.9%
ASCII	4	< 0.1%

Most frequent character per block

Hangul

Value	Count	Frequency (%)
리	1118	3.3%
나	943	2.8%
무	788	2.3%
이	769	2.3%
속	734	2.2%
고	678	2.0%
개	632	1.9%
미	470	1.4%
구	464	1.4%
사	460	1.4%
Other values (682)	26746	79.1%

ASCII

Value	Count	Frequency (%)
/	1	25.0%
a	1	25.0%
U	1	25.0%
K	1	25.0%

Count
Matrix

A simple visualization of nullity by column.

Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

First rows
Last rows

	유전정보아이디	학명	국명
62991	WBN0403419	Micropsalliota pleurocystidiata Heinem. & Little Flower 1983	<NA>
17404	WBN0362518	Agelena limbata Thorell, 1897	들풀거미
26325	WBN0369629	Impatiens L.	물봉선속
833	WBN0339461	Chrysosplenium japonicum (Maxim.) Makino	산괭이눈
24921	WBN0364430	Chlorostoma lischkei Tapparone Canefri, 1874	밤고둥
1889	WBN0338910	Cynanchum volubile (Maxim.) Hemsl.	세포큰조롱
22828	WBN0348845	Maianthemum japonicum (A. Gray) La Frankie	풀솜대
16881	WBN0360619	Leibnitzia anandria (L.) Turcz.	솜나물
36011	WBN0378668	Eriocaulon truncatum Buch.-Ham. ex Mart.	<NA>
18724	WBN0347991	Galium kinuta Nakai & H. Hara	민둥갈퀴

	유전정보아이디	학명	국명
63243	WBN0402663	Rikiosatoa grisea (Butler, 1878)	두줄가지나방
10576	WBN0350608	Paraburkholderia caledonica Coenye et al. 2001	<NA>
47431	WBN0392150	Arabis gemmifera (Matsum.) Makino	산장대
41121	WBN0381411	Peromyia Kieffer, 1894	어리애혹파리속
33542	WBN0375844	Spermacoce remota Lam.	<NA>
28465	WBN0366217	Gloydius ussuriensis (Emelianov, 1929)	쇠살모사
7834	WBN0345319	Sphingobium algicola Lee Y and Jeon CO. 2017	<NA>
12356	WBN0355370	Taraxacum formosanum Kitam.	영도민들레
48436	WBN0390003	Orthocladius ulaanbaatus Sasa and Suzuki, 1997	울란바트깃깔따구
24055	WBN0343215	Forsythia ovata Nakai	만리화

Overview

Variables

Most occurring characters

Most occurring categories

Most frequent character per category

Decimal Number

Uppercase Letter

Most occurring scripts

Most frequent character per script

Common

Latin

Most occurring blocks

Most frequent character per block

ASCII

Most occurring characters

Most occurring categories

Most frequent character per category

Uppercase Letter

Lowercase Letter

Decimal Number

Other Punctuation

Open Punctuation

Close Punctuation

Space Separator

Dash Punctuation

Math Symbol

Final Punctuation

Most occurring scripts

Most frequent character per script

Latin

Common

Most occurring blocks

Most frequent character per block

ASCII

None

Punctuation

Most occurring characters

Most occurring categories

Most frequent character per category

Other Letter

Uppercase Letter

Other Punctuation

Lowercase Letter

Most occurring scripts

Most frequent character per script

Hangul

Latin

Common

Most occurring blocks

Most frequent character per block

Hangul

ASCII

Missing values

Sample