gimi9 Pandas Profiling

Dataset statistics

Number of variables	4
Number of observations	10000
Missing cells	2686
Missing cells (%)	6.7%
Duplicate rows	0
Duplicate rows (%)	0.0%
Total size in memory	390.6 KiB
Average record size in memory	40.0 B

Variable types

Text	4

Dataset

Description	국립생물자원관에서 생산한 DNA 바코드 서열 관련 자생 야생생물의 유전정보 현황(유전정보관리 번호, 학명, 국명 등) 제공
Author	환경부 국립생물자원관
URL	https://www.data.go.kr/data/3070009/fileData.do

Alerts

`국명` has 2686 (26.9%) missing values	Missing
`유전정보아이디` has unique values	Unique

Reproduction

Analysis started	2023-12-12 16:32:35.548703
Analysis finished	2023-12-12 16:32:36.263375
Duration	0.71 seconds
Software version	ydata-profiling vv4.5.1
Download configuration	config.json

유전정보아이디
Text

UNIQUE

Distinct	10000
Distinct (%)	100.0%
Missing	0
Missing (%)	0.0%
Memory size	156.2 KiB

Length

Max length	10
Median length	10
Mean length	10
Min length	10

Characters and Unicode

Total characters	100000
Distinct characters	13
Distinct categories	2 ?
Distinct scripts	2 ?
Distinct blocks	1 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	10000 ?
Unique (%)	100.0%

Sample

1st row	WBN0368067
2nd row	WBN0342171
3rd row	WBN0377694
4th row	WBN0340032
5th row	WBN0383118

Value	Count	Frequency (%)
wbn0368067	1	< 0.1%
wbn0382855	1	< 0.1%
wbn0361566	1	< 0.1%
wbn0397397	1	< 0.1%
wbn0346392	1	< 0.1%
wbn0379514	1	< 0.1%
wbn0345017	1	< 0.1%
wbn0368320	1	< 0.1%
wbn0400002	1	< 0.1%
wbn0358100	1	< 0.1%
Other values (9990)	9990	99.9%

Most occurring characters

Value	Count	Frequency (%)
0	14709	14.7%
3	13904	13.9%
W	10000	10.0%
B	10000	10.0%
N	10000	10.0%
4	5799	5.8%
9	5695	5.7%
8	5527	5.5%
6	5493	5.5%
7	5486	5.5%
Other values (3)	13387	13.4%

Most occurring categories

Value	Count	Frequency (%)
Decimal Number	70000	70.0%
Uppercase Letter	30000	30.0%

Most frequent character per category

Decimal Number

Value	Count	Frequency (%)
0	14709	21.0%
3	13904	19.9%
4	5799	8.3%
9	5695	8.1%
8	5527	7.9%
6	5493	7.8%
7	5486	7.8%
5	5381	7.7%
1	4007	5.7%
2	3999	5.7%

Uppercase Letter

Value	Count	Frequency (%)
W	10000	33.3%
B	10000	33.3%
N	10000	33.3%

Most occurring scripts

Value	Count	Frequency (%)
Common	70000	70.0%
Latin	30000	30.0%

Most frequent character per script

Common

Value	Count	Frequency (%)
0	14709	21.0%
3	13904	19.9%
4	5799	8.3%
9	5695	8.1%
8	5527	7.9%
6	5493	7.8%
7	5486	7.8%
5	5381	7.7%
1	4007	5.7%
2	3999	5.7%

Latin

Value	Count	Frequency (%)
W	10000	33.3%
B	10000	33.3%
N	10000	33.3%

Most occurring blocks

Value	Count	Frequency (%)
ASCII	100000	100.0%

Most frequent character per block

ASCII

Value	Count	Frequency (%)
0	14709	14.7%
3	13904	13.9%
W	10000	10.0%
B	10000	10.0%
N	10000	10.0%
4	5799	5.8%
9	5695	5.7%
8	5527	5.5%
6	5493	5.5%
7	5486	5.5%
Other values (3)	13387	13.4%

학명
Text

Distinct	4794
Distinct (%)	47.9%
Missing	0
Missing (%)	0.0%
Memory size	156.2 KiB

Length

Max length	117
Median length	75
Mean length	32.1174
Min length	5

Characters and Unicode

Total characters	321174
Distinct characters	77
Distinct categories	10 ?
Distinct scripts	2 ?
Distinct blocks	3 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	3104 ?
Unique (%)	31.0%

Sample

1st row	Asplenium incisum Thunb.
2nd row	Anthus gustavi Swinhoe, 1863
3rd row	Glomerella Spauld. & H. Schrenk 1903
4th row	Petrolisthes coccineus (Owen, 1839)
5th row	Pagurus maculosus Komai & Imafuku, 1996

Value	Count	Frequency (%)
l	1053	2.5%
	1038	2.4%
et	506	1.2%
al	504	1.2%
ex	444	1.0%
nakai	375	0.9%
japonica	311	0.7%
a	297	0.7%
var	294	0.7%
h	286	0.7%
Other values (8363)	37354	88.0%

Most occurring characters

Value	Count	Frequency (%)
	32462	10.1%
a	28650	8.9%
i	22195	6.9%
e	18804	5.9%
s	15899	5.0%
o	15720	4.9%
r	15295	4.8%
n	14436	4.5%
l	13077	4.1%
u	12660	3.9%
Other values (67)	131976	41.1%

Most occurring categories

Value	Count	Frequency (%)
Lowercase Letter	222556	69.3%
Space Separator	32462	10.1%
Uppercase Letter	26618	8.3%
Decimal Number	18956	5.9%
Other Punctuation	13550	4.2%
Open Punctuation	3406	1.1%
Close Punctuation	3406	1.1%
Dash Punctuation	183	0.1%
Math Symbol	20	< 0.1%
Final Punctuation	17	< 0.1%

Most frequent character per category

Lowercase Letter

Value	Count	Frequency (%)
a	28650	12.9%
i	22195	10.0%
e	18804	8.4%
s	15899	7.1%
o	15720	7.1%
r	15295	6.9%
n	14436	6.5%
l	13077	5.9%
u	12660	5.7%
t	10225	4.6%
Other values (17)	55595	25.0%

Uppercase Letter

Value	Count	Frequency (%)
C	2655	10.0%
L	2518	9.5%
S	2253	8.5%
M	2003	7.5%
P	1751	6.6%
A	1741	6.5%
H	1578	5.9%
T	1360	5.1%
B	1336	5.0%
K	1186	4.5%
Other values (17)	8237	30.9%

Decimal Number

Value	Count	Frequency (%)
1	4936	26.0%
8	3280	17.3%
9	2275	12.0%
0	1779	9.4%
7	1581	8.3%
2	1488	7.8%
5	1011	5.3%
6	981	5.2%
3	848	4.5%
4	777	4.1%

Other Punctuation

Value	Count	Frequency (%)
.	8328	61.5%
,	3759	27.7%
&	1035	7.6%
?	421	3.1%
'	7	0.1%

Open Punctuation

Value	Count	Frequency (%)
(	3403	99.9%
[	3	0.1%

Close Punctuation

Value	Count	Frequency (%)
)	3403	99.9%
]	3	0.1%

Space Separator

Value	Count	Frequency (%)
	32462	100.0%

Dash Punctuation

Value	Count	Frequency (%)
-	183	100.0%

Math Symbol

Value	Count	Frequency (%)
×	20	100.0%

Final Punctuation

Value	Count	Frequency (%)
’	17	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Latin	249174	77.6%
Common	72000	22.4%

Most frequent character per script

Latin

Value	Count	Frequency (%)
a	28650	11.5%
i	22195	8.9%
e	18804	7.5%
s	15899	6.4%
o	15720	6.3%
r	15295	6.1%
n	14436	5.8%
l	13077	5.2%
u	12660	5.1%
t	10225	4.1%
Other values (44)	82213	33.0%

Common

Value	Count	Frequency (%)
	32462	45.1%
.	8328	11.6%
1	4936	6.9%
,	3759	5.2%
(	3403	4.7%
)	3403	4.7%
8	3280	4.6%
9	2275	3.2%
0	1779	2.5%
7	1581	2.2%
Other values (13)	6794	9.4%

Most occurring blocks

Value	Count	Frequency (%)
ASCII	321134	> 99.9%
None	23	< 0.1%
Punctuation	17	< 0.1%

Most frequent character per block

ASCII

Value	Count	Frequency (%)
	32462	10.1%
a	28650	8.9%
i	22195	6.9%
e	18804	5.9%
s	15899	5.0%
o	15720	4.9%
r	15295	4.8%
n	14436	4.5%
l	13077	4.1%
u	12660	3.9%
Other values (63)	131936	41.1%

None

Value	Count	Frequency (%)
×	20	87.0%
Ø	2	8.7%
ø	1	4.3%

Punctuation

Value	Count	Frequency (%)
’	17	100.0%

국명
Text

MISSING

Distinct	3057
Distinct (%)	41.8%
Missing	2686
Missing (%)	26.9%
Memory size	156.2 KiB

Length

Max length	13
Median length	11
Mean length	4.6462948
Min length	1

Characters and Unicode

Total characters	33983
Distinct characters	688
Distinct categories	2 ?
Distinct scripts	2 ?
Distinct blocks	2 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	1812 ?
Unique (%)	24.8%

Sample

1st row	꼬리고사리
2nd row	흰등밭종다리
3rd row	작은뿔껍질균속
4th row	검붉은게붙이
5th row	가는몸참집게

Value	Count	Frequency (%)
밤고둥	180	2.5%
낫균속	103	1.4%
구멍밤고둥	102	1.4%
홍합	94	1.3%
가는몸참집게	76	1.0%
고랑딱개비	76	1.0%
극동갯강구	74	1.0%
쇠살모사	63	0.9%
덧나무	52	0.7%
무당거미	50	0.7%
Other values (3047)	6444	88.1%

Most occurring characters

Value	Count	Frequency (%)
리	1111	3.3%
나	1044	3.1%
무	844	2.5%
속	811	2.4%
이	791	2.3%
고	723	2.1%
개	629	1.9%
미	478	1.4%
구	454	1.3%
사	454	1.3%
Other values (678)	26644	78.4%

Most occurring categories

Value	Count	Frequency (%)
Other Letter	33979	> 99.9%
Other Punctuation	4	< 0.1%

Most frequent character per category

Other Letter

Value	Count	Frequency (%)
리	1111	3.3%
나	1044	3.1%
무	844	2.5%
속	811	2.4%
이	791	2.3%
고	723	2.1%
개	629	1.9%
미	478	1.4%
구	454	1.3%
사	454	1.3%
Other values (677)	26640	78.4%

Other Punctuation

Value	Count	Frequency (%)
/	4	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Hangul	33979	> 99.9%
Common	4	< 0.1%

Most frequent character per script

Hangul

Value	Count	Frequency (%)
리	1111	3.3%
나	1044	3.1%
무	844	2.5%
속	811	2.4%
이	791	2.3%
고	723	2.1%
개	629	1.9%
미	478	1.4%
구	454	1.3%
사	454	1.3%
Other values (677)	26640	78.4%

Common

Value	Count	Frequency (%)
/	4	100.0%

Most occurring blocks

Value	Count	Frequency (%)
Hangul	33979	> 99.9%
ASCII	4	< 0.1%

Most frequent character per block

Hangul

Value	Count	Frequency (%)
리	1111	3.3%
나	1044	3.1%
무	844	2.5%
속	811	2.4%
이	791	2.3%
고	723	2.1%
개	629	1.9%
미	478	1.4%
구	454	1.3%
사	454	1.3%
Other values (677)	26640	78.4%

ASCII

Value	Count	Frequency (%)
/	4	100.0%

마커명
Text

Distinct	55
Distinct (%)	0.5%
Missing	0
Missing (%)	0.0%
Memory size	156.2 KiB

Length

Max length	14
Median length	3
Mean length	3.9099
Min length	3

Characters and Unicode

Total characters	39099
Distinct characters	56
Distinct categories	6 ?
Distinct scripts	3 ?
Distinct blocks	2 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	4 ?
Unique (%)	< 0.1%

Sample

1st row	rbcL
2nd row	Cytb
3rd row	CHS-1
4th row	COI
5th row	COI

Value	Count	Frequency (%)
rbcl	1926	18.9%
coi	1835	18.0%
its	1636	16.0%
16s	1396	13.7%
matk	1130	11.1%
trnh-psba	422	4.1%
cytb	260	2.5%
rrna	213	2.1%
trnl-f	191	1.9%
lsu	135	1.3%
Other values (46)	1073	10.5%

Most occurring characters

Value	Count	Frequency (%)
S	3573	9.1%
I	3543	9.1%
r	2839	7.3%
b	2744	7.0%
L	2272	5.8%
C	2237	5.7%
t	2128	5.4%
c	1982	5.1%
O	1840	4.7%
1	1719	4.4%
Other values (46)	14222	36.4%

Most occurring categories

Value	Count	Frequency (%)
Uppercase Letter	19330	49.4%
Lowercase Letter	14849	38.0%
Decimal Number	3765	9.6%
Dash Punctuation	922	2.4%
Space Separator	217	0.6%
Other Punctuation	16	< 0.1%

Most frequent character per category

Uppercase Letter

Value	Count	Frequency (%)
S	3573	18.5%
I	3543	18.3%
L	2272	11.8%
C	2237	11.6%
O	1840	9.5%
T	1688	8.7%
K	1202	6.2%
A	684	3.5%
H	588	3.0%
R	328	1.7%
Other values (14)	1375	7.1%

Lowercase Letter

Value	Count	Frequency (%)
r	2839	19.1%
b	2744	18.5%
t	2128	14.3%
c	1982	13.3%
a	1202	8.1%
m	1152	7.8%
n	758	5.1%
p	681	4.6%
s	573	3.9%
y	280	1.9%
Other values (11)	510	3.4%

Decimal Number

Value	Count	Frequency (%)
1	1719	45.7%
6	1410	37.5%
2	361	9.6%
8	213	5.7%
3	29	0.8%
4	20	0.5%
5	8	0.2%
9	5	0.1%

Dash Punctuation

Value	Count	Frequency (%)
-	922	100.0%

Space Separator

Value	Count	Frequency (%)
	217	100.0%

Other Punctuation

Value	Count	Frequency (%)
/	16	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Latin	34140	87.3%
Common	4920	12.6%
Greek	39	0.1%

Most frequent character per script

Latin

Value	Count	Frequency (%)
S	3573	10.5%
I	3543	10.4%
r	2839	8.3%
b	2744	8.0%
L	2272	6.7%
C	2237	6.6%
t	2128	6.2%
c	1982	5.8%
O	1840	5.4%
T	1688	4.9%
Other values (34)	9294	27.2%

Common

Value	Count	Frequency (%)
1	1719	34.9%
6	1410	28.7%
-	922	18.7%
2	361	7.3%
	217	4.4%
8	213	4.3%
3	29	0.6%
4	20	0.4%
/	16	0.3%
5	8	0.2%

Greek

Value	Count	Frequency (%)
α	39	100.0%

Most occurring blocks

Value	Count	Frequency (%)
ASCII	39060	99.9%
None	39	0.1%

Most frequent character per block

ASCII

Value	Count	Frequency (%)
S	3573	9.1%
I	3543	9.1%
r	2839	7.3%
b	2744	7.0%
L	2272	5.8%
C	2237	5.7%
t	2128	5.4%
c	1982	5.1%
O	1840	4.7%
1	1719	4.4%
Other values (45)	14183	36.3%

None

Value	Count	Frequency (%)
α	39	100.0%

Count
Matrix

A simple visualization of nullity by column.

Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

First rows
Last rows

	유전정보아이디	학명	국명	마커명
30127	WBN0368067	Asplenium incisum Thunb.	꼬리고사리	rbcL
2111	WBN0342171	Anthus gustavi Swinhoe, 1863	흰등밭종다리	Cytb
39921	WBN0377694	Glomerella Spauld. & H. Schrenk 1903	작은뿔껍질균속	CHS-1
2272	WBN0340032	Petrolisthes coccineus (Owen, 1839)	검붉은게붙이	COI
57483	WBN0383118	Pagurus maculosus Komai & Imafuku, 1996	가는몸참집게	COI
53234	WBN0387239	Polygonatum Mill.	둥굴레속	rbcL
467	WBN0337853	Petunia × hybrida (Hook.) Vilm.	페튜니아	rbcL
7881	WBN0356460	Potamogeton fryeri A. Benn.	선가래	rbcL
45620	WBN0401025	Cardamine leucantha (Tausch) O. E. Schulz	미나리냉이	trnH-psbA
47831	WBN0395041	Clematis ochotensis (Pall.) Poir.	자주종덩굴	rbcL

	유전정보아이디	학명	국명	마커명
63650	WBN0403163	Cylindromyia brassicaria (Fabricius, 1775)	표주박기생파리	COI
47852	WBN0395062	Coriandrum sativum L.	고수	rbcL
19620	WBN0347780	Solanum lycopersicum L.	토마토	trnH-psbA
10779	WBN0348976	Asparagus cochinchinensis (Lour.) Merr.	천문동	ITS
29079	WBN0365442	Mytilus unguiculatus Valenciennes, 1858	홍합	16S
9701	WBN0354896	Aster meyendorffii (Regel & Maack) Voss	개쑥부쟁이	matK
17755	WBN0361442	Sagina L.	개미자리속	trnL-F
58179	WBN0389533	Modiolicola bifida Tanaka, 1961	진주담치속살이	COI
38161	WBN0373770	Gasteracantha kuhli C. L. Koch, 1837	가시거미	16S
41303	WBN0376532	Dissotis rotundifolia (Sm.) Triana	<NA>	ITS

Overview

Variables

Most occurring characters

Most occurring categories

Most frequent character per category

Decimal Number

Uppercase Letter

Most occurring scripts

Most frequent character per script

Common

Latin

Most occurring blocks

Most frequent character per block

ASCII

Most occurring characters

Most occurring categories

Most frequent character per category

Lowercase Letter

Uppercase Letter

Decimal Number

Other Punctuation

Open Punctuation

Close Punctuation

Space Separator

Dash Punctuation

Math Symbol

Final Punctuation

Most occurring scripts

Most frequent character per script

Latin

Common

Most occurring blocks

Most frequent character per block

ASCII

None

Punctuation

Most occurring characters

Most occurring categories

Most frequent character per category

Other Letter

Other Punctuation

Most occurring scripts

Most frequent character per script

Hangul

Common

Most occurring blocks

Most frequent character per block

Hangul

ASCII

Most occurring characters

Most occurring categories

Most frequent character per category

Uppercase Letter

Lowercase Letter

Decimal Number

Dash Punctuation

Space Separator

Other Punctuation

Most occurring scripts

Most frequent character per script

Latin

Common

Greek

Most occurring blocks

Most frequent character per block

ASCII

None

Missing values

Sample