gimi9 Pandas Profiling

Dataset statistics

Number of variables	5
Number of observations	10000
Missing cells	6
Missing cells (%)	< 0.1%
Duplicate rows	12
Duplicate rows (%)	0.1%
Total size in memory	488.3 KiB
Average record size in memory	50.0 B

Variable types

Categorical	3
Text	2

Dataset

Description	LMO법에 따른 시험·연구용 LMO 수입신고 및 수출통보, 연구시설 신고 등 각종 제도에 대한 민원서류 접수·처리, LMO 안전관리등급 관련 정보를 제공합니다.
Author	한국생명공학연구원
URL	https://www.data.go.kr/data/15040518/fileData.do

Alerts

Dataset has 12 (0.1%) duplicate rows	Duplicates
`위험군` is highly overall correlated with `등급`	High correlation
`등급` is highly overall correlated with `위험군`	High correlation
`분류` is highly imbalanced (50.5%)	Imbalance
`위험군` is highly imbalanced (64.4%)	Imbalance
`등급` is highly imbalanced (62.6%)	Imbalance

Reproduction

Analysis started	2023-12-12 19:44:39.332896
Analysis finished	2023-12-12 19:44:40.138878
Duration	0.81 seconds
Software version	ydata-profiling vv4.5.1
Download configuration	config.json

분류
Categorical

IMBALANCE

Distinct	5
Distinct (%)	0.1%
Missing	0
Missing (%)	0.0%
Memory size	156.2 KiB

동식물	7771
세균	884
진균	840
바이러스	384
기생충	121

Length

Max length	4
Median length	3
Mean length	2.866
Min length	2

Unique

Unique	0 ?
Unique (%)	0.0%

Sample

1st row	동식물
2nd row	진균
3rd row	동식물
4th row	동식물
5th row	동식물

Common Values

Value	Count	Frequency (%)
동식물	7771	77.7%
세균	884	8.8%
진균	840	8.4%
바이러스	384	3.8%
기생충	121	1.2%

Length

Histogram of lengths of the category

Common Values (Plot)

Value	Count	Frequency (%)
동식물	7771	77.7%
세균	884	8.8%
진균	840	8.4%
바이러스	384	3.8%
기생충	121	1.2%

위험군
Categorical

HIGH CORRELATION IMBALANCE

Distinct	5
Distinct (%)	0.1%
Missing	0
Missing (%)	0.0%
Memory size	156.2 KiB

<NA>	7771
2	2142
3	65
4	18
1	4

Length

Max length	4
Median length	4
Mean length	3.3313
Min length	1

Unique

Unique	0 ?
Unique (%)	0.0%

Sample

1st row	<NA>
2nd row	2
3rd row	<NA>
4th row	<NA>
5th row	<NA>

Common Values

Value	Count	Frequency (%)
<NA>	7771	77.7%
2	2142	21.4%
3	65	0.7%
4	18	0.2%
1	4	< 0.1%

Length

Histogram of lengths of the category

Common Values (Plot)

Value	Count	Frequency (%)
na	7771	77.7%
2	2142	21.4%
3	65	0.7%
4	18	0.2%
1	4	< 0.1%

구분
Text

Distinct	4795
Distinct (%)	48.0%
Missing	6
Missing (%)	0.1%
Memory size	156.2 KiB

Length

Max length	20
Median length	17
Mean length	9.484991
Min length	2

Characters and Unicode

Total characters	94793
Distinct characters	54
Distinct categories	4 ?
Distinct scripts	2 ?
Distinct blocks	1 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	3238 ?
Unique (%)	32.4%

Sample

1st row	Aconitum
2nd row	Lyophyllum
3rd row	Eriophyes
4th row	Rhomphocallus
5th row	Arctium

Value	Count	Frequency (%)
allium	85	0.9%
aconitum	76	0.8%
amanita	68	0.7%
acer	66	0.7%
agrilus	56	0.6%
achnanthes	52	0.5%
mycoplasma	44	0.4%
alternaria	43	0.4%
acremonium	41	0.4%
acleris	35	0.4%
Other values (4786)	9431	94.3%

Most occurring characters

Value	Count	Frequency (%)
a	10232	10.8%
i	8105	8.6%
o	7588	8.0%
e	6600	7.0%
r	6311	6.7%
s	5984	6.3%
l	5244	5.5%
c	4630	4.9%
t	4493	4.7%
n	4480	4.7%
Other values (44)	31126	32.8%

Most occurring categories

Value	Count	Frequency (%)
Lowercase Letter	84783	89.4%
Uppercase Letter	9993	10.5%
Space Separator	16	< 0.1%
Other Punctuation	1	< 0.1%

Most frequent character per category

Lowercase Letter

Value	Count	Frequency (%)
a	10232	12.1%
i	8105	9.6%
o	7588	8.9%
e	6600	7.8%
r	6311	7.4%
s	5984	7.1%
l	5244	6.2%
c	4630	5.5%
t	4493	5.3%
n	4480	5.3%
Other values (16)	21116	24.9%

Uppercase Letter

Value	Count	Frequency (%)
A	3475	34.8%
P	1081	10.8%
C	924	9.2%
S	594	5.9%
M	496	5.0%
L	388	3.9%
E	341	3.4%
T	334	3.3%
H	323	3.2%
B	307	3.1%
Other values (16)	1730	17.3%

Space Separator

Value	Count	Frequency (%)
	16	100.0%

Other Punctuation

Value	Count	Frequency (%)
&	1	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Latin	94776	> 99.9%
Common	17	< 0.1%

Most frequent character per script

Latin

Value	Count	Frequency (%)
a	10232	10.8%
i	8105	8.6%
o	7588	8.0%
e	6600	7.0%
r	6311	6.7%
s	5984	6.3%
l	5244	5.5%
c	4630	4.9%
t	4493	4.7%
n	4480	4.7%
Other values (42)	31109	32.8%

Common

Value	Count	Frequency (%)
	16	94.1%
&	1	5.9%

Most occurring blocks

Value	Count	Frequency (%)
ASCII	94793	100.0%

Most frequent character per block

ASCII

Value	Count	Frequency (%)
a	10232	10.8%
i	8105	8.6%
o	7588	8.0%
e	6600	7.0%
r	6311	6.7%
s	5984	6.3%
l	5244	5.5%
c	4630	4.9%
t	4493	4.7%
n	4480	4.7%
Other values (44)	31126	32.8%

생물체
Text

Distinct	9176
Distinct (%)	91.8%
Missing	0
Missing (%)	0.0%
Memory size	156.2 KiB

Length

Max length	197
Median length	69
Mean length	11.6146
Min length	4

Characters and Unicode

Total characters	116146
Distinct characters	249
Distinct categories	9 ?
Distinct scripts	3 ?
Distinct blocks	3 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	8635 ?
Unique (%)	86.4%

Sample

1st row	A.kirinense
2nd row	L.shimeji
3rd row	E.mali
4th row	R.coreanus
5th row	A.minus

Value	Count	Frequency (%)
virus	283	2.5%
a.japonica	30	0.3%
c	25	0.2%
b	23	0.2%
a.koreana	18	0.2%
mosaic	16	0.1%
bovine	16	0.1%
herpesvirus	15	0.1%
a	15	0.1%
disease	15	0.1%
Other values (9399)	10716	95.9%

Most occurring characters

Value	Count	Frequency (%)
a	11510	9.9%
i	11094	9.6%
.	9651	8.3%
s	8664	7.5%
e	7143	6.2%
n	6465	5.6%
r	6149	5.3%
u	5628	4.8%
o	5546	4.8%
l	4923	4.2%
Other values (239)	39373	33.9%

Most occurring categories

Value	Count	Frequency (%)
Lowercase Letter	93838	80.8%
Uppercase Letter	10136	8.7%
Other Punctuation	9677	8.3%
Space Separator	1426	1.2%
Other Letter	736	0.6%
Close Punctuation	120	0.1%
Open Punctuation	119	0.1%
Decimal Number	50	< 0.1%
Dash Punctuation	44	< 0.1%

Most frequent character per category

Other Letter

Value	Count	Frequency (%)
충	70	9.5%
구	19	2.6%
지	18	2.4%
리	17	2.3%
스	17	2.3%
제	17	2.3%
모	15	2.0%
이	15	2.0%
선	15	2.0%
편	14	1.9%
Other values (169)	519	70.5%

Lowercase Letter

Value	Count	Frequency (%)
a	11510	12.3%
i	11094	11.8%
s	8664	9.2%
e	7143	7.6%
n	6465	6.9%
r	6149	6.6%
u	5628	6.0%
o	5546	5.9%
l	4923	5.2%
t	4653	5.0%
Other values (16)	22063	23.5%

Uppercase Letter

Value	Count	Frequency (%)
A	3489	34.4%
P	1034	10.2%
C	950	9.4%
S	630	6.2%
M	523	5.2%
L	403	4.0%
E	362	3.6%
T	336	3.3%
H	324	3.2%
B	313	3.1%
Other values (16)	1772	17.5%

Decimal Number

Value	Count	Frequency (%)
1	16	32.0%
2	13	26.0%
3	8	16.0%
4	7	14.0%
7	2	4.0%
5	1	2.0%
8	1	2.0%
0	1	2.0%
9	1	2.0%

Other Punctuation

Value	Count	Frequency (%)
.	9651	99.7%
,	20	0.2%
'	3	< 0.1%
※	2	< 0.1%
/	1	< 0.1%

Space Separator

Value	Count	Frequency (%)
	1426	100.0%

Close Punctuation

Value	Count	Frequency (%)
)	120	100.0%

Open Punctuation

Value	Count	Frequency (%)
(	119	100.0%

Dash Punctuation

Value	Count	Frequency (%)
-	44	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Latin	103974	89.5%
Common	11436	9.8%
Hangul	736	0.6%

Most frequent character per script

Hangul

Value	Count	Frequency (%)
충	70	9.5%
구	19	2.6%
지	18	2.4%
리	17	2.3%
스	17	2.3%
제	17	2.3%
모	15	2.0%
이	15	2.0%
선	15	2.0%
편	14	1.9%
Other values (169)	519	70.5%

Latin

Value	Count	Frequency (%)
a	11510	11.1%
i	11094	10.7%
s	8664	8.3%
e	7143	6.9%
n	6465	6.2%
r	6149	5.9%
u	5628	5.4%
o	5546	5.3%
l	4923	4.7%
t	4653	4.5%
Other values (42)	32199	31.0%

Common

Value	Count	Frequency (%)
.	9651	84.4%
	1426	12.5%
)	120	1.0%
(	119	1.0%
-	44	0.4%
,	20	0.2%
1	16	0.1%
2	13	0.1%
3	8	0.1%
4	7	0.1%
Other values (8)	12	0.1%

Most occurring blocks

Value	Count	Frequency (%)
ASCII	115408	99.4%
Hangul	736	0.6%
Punctuation	2	< 0.1%

Most frequent character per block

ASCII

Value	Count	Frequency (%)
a	11510	10.0%
i	11094	9.6%
.	9651	8.4%
s	8664	7.5%
e	7143	6.2%
n	6465	5.6%
r	6149	5.3%
u	5628	4.9%
o	5546	4.8%
l	4923	4.3%
Other values (59)	38635	33.5%

Hangul

Value	Count	Frequency (%)
충	70	9.5%
구	19	2.6%
지	18	2.4%
리	17	2.3%
스	17	2.3%
제	17	2.3%
모	15	2.0%
이	15	2.0%
선	15	2.0%
편	14	1.9%
Other values (169)	519	70.5%

Punctuation

Value	Count	Frequency (%)
※	2	100.0%

등급
Categorical

HIGH CORRELATION IMBALANCE

Distinct	5
Distinct (%)	0.1%
Missing	0
Missing (%)	0.0%
Memory size	156.2 KiB

<NA>	7771
2	2074
1	72
3	65
4	18

Length

Max length	4
Median length	4
Mean length	3.3313
Min length	1

Unique

Unique	0 ?
Unique (%)	0.0%

Sample

1st row	<NA>
2nd row	2
3rd row	<NA>
4th row	<NA>
5th row	<NA>

Common Values

Value	Count	Frequency (%)
<NA>	7771	77.7%
2	2074	20.7%
1	72	0.7%
3	65	0.7%
4	18	0.2%

Length

Histogram of lengths of the category

Common Values (Plot)

Value	Count	Frequency (%)
na	7771	77.7%
2	2074	20.7%
1	72	0.7%
3	65	0.7%
4	18	0.2%

Heatmap
Table

	분류	위험군	등급
분류	1.000	0.368	0.805
위험군	0.368	1.000	0.984
등급	0.805	0.984	1.000

Heatmap
Table

	분류	등급	위험군
분류	1.000	0.446	0.151
등급	0.446	1.000	0.827
위험군	0.151	0.827	1.000

Heatmap
Table

	분류	위험군	등급
분류	1.000	0.151	0.446
위험군	0.151	1.000	0.827
등급	0.446	0.827	1.000

Count
Matrix

A simple visualization of nullity by column.

Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

First rows
Last rows

	분류	위험군	구분	생물체	등급
3979	동식물	<NA>	Aconitum	A.kirinense	<NA>
2824	진균	2	Lyophyllum	L.shimeji	2
8898	동식물	<NA>	Eriophyes	E.mali	<NA>
12699	동식물	<NA>	Rhomphocallus	R.coreanus	<NA>
6660	동식물	<NA>	Arctium	A.minus	<NA>
13080	동식물	<NA>	Semisulcospira	S.libertina	<NA>
6041	동식물	<NA>	Alosterna	A.perpera	<NA>
7278	동식물	<NA>	Calopogonium	C.mucunoides	<NA>
6011	동식물	<NA>	Alonella	A.exigua	<NA>
7608	동식물	<NA>	Chitalpa	C.tashkinensis	<NA>

	분류	위험군	구분	생물체	등급
1068	세균	2	Actinokineospora	A.inagensis	2
12329	동식물	<NA>	Prunus	P.ishidoyana	<NA>
13560	동식물	<NA>	Tegecoelotes	T.secundus	<NA>
13122	동식물	<NA>	Shiragaia	S.taeguensis	<NA>
13734	동식물	<NA>	Todarodes	T.pacificus	<NA>
10829	동식물	<NA>	Molophilus	M.avidus	<NA>
11837	동식물	<NA>	Philodromus	P.auricomus	<NA>
1330	세균	2	Achnanthes	A.rupestoides	2
12990	동식물	<NA>	Scirpus	S.juncoides	<NA>
8123	동식물	<NA>	Cryptoblabes	C.adoceta	<NA>

Most frequently occurring

	분류	위험군	구분	생물체	등급	# duplicates
0	동식물	<NA>	Alopecurus	A.aequalis	<NA>	2
1	동식물	<NA>	Chrysso	C.lativentris	<NA>	2
2	동식물	<NA>	Clubiona	C.papillata	<NA>	2
3	동식물	<NA>	Hydrolithon	H.sargassi	<NA>	2
4	동식물	<NA>	Melanoplus	M.differentialis	<NA>	2
5	동식물	<NA>	Oncometopia	O.nigricans	<NA>	2
6	동식물	<NA>	Prosopis	P.juliflora	<NA>	2
7	동식물	<NA>	Sarcocheilichthys	S.variegatus	<NA>	2
8	동식물	<NA>	Spergularia	S.marina	<NA>	2
9	동식물	<NA>	Vespa	V.velutina	<NA>	2

Overview

Variables

Common Values

Length

Common Values (Plot)

Common Values

Length

Common Values (Plot)

Most occurring characters

Most occurring categories

Most frequent character per category

Lowercase Letter

Uppercase Letter

Space Separator

Other Punctuation

Most occurring scripts

Most frequent character per script

Latin

Common

Most occurring blocks

Most frequent character per block

ASCII

Most occurring characters

Most occurring categories

Most frequent character per category

Other Letter

Lowercase Letter

Uppercase Letter

Decimal Number

Other Punctuation

Space Separator

Close Punctuation

Open Punctuation

Dash Punctuation

Most occurring scripts

Most frequent character per script

Hangul

Latin

Common

Most occurring blocks

Most frequent character per block

ASCII

Hangul

Punctuation

Common Values

Length

Common Values (Plot)

Correlations

Missing values

Sample

Duplicate rows

Most frequently occurring