gimi9 Pandas Profiling

Dataset statistics

Number of variables	4
Number of observations	7578
Missing cells	3033
Missing cells (%)	10.0%
Duplicate rows	34
Duplicate rows (%)	0.4%
Total size in memory	244.3 KiB
Average record size in memory	33.0 B

Variable types

Numeric	1
Text	3

Dataset

Description	'환경표지 인증 기준' 중 유해물질 상세기준(유해물질 확장 구분, 항목명, 코드명, 물질명, 참조명)에 대한 정보를 제공
Author	한국환경산업기술원
URL	https://www.data.go.kr/data/15071378/fileData.do

Alerts

Dataset has 34 (0.4%) duplicate rows	Duplicates
`유해물질확장항목` has 719 (9.5%) missing values	Missing
`유해물질확장 참조명` has 2296 (30.3%) missing values	Missing

Reproduction

Analysis started	2023-12-12 07:05:41.095024
Analysis finished	2023-12-12 07:05:41.961248
Duration	0.87 seconds
Software version	ydata-profiling vv4.5.1
Download configuration	config.json

유해물질확장구분
Real number (ℝ)

Distinct	15
Distinct (%)	0.2%
Missing	0
Missing (%)	0.0%
Infinite	0
Infinite (%)	0.0%
Mean	4.2257852

Minimum	1
Maximum	15
Zeros	0
Zeros (%)	0.0%
Negative	0
Negative (%)	0.0%
Memory size	66.7 KiB

Quantile statistics

Minimum	1
5-th percentile	1
Q1	1
median	1
Q3	10
95-th percentile	14
Maximum	15
Range	14
Interquartile range (IQR)	9

Descriptive statistics

Standard deviation	4.8497646
Coefficient of variation (CV)	1.14766
Kurtosis	-0.58936491
Mean	4.2257852
Median Absolute Deviation (MAD)	0
Skewness	1.0762272
Sum	32023
Variance	23.520217
Monotonicity	Not monotonic

Histogram with fixed size bins (bins=15)

Value	Count	Frequency (%)
1	4264	56.3%
2	1079	14.2%
10	987	13.0%
14	638	8.4%
13	168	2.2%
9	99	1.3%
12	97	1.3%
15	84	1.1%
11	73	1.0%
4	28	0.4%
Other values (5)	61	0.8%

Minimum 10 values
Maximum 10 values

Value	Count	Frequency (%)
1	4264	56.3%
2	1079	14.2%
3	12	0.2%
4	28	0.4%
5	4	0.1%
6	12	0.2%
7	7	0.1%
8	26	0.3%
9	99	1.3%
10	987	13.0%

Value	Count	Frequency (%)
15	84	1.1%
14	638	8.4%
13	168	2.2%
12	97	1.3%
11	73	1.0%
10	987	13.0%
9	99	1.3%
8	26	0.3%
7	7	0.1%
6	12	0.2%

유해물질확장항목
Text

MISSING

Distinct	5053
Distinct (%)	73.7%
Missing	719
Missing (%)	9.5%
Memory size	59.3 KiB

Length

Max length	453
Median length	241
Mean length	9.9867328
Min length	1

Characters and Unicode

Total characters	68499
Distinct characters	22
Distinct categories	8 ?
Distinct scripts	2 ?
Distinct blocks	1 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	4077 ?
Unique (%)	59.4%

Sample

1st row	-
2nd row	1313-13-9
3rd row	7722-64-7
4th row	7785-87-7
5th row	116633-53-5

Value	Count	Frequency (%)
	208	2.8%
2	74	1.0%
3	48	0.6%
4	25	0.3%
56-38-2	7	0.1%
1976-06-02	7	0.1%
5	7	0.1%
76-44-8	6	0.1%
87-86-5	6	0.1%
510-15-6	6	0.1%
Other values (5373)	7001	94.7%

Most occurring characters

Value	Count	Frequency (%)
-	14420	21.1%
1	7563	11.0%
0	5327	7.8%
2	5323	7.8%
6	5134	7.5%
7	5013	7.3%
3	4906	7.2%
5	4897	7.1%
4	4827	7.0%
9	4756	6.9%
Other values (12)	6333	9.2%

Most occurring categories

Value	Count	Frequency (%)
Decimal Number	52480	76.6%
Dash Punctuation	14420	21.1%
Space Separator	536	0.8%
Close Punctuation	519	0.8%
Open Punctuation	518	0.8%
Other Punctuation	16	< 0.1%
Lowercase Letter	6	< 0.1%
Uppercase Letter	4	< 0.1%

Most frequent character per category

Decimal Number

Value	Count	Frequency (%)
1	7563	14.4%
0	5327	10.2%
2	5323	10.1%
6	5134	9.8%
7	5013	9.6%
3	4906	9.3%
5	4897	9.3%
4	4827	9.2%
9	4756	9.1%
8	4734	9.0%

Uppercase Letter

Value	Count	Frequency (%)
T	2	50.0%
B	1	25.0%
C	1	25.0%

Lowercase Letter

Value	Count	Frequency (%)
e	2	33.3%
m	2	33.3%
p	2	33.3%

Other Punctuation

Value	Count	Frequency (%)
.	14	87.5%
,	2	12.5%

Dash Punctuation

Value	Count	Frequency (%)
-	14420	100.0%

Space Separator

Value	Count	Frequency (%)
	536	100.0%

Close Punctuation

Value	Count	Frequency (%)
]	519	100.0%

Open Punctuation

Value	Count	Frequency (%)
[	518	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Common	68489	> 99.9%
Latin	10	< 0.1%

Most frequent character per script

Common

Value	Count	Frequency (%)
-	14420	21.1%
1	7563	11.0%
0	5327	7.8%
2	5323	7.8%
6	5134	7.5%
7	5013	7.3%
3	4906	7.2%
5	4897	7.2%
4	4827	7.0%
9	4756	6.9%
Other values (6)	6323	9.2%

Latin

Value	Count	Frequency (%)
T	2	20.0%
e	2	20.0%
m	2	20.0%
p	2	20.0%
B	1	10.0%
C	1	10.0%

Most occurring blocks

Value	Count	Frequency (%)
ASCII	68499	100.0%

Most frequent character per block

ASCII

Value	Count	Frequency (%)
-	14420	21.1%
1	7563	11.0%
0	5327	7.8%
2	5323	7.8%
6	5134	7.5%
7	5013	7.3%
3	4906	7.2%
5	4897	7.1%
4	4827	7.0%
9	4756	6.9%
Other values (12)	6333	9.2%

유해물질확장 물질명
Text

Distinct	7481
Distinct (%)	99.0%
Missing	18
Missing (%)	0.2%
Memory size	59.3 KiB

Length

Max length	1024
Median length	516
Mean length	87.949603
Min length	3

Characters and Unicode

Total characters	664899
Distinct characters	433
Distinct categories	16 ?
Distinct scripts	4 ?
Distinct blocks	7 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	7407 ?
Unique (%)	98.0%

Sample

1st row	trisodium bis[(3'-nitro-5'-sulfonato(6-amino-2-[4-(2-hydroxy-1-naphtylazo)phenylsulfonylamino]pyrimidin-5-azo)benzene-2',4-diolato)]chromate (III)
2nd row	potassium tetrasodium bis[(N,N'-n)-1'-(phenylcarbamoyl)-3,5-disulfonatobenzeneazo-1'-prop-1'-ene-2,2'-diolato]chromate(III)
3rd row	manganese dioxide
4th row	potassium permanganate
5th row	manganese sulphate

Value	Count	Frequency (%)
of	3068	4.9%
the	1380	2.2%
혼합물	1149	1.8%
함유한	1144	1.8%
및	1131	1.8%
a	1087	1.7%
이를	1075	1.7%
hydrocarbons	1073	1.7%
이상	1047	1.7%
and	1009	1.6%
Other values (12419)	49379	79.0%

Most occurring characters

Value	Count	Frequency (%)
	55508	8.3%
o	43768	6.6%
e	41715	6.3%
i	34971	5.3%
a	33846	5.1%
n	33211	5.0%
t	32866	4.9%
-	30740	4.6%
l	28574	4.3%
r	26903	4.0%
Other values (423)	302797	45.5%

Most occurring categories

Value	Count	Frequency (%)
Lowercase Letter	444510	66.9%
Space Separator	55509	8.3%
Decimal Number	41905	6.3%
Other Letter	33257	5.0%
Dash Punctuation	30740	4.6%
Other Punctuation	19318	2.9%
Uppercase Letter	14578	2.2%
Open Punctuation	12381	1.9%
Close Punctuation	12381	1.9%
Math Symbol	221	< 0.1%
Other values (6)	99	< 0.1%

Most frequent character per category

Other Letter

Value	Count	Frequency (%)
이	2739	8.2%
물	1280	3.8%
합	1239	3.7%
한	1185	3.6%
유	1181	3.6%
혼	1174	3.5%
함	1163	3.5%
상	1149	3.5%
및	1144	3.4%
를	1143	3.4%
Other values (307)	19860	59.7%

Lowercase Letter

Value	Count	Frequency (%)
o	43768	9.8%
e	41715	9.4%
i	34971	7.9%
a	33846	7.6%
n	33211	7.5%
t	32866	7.4%
l	28574	6.4%
r	26903	6.1%
h	22823	5.1%
y	20622	4.6%
Other values (26)	125211	28.2%

Uppercase Letter

Value	Count	Frequency (%)
C	2281	15.6%
N	1722	11.8%
I	1465	10.0%
S	1275	8.7%
O	1186	8.1%
A	992	6.8%
H	787	5.4%
D	694	4.8%
T	655	4.5%
P	594	4.1%
Other values (16)	2927	20.1%

Other Punctuation

Value	Count	Frequency (%)
,	9783	50.6%
;	4133	21.4%
.	2370	12.3%
%	1278	6.6%
'	1068	5.5%
:	560	2.9%
/	74	0.4%
*	24	0.1%
′	13	0.1%
…	9	< 0.1%
Other values (2)	6	< 0.1%

Decimal Number

Value	Count	Frequency (%)
2	8817	21.0%
1	8236	19.7%
4	5559	13.3%
3	5333	12.7%
5	4096	9.8%
6	2727	6.5%
0	2603	6.2%
7	1792	4.3%
8	1439	3.4%
9	1303	3.1%

Math Symbol

Value	Count	Frequency (%)
=	68	30.8%
+	45	20.4%
~	40	18.1%
<	30	13.6%
≥	18	8.1%
>	9	4.1%
≤	5	2.3%
±	3	1.4%
∼	2	0.9%
～	1	0.5%

Letter Number

Value	Count	Frequency (%)
Ⅱ	5	35.7%
Ⅲ	4	28.6%
Ⅴ	3	21.4%
Ⅳ	1	7.1%
Ⅵ	1	7.1%

Close Punctuation

Value	Count	Frequency (%)
)	7472	60.4%
]	4718	38.1%
}	190	1.5%
〕	1	< 0.1%

Open Punctuation

Value	Count	Frequency (%)
(	7498	60.6%
[	4728	38.2%
{	155	1.3%

Space Separator

Value	Count	Frequency (%)
	55508	> 99.9%
	1	< 0.1%

Final Punctuation

Value	Count	Frequency (%)
’	22	95.7%
”	1	4.3%

Initial Punctuation

Value	Count	Frequency (%)
‘	5	83.3%
“	1	16.7%

Dash Punctuation

Value	Count	Frequency (%)
-	30740	100.0%

Other Symbol

Value	Count	Frequency (%)
°	52	100.0%

Format

Value	Count	Frequency (%)
	2	100.0%

Modifier Symbol

Value	Count	Frequency (%)
´	2	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Latin	458722	69.0%
Common	172540	25.9%
Hangul	33257	5.0%
Greek	380	0.1%

Most frequent character per script

Hangul

Value	Count	Frequency (%)
이	2739	8.2%
물	1280	3.8%
합	1239	3.7%
한	1185	3.6%
유	1181	3.6%
혼	1174	3.5%
함	1163	3.5%
상	1149	3.5%
및	1144	3.4%
를	1143	3.4%
Other values (307)	19860	59.7%

Latin

Value	Count	Frequency (%)
o	43768	9.5%
e	41715	9.1%
i	34971	7.6%
a	33846	7.4%
n	33211	7.2%
t	32866	7.2%
l	28574	6.2%
r	26903	5.9%
h	22823	5.0%
y	20622	4.5%
Other values (48)	139423	30.4%

Common

Value	Count	Frequency (%)
	55508	32.2%
-	30740	17.8%
,	9783	5.7%
2	8817	5.1%
1	8236	4.8%
(	7498	4.3%
)	7472	4.3%
4	5559	3.2%
3	5333	3.1%
[	4728	2.7%
Other values (39)	28866	16.7%

Greek

Value	Count	Frequency (%)
α	209	55.0%
β	53	13.9%
κ	41	10.8%
ω	33	8.7%
η	22	5.8%
μ	9	2.4%
λ	7	1.8%
γ	5	1.3%
ε	1	0.3%

Most occurring blocks

Value	Count	Frequency (%)
ASCII	631109	94.9%
Hangul	33255	5.0%
None	443	0.1%
Punctuation	51	< 0.1%
Math Operators	25	< 0.1%
Number Forms	14	< 0.1%
Compat Jamo	2	< 0.1%

Most frequent character per block

ASCII

Value	Count	Frequency (%)
	55508	8.8%
o	43768	6.9%
e	41715	6.6%
i	34971	5.5%
a	33846	5.4%
n	33211	5.3%
t	32866	5.2%
-	30740	4.9%
l	28574	4.5%
r	26903	4.3%
Other values (75)	269007	42.6%

Hangul

Value	Count	Frequency (%)
이	2739	8.2%
물	1280	3.8%
합	1239	3.7%
한	1185	3.6%
유	1181	3.6%
혼	1174	3.5%
함	1163	3.5%
상	1149	3.5%
및	1144	3.4%
를	1143	3.4%
Other values (306)	19858	59.7%

None

Value	Count	Frequency (%)
α	209	47.2%
β	53	12.0%
°	52	11.7%
κ	41	9.3%
ω	33	7.4%
η	22	5.0%
μ	9	2.0%
λ	7	1.6%
γ	5	1.1%
±	3	0.7%
Other values (7)	9	2.0%

Punctuation

Value	Count	Frequency (%)
’	22	43.1%
′	13	25.5%
…	9	17.6%
‘	5	9.8%
“	1	2.0%
”	1	2.0%

Math Operators

Value	Count	Frequency (%)
≥	18	72.0%
≤	5	20.0%
∼	2	8.0%

Number Forms

Value	Count	Frequency (%)
Ⅱ	5	35.7%
Ⅲ	4	28.6%
Ⅴ	3	21.4%
Ⅳ	1	7.1%
Ⅵ	1	7.1%

Compat Jamo

Value	Count	Frequency (%)
ㆍ	2	100.0%

유해물질확장 참조명
Text

MISSING

Distinct	1332
Distinct (%)	25.2%
Missing	2296
Missing (%)	30.3%
Memory size	59.3 KiB

Length

Max length	84
Median length	76
Mean length	10.652972
Min length	1

Characters and Unicode

Total characters	56269
Distinct characters	44
Distinct categories	8 ?
Distinct scripts	2 ?
Distinct blocks	1 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	953 ?
Unique (%)	18.0%

Sample

1st row	H317H412
2nd row	H318
3rd row	H332H302
4th row	H272H361dH302H400H410
5th row	H373 **H411

Value	Count	Frequency (%)
3	501	8.3%
2b	314	5.2%
h350	290	4.8%
h400h410	212	3.5%
h413	178	2.9%
h220h350h340	148	2.4%
h411	142	2.3%
h350h340h304	140	2.3%
1	120	2.0%
h317	113	1.9%
Other values (1207)	3894	64.3%

Most occurring characters

Value	Count	Frequency (%)
H	12751	22.7%
3	10878	19.3%
1	7602	13.5%
0	7238	12.9%
4	4677	8.3%
2	3782	6.7%
5	1988	3.5%
7	1608	2.9%
*	1437	2.6%
	770	1.4%
Other values (34)	3538	6.3%

Most occurring categories

Value	Count	Frequency (%)
Decimal Number	39271	69.8%
Uppercase Letter	13378	23.8%
Other Punctuation	1462	2.6%
Lowercase Letter	1237	2.2%
Space Separator	770	1.4%
Close Punctuation	73	0.1%
Open Punctuation	73	0.1%
Dash Punctuation	5	< 0.1%

Most frequent character per category

Lowercase Letter

Value	Count	Frequency (%)
i	134	10.8%
f	124	10.0%
e	104	8.4%
t	102	8.2%
s	99	8.0%
d	98	7.9%
r	80	6.5%
n	78	6.3%
o	73	5.9%
a	68	5.5%
Other values (12)	277	22.4%

Decimal Number

Value	Count	Frequency (%)
3	10878	27.7%
1	7602	19.4%
0	7238	18.4%
4	4677	11.9%
2	3782	9.6%
5	1988	5.1%
7	1608	4.1%
8	558	1.4%
6	541	1.4%
9	399	1.0%

Uppercase Letter

Value	Count	Frequency (%)
H	12751	95.3%
B	314	2.3%
D	160	1.2%
A	83	0.6%
F	65	0.5%
I	5	< 0.1%

Other Punctuation

Value	Count	Frequency (%)
*	1437	98.3%
,	25	1.7%

Space Separator

Value	Count	Frequency (%)
	770	100.0%

Close Punctuation

Value	Count	Frequency (%)
)	73	100.0%

Open Punctuation

Value	Count	Frequency (%)
(	73	100.0%

Dash Punctuation

Value	Count	Frequency (%)
-	5	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Common	41654	74.0%
Latin	14615	26.0%

Most frequent character per script

Latin

Value	Count	Frequency (%)
H	12751	87.2%
B	314	2.1%
D	160	1.1%
i	134	0.9%
f	124	0.8%
e	104	0.7%
t	102	0.7%
s	99	0.7%
d	98	0.7%
A	83	0.6%
Other values (18)	646	4.4%

Common

Value	Count	Frequency (%)
3	10878	26.1%
1	7602	18.3%
0	7238	17.4%
4	4677	11.2%
2	3782	9.1%
5	1988	4.8%
7	1608	3.9%
*	1437	3.4%
	770	1.8%
8	558	1.3%
Other values (6)	1116	2.7%

Most occurring blocks

Value	Count	Frequency (%)
ASCII	56269	100.0%

Most frequent character per block

ASCII

Value	Count	Frequency (%)
H	12751	22.7%
3	10878	19.3%
1	7602	13.5%
0	7238	12.9%
4	4677	8.3%
2	3782	6.7%
5	1988	3.5%
7	1608	2.9%
*	1437	2.6%
	770	1.4%
Other values (34)	3538	6.3%

유해물질확장구분

유해물질확장구분

A simple visualization of nullity by column.

Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

The correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another.

First rows
Last rows

	유해물질확장구분	유해물질확장항목	유해물질확장 물질명	유해물질확장 참조명
0	1	<NA>	trisodium bis[(3'-nitro-5'-sulfonato(6-amino-2-[4-(2-hydroxy-1-naphtylazo)phenylsulfonylamino]pyrimidin-5-azo)benzene-2',4-diolato)]chromate (III)	H317H412
1	1	-	potassium tetrasodium bis[(N,N'-n)-1'-(phenylcarbamoyl)-3,5-disulfonatobenzeneazo-1'-prop-1'-ene-2,2'-diolato]chromate(III)	H318
2	1	1313-13-9	manganese dioxide	H332H302
3	1	7722-64-7	potassium permanganate	H272H361dH302H400H410
4	1	7785-87-7	manganese sulphate	H373 **H411
5	1	116633-53-5	bis(N,N',N''-trimethyl-1,4,7-triazacyclononane)-trioxo-dimanganese (IV) di(hexafluorophosphate) monohydrate	H411
6	1	<NA>	reaction mass of: tri-sodium [29H, 31H-phthalocyanine-C,C,C-trisulfonato (6-)-N29,N30,N31,N32] manganate (3-); tetrasodium [29H,31H-phthalocyanine-C,C,C,C-tetrasulfonato (6-)-N29,N30,N31,N32], manganate (3-); pentasodium [29H,31H-phthalocyanine-C,C,C,C,C-pentasulfonato (6-)-N29,N30,N31,N32] manganate (3-)	H400H410
7	1	100011-37-8	(η-cumene)-(η-cyclopentadienyl)iron(II) hexafluoroantimonate	H302H318H412
8	1	117549-13-0	(η-cumene)-(η-cyclopentadienyl)iron(II) trifluoromethane-sulfonate	H302H412
9	1	7720-78-7	iron (II) sulfate	H302H315H319

	유해물질확장구분	유해물질확장항목	유해물질확장 물질명	유해물질확장 참조명
7568	14	510-15-6	클로로벤질레이트 [Chlorobenzilate]	<NA>
7569	14	126-72-7	트리스(2,3-디브로모 프로필)포스페이트 [Tris(2,3-dibromopropyl)phosphate]	<NA>
7570	14	52645-53-1	퍼메트린 [Permethrin]	<NA>
7571	14	814-49-3	디에틸에스테르 클로로인산 [Phosphorochloridic acid diethyl ester]	<NA>
7572	14	<NA>	중크롬산염류 [Dichromicacid,salts](본고시에서별도로규정한물질은제외)	<NA>
7573	14	7778-50-9	Potassium dichromate	<NA>
7574	14	2151163	Ammonium dichromate	<NA>
7575	14	13530-68-2	중크롬산 [Dichromic acid]	<NA>
7576	14	10588-01-9	Sodium dichromate	<NA>
7577	14	15586-38-6	Nickel dichromate	<NA>

Most frequently occurring

	유해물질확장구분	유해물질확장항목	유해물질확장 물질명	유해물질확장 참조명	# duplicates
3	10	<NA>	<NA>	<NA>	18
0	4	9016-45-9	Nonylphenol polyethylene glycol ether ; nonylphenol ethoxylate	<NA>	2
1	4	93894-07-6	카드뮴비스(오-노닐페놀레이트)	<NA>	2
2	9	<NA>	1~2까지의 화학물질 중 하나를 0.1%이상 함유한 혼합물	<NA>	2
4	14	101007-06-1	아크린아트린 [Acrinathrin]	<NA>	2
5	14	107-02-8	아크롤레인 [Acrolein]	<NA>	2
6	14	115-29-7	엔도술판 [Endosulfan]	<NA>	2
7	14	116-06-3	알디캅 [Aldicarb]	<NA>	2
8	14	13171-21-6	포스파미돈 [Phosphamidon]	<NA>	2
9	14	133-06-2	켑탄 [Captan]	<NA>	2

Overview

Variables

Most occurring characters

Most occurring categories

Most frequent character per category

Decimal Number

Uppercase Letter

Lowercase Letter

Other Punctuation

Dash Punctuation

Space Separator

Close Punctuation

Open Punctuation

Most occurring scripts

Most frequent character per script

Common

Latin

Most occurring blocks

Most frequent character per block

ASCII

Most occurring characters

Most occurring categories

Most frequent character per category

Other Letter

Lowercase Letter

Uppercase Letter

Other Punctuation

Decimal Number

Math Symbol

Letter Number

Close Punctuation

Open Punctuation

Space Separator

Final Punctuation

Initial Punctuation

Dash Punctuation

Other Symbol

Format

Modifier Symbol

Most occurring scripts

Most frequent character per script

Hangul

Latin

Common

Greek

Most occurring blocks

Most frequent character per block

ASCII

Hangul

None

Punctuation

Math Operators

Number Forms

Compat Jamo

Most occurring characters

Most occurring categories

Most frequent character per category

Lowercase Letter

Decimal Number

Uppercase Letter

Other Punctuation

Space Separator

Close Punctuation

Open Punctuation

Dash Punctuation

Most occurring scripts

Most frequent character per script

Latin

Common

Most occurring blocks

Most frequent character per block

ASCII

Interactions

Missing values

Sample

Duplicate rows

Most frequently occurring