gimi9 Pandas Profiling

Dataset statistics

Number of variables	4
Number of observations	10000
Missing cells	0
Missing cells (%)	0.0%
Duplicate rows	0
Duplicate rows (%)	0.0%
Total size in memory	410.2 KiB
Average record size in memory	42.0 B

Variable types

Text	2
Categorical	1
Numeric	1

Dataset

Description	의료급여 수급권자에 대해 등록된 고시 질환 내역 중 희귀, 난치성 질환의 상병 기호 내역과 그룹 별 일련번호로 분류한 내역. 컬럼명은 상병기호, 그룹, 그룹내 일련번호, 순번으로 구성됨
URL	https://www.data.go.kr/data/15121404/fileData.do

Alerts

`상병기호` has unique values	Unique
`순번` has 2558 (25.6%) zeros	Zeros

Reproduction

Analysis started	2023-12-12 23:00:13.196514
Analysis finished	2023-12-12 23:00:14.154768
Duration	0.96 seconds
Software version	ydata-profiling vv4.5.1
Download configuration	config.json

상병기호
Text

UNIQUE

Distinct	10000
Distinct (%)	100.0%
Missing	0
Missing (%)	0.0%
Memory size	156.2 KiB

Length

Max length	5
Median length	4
Mean length	3.8584
Min length	3

Characters and Unicode

Total characters	38584
Distinct characters	35
Distinct categories	2 ?
Distinct scripts	2 ?
Distinct blocks	1 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	10000 ?
Unique (%)	100.0%

Sample

1st row	M872
2nd row	T71
3rd row	M354
4th row	M813
5th row	Z891

Value	Count	Frequency (%)
m872	1	< 0.1%
q922	1	< 0.1%
x995	1	< 0.1%
v31	1	< 0.1%
t631	1	< 0.1%
x108	1	< 0.1%
g620	1	< 0.1%
g618	1	< 0.1%
a666	1	< 0.1%
y497	1	< 0.1%
Other values (9990)	9990	99.9%

Most occurring characters

Value	Count	Frequency (%)
0	3531	9.2%
1	3289	8.5%
2	3155	8.2%
3	3028	7.8%
8	2880	7.5%
4	2826	7.3%
9	2654	6.9%
5	2585	6.7%
6	2393	6.2%
7	2243	5.8%
Other values (25)	10000	25.9%

Most occurring categories

Value	Count	Frequency (%)
Decimal Number	28584	74.1%
Uppercase Letter	10000	25.9%

Most frequent character per category

Uppercase Letter

Value	Count	Frequency (%)
X	736	7.4%
W	655	6.6%
V	608	6.1%
Y	558	5.6%
T	538	5.4%
Q	494	4.9%
Z	492	4.9%
S	492	4.9%
M	455	4.5%
D	386	3.9%
Other values (15)	4586	45.9%

Decimal Number

Value	Count	Frequency (%)
0	3531	12.4%
1	3289	11.5%
2	3155	11.0%
3	3028	10.6%
8	2880	10.1%
4	2826	9.9%
9	2654	9.3%
5	2585	9.0%
6	2393	8.4%
7	2243	7.8%

Most occurring scripts

Value	Count	Frequency (%)
Common	28584	74.1%
Latin	10000	25.9%

Most frequent character per script

Latin

Value	Count	Frequency (%)
X	736	7.4%
W	655	6.6%
V	608	6.1%
Y	558	5.6%
T	538	5.4%
Q	494	4.9%
Z	492	4.9%
S	492	4.9%
M	455	4.5%
D	386	3.9%
Other values (15)	4586	45.9%

Common

Value	Count	Frequency (%)
0	3531	12.4%
1	3289	11.5%
2	3155	11.0%
3	3028	10.6%
8	2880	10.1%
4	2826	9.9%
9	2654	9.3%
5	2585	9.0%
6	2393	8.4%
7	2243	7.8%

Most occurring blocks

Value	Count	Frequency (%)
ASCII	38584	100.0%

Most frequent character per block

ASCII

Value	Count	Frequency (%)
0	3531	9.2%
1	3289	8.5%
2	3155	8.2%
3	3028	7.8%
8	2880	7.5%
4	2826	7.3%
9	2654	6.9%
5	2585	6.7%
6	2393	6.2%
7	2243	5.8%
Other values (25)	10000	25.9%

그룹
Categorical

Distinct	3
Distinct (%)	< 0.1%
Missing	0
Missing (%)	0.0%
Memory size	156.2 KiB

33	8382
11	916
22	702

Length

Max length	2
Median length	2
Mean length	2
Min length	2

Unique

Unique	0 ?
Unique (%)	0.0%

Sample

1st row	33
2nd row	33
3rd row	22
4th row	33
5th row	33

Common Values

Value	Count	Frequency (%)
33	8382	83.8%
11	916	9.2%
22	702	7.0%

Length

Histogram of lengths of the category

Common Values (Plot)

Value	Count	Frequency (%)
33	8382	83.8%
11	916	9.2%
22	702	7.0%

그룹내 일련번호
Text

Distinct	8474
Distinct (%)	84.7%
Missing	0
Missing (%)	0.0%
Memory size	156.2 KiB

Length

Max length	34
Median length	22
Mean length	15.8445
Min length	1

Characters and Unicode

Total characters	158445
Distinct characters	715
Distinct categories	9 ?
Distinct scripts	3 ?
Distinct blocks	3 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	8267 ?
Unique (%)	82.7%

Sample

1st row	이전의 외상에 의한 골괴사증
2nd row	질식
3rd row	결합조직의기타전신성침습
4th row	수술후 흡수불량성 골다공증
5th row	손 및 손목의 후천성 부재

Value	Count	Frequency (%)
및	2640	6.5%
기타	2146	5.3%
상세불명의	1041	2.6%
또는	598	1.5%
의한	579	1.4%
명시된	504	1.2%
다친	472	1.2%
충돌로	417	1.0%
악성신생물	404	1.0%
장애	393	1.0%
Other values (6501)	31220	77.3%

Most occurring characters

Value	Count	Frequency (%)
	31067	19.6%
의	5724	3.6%
기	3761	2.4%
에	3632	2.3%
성	3369	2.1%
상	2909	1.8%
및	2827	1.8%
서	2451	1.5%
타	2427	1.5%
장	2130	1.3%
Other values (705)	98148	61.9%

Most occurring categories

Value	Count	Frequency (%)
Other Letter	125317	79.1%
Space Separator	31067	19.6%
Open Punctuation	608	0.4%
Close Punctuation	597	0.4%
Uppercase Letter	549	0.3%
Decimal Number	163	0.1%
Dash Punctuation	118	0.1%
Other Punctuation	25	< 0.1%
Math Symbol	1	< 0.1%

Most frequent character per category

Other Letter

Value	Count	Frequency (%)
의	5724	4.6%
기	3761	3.0%
에	3632	2.9%
성	3369	2.7%
상	2909	2.3%
및	2827	2.3%
서	2451	2.0%
타	2427	1.9%
장	2130	1.7%
명	1999	1.6%
Other values (664)	94088	75.1%

Uppercase Letter

Value	Count	Frequency (%)
N	124	22.6%
S	109	19.9%
O	99	18.0%
C	37	6.7%
E	30	5.5%
B	24	4.4%
I	24	4.4%
X	15	2.7%
A	14	2.6%
V	13	2.4%
Other values (12)	60	10.9%

Decimal Number

Value	Count	Frequency (%)
1	38	23.3%
2	32	19.6%
3	21	12.9%
0	16	9.8%
4	15	9.2%
9	13	8.0%
6	11	6.7%
5	7	4.3%
7	6	3.7%
8	4	2.5%

Other Punctuation

Value	Count	Frequency (%)
%	15	60.0%
.	8	32.0%
·	1	4.0%
/	1	4.0%

Space Separator

Value	Count	Frequency (%)
	31067	100.0%

Open Punctuation

Value	Count	Frequency (%)
(	608	100.0%

Close Punctuation

Value	Count	Frequency (%)
)	597	100.0%

Dash Punctuation

Value	Count	Frequency (%)
-	118	100.0%

Math Symbol

Value	Count	Frequency (%)
+	1	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Hangul	125317	79.1%
Common	32579	20.6%
Latin	549	0.3%

Most frequent character per script

Hangul

Value	Count	Frequency (%)
의	5724	4.6%
기	3761	3.0%
에	3632	2.9%
성	3369	2.7%
상	2909	2.3%
및	2827	2.3%
서	2451	2.0%
타	2427	1.9%
장	2130	1.7%
명	1999	1.6%
Other values (664)	94088	75.1%

Latin

Value	Count	Frequency (%)
N	124	22.6%
S	109	19.9%
O	99	18.0%
C	37	6.7%
E	30	5.5%
B	24	4.4%
I	24	4.4%
X	15	2.7%
A	14	2.6%
V	13	2.4%
Other values (12)	60	10.9%

Common

Value	Count	Frequency (%)
	31067	95.4%
(	608	1.9%
)	597	1.8%
-	118	0.4%
1	38	0.1%
2	32	0.1%
3	21	0.1%
0	16	< 0.1%
%	15	< 0.1%
4	15	< 0.1%
Other values (9)	52	0.2%

Most occurring blocks

Value	Count	Frequency (%)
Hangul	125317	79.1%
ASCII	33127	20.9%
None	1	< 0.1%

Most frequent character per block

ASCII

Value	Count	Frequency (%)
	31067	93.8%
(	608	1.8%
)	597	1.8%
N	124	0.4%
-	118	0.4%
S	109	0.3%
O	99	0.3%
1	38	0.1%
C	37	0.1%
2	32	0.1%
Other values (30)	298	0.9%

Hangul

Value	Count	Frequency (%)
의	5724	4.6%
기	3761	3.0%
에	3632	2.9%
성	3369	2.7%
상	2909	2.3%
및	2827	2.3%
서	2451	2.0%
타	2427	1.9%
장	2130	1.7%
명	1999	1.6%
Other values (664)	94088	75.1%

None

Value	Count	Frequency (%)
·	1	100.0%

순번
Real number (ℝ)

ZEROS

Distinct	255
Distinct (%)	2.5%
Missing	0
Missing (%)	0.0%
Infinite	0
Infinite (%)	0.0%
Mean	122.1933

Minimum	0
Maximum	298
Zeros	2558
Zeros (%)	25.6%
Negative	0
Negative (%)	0.0%
Memory size	166.0 KiB

Quantile statistics

Minimum	0
5-th percentile	0
Q1	0
median	99
Q3	244
95-th percentile	289
Maximum	298
Range	298
Interquartile range (IQR)	244

Descriptive statistics

Standard deviation	117.41211
Coefficient of variation (CV)	0.96087188
Kurtosis	-1.6823419
Mean	122.1933
Median Absolute Deviation (MAD)	99
Skewness	0.20072777
Sum	1221933
Variance	13785.602
Monotonicity	Not monotonic

Histogram with fixed size bins (bins=50)

Value	Count	Frequency (%)
0	2558	25.6%
7	397	4.0%
1	320	3.2%
281	320	3.2%
270	258	2.6%
199	251	2.5%
298	246	2.5%
2	236	2.4%
11	177	1.8%
96	158	1.6%
Other values (245)	5079	50.8%

Minimum 10 values
Maximum 10 values

Value	Count	Frequency (%)
0	2558	25.6%
1	320	3.2%
2	236	2.4%
3	42	0.4%
4	54	0.5%
5	46	0.5%
6	78	0.8%
7	397	4.0%
8	46	0.5%
9	55	0.5%

Value	Count	Frequency (%)
298	246	2.5%
297	90	0.9%
296	3	< 0.1%
295	5	0.1%
294	19	0.2%
293	6	0.1%
292	52	0.5%
291	1	< 0.1%
290	70	0.7%
289	34	0.3%

순번

순번

Phik (φk)
Auto

Heatmap
Table

	그룹	순번
그룹	1.000	0.542
순번	0.542	1.000

Heatmap
Table

	순번	그룹
순번	1.000	0.386
그룹	0.386	1.000

Count
Matrix

A simple visualization of nullity by column.

Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

First rows
Last rows

	상병기호	그룹	그룹내 일련번호	순번
4919	M872	33	이전의 외상에 의한 골괴사증	210
14195	T71	33	질식	287
13188	M354	22	결합조직의기타전신성침습	65
13808	M813	33	수술후 흡수불량성 골다공증	208
10154	Z891	33	손 및 손목의 후천성 부재	298
5718	Q961	22	터저증후군	98
13442	Q402	33	기타 명시된 위의 선천성 기형	259
584	Y27	33	의도 미확인의 증기 및 고온물체에 접촉	0
11291	Y183	33	운동 및 경기장에서 살충제에 의한 의도 미확인의 중	0
5151	O31	33	다태 임신에 특이한 합병증	239

	상병기호	그룹	그룹내 일련번호	순번
4060	E35	33	달리 분류된 질환에서의 내분비선 장애	111
11256	W010	33	주거지에서 미끌림 걸림 및 헛디딤에 의한 동일 면상	0
6341	Z655	33	재앙 전쟁 및 기타 적대행위에 노출	298
7516	K350	33	전신성 복막염을 동반한 급성 충수염	186
14191	T703	33	잠함병(감압병)	287
12408	I458	11	기타 명시된 전도 장애	11
1475	I23	11	급성 심근경색증에 의한 특정 현재 합병증	11
10560	C939	22	악성신생물(암)	6
13066	S620	33	손의 주상골의 골절	274
4925	M888	33	기타 뼈의 파젯병	210

Overview

Variables

Most occurring characters

Most occurring categories

Most frequent character per category

Uppercase Letter

Decimal Number

Most occurring scripts

Most frequent character per script

Latin

Common

Most occurring blocks

Most frequent character per block

ASCII

Common Values

Length

Common Values (Plot)

Most occurring characters

Most occurring categories

Most frequent character per category

Other Letter

Uppercase Letter

Decimal Number

Other Punctuation

Space Separator

Open Punctuation

Close Punctuation

Dash Punctuation

Math Symbol

Most occurring scripts

Most frequent character per script

Hangul

Latin

Common

Most occurring blocks

Most frequent character per block

ASCII

Hangul

None

Interactions

Correlations

Missing values

Sample