gimi9 Pandas Profiling

Dataset statistics

Number of variables	4
Number of observations	105
Missing cells	15
Missing cells (%)	3.6%
Duplicate rows	1
Duplicate rows (%)	1.0%
Total size in memory	3.8 KiB
Average record size in memory	37.3 B

Variable types

Numeric	3
Categorical	1

Dataset

Description	고지혈증 환자들이 시행한 혈액 검사를 이용하여 당뇨, 비뇨기 질환과의 관련성을 평가할 수 있는 검사 데이터를 포함함. 검체 채취 일장, 접수 일자를 이용하여 처방시점으로 부터의 기간을 계산한 시점 데이터를 생성함. 검사항목은HbA1c, PSA(Prostate Specific Ag), free PSA, 등 고지혈증의 간독성과 신독성 등 다양한 부작용을 평가할 수 있는 주요 검사항목이 포함됨 - HbA1c(당화혈색소) :혈액 속 적혈구 내 혈색소에 포도당 일부가 결합한 상태. 일반 혈당 검사가 검사 시점 혈당만을 알 수 있는데 반해 당화혈색소를 통해 3개월 간의 평균 혈당을 알 수 있음 - PSA(Prostate Specific Antigen) : 전립선특이항원(전립샘특이항원). 전립선에서 분비되며 정액이나 혈액 속에 들어있는 당단백의 하나로, 전립선암 종양표지자(tumor marker) - free PSA : 활성 전립선특이항원
Author	가톨릭대학교 은평성모병원
URL	http://cmcdata.net/data/dataset/coexistence-disease-analysis-blood-test-data-dyslipidemia-eunpyeong

Alerts

Dataset has 1 (1.0%) duplicate rows	Duplicates
`일련번호` is highly overall correlated with `PSA_X_VAL`	High correlation
`A1C_VAL` is highly overall correlated with `PSA_X_VAL`	High correlation
`PSA_L_VAL` is highly overall correlated with `PSA_X_VAL`	High correlation
`PSA_X_VAL` is highly overall correlated with `일련번호` and 2 other fields	High correlation
`PSA_X_VAL` is highly imbalanced (86.7%)	Imbalance
`일련번호` has 5 (4.8%) missing values	Missing
`A1C_VAL` has 5 (4.8%) missing values	Missing
`PSA_L_VAL` has 5 (4.8%) missing values	Missing

Reproduction

Analysis started	2023-10-08 18:58:02.475671
Analysis finished	2023-10-08 18:58:04.353789
Duration	1.88 second
Software version	ydata-profiling vv4.5.1
Download configuration	config.json

일련번호
Real number (ℝ)

HIGH CORRELATION MISSING

Distinct	100
Distinct (%)	100.0%
Missing	5
Missing (%)	4.8%
Infinite	0
Infinite (%)	0.0%
Mean	50.5

Minimum	1
Maximum	100
Zeros	0
Zeros (%)	0.0%
Negative	0
Negative (%)	0.0%
Memory size	1.1 KiB

Quantile statistics

Minimum	1
5-th percentile	5.95
Q1	25.75
median	50.5
Q3	75.25
95-th percentile	95.05
Maximum	100
Range	99
Interquartile range (IQR)	49.5

Descriptive statistics

Standard deviation	29.011492
Coefficient of variation (CV)	0.57448499
Kurtosis	-1.2
Mean	50.5
Median Absolute Deviation (MAD)	25
Skewness	0
Sum	5050
Variance	841.66667
Monotonicity	Strictly increasing

Histogram with fixed size bins (bins=50)

Value	Count	Frequency (%)
65	1	1.0%
75	1	1.0%
74	1	1.0%
73	1	1.0%
72	1	1.0%
71	1	1.0%
70	1	1.0%
69	1	1.0%
68	1	1.0%
67	1	1.0%
Other values (90)	90	85.7%
(Missing)	5	4.8%

Minimum 10 values
Maximum 10 values

Value	Count	Frequency (%)
1	1	1.0%
2	1	1.0%
3	1	1.0%
4	1	1.0%
5	1	1.0%
6	1	1.0%
7	1	1.0%
8	1	1.0%
9	1	1.0%
10	1	1.0%

Value	Count	Frequency (%)
100	1	1.0%
99	1	1.0%
98	1	1.0%
97	1	1.0%
96	1	1.0%
95	1	1.0%
94	1	1.0%
93	1	1.0%
92	1	1.0%
91	1	1.0%

A1C_VAL
Real number (ℝ)

HIGH CORRELATION MISSING

Distinct	42
Distinct (%)	42.0%
Missing	5
Missing (%)	4.8%
Infinite	0
Infinite (%)	0.0%
Mean	6.719

Minimum	4.3
Maximum	14.7
Zeros	0
Zeros (%)	0.0%
Negative	0
Negative (%)	0.0%
Memory size	1.1 KiB

Quantile statistics

Minimum	4.3
5-th percentile	5.2
Q1	5.575
median	6
Q3	7.3
95-th percentile	10.41
Maximum	14.7
Range	10.4
Interquartile range (IQR)	1.725

Descriptive statistics

Standard deviation	1.7490947
Coefficient of variation (CV)	0.26032069
Kurtosis	4.5085791
Mean	6.719
Median Absolute Deviation (MAD)	0.6
Skewness	1.9242015
Sum	671.9
Variance	3.0593323
Monotonicity	Not monotonic

Histogram with fixed size bins (bins=42)

Value	Count	Frequency (%)
5.5	9	8.6%
5.8	8	7.6%
5.4	7	6.7%
5.7	6	5.7%
7.3	5	4.8%
6.3	5	4.8%
5.6	4	3.8%
6.0	4	3.8%
5.9	4	3.8%
5.3	3	2.9%
Other values (32)	45	42.9%
(Missing)	5	4.8%

Minimum 10 values
Maximum 10 values

Value	Count	Frequency (%)
4.3	1	1.0%
4.8	1	1.0%
5.0	1	1.0%
5.2	3	2.9%
5.3	3	2.9%
5.4	7	6.7%
5.5	9	8.6%
5.6	4	3.8%
5.7	6	5.7%
5.8	8	7.6%

Value	Count	Frequency (%)
14.7	1	1.0%
11.9	1	1.0%
11.7	1	1.0%
10.8	1	1.0%
10.6	1	1.0%
10.4	1	1.0%
10.0	1	1.0%
9.8	2	1.9%
9.1	1	1.0%
8.7	1	1.0%

PSA_L_VAL
Real number (ℝ)

HIGH CORRELATION MISSING

Distinct	79
Distinct (%)	79.0%
Missing	5
Missing (%)	4.8%
Infinite	0
Infinite (%)	0.0%
Mean	1.8223

Minimum	0.04
Maximum	20.72
Zeros	0
Zeros (%)	0.0%
Negative	0
Negative (%)	0.0%
Memory size	1.1 KiB

Quantile statistics

Minimum	0.04
5-th percentile	0.1685
Q1	0.47
median	0.745
Q3	1.3675
95-th percentile	7.7635
Maximum	20.72
Range	20.68
Interquartile range (IQR)	0.8975

Descriptive statistics

Standard deviation	3.4427171
Coefficient of variation (CV)	1.8892153
Kurtosis	15.341024
Mean	1.8223
Median Absolute Deviation (MAD)	0.385
Skewness	3.8495146
Sum	182.23
Variance	11.852301
Monotonicity	Not monotonic

Histogram with fixed size bins (bins=50)

Value	Count	Frequency (%)
0.69	3	2.9%
0.65	3	2.9%
0.84	3	2.9%
1.19	3	2.9%
0.61	3	2.9%
0.04	2	1.9%
0.53	2	1.9%
0.47	2	1.9%
1.01	2	1.9%
0.22	2	1.9%
Other values (69)	75	71.4%
(Missing)	5	4.8%

Minimum 10 values
Maximum 10 values

Value	Count	Frequency (%)
0.04	2	1.9%
0.08	1	1.0%
0.09	1	1.0%
0.14	1	1.0%
0.17	1	1.0%
0.18	1	1.0%
0.22	2	1.9%
0.26	1	1.0%
0.29	1	1.0%
0.3	1	1.0%

Value	Count	Frequency (%)
20.72	1	1.0%
16.5	1	1.0%
16.1	1	1.0%
12.45	1	1.0%
12.2	1	1.0%
7.53	1	1.0%
4.69	1	1.0%
4.36	1	1.0%
3.99	1	1.0%
3.52	1	1.0%

PSA_X_VAL
Categorical

HIGH CORRELATION IMBALANCE

Distinct	5
Distinct (%)	4.8%
Missing	0
Missing (%)	0.0%
Memory size	972.0 B

<NA>	101
0.07	1
0.59	1
0.32	1
0.81	1

Length

Max length	4
Median length	4
Mean length	4
Min length	4

Unique

Unique	4 ?
Unique (%)	3.8%

Sample

1st row	<NA>
2nd row	<NA>
3rd row	<NA>
4th row	<NA>
5th row	<NA>

Common Values

Value	Count	Frequency (%)
<NA>	101	96.2%
0.07	1	1.0%
0.59	1	1.0%
0.32	1	1.0%
0.81	1	1.0%

Length

Histogram of lengths of the category

Common Values (Plot)

Value	Count	Frequency (%)
na	101	96.2%
0.07	1	1.0%
0.59	1	1.0%
0.32	1	1.0%
0.81	1	1.0%

Phik (φk)
Auto

Heatmap
Table

	일련번호	A1C_VAL	PSA_L_VAL	PSA_X_VAL
일련번호	1.000	0.000	0.000	1.000
A1C_VAL	0.000	1.000	0.314	1.000
PSA_L_VAL	0.000	0.314	1.000	1.000
PSA_X_VAL	1.000	1.000	1.000	1.000

Heatmap
Table

	일련번호	A1C_VAL	PSA_L_VAL	PSA_X_VAL
일련번호	1.000	-0.030	-0.028	1.000
A1C_VAL	-0.030	1.000	-0.023	1.000
PSA_L_VAL	-0.028	-0.023	1.000	1.000
PSA_X_VAL	1.000	1.000	1.000	1.000

A simple visualization of nullity by column.

Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

The correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another.

First rows
Last rows

	일련번호	A1C_VAL	PSA_L_VAL	PSA_X_VAL
0	1	6.0	0.77	<NA>
1	2	5.5	0.47	<NA>
2	3	5.5	1.05	<NA>
3	4	5.5	1.44	<NA>
4	5	6.0	0.09	<NA>
5	6	6.3	4.69	<NA>
6	7	8.7	1.16	<NA>
7	8	11.7	0.42	<NA>
8	9	7.0	16.5	0.07
9	10	5.4	0.82	<NA>

	일련번호	A1C_VAL	PSA_L_VAL	PSA_X_VAL
95	96	5.2	0.3	<NA>
96	97	10.0	20.72	<NA>
97	98	8.1	0.82	<NA>
98	99	5.4	0.58	<NA>
99	100	5.5	0.41	<NA>
100	<NA>	<NA>	<NA>	<NA>
101	<NA>	<NA>	<NA>	<NA>
102	<NA>	<NA>	<NA>	<NA>
103	<NA>	<NA>	<NA>	<NA>
104	<NA>	<NA>	<NA>	<NA>

Most frequently occurring

	일련번호	A1C_VAL	PSA_L_VAL	PSA_X_VAL	# duplicates
0	<NA>	<NA>	<NA>	<NA>	5

Overview

Variables

Common Values

Length

Common Values (Plot)

Interactions

Correlations

Missing values

Sample

Duplicate rows

Most frequently occurring