gimi9 Pandas Profiling

Dataset statistics

Number of variables	3
Number of observations	100
Missing cells	0
Missing cells (%)	0.0%
Duplicate rows	0
Duplicate rows (%)	0.0%
Total size in memory	2.6 KiB
Average record size in memory	26.3 B

Variable types

Text	1
Categorical	2

Dataset

Description	병원정보시스템에 저장되어 있는 전체 데이터에서 ICD-10 코드 중 F101, F102, F103, F104, F109의 진단코드를 가진 환자와 K700, K701, K703, K7030, K7031, K7041, K709의 진단코드를 가진 환자들을 추출한 코호트의 인구통계학적 정보 데이터임. 환자들의 최초 처방 당시의 연령, 성별 데이터를 이용하여 연령대별 특성과 성별 특성을 분석할 수 있음. -SEX : 0은 남자, 1은 여자로 구분 하였음
Author	가톨릭대학교 서울성모병원
URL	http://cmcdata.net/data/dataset/demographic-data-alcohol-use-disorder

Alerts

`SEX` is highly imbalanced (53.1%)	Imbalance
`RID` has unique values	Unique

Reproduction

Analysis started	2023-10-08 18:56:20.154028
Analysis finished	2023-10-08 18:56:20.562791
Duration	0.41 seconds
Software version	ydata-profiling vv4.5.1
Download configuration	config.json

RID
Text

UNIQUE

Distinct	100
Distinct (%)	100.0%
Missing	0
Missing (%)	0.0%
Memory size	932.0 B

Length

Max length	8
Median length	8
Mean length	8
Min length	8

Characters and Unicode

Total characters	800
Distinct characters	11
Distinct categories	2 ?
Distinct scripts	2 ?
Distinct blocks	1 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	100 ?
Unique (%)	100.0%

Sample

1st row	R0000002
2nd row	R0000003
3rd row	R0000004
4th row	R0000006
5th row	R0000008

Value	Count	Frequency (%)
r0000002	1	1.0%
r0000109	1	1.0%
r0000133	1	1.0%
r0000129	1	1.0%
r0000128	1	1.0%
r0000125	1	1.0%
r0000122	1	1.0%
r0000118	1	1.0%
r0000116	1	1.0%
r0000114	1	1.0%
Other values (90)	90	90.0%

Most occurring characters

Value	Count	Frequency (%)
0	479	59.9%
R	100	12.5%
1	65	8.1%
6	24	3.0%
2	22	2.8%
5	21	2.6%
4	21	2.6%
3	21	2.6%
7	17	2.1%
9	16	2.0%

Most occurring categories

Value	Count	Frequency (%)
Decimal Number	700	87.5%
Uppercase Letter	100	12.5%

Most frequent character per category

Decimal Number

Value	Count	Frequency (%)
0	479	68.4%
1	65	9.3%
6	24	3.4%
2	22	3.1%
5	21	3.0%
4	21	3.0%
3	21	3.0%
7	17	2.4%
9	16	2.3%
8	14	2.0%

Uppercase Letter

Value	Count	Frequency (%)
R	100	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Common	700	87.5%
Latin	100	12.5%

Most frequent character per script

Common

Value	Count	Frequency (%)
0	479	68.4%
1	65	9.3%
6	24	3.4%
2	22	3.1%
5	21	3.0%
4	21	3.0%
3	21	3.0%
7	17	2.4%
9	16	2.3%
8	14	2.0%

Latin

Value	Count	Frequency (%)
R	100	100.0%

Most occurring blocks

Value	Count	Frequency (%)
ASCII	800	100.0%

Most frequent character per block

ASCII

Value	Count	Frequency (%)
0	479	59.9%
R	100	12.5%
1	65	8.1%
6	24	3.0%
2	22	2.8%
5	21	2.6%
4	21	2.6%
3	21	2.6%
7	17	2.1%
9	16	2.0%

Age_grp
Categorical

Distinct	6
Distinct (%)	6.0%
Missing	0
Missing (%)	0.0%
Memory size	932.0 B

50대	35
40대	25
60대	18
30대	13
70대	8

Length

Max length	3
Median length	3
Mean length	3
Min length	3

Unique

Unique	1 ?
Unique (%)	1.0%

Sample

1st row	60대
2nd row	50대
3rd row	70대
4th row	30대
5th row	50대

Common Values

Value	Count	Frequency (%)
50대	35	35.0%
40대	25	25.0%
60대	18	18.0%
30대	13	13.0%
70대	8	8.0%
10대	1	1.0%

Length

Histogram of lengths of the category

Common Values (Plot)

Value	Count	Frequency (%)
50대	35	35.0%
40대	25	25.0%
60대	18	18.0%
30대	13	13.0%
70대	8	8.0%
10대	1	1.0%

SEX
Categorical

IMBALANCE

Distinct	2
Distinct (%)	2.0%
Missing	0
Missing (%)	0.0%
Memory size	932.0 B

0	90
1	10

Length

Max length	1
Median length	1
Mean length	1
Min length	1

Unique

Unique	0 ?
Unique (%)	0.0%

Sample

1st row	0
2nd row	1
3rd row	0
4th row	1
5th row	0

Common Values

Value	Count	Frequency (%)
0	90	90.0%
1	10	10.0%

Length

Histogram of lengths of the category

Common Values (Plot)

Value	Count	Frequency (%)
0	90	90.0%
1	10	10.0%

Heatmap
Table

	RID	Age_grp	SEX
RID	1.000	1.000	1.000
Age_grp	1.000	1.000	0.123
SEX	1.000	0.123	1.000

Heatmap
Table

	Age_grp	SEX
Age_grp	1.000	0.084
SEX	0.084	1.000

Heatmap
Table

	Age_grp	SEX
Age_grp	1.000	0.084
SEX	0.084	1.000

Count
Matrix

A simple visualization of nullity by column.

Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

First rows
Last rows

	RID	Age_grp	SEX
0	R0000002	60대	0
1	R0000003	50대	1
2	R0000004	70대	0
3	R0000006	30대	1
4	R0000008	50대	0
5	R0000010	30대	0
6	R0000016	30대	0
7	R0000019	50대	0
8	R0000020	50대	0
9	R0000022	40대	1

	RID	Age_grp	SEX
90	R0000163	70대	1
91	R0000164	60대	0
92	R0000166	50대	0
93	R0000171	40대	1
94	R0000172	60대	0
95	R0000173	50대	0
96	R0000175	40대	0
97	R0000176	60대	0
98	R0000178	40대	0
99	R0000181	30대	0

Overview

Variables

Most occurring characters

Most occurring categories

Most frequent character per category

Decimal Number

Uppercase Letter

Most occurring scripts

Most frequent character per script

Common

Latin

Most occurring blocks

Most frequent character per block

ASCII

Common Values

Length

Common Values (Plot)

Common Values

Length

Common Values (Plot)

Correlations

Missing values

Sample