gimi9 Pandas Profiling

Dataset statistics

Number of variables	3
Number of observations	1608
Missing cells	388
Missing cells (%)	8.0%
Duplicate rows	0
Duplicate rows (%)	0.0%
Total size in memory	37.8 KiB
Average record size in memory	24.1 B

Variable types

Text	3

Dataset

Description	국립중앙과학관 홈페이지 과학학습콘텐츠에서 제공하는 관련 사이트 목록입니다.
Author	과학기술정보통신부 국립중앙과학관
URL	https://www.data.go.kr/data/15067815/fileData.do

Alerts

`사이트명` has 388 (24.1%) missing values	Missing
`고유 아이디` has unique values	Unique

Reproduction

Analysis started	2023-12-12 05:44:48.339787
Analysis finished	2023-12-12 05:44:48.803881
Duration	0.46 seconds
Software version	ydata-profiling vv4.5.1
Download configuration	config.json

고유 아이디
Text

UNIQUE

Distinct	1608
Distinct (%)	100.0%
Missing	0
Missing (%)	0.0%
Memory size	12.7 KiB

Length

Max length	5
Median length	3
Mean length	3.8152985
Min length	2

Characters and Unicode

Total characters	6135
Distinct characters	11
Distinct categories	2 ?
Distinct scripts	1 ?
Distinct blocks	1 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	1608 ?
Unique (%)	100.0%

Sample

1st row	435
2nd row	442
3rd row	447
4th row	449
5th row	459

Value	Count	Frequency (%)
435	1	0.1%
5,028	1	0.1%
5,000	1	0.1%
2,880	1	0.1%
2,879	1	0.1%
2,878	1	0.1%
2,877	1	0.1%
2,876	1	0.1%
2,873	1	0.1%
1,575	1	0.1%
Other values (1598)	1598	99.4%

Most occurring characters

Value	Count	Frequency (%)
1	994	16.2%
,	692	11.3%
2	646	10.5%
5	596	9.7%
3	525	8.6%
4	521	8.5%
9	466	7.6%
8	448	7.3%
7	426	6.9%
6	418	6.8%

Most occurring categories

Value	Count	Frequency (%)
Decimal Number	5443	88.7%
Other Punctuation	692	11.3%

Most frequent character per category

Decimal Number

Value	Count	Frequency (%)
1	994	18.3%
2	646	11.9%
5	596	10.9%
3	525	9.6%
4	521	9.6%
9	466	8.6%
8	448	8.2%
7	426	7.8%
6	418	7.7%
0	403	7.4%

Other Punctuation

Value	Count	Frequency (%)
,	692	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Common	6135	100.0%

Most frequent character per script

Common

Value	Count	Frequency (%)
1	994	16.2%
,	692	11.3%
2	646	10.5%
5	596	9.7%
3	525	8.6%
4	521	8.5%
9	466	7.6%
8	448	7.3%
7	426	6.9%
6	418	6.8%

Most occurring blocks

Value	Count	Frequency (%)
ASCII	6135	100.0%

Most frequent character per block

ASCII

Value	Count	Frequency (%)
1	994	16.2%
,	692	11.3%
2	646	10.5%
5	596	9.7%
3	525	8.6%
4	521	8.5%
9	466	7.6%
8	448	7.3%
7	426	6.9%
6	418	6.8%

고유 아이디 2
Text

Distinct	723
Distinct (%)	45.0%
Missing	0
Missing (%)	0.0%
Memory size	12.7 KiB

Length

Max length	5
Median length	3
Mean length	3.7002488
Min length	1

Characters and Unicode

Total characters	5950
Distinct characters	11
Distinct categories	2 ?
Distinct scripts	1 ?
Distinct blocks	1 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	420 ?
Unique (%)	26.1%

Sample

1st row	189
2nd row	188
3rd row	186
4th row	185
5th row	183

Value	Count	Frequency (%)
181	15	0.9%
1,217	12	0.7%
388	11	0.7%
314	10	0.6%
188	10	0.6%
313	10	0.6%
378	9	0.6%
309	9	0.6%
387	8	0.5%
1,078	8	0.5%
Other values (713)	1506	93.7%

Most occurring characters

Value	Count	Frequency (%)
1	1249	21.0%
3	760	12.8%
2	711	11.9%
0	665	11.2%
,	617	10.4%
8	362	6.1%
4	354	5.9%
9	339	5.7%
7	315	5.3%
5	298	5.0%

Most occurring categories

Value	Count	Frequency (%)
Decimal Number	5333	89.6%
Other Punctuation	617	10.4%

Most frequent character per category

Decimal Number

Value	Count	Frequency (%)
1	1249	23.4%
3	760	14.3%
2	711	13.3%
0	665	12.5%
8	362	6.8%
4	354	6.6%
9	339	6.4%
7	315	5.9%
5	298	5.6%
6	280	5.3%

Other Punctuation

Value	Count	Frequency (%)
,	617	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Common	5950	100.0%

Most frequent character per script

Common

Value	Count	Frequency (%)
1	1249	21.0%
3	760	12.8%
2	711	11.9%
0	665	11.2%
,	617	10.4%
8	362	6.1%
4	354	5.9%
9	339	5.7%
7	315	5.3%
5	298	5.0%

Most occurring blocks

Value	Count	Frequency (%)
ASCII	5950	100.0%

Most frequent character per block

ASCII

Value	Count	Frequency (%)
1	1249	21.0%
3	760	12.8%
2	711	11.9%
0	665	11.2%
,	617	10.4%
8	362	6.1%
4	354	5.9%
9	339	5.7%
7	315	5.3%
5	298	5.0%

사이트명
Text

MISSING

Distinct	809
Distinct (%)	66.3%
Missing	388
Missing (%)	24.1%
Memory size	12.7 KiB

Length

Max length	124
Median length	57
Mean length	14.279508
Min length	4

Characters and Unicode

Total characters	17421
Distinct characters	504
Distinct categories	9 ?
Distinct scripts	3 ?
Distinct blocks	3 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	734 ?
Unique (%)	60.2%

Sample

1st row	위키피디아 - RCA Records
2nd row	위키피디아 - Extended play
3rd row	위키피디아 - Edison Records
4th row	위키피디아 - Edison Records
5th row	위키피디아 - Theremin

Value	Count	Frequency (%)
	411	13.3%
위키피디아	234	7.5%
두산백과	138	4.5%
한국위키피디아	80	2.6%
문화재청	42	1.4%
네이버지식백과	41	1.3%
문화콘텐츠닷컴	31	1.0%
한국민족문화대백과	30	1.0%
향토문화대전	28	0.9%
youtube	28	0.9%
Other values (1221)	2038	65.7%

Most occurring characters

Value	Count	Frequency (%)
	1882	10.8%
-	874	5.0%
위	446	2.6%
키	425	2.4%
과	419	2.4%
아	390	2.2%
피	375	2.2%
디	375	2.2%
백	342	2.0%
e	340	2.0%
Other values (494)	11553	66.3%

Most occurring categories

Value	Count	Frequency (%)
Other Letter	10137	58.2%
Lowercase Letter	3105	17.8%
Space Separator	1882	10.8%
Uppercase Letter	1208	6.9%
Dash Punctuation	874	5.0%
Decimal Number	122	0.7%
Open Punctuation	35	0.2%
Close Punctuation	35	0.2%
Other Punctuation	23	0.1%

Most frequent character per category

Other Letter

Value	Count	Frequency (%)
위	446	4.4%
키	425	4.2%
과	419	4.1%
아	390	3.8%
피	375	3.7%
디	375	3.7%
백	342	3.4%
국	274	2.7%
이	268	2.6%
리	235	2.3%
Other values (421)	6588	65.0%

Uppercase Letter

Value	Count	Frequency (%)
D	149	12.3%
S	139	11.5%
L	132	10.9%
N	115	9.5%
M	86	7.1%
P	68	5.6%
C	65	5.4%
I	60	5.0%
R	53	4.4%
B	48	4.0%
Other values (16)	293	24.3%

Lowercase Letter

Value	Count	Frequency (%)
e	340	11.0%
o	300	9.7%
a	272	8.8%
n	250	8.1%
r	249	8.0%
i	213	6.9%
t	186	6.0%
c	163	5.2%
u	160	5.2%
l	138	4.4%
Other values (14)	834	26.9%

Decimal Number

Value	Count	Frequency (%)
0	41	33.6%
1	28	23.0%
3	11	9.0%
2	11	9.0%
8	9	7.4%
5	8	6.6%
6	7	5.7%
7	3	2.5%
4	3	2.5%
9	1	0.8%

Other Punctuation

Value	Count	Frequency (%)
:	7	30.4%
'	5	21.7%
,	3	13.0%
/	3	13.0%
&	1	4.3%
·	1	4.3%
？	1	4.3%
?	1	4.3%
.	1	4.3%

Space Separator

Value	Count	Frequency (%)
	1882	100.0%

Dash Punctuation

Value	Count	Frequency (%)
-	874	100.0%

Open Punctuation

Value	Count	Frequency (%)
(	35	100.0%

Close Punctuation

Value	Count	Frequency (%)
)	35	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Hangul	10137	58.2%
Latin	4313	24.8%
Common	2971	17.1%

Most frequent character per script

Hangul

Value	Count	Frequency (%)
위	446	4.4%
키	425	4.2%
과	419	4.1%
아	390	3.8%
피	375	3.7%
디	375	3.7%
백	342	3.4%
국	274	2.7%
이	268	2.6%
리	235	2.3%
Other values (421)	6588	65.0%

Latin

Value	Count	Frequency (%)
e	340	7.9%
o	300	7.0%
a	272	6.3%
n	250	5.8%
r	249	5.8%
i	213	4.9%
t	186	4.3%
c	163	3.8%
u	160	3.7%
D	149	3.5%
Other values (40)	2031	47.1%

Common

Value	Count	Frequency (%)
	1882	63.3%
-	874	29.4%
0	41	1.4%
(	35	1.2%
)	35	1.2%
1	28	0.9%
3	11	0.4%
2	11	0.4%
8	9	0.3%
5	8	0.3%
Other values (13)	37	1.2%

Most occurring blocks

Value	Count	Frequency (%)
Hangul	10137	58.2%
ASCII	7282	41.8%
None	2	< 0.1%

Most frequent character per block

ASCII

Value	Count	Frequency (%)
	1882	25.8%
-	874	12.0%
e	340	4.7%
o	300	4.1%
a	272	3.7%
n	250	3.4%
r	249	3.4%
i	213	2.9%
t	186	2.6%
c	163	2.2%
Other values (61)	2553	35.1%

Hangul

Value	Count	Frequency (%)
위	446	4.4%
키	425	4.2%
과	419	4.1%
아	390	3.8%
피	375	3.7%
디	375	3.7%
백	342	3.4%
국	274	2.7%
이	268	2.6%
리	235	2.3%
Other values (421)	6588	65.0%

None

Value	Count	Frequency (%)
·	1	50.0%
？	1	50.0%

Count
Matrix

A simple visualization of nullity by column.

Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

First rows
Last rows

	고유 아이디	고유 아이디 2	사이트명
0	435	189	위키피디아 - RCA Records
1	442	188	위키피디아 - Extended play
2	447	186	위키피디아 - Edison Records
3	449	185	위키피디아 - Edison Records
4	459	183	위키피디아 - Theremin
5	463	181	위키피디아 - RCA Records
6	471	181	위키피디아 - Extended play
7	473	181	한국위키피디아 - 자기 테이프
8	480	177	위키피디아 - Gramophone Company
9	501	256	두산백과 - 컴퓨터

	고유 아이디	고유 아이디 2	사이트명
1598	2,934	1,375	<NA>
1599	2,958	1,399	<NA>
1600	2,965	1,405	<NA>
1601	2,968	1,408	<NA>
1602	2,972	1,413	<NA>
1603	2,973	1,414	<NA>
1604	2,974	1,415	<NA>
1605	2,976	1,417	<NA>
1606	2,989	1,430	<NA>
1607	2,994	1,435	<NA>

Overview

Variables

Most occurring characters

Most occurring categories

Most frequent character per category

Decimal Number

Other Punctuation

Most occurring scripts

Most frequent character per script

Common

Most occurring blocks

Most frequent character per block

ASCII

Most occurring characters

Most occurring categories

Most frequent character per category

Decimal Number

Other Punctuation

Most occurring scripts

Most frequent character per script

Common

Most occurring blocks

Most frequent character per block

ASCII

Most occurring characters

Most occurring categories

Most frequent character per category

Other Letter

Uppercase Letter

Lowercase Letter

Decimal Number

Other Punctuation

Space Separator

Dash Punctuation

Open Punctuation

Close Punctuation

Most occurring scripts

Most frequent character per script

Hangul

Latin

Common

Most occurring blocks

Most frequent character per block

ASCII

Hangul

None

Missing values

Sample