Overview

Dataset statistics

Number of variables3
Number of observations10000
Missing cells1
Missing cells (%)< 0.1%
Duplicate rows207
Duplicate rows (%)2.1%
Total size in memory312.5 KiB
Average record size in memory32.0 B

Variable types

Text1
Categorical2

Dataset

Description한국연구재단이 보유하고있는 기초학문자료센터 시스템에 있는 원문목록데이터입니다. 대표 데이터로는 원문목록명, 작성기관등이 있습니다.
Author한국연구재단
URLhttps://www.data.go.kr/data/15092444/fileData.do

Alerts

Dataset has 207 (2.1%) duplicate rowsDuplicates
확장자 is highly imbalanced (54.6%)Imbalance

Reproduction

Analysis started2023-12-16 15:04:46.935308
Analysis finished2023-12-16 15:04:51.498624
Duration4.56 seconds
Software versionydata-profiling vv4.5.1
Download configurationconfig.json

Variables

Distinct9629
Distinct (%)96.3%
Missing1
Missing (%)< 0.1%
Memory size156.2 KiB
2023-12-16T15:04:52.907384image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/

Length

Max length183
Median length114
Mean length16.591759
Min length5

Characters and Unicode

Total characters165901
Distinct characters1137
Distinct categories15 ?
Distinct scripts6 ?
Distinct blocks9 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique9358 ?
Unique (%)93.6%

Sample

1st rowA000131.pdf
2nd row139-2.jpg
3rd row언어적 입력의 품사가 영아의 초기 어휘발달에 미치는 영향.pdf
4th row2213s4_273.mp3
5th rowAS201512.pdf
ValueCountFrequency (%)
표현된 121
 
0.8%
사진 117
 
0.7%
관한 86
 
0.5%
75
 
0.5%
연구.pdf 63
 
0.4%
대한 57
 
0.4%
picture 46
 
0.3%
41
 
0.3%
고문서 35
 
0.2%
미치는 29
 
0.2%
Other values (12853) 15093
95.7%
2023-12-16T15:04:56.085392image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/

Most occurring characters

ValueCountFrequency (%)
0 24266
 
14.6%
1 13213
 
8.0%
. 10541
 
6.4%
p 8580
 
5.2%
2 7838
 
4.7%
3 7631
 
4.6%
5780
 
3.5%
5 5325
 
3.2%
4 4553
 
2.7%
_ 4249
 
2.6%
Other values (1127) 73925
44.6%

Most occurring categories

ValueCountFrequency (%)
Decimal Number 75640
45.6%
Lowercase Letter 29727
 
17.9%
Other Letter 25573
 
15.4%
Other Punctuation 10789
 
6.5%
Uppercase Letter 10648
 
6.4%
Space Separator 5780
 
3.5%
Connector Punctuation 4249
 
2.6%
Dash Punctuation 2564
 
1.5%
Open Punctuation 453
 
0.3%
Close Punctuation 453
 
0.3%
Other values (5) 25
 
< 0.1%

Most frequent character per category

Other Letter
ValueCountFrequency (%)
780
 
3.1%
524
 
2.0%
451
 
1.8%
390
 
1.5%
387
 
1.5%
345
 
1.3%
343
 
1.3%
342
 
1.3%
326
 
1.3%
325
 
1.3%
Other values (1018) 21360
83.5%
Lowercase Letter
ValueCountFrequency (%)
p 8580
28.9%
m 3660
12.3%
d 3577
12.0%
f 3499
11.8%
g 2081
 
7.0%
j 1963
 
6.6%
s 1492
 
5.0%
e 597
 
2.0%
i 491
 
1.7%
n 410
 
1.4%
Other values (30) 3377
 
11.4%
Uppercase Letter
ValueCountFrequency (%)
P 1392
13.1%
G 1310
12.3%
J 1089
10.2%
A 1049
9.9%
S 1033
9.7%
B 999
9.4%
M 736
6.9%
C 664
6.2%
D 550
 
5.2%
N 447
 
4.2%
Other values (20) 1379
13.0%
Decimal Number
ValueCountFrequency (%)
0 24266
32.1%
1 13213
17.5%
2 7838
 
10.4%
3 7631
 
10.1%
5 5325
 
7.0%
4 4553
 
6.0%
6 3702
 
4.9%
7 3346
 
4.4%
8 2982
 
3.9%
9 2784
 
3.7%
Other Punctuation
ValueCountFrequency (%)
. 10541
97.7%
, 201
 
1.9%
' 36
 
0.3%
# 5
 
< 0.1%
· 2
 
< 0.1%
& 2
 
< 0.1%
; 1
 
< 0.1%
! 1
 
< 0.1%
Open Punctuation
ValueCountFrequency (%)
( 411
90.7%
[ 31
 
6.8%
7
 
1.5%
2
 
0.4%
2
 
0.4%
Close Punctuation
ValueCountFrequency (%)
) 411
90.7%
] 31
 
6.8%
7
 
1.5%
2
 
0.4%
2
 
0.4%
Dash Punctuation
ValueCountFrequency (%)
- 2561
99.9%
3
 
0.1%
Math Symbol
ValueCountFrequency (%)
~ 12
92.3%
+ 1
 
7.7%
Initial Punctuation
ValueCountFrequency (%)
3
75.0%
1
 
25.0%
Space Separator
ValueCountFrequency (%)
5780
100.0%
Connector Punctuation
ValueCountFrequency (%)
_ 4249
100.0%
Final Punctuation
ValueCountFrequency (%)
5
100.0%
Modifier Symbol
ValueCountFrequency (%)
` 2
100.0%
Letter Number
ValueCountFrequency (%)
1
100.0%

Most occurring scripts

ValueCountFrequency (%)
Common 99952
60.2%
Latin 40347
24.3%
Hangul 24085
 
14.5%
Han 1486
 
0.9%
Cyrillic 29
 
< 0.1%
Hiragana 2
 
< 0.1%

Most frequent character per script

Hangul
ValueCountFrequency (%)
780
 
3.2%
524
 
2.2%
451
 
1.9%
390
 
1.6%
387
 
1.6%
345
 
1.4%
343
 
1.4%
342
 
1.4%
326
 
1.4%
325
 
1.3%
Other values (680) 19872
82.5%
Han
ValueCountFrequency (%)
103
 
6.9%
84
 
5.7%
76
 
5.1%
輿 67
 
4.5%
62
 
4.2%
33
 
2.2%
30
 
2.0%
25
 
1.7%
23
 
1.5%
20
 
1.3%
Other values (327) 963
64.8%
Latin
ValueCountFrequency (%)
p 8580
21.3%
m 3660
 
9.1%
d 3577
 
8.9%
f 3499
 
8.7%
g 2081
 
5.2%
j 1963
 
4.9%
s 1492
 
3.7%
P 1392
 
3.5%
G 1310
 
3.2%
J 1089
 
2.7%
Other values (43) 11704
29.0%
Common
ValueCountFrequency (%)
0 24266
24.3%
1 13213
13.2%
. 10541
10.5%
2 7838
 
7.8%
3 7631
 
7.6%
5780
 
5.8%
5 5325
 
5.3%
4 4553
 
4.6%
_ 4249
 
4.3%
6 3702
 
3.7%
Other values (28) 12854
12.9%
Cyrillic
ValueCountFrequency (%)
с 5
17.2%
и 4
13.8%
п 2
 
6.9%
е 2
 
6.9%
к 2
 
6.9%
л 2
 
6.9%
я 1
 
3.4%
о 1
 
3.4%
г 1
 
3.4%
д 1
 
3.4%
Other values (8) 8
27.6%
Hiragana
ValueCountFrequency (%)
2
100.0%

Most occurring blocks

ValueCountFrequency (%)
ASCII 140262
84.5%
Hangul 24085
 
14.5%
CJK 1448
 
0.9%
CJK Compat Ideographs 38
 
< 0.1%
Cyrillic 29
 
< 0.1%
None 27
 
< 0.1%
Punctuation 9
 
< 0.1%
Hiragana 2
 
< 0.1%
Number Forms 1
 
< 0.1%

Most frequent character per block

ASCII
ValueCountFrequency (%)
0 24266
17.3%
1 13213
 
9.4%
. 10541
 
7.5%
p 8580
 
6.1%
2 7838
 
5.6%
3 7631
 
5.4%
5780
 
4.1%
5 5325
 
3.8%
4 4553
 
3.2%
_ 4249
 
3.0%
Other values (69) 48286
34.4%
Hangul
ValueCountFrequency (%)
780
 
3.2%
524
 
2.2%
451
 
1.9%
390
 
1.6%
387
 
1.6%
345
 
1.4%
343
 
1.4%
342
 
1.4%
326
 
1.4%
325
 
1.3%
Other values (680) 19872
82.5%
CJK
ValueCountFrequency (%)
103
 
7.1%
84
 
5.8%
76
 
5.2%
輿 67
 
4.6%
62
 
4.3%
33
 
2.3%
30
 
2.1%
25
 
1.7%
23
 
1.6%
20
 
1.4%
Other values (313) 925
63.9%
CJK Compat Ideographs
ValueCountFrequency (%)
18
47.4%
3
 
7.9%
3
 
7.9%
2
 
5.3%
2
 
5.3%
2
 
5.3%
1
 
2.6%
1
 
2.6%
1
 
2.6%
1
 
2.6%
Other values (4) 4
 
10.5%
None
ValueCountFrequency (%)
7
25.9%
7
25.9%
3
11.1%
2
 
7.4%
· 2
 
7.4%
2
 
7.4%
2
 
7.4%
2
 
7.4%
Punctuation
ValueCountFrequency (%)
5
55.6%
3
33.3%
1
 
11.1%
Cyrillic
ValueCountFrequency (%)
с 5
17.2%
и 4
13.8%
п 2
 
6.9%
е 2
 
6.9%
к 2
 
6.9%
л 2
 
6.9%
я 1
 
3.4%
о 1
 
3.4%
г 1
 
3.4%
д 1
 
3.4%
Other values (8) 8
27.6%
Hiragana
ValueCountFrequency (%)
2
100.0%
Number Forms
ValueCountFrequency (%)
1
100.0%

확장자
Categorical

IMBALANCE 

Distinct29
Distinct (%)0.3%
Missing0
Missing (%)0.0%
Memory size156.2 KiB
pdf
3437 
mp3
3073 
jpg
1906 
JPG
1076 
wmv
 
272
Other values (24)
 
236

Length

Max length4
Median length3
Mean length3.0049
Min length2

Unique

Unique9 ?
Unique (%)0.1%

Sample

1st rowpdf
2nd rowjpg
3rd rowpdf
4th rowmp3
5th rowpdf

Common Values

ValueCountFrequency (%)
pdf 3437
34.4%
mp3 3073
30.7%
jpg 1906
19.1%
JPG 1076
 
10.8%
wmv 272
 
2.7%
TXT 43
 
0.4%
jpeg 40
 
0.4%
sav 30
 
0.3%
xls 27
 
0.3%
MP3 23
 
0.2%
Other values (19) 73
 
0.7%

Length

2023-12-16T15:04:57.024196image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
pdf 3443
34.4%
mp3 3096
31.0%
jpg 2982
29.8%
wmv 272
 
2.7%
txt 44
 
0.4%
jpeg 40
 
0.4%
xls 32
 
0.3%
sav 30
 
0.3%
zip 21
 
0.2%
htm 9
 
0.1%
Other values (12) 31
 
0.3%

작성기관
Categorical

Distinct9
Distinct (%)0.1%
Missing0
Missing (%)0.0%
Memory size156.2 KiB
SungkyunkwanUniv
3295 
KOSSDA
2281 
KoreaUniv
1405 
ChonbukUniv
1345 
KongjuUniv
706 
Other values (4)
968 

Length

Max length16
Median length11
Mean length11.0771
Min length2

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st rowChonbukUniv
2nd rowKoreaUniv
3rd rowKOSSDA
4th rowSungkyunkwanUniv
5th rowKoreaUniv

Common Values

ValueCountFrequency (%)
SungkyunkwanUniv 3295
33.0%
KOSSDA 2281
22.8%
KoreaUniv 1405
14.1%
ChonbukUniv 1345
13.5%
KongjuUniv 706
 
7.1%
SeoulUniv 351
 
3.5%
ChonnamUniv 315
 
3.1%
MyongjiUniv 293
 
2.9%
-1 9
 
0.1%

Length

2023-12-16T15:04:58.226330image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/
Histogram of lengths of the category

Common Values (Plot)

2023-12-16T15:04:59.723468image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/
ValueCountFrequency (%)
sungkyunkwanuniv 3295
33.0%
kossda 2281
22.8%
koreauniv 1405
14.1%
chonbukuniv 1345
13.5%
kongjuuniv 706
 
7.1%
seouluniv 351
 
3.5%
chonnamuniv 315
 
3.1%
myongjiuniv 293
 
2.9%
1 9
 
0.1%

Correlations

2023-12-16T15:05:00.802703image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/
확장자작성기관
확장자1.0000.713
작성기관0.7131.000
2023-12-16T15:05:01.337615image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/
작성기관확장자
작성기관1.0000.359
확장자0.3591.000
2023-12-16T15:05:01.866035image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/
확장자작성기관
확장자1.0000.359
작성기관0.3591.000

Missing values

2023-12-16T15:04:50.767938image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/
A simple visualization of nullity by column.
2023-12-16T15:04:51.142477image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

Sample

원문목록명확장자작성기관
15303A000131.pdfpdfChonbukUniv
46395139-2.jpgjpgKoreaUniv
84200언어적 입력의 품사가 영아의 초기 어휘발달에 미치는 영향.pdfpdfKOSSDA
772962213s4_273.mp3mp3SungkyunkwanUniv
2601AS201512.pdfpdfKoreaUniv
6176BS20704.pdfpdfKOSSDA
421881000100550808_47.mp3mp3SungkyunkwanUniv
1739044950010.JPGJPGKOSSDA
594331000100570071_17.mp3mp3SungkyunkwanUniv
39238m290603.jpgjpgKoreaUniv
원문목록명확장자작성기관
27278김동규소장 연길7.jpgjpgChonbukUniv
70768한국교육의 종합이해와 미래구상III(면담조사자료집).pdfpdfKOSSDA
16164고05948 copy.jpgjpgChonbukUniv
635421000100570266_98.mp3mp3SungkyunkwanUniv
30276박상호소장 통문1.JPGJPGChonbukUniv
270061000100560084_38.mp3mp3SungkyunkwanUniv
16193중간보고서.pdfpdfChonbukUniv
73099S1MM06-12.jpgjpgKOSSDA
171102222s4_659.mp3mp3SungkyunkwanUniv
1170A200056.pdfpdfKoreaUniv

Duplicate rows

Most frequently occurring

원문목록명확장자작성기관# duplicates
11-1.jpgjpgKoreaUniv4
1416-1.jpgjpgKoreaUniv4
2126-1.jpgjpgKoreaUniv4
105B000021.pdfpdfChonnamUniv4
107B000051.pdfpdfChonnamUniv4
112B000081.pdfpdfChonnamUniv4
131B000381.pdfpdfChonnamUniv4
171G000121.pdfpdfMyongjiUniv4
177G000571.pdfpdfMyongjiUniv4
310-2.jpgjpgKoreaUniv3