1. Perl
  2. Language implementation research

Study of Perl language implementation

A personal note about the implementation of the Perl language. The findings obtained by reading the Perl source code are described here. Please note that my understanding is partial, as I don't understand the big picture of Perl's implementation. I think it will be helpful when writing XS.

Overview of source code

A brief description of the source code. There are many directories and source code in Perl repositories, but I'll briefly explain what each of them implements.

perlmain.c Perl entry point. Generated from miniperlmain.c. Memory allocation, lexical analysis, parsing, execution, and memory release are called in order.
perl.h Perl header
perl.c Perl implementation. Functions that call memory allocation, lexical analysis, parsing, execution, and memory release are defined.
interpvar.h Perl interpreter variable are defined
perlvars.h Perl global variable are defined
embedvar.h Abstraction of interpreter variable and global variable
parser.h Header for lexical analysis and parsing. Parser definition.
toke.c Implementation of lexical analysis. You can see the types of tokens generated by lexical analysis by looking at perly.y. The main function is yylex
perly.y Definition of parsing by yacc.
perly.c Parsing implementation. Generate a syntax tree from the token. The main function is yyparse
opnames.h Operation code type
opcode.h Operational information
op.c Functions that generate nodes in the syntax tree and functions that optimize them are implemented.
run.c Run the syntax tree created by parsing
scope.c Implementation of functions related to scope.
cop.h A macro is defined to execute the following executable statement. Loop control and next, last. Exception execution, etc.
regen Contains tools that automatically generate source code.
sv.c Implementation of a scalar variable
av.c Array implementation
hv.c Hash implementation
pad.c Implementation of functionality for holding a lexical variable within scope
pp.h pp_hot.h pp_sort.c pp_sys.c pp_pack.c pp_ctl.c Perl functions and operator implementations
handy.h Convenient macros are defined
util.c Convenient functions are implemented
vutil.c Version analytic function
perlio.c Perl IO implementation
regcomp.c Regular expression compilation implementation
regexec.c Implementation of regular expression execution
time64.c Time implementation
utf8.c UTF-8 implementation

OP node type of syntax tree

Perl is the language to be compiled. Perl source code is parsed and converted into a syntax tree after lexical analysis is complete.

This syntax tree is expressed as "OP type". In other words, the OP type data structure is connected as a tree structure. For convenience, we will call this an OP node.

Knowing the type of OP node will tell you what OP node Perl has. Its definition is in "opnames.h".

// opnames.h
typedef enum opcode {
OP_NULL = 0,
OP_STUB = 1,
OP_SCALAR = 2,
OP_PUSHMARK = 3,
OP_WANTARRAY = 4,
OP_CONST = 5,
OP_GVSV = 6,
OP_GV = 7,
OP_GELEM = 8,
OP_PADSV = 9,
OP_PADAV = 10,
OP_PADHV = 11,
OP_PADANY = 12,
OP_PUSHRE = 13,
OP_RV2GV = 14,
OP_RV2SV = 15,
OP_AV2ARYLEN = 16,
OP_RV2CV = 17,
OP_ANONCODE = 18,
OP_PROTOTYPE = 19,
OP_REFGEN = 20,
OP_SREFGEN = 21,
OP_REF = 22,
OP_BLESS = 23,
OP_BACKTICK = 24,
OP_GLOB = 25,
OP_READLINE = 26,
OP_RCATLINE = 27,
OP_REGCMAYBE = 28,
OP_REGCRESET = 29,
OP_REGCOMP = 30,
OP_MATCH = 31,
OP_QR = 32,
OP_SUBST = 33,
OP_SUBSTCONT = 34,
OP_TRANS = 35,
OP_TRANSR = 36,
OP_SASSIGN = 37,
OP_AASSIGN = 38,
OP_CHOP = 39,
OP_SCHOP = 40,
OP_CHOMP = 41,
OP_SCHOMP = 42,
OP_DEFINED = 43,
OP_UNDEF =44,
OP_STUDY = 45,
OP_POS = 46,
OP_PREINC = 47,
OP_I_PREINC = 48,
OP_PREDEC = 49,
OP_I_PREDEC = 50,
OP_POSTINC = 51,
OP_I_POSTINC = 52,
OP_POSTDEC = 53,
OP_I_POSTDEC = 54,
OP_POW = 55,
OP_MULTIPLY = 56,
OP_I_MULTIPLY = 57,
OP_DIVIDE = 58,
OP_I_DIVIDE = 59,
OP_MODULO = 60,
OP_I_MODULO = 61,
OP_REPEAT = 62,
OP_ADD = 63,
OP_I_ADD = 64,
OP_SUBTRACT = 65,
OP_I_SUBTRACT = 66,
OP_CONCAT = 67,
OP_STRINGIFY = 68,
OP_LEFT_SHIFT = 69,
OP_RIGHT_SHIFT = 70,
OP_LT = 71,
OP_I_LT = 72,
OP_GT = 73,
OP_I_GT = 74,
OP_LE = 75,
OP_I_LE = 76,
OP_GE = 77,
OP_I_GE = 78,
OP_EQ = 79,
OP_I_EQ = 80,
OP_NE = 81,
OP_I_NE = 82,
OP_NCMP = 83,
OP_I_NCMP = 84,
OP_SLT = 85,
OP_SGT = 86,
OP_SLE = 87,
OP_SGE = 88,
OP_SEQ = 89,
OP_SNE = 90,
OP_SCMP = 91,
OP_BIT_AND = 92,
OP_BIT_XOR = 93,
OP_BIT_OR = 94,
OP_NBIT_AND = 95,
OP_NBIT_XOR = 96,
OP_NBIT_OR = 97,
OP_SBIT_AND = 98,
OP_SBIT_XOR = 99,
OP_SBIT_OR = 100,
OP_NEGATE = 101,
OP_I_NEGATE = 102,
OP_NOT = 103,
OP_COMPLEMENT = 104,
OP_NCOMPLEMENT = 105,
OP_SCOMPLEMENT = 106,
OP_SMARTMATCH = 107,
OP_ATAN2 = 108,
OP_SIN = 109,
OP_COS = 110,
OP_RAND = 111,
OP_SRAND = 112,
OP_EXP = 113,
OP_LOG = 114,
OP_SQRT = 115,
OP_INT = 116,
OP_HEX = 117,
OP_OCT = 118,
OP_ABS = 119,
OP_LENGTH = 120,
OP_SUBSTR = 121,
OP_VEC = 122,
OP_INDEX = 123,
OP_RINDEX = 124,
OP_SPRINTF = 125,
OP_FORMLINE = 126,
OP_ORD = 127,
OP_CHR = 128,
OP_CRYPT = 129,
OP_UCFIRST = 130,
OP_LCFIRST = 131,
OP_UC = 132,
OP_LC = 133,
OP_QUOTEMETA = 134,
OP_RV2AV = 135,
OP_AELEMFAST = 136,
OP_AELEMFAST_LEX = 137,
OP_AELEM = 138,
OP_ASLICE = 139,
OP_KVASLICE = 140,
OP_AEACH = 141,
OP_AKEYS = 142,
OP_AVALUES = 143,
OP_EACH = 144,
OP_VALUES = 145,
OP_KEYS = 146,
OP_DELETE = 147,
OP_EXISTS = 148,
OP_RV2HV = 149,
OP_HELEM = 150,
OP_HSLICE = 151,
OP_KVHSLICE = 152,
OP_MULTIDEREF = 153,
OP_UNPACK = 154,
OP_PACK = 155,
OP_SPLIT = 156,
OP_JOIN = 157,
OP_LIST = 158,
OP_LSLICE = 159,
OP_ANONLIST = 160,
OP_ANONHASH = 161,
OP_SPLICE = 162,
OP_PUSH = 163,
OP_POP = 164,
OP_SHIFT = 165,
OP_UNSHIFT = 166,
OP_SORT = 167,
OP_REVERSE = 168,
OP_GREPSTART = 169,
OP_GREPWHILE = 170,
OP_MAPSTART = 171,
OP_MAPWHILE = 172,
OP_RANGE = 173,
OP_FLIP = 174,
OP_FLOP = 175,
OP_AND = 176,
OP_OR = 177,
OP_XOR = 178,
OP_DOR = 179,
OP_COND_EXPR = 180,
OP_ANDASSIGN = 181,
OP_ORASSIGN = 182,
OP_DORASSIGN = 183,
OP_METHOD = 184,
OP_ENTERSUB = 185,
OP_LEAVESUB = 186,
OP_LEAVESUBLV = 187,
OP_CALLER = 188,
OP_WARN = 189,
OP_DIE = 190,
OP_RESET = 191,
OP_LINESEQ = 192,
OP_NEXTSTATE = 193,
OP_DBSTATE = 194,
OP_UNSTACK = 195,
OP_ENTER = 196,
OP_LEAVE = 197,
OP_SCOPE = 198,
OP_ENTERITER = 199,
OP_ITER = 200,
OP_ENTERLOOP = 201,
OP_LEAVELOOP = 202,
OP_RETURN = 203,
OP_LAST = 204,
OP_NEXT = 205,
OP_REDO = 206,
OP_DUMP = 207,
OP_GOTO = 208,
OP_EXIT = 209,
OP_METHOD_NAMED = 210,
OP_METHOD_SUPER = 211,
OP_METHOD_REDIR = 212,
OP_METHOD_REDIR_SUPER = 213,
OP_ENTERGIVEN = 214,
OP_LEAVEGIVEN = 215,
OP_ENTERWHEN = 216,
OP_LEAVEWHEN = 217,
OP_BREAK = 218,
OP_CONTINUE = 219,
OP_OPEN = 220,
OP_CLOSE = 221,
OP_PIPE_OP = 222,
OP_FILENO = 223,
OP_UMASK = 224,
OP_BINMODE = 225,
OP_TIE = 226,
OP_UNTIE = 227,
OP_TIED = 228,
OP_DBMOPEN = 229,
OP_DBMCLOSE = 230,
OP_SSELECT = 231,
OP_SELECT = 232,
OP_GETC = 233,
OP_READ = 234,
OP_ENTERWRITE = 235,
OP_LEAVEWRITE = 236,
OP_PRTF = 237,
OP_PRINT = 238,
OP_SAY = 239,
OP_SYSOPEN = 240,
OP_SYSSEEK = 241,
OP_SYSREAD = 242,
OP_SYSWRITE = 243,
OP_EOF = 244,
OP_TELL = 245,
OP_SEEK = 246,
OP_TRUNCATE = 247,
OP_FCNTL = 248,
OP_IOCTL = 249,
OP_FLOCK = 250,
OP_SEND = 251,
OP_RECV = 252,
OP_SOCKET = 253,
OP_SOCKPAIR = 254,
OP_BIND = 255,
OP_CONNECT = 256,
OP_LISTEN = 257,
OP_ACCEPT = 258,
OP_SHUTDOWN = 259,
OP_GSOCKOPT = 260,
OP_SSOCKOPT = 261,
OP_GETSOCKNAME = 262,
OP_GETPEERNAME = 263,
OP_LSTAT = 264,
OP_STAT = 265,
OP_FTRREAD = 266,
OP_FTRWRITE = 267,
OP_FTREXEC = 268,
OP_FTEREAD = 269,
OP_FTEWRITE = 270,
OP_FTEEXEC = 271,
OP_FTIS = 272,
OP_FTSIZE = 273,
OP_FTMTIME = 274,
OP_FTATIME = 275,
OP_FTCTIME = 276,
OP_FTROWNED = 277,
OP_FTEOWNED = 278,
OP_FTZERO = 279,
OP_FTSOCK = 280,
OP_FTCHR = 281,
OP_FTBLK = 282,
OP_FTFILE = 283,
OP_FTDIR = 284,
OP_FTPIPE = 285,
OP_FTSUID = 286,
OP_FTSGID = 287,
OP_FTSVTX = 288,
OP_FTLINK = 289,
OP_FTTTY = 290,
OP_FTTEXT = 291,
OP_FTBINARY = 292,
OP_CHDIR = 293,
OP_CHOWN = 294,
OP_CHROOT = 295,
OP_UNLINK = 296,
OP_CHMOD = 297,
OP_UTIME = 298,
OP_RENAME = 299,
OP_LINK = 300,
OP_SYMLINK = 301,
OP_READLINK = 302,
OP_MKDIR = 303,
OP_RMDIR = 304,
OP_OPEN_DIR = 305,OP_READDIR = 306,
OP_TELLDIR = 307,
OP_SEEKDIR = 308,
OP_REWINDDIR = 309,
OP_CLOSEDIR = 310,
OP_FORK = 311,
OP_WAIT = 312,
OP_WAITPID = 313,
OP_SYSTEM = 314,
OP_EXEC = 315,
OP_KILL = 316,
OP_GETPPID = 317,
OP_GETPGRP = 318,
OP_SETPGRP = 319,
OP_GETPRIORITY = 320,
OP_SETPRIORITY = 321,
OP_TIME = 322,
OP_TMS = 323,
OP_LOCALTIME = 324,
OP_GMTIME = 325,
OP_ALARM = 326,
OP_SLEEP = 327,
OP_SHMGET = 328,
OP_SHMCTL = 329,
OP_SHMREAD = 330,
OP_SHMWRITE = 331,
OP_MSGGET = 332,
OP_MSGCTL = 333,
OP_MSGSND = 334,
OP_MSGRCV = 335,
OP_SEMOP = 336,
OP_SEMGET = 337,
OP_SEMCTL = 338,
OP_REQUIRE = 339,
OP_DOFILE = 340,
OP_HINTSEVAL = 341,
OP_ENTEREVAL = 342,
OP_LEAVEEVAL = 343,
OP_ENTERTRY = 344,
OP_LEAVETRY = 345,
OP_GHBYNAME = 346,
OP_GHBYADDR = 347,
OP_GHOSTENT = 348,
OP_GNBYNAME = 349,
OP_GNBYADDR = 350,
OP_GNETENT = 351,
OP_GPBYNAME = 352,
OP_GPBYNUMBER = 353,
OP_GPROTOENT = 354,
OP_GSBYNAME = 355,
OP_GSBYPORT = 356,
OP_GSERVENT = 357,
OP_SHOSTENT = 358,
OP_SNETENT = 359,
OP_SPROTOENT = 360,
OP_SSERVENT = 361,
OP_EHOSTENT = 362,
OP_ENETENT = 363,
OP_EPROTOENT = 364,
OP_ESERVENT = 365,
OP_GPWNAM = 366,
OP_GPWUID = 367,
OP_GPWENT = 368,
OP_SPWENT = 369,
OP_EPWENT = 370,
OP_GGRNAM = 371,
OP_GGRGID = 372,
OP_GGRENT = 373,
OP_SGRENT = 374,
OP_EGRENT = 375,
OP_GETLOGIN = 376,
OP_SYSCALL = 377,
OP_LOCK = 378,
OP_ONCE = 379,
OP_CUSTOM = 380,
OP_COREARGS = 381,
OP_RUNCV = 382,
OP_FC = 383,
OP_PADCV = 384,
OP_INTROCV = 385,
OP_CLONECV = 386,
OP_PADRANGE = 387,
OP_REFASSIGN = 388,
OP_LVREF = 389,
OP_LVREFSLICE = 390,
OP_LVAVREF = 391,
OP_ANONCONST = 392,
OP_max
} opcode;

Perl parser structure

Perl parsers are generated using bison. Think of bison as a tool that adds functionality to the parsing tool yacc.

Perl syntax rules are described in "perly.y", and this file is converted to the following file by the binson command.

perly.h

perly.tab

perly.act

These files are C language files. In other words, the syntax rules written in "perly.y" are converted into C code.

"Perly.c" is the source code for Perl's parsing, but "perly.act" is included inside this.

// perly.c
int
Perl_yyparse (pTHX_ int gramtype) {
  // ...
  
  // Tokenize
  parser->yychar = yylex ();
  
  // ...
  
  /* Parsing
  # include "perly.act"
  
  // ...
}

Before parsing, you have to generate a token, which is described in "toke.c". That is the function yylex.

// toke.c
int
Perl_yylex (pTHX)
{
  // ...
}

Parser stack frame information

Stack frames are used when locating source code.

// parser.h
typedef struct {
    YYSTYPE val;/* semantic value */    short state;
    I32 savestack_ix;/* size of savestack at this state */    CV * compcv;/* value of PL_compcv when this value was created */} yy_stack_frame;

// parly.h
typedef union YYSTYPE
{
/ * Line 2058 of yacc.c */
    I32 ival;/* __DEFAULT__ (marker for regen_perly.pl;
must always be 1st union member) */    char * pval;
    OP * opval;
    GV * gvval;


/ * Line 2058 of yacc.c */} YYSTYPE;

aTHX_ is expanded when threads are enabled and Perl is compiled

aTHX_ will be expanded to the current thread when the thread is enabled and Perl is compiled.

// perl.h
# define aTHX_ aTHX,
# define aTHX my_perl

If the thread is not enabled, it will be expanded as follows and will do nothing.

// perl.h
# define aTHX
# define aTHX_

Expected abbreviations

Since you can roughly predict the abbreviations used for variable, pick up the ones that are often used.

  • Abbreviation for fn - finish
  • Abbreviation for ix - index

Safefree is just free memory

Safefree is used to free memory for variable, but it's just free memory. Think of it as free. The process is hidden so that it can handle various conditions.

// Example of use
Safefree (PL_exitlist);

Mounting.

// handy.h
# define Safefree (d) safefree (MEM_LOG_FREE ((Malloc_t) (d)))

// per.h
# define safefree safesysfree

// embed.h
# define safesysfree Perl_safesysfree

// util.h
Free_t
Perl_safesysfree (Malloc_t where)
{
    // ...
    if (where) {
    Malloc_t where_intrn = where;
    PerlMem_free (where_intrn);
    }
}


//iperlsys.h
# define PerlMem_free (buf) free ((buf))

SvREFCNT_dec is performed when freeing the memory of SV type variable.

In perl_destruct, the memory of variable used in Perl is released, which is done as follows.

SvREFCNT_dec (PL_ofsgv);
PL_ofsgv = NULL;

When an SV type variable is created, the reference count is incremented by 1 by SvREFCNT_inc, so when releasing the memory, the reference count is decremented by 1 by SvREFCNT_dec.

By doing so, the memory release processing mechanism of the SV finally releases the memory.

NULL is assigned to the reference source variable.

Skeleton of processing the contents of perl_parse

When parsing a Perl script, a function called perl_parse is called, but I'll take out the internal skeleton of this function.

// perl.c
int
perl_parse (pTHXx_ XSINIT_t xsinit, int argc, char ** argv, char ** env)
{
  // ...
  
  parse_body (env, xsinit);
  
  // ...
}

parl_parse calls parse_body. It analyzes command line options, opens the script, initializes the lexer, and parses the Perl script by the lexer.

// embed.h
# define parse_body (a, b) S_parse_body (aTHX_ a, b)

// perl.c
STATIC void *
S_parse_body (pTHX_ char ** env, XSINIT_t xsinit)
{
  // There is command line option parsing etc.
  
  // Open the script file
  rsfp = open_script (scriptname, dosearch, & suidscript);

  // Initialization of Lexar (Talkerizer)
  lex_start (linestr_sv, rsfp, lex_start_flags);
  
  // Perl by LexarPerform crypto analysis
  if (yyparse (GRAMPROG) || PL_parser->error_count) {
    
  }
}

Once this is done, the Perl script is broken down into the smallest units of parsing called tokens.

If you take out only the skeleton of the Perl interpreter

I took out only the skeleton of the Perl interpreter. The processing flow is as follows. If you go into the details from this skeleton part, it's a good idea to read the source code of Perl itself.

// miniperlmain.c
int
main (int argc, char ** argv, char ** env)
{
  
  // Memory allocation
  my_perl = perl_alloc ();
  
  // Build, initialize, etc. necessary information
  perl_construct (my_perl);
  
  // parse Perl script
  perl_parse (my_perl, xs_init, argc, argv, (char **) NULL);
  
  // Effective the parsed Perl script
  perl_run (my_perl);
  
  // Discard Perl information
  perl_destruct (my_perl);
  
  // Free memory
  perl_free (my_perl);
}

Looking only at the skeleton, it's very simple.

Difference between EXTERN.h and INTERN.h

There are EXTERN.h and INTERN.h in Perl, but for the purpose, EXTERN.h is used when referencing Perl itself from an extension module. You need to load EXTERN.h to make the global variable that Perl itself has visible to the extension module.

INTERN.h, on the other hand, is used when compiling Perl itself. The values of global variable are set.

SVp_IOK indicates that you have a valid integer that is not exposed.

SVp_IOK indicates that it has a valid integer that is not exposed. SVf_IOK, on the other hand, indicates that it has a valid public integer.

What does it mean to be public?

SVt_PVIV means that both SVt_PV and SVt_IV are set.

SVt_PV indicates that SV is a string, and SVt_IV indicates that SV is an integer. SVt_PVIV indicates that SV is a string and is an integer.

new_XPVNV

XPVNV is assigned to SV type sv_any. It represents a number.

// sv.c
# define new_XPVNV () new_body_allocated (SVt_PVNV)

// sv.c
# define new_body_allocated (sv_type)\    (void *) ((char *) S_new_body (aTHX_ sv_type)\- body_by_type [sv_type] .offset)

// sv.c
STATIC void *
S_new_body (pTHX_ const svtype sv_type)
{
    void * xpv;
    new_body_inline (xpv, sv_type);
    return xpv;
}

// sv.c
# define new_body_inline (xpv, sv_type)\STMT_START {\
void ** const r3wt = & PL_body_roots [sv_type];\xpv = (PTR_TBL_ENT_t *) (* ((void **) (r3wt))\? * ((void **) (r3wt)): Perl_more_bodies (aTHX_ sv_type,\bodies_by_type [sv_type] .body_size,\bodies_by_type [sv_type] .arena_size));\* (r3wt) = * (void **) (xpv);\    } STMT_END

// sv.c
/ *
 PL_body_roots [] array of pointers to list of free bodies of svtype
                     arrays are indexed by the svtype needed
*/


SV type definition

SV type definition.

// perl.h
typedef struct STRUCT_SV SV;

// perl.h
/ * SGI's <sys/sema.h> has struct sv */# if defined(__sgi)
# define STRUCT_SV perl_sv
# else
# define STRUCT_SV sv
# endif

// sv.h
struct STRUCT_SV {/ * struct sv {*/    _SV_HEAD (void *);
    _SV_HEAD_UNION;
};

// sv.h
/ * start with 2 sv-head building blocks */# define _SV_HEAD (ptrtype)\ptrtype sv_any;/* pointer to body */\
    U32 sv_refcnt;/* how many reference to us */\
    U32 sv_flags/* what we are */
// sv.h
# define _SV_HEAD_UNION\union {\
char * svu_pv;/* pointer to malloced string */\
IV svu_iv;\UV svu_uv;\_NV_BODYLESS_UNION\SV * svu_rv;/* pointer to another SV */\
struct regexp * svu_rx;\SV ** svu_array;\HE ** svu_hash;\GP * svu_gp;\PerlIO * svu_fp;\    } sv_u\    _SV_HEAD_DEBUG

sv_any is a pointer to a structure that stores the actual value held by the SV. sv_refcnt is a reference count, sv_flag is a type and flag related to SV, for example, the type is an integer, the flag is read-only, and so on. I'm not sure how sv_u is used.

SV type type

SV types can be extended and can be various, but these types are implemented as svtypes.

typedef enum {
SVt_NULL,/* 0 *// * BIND was here, before INVLIST replaced it. */SVt_IV,/* 1 */SVt_NV,/* 2 *// * RV was here, before it was merged with IV. */SVt_PV,/* 3 */SVt_INVLIST,/* 4, implemented as a PV */SVt_PVIV,/* 5 */SVt_PVNV,/* 6 */SVt_PVMG,/* 7 */SVt_REGEXP,/* 8 *// * PVBM was here, before BIND replaced it. */SVt_PVGV,/* 9 */SVt_PVLV,/* 10 */SVt_PVAV,/* 11 */SVt_PVHV,/* 12 */SVt_PVCV,/* 13 */SVt_PVFM,/* 14 */SVt_PVIO,/* 15 */SVt_LAST/* keep last in enum. Used to size arrays */} svtype;

Perl can have multiple interpreters

When reading the source code, it is important to understand that Perl can have multiple interpreters, and that the information of the current interpreter is accessed from functions etc. prize.

Perl-Perl Interpreter 1 (current interpreter)
     - Perl interpreter 2
     - Perl interpreter 3

By default, it's a good idea to follow the source code, imagining that only one interpreter is running.

You don't have to be aware of dVar

The declaration dVar sometimes comes up, but I don't think you need to be aware of it.

# ifdef PERL_GLOBAL_STRUCT
# define dVAR pVAR = (struct perl_vars *) PERL_GET_VARS ()
# else
# define dVAR dNOOP
# endif

# define pVAR struct perl_vars * my_vars PERL_UNUSED_DECL

PERL_GLOBAL_STRUCT is defined from the source code only when the OS is symbian. If so, for most operating systems, dVAR is dNOOP, which means that dNOOP does nothing, so dVar does nothing.

// symbian/PerlApp.cpp:
# define PERL_GLOBAL_STRUCT

// symbian/PerlBase.h:
# define PERL_GLOBAL_STRUCT

System call wrappers are described in iperlsys.h

In Perl source code, you don't call malloc etc. directly. The system call is accessed using a wrapper function or macro, and the definition of this wrapper is described in "iperlsys.h".

iperlsys.h

perl_alloc just allocates memory and initializes with 0

perl_alloc just allocates memory and initializes with 0.

Easy to read tips

  • Ignore macros that contain ASSERT because they just call assert
  • PERL_UNUSED_ARG only removes warnings for unused variable, so ignore them
  • EnvoyMany macros used are in perl.h, proto.h or handy.h
  • Helper-like functions are often found in util.c
  • # Since there are many special customizations in ifdef, skip it for the time being
  • # Read inside ifndef, read after # else in # ifdef ~ # else ~
  • Forget the thread description in your code

How to compile Perl source code

The method for compiling the Perl source code is described below.

How to compile Perl itself from source code

Perl entry point is perlmain.c

Perl's entry point is in perlmain.c. However, this file is not included in the Perl source code.

There is a generated description in the following Makefile.

# Cross/Makefile-cross-SH
perlmain.c: miniperlmain.c config.sh $(FIRSTMAKEFILE)
sh writemain $(DYNALOADER) $(static_ext)> perlmain.c

If you execute make, "perlmain.c" will be generated.

Also, "perlmain.c" is almost the same as the description of "mini perlmain.c", only the following line is added.

/ * Register any extra external extensions */EXTERN_C void boot_DynaLoader (pTHX_ CV * cv);

In other words, "mini perlmain.c" with the dynamic load function added is "perlmain.c". So, if you look at "miniperlmain.c", you can see almost everything.

OP type is op type and is defined in op.h

The OP type that stores operator information is defined in op.h.

// perl.h
typedef struct op OP;

// op.h
struct op {
    BASEOP
};

// op.h
# ifdef BASEOP_DEFINITION
# define BASEOP BASEOP_DEFINITION
# else
# define BASEOP\OP * op_next;\OP * _OP_SIBPARENT_FIELDNAME;\OP * (* op_ppaddr) (pTHX);\PADOFFSET op_targ;\PERL_BITFIELD16 op_type: 9;\PERL_BITFIELD16 op_opt: 1;\PERL_BITFIELD16 op_slabbed: 1;\PERL_BITFIELD16 op_savefree: 1;\PERL_BITFIELD16 op_static: 1;\PERL_BITFIELD16 op_folded: 1;\PERL_BITFIELD16 op_moresib: 1;\PERL_BITFIELD16 op_spare: 1;\U8 op_flags;\U8 op_private;
# endif

/ *
 * The fields of BASEOP are:
 * op_next Pointer to next ppcode to execute after this one.
 * (Top level pre-grafted op points to first op,
 * but this is replaced when op is grafted in, when
 * this op will point to the real next op, and the new
 * parent takes over role of remembering starting op.)
 * op_ppaddr Pointer to current ppcode's function.
 * op_type The type of the operation.
 * op_opt Whether or not the op has been optimised by the
 * peephole optimiser.
 * op_slabbed allocated via opslab
 * op_static tell op_free () to skip PerlMemShared_free (), when
 *! op_slabbed.
 * op_savefree on savestack via SAVEFREEOP
 * op_folded Result/remainder of a constant fold operation.
 * op_moresib this op is is not the last sibling
 * op_spare One spare bit
 * op_flags Flags common to all operations. See OPf_ * below.
 * op_private Flags peculiar to a particular operation (BUT,
 * by default, set to the number of children until
 * the operation is privatized by a check routine,
 * which may or may not check number of children).
 */

Perl's lexar (talker) main routine is yylex

The main routine for Perl's lexar (talkerizer) is yylex.

// toke.c
int
Perl_yylex (pTHX)
{
    dVAR;
    char * s = PL_bufptr;
    char * d;
    STRLEN len;
    bool bof = FALSE;
    const bool saw_infix_sigil = cBOOL (PL_parser->saw_infix_sigil);
    U8 formbrack = 0;
    U32 fake_eof = 0;

    // ...
}

Perl scripts are decomposed into tokens by this routine.

YYSTYPE is defined in perly.h

YYSTYPE is defined in perly.h.

PL_parser is the same as yy_parser

In the description of embedvar.h and intrpvar.h, yy_parser is given the name PL_parser.

// embedvar.h
# define PL_parser (vTHX->Iparser)

// intrpvar.h
PERLVAR (I, parser, yy_parser *)/* current parser state */
// parser.h
typedef struct yy_parser {

/* parser state */
    struct yy_parser * old_parser;/* previous value of PL_parser */    YYSTYPE yylval;/* value of lookahead symbol, set by yylex () */    int yychar;/* The lookahead symbol. */
/* Number of tokens to shift before error messages enabled. */    int yyerrstatus;

    int stack_size;
    int yylen;/* length of active reduction */    yy_stack_frame * stack;/* base of stack */    yy_stack_frame * ps;/* current stack frame */
/* lexer state */
    I32 lex_brackets;/* square and curly bracket count */    I32 lex_casemods;/* casemod count */    char * lex_brackstack;/* what kind of brackets to pop */    char * lex_casestack;/* what kind of case mods in effect */    U8 lex_defer;/* state after determined token */    U8 lex_dojoin;/* doing an array interpolation
1 = @{...} 2 =->@*/    U8 lex_expect;/* UNUSED */    U8 expect;/* how to interpret ambiguous tokens */    I32 lex_formbrack;/* bracket count at outer format level */    OP * lex_inpat;/* in pattern $) and $| are special */    OP * lex_op;/* extra info to pass back on op */    SV * lex_repl;/* runtime replacement from s /// */    U16 lex_inwhat;/* what kind of quoting are we in */    OPCODE last_lop_op;/* last named list or unary operator */    I32 lex_starts;/* how many interps done on level */    SV * lex_stuff;/* runtime pattern from m // or s /// */    I32 multi_start;/* 1st line of multi-line string */    I32 multi_end;/* last line of multi-line string */    char multi_open;/* delimiter of said string */    char multi_close;/* delimiter of said string */    bool preambled;
    bool lex_re_reparsing;/* we're doing G_RE_REPARSING */    I32 lex_allbrackets;/* (), [], {},?: bracket count */    SUBLEXINFO sublex_info;
    LEXSHARED * lex_shared;
    SV * linesstr;/* current chunk of src text */    char * bufptr;/* carries the cursor (current parsing)
position) from one invocation of yylex
to the next */    char * oldbufptr;/* in yylex, beginning of current token */    char * oldoldbufptr;/* in yylex, beginning of previous token */    char * bufend;
    char * linestart;/* beginning of most recently read line */    char * last_uni;/* position of last named-unary op */    char * last_lop;/* position of last list operator *//* copline is used to pass a specific line number to newSTATEOP. It
       is a one-time line number, as newSTATEOP invalidates it (sets it to)
       NOLINE) after using it. The purpose of this is to report line num-
       bers in multiline constructs using the number of the first line. */    line_t copline;
    U16 in_my;/* we're compiling a "my"/"our" declaration */    U8 lex_state;/* next token is determined */    U8 error_count;/* how many compile errors so far, max 10 */    HV * in_my_stash;/* declared class of this "my" declaration */    PerlIO * rsfp;/* current source file pointer */    AV * rsfp_filters;/* holds chain of active source filters */    U8 form_lex_state;/* remember lex_state when parsing fmt */
    YYSTYPE nextval [5];/* value of next token, if any */    I32 nexttype [5];/* type of next token */    U32 nexttoke;

    COP * saved_curcop;/* the previous PL_curcop */    char tokenbuf [256];
    line_t herelines;/* number of lines in here-doc */    line_t preambling;/* line # when processing $ENV{PERL5DB} */    U8 lex_fakeeof;/* precedence at which to fake EOF */    U8 lex_flags;
    PERL_BITFIELD16 in_pod: 1;/* lexer is within a =pod section */    PERL_BITFIELD16 filtered: 1;/* source filters in evalbytes */    PERL_BITFIELD16 saw_infix_sigil: 1;/* saw & or * or%operator */    PERL_BITFIELD16 parsed_sub: 1;/* last thing parsed was a sub */} yy_parser;

PADLIST

PADLIST is a list of a lexical variable.

// perl.h
struct padlist PADLIST;

// pad.h
struct padlist {
    SSize_t xpadl_max;/* max index for which array has space */    union {
PAD ** xpadlarr_alloc;/* Pointer to beginning of array of AVs.
index 0 is a padnamelist * */struct {
PADNAMELIST * padnl;
PAD * pad_1;/* this slice of PAD * array always alloced */PAD * pad_2;/* maybe unalloced */} * xpadlarr_dbg;/* for use with a C debugger only */    } xpadl_arr;
    U32 xpadl_id;/* Semi-unique ID, shared between clones */    U32 xpadl_outid;/* ID of outer pad */};

PADNAMELIST

// perl.h
typedef struct padnamelist PADNAMELIST;

// pad.h
struct padnamelist {
    SSize_t xpadnl_fill;/* max index in use */    PADNAME ** xpadnl_alloc;/* pointer to beginning of array */    SSize_t xpadnl_max;/* max index for which array has space */    PADOFFSET xpadnl_max_named;/* highest index with len> 0 */    U32 xpadnl_refcnt;
};

PADNAME

// perl.h
typedef struct padname PADNAME;

// //
# define _PADNAME_BASE\char * xpadn_pv;\HV * xpadn_ourstash;\union {\
HV * xpadn_typestash;\CV * xpadn_protocv;\    } xpadn_type_u;\U32 xpadn_low;\U32 xpadn_high;\U32 xpadn_refcnt;\int xpadn_gen;\U8 xpadn_len;\U8 xpadn_flags

struct padname {
    _PADNAME_BASE;
};

ANY

ANY is a data structure that can be assigned anything.

// perl.h
typedef union any ANY;

# ifdef UNION_ANY_DEFINITION
UNION_ANY_DEFINITION;
# else
union any {
    void * any_ptr;
    I32 any_i32;
    U32 any_u32;
    IV any_iv;
    UV any_uv;
    long any_long;
    bool any_bool;
    void (* any_dptr) (void *);
    void (* any_dxptr) (pTHX_ void *);
};
# endif

Be aware of whether the data structure is intended for arrays or pointers

// Is it intended to be an array of integers or a pointer to a single integer?
PERLVAR (I, markstack_max, I32 *)

STMT_START and STMT_END are mechanisms to safely execute statement in macros

STMT_START and STMT_END are mechanisms to safely execute statement in macros. You can think of it as "do {" and "} while (0)".

// A mechanism for safely executing a macro by wrapping the body of the macro in a do-while loop.

# define SWAP (x, y)\do {\
    tmp = x;\x = y;\y = tmp;}\while (0)

PoisonNew is filled with one pattern

PoisonNew is filled with one pattern. Think of it as memset.

// handy.h
# define PoisonNew (d, n, t) PoisonWith (d, n, t, 0xAB)
# define PoisonWith (d, n, t, b) (MEM_WRAP_CHECK_ (n, t) (void) memset ((char *) (d), (U8) (b), (n) * sizeof (t)))

Renew stands for realloc

Renew stands for realloc. It will also check if the memory to be secured does not overflow.

// handy.h
# define Renew(v, n, t)\(v = (MEM_WRAP_CHECK_ (n, t) (t *) MEM_LOG_REALLOC (n, t, v, saferealloc ((Malloc_t) (v), (MEM_SIZE) ((n) * sizeof (t)))))
# define Renewc (v, n, t, c)\(v = (MEM_WRAP_CHECK_ (n, t) (c *) MEM_LOG_REALLOC (n, t, v, saferealloc ((Malloc_t) (v), (MEM_SIZE) ((n) * sizeof (t)))))

n is a pointer, n is a size, and t is a type. The secured size is n × t.

I don't know what STRESS_REALLOC is

I don't know what STRESS_REALLOC is. Where can it be defined?

PERL_SI is stack information

PERL_SI is stack information.

// cop.h
struct stackinfo {
    AV * si_stack;/* stack for current runlevel */    PERL_CONTEXT * si_cxstack;/* context stack for runlevel */    struct stackinfo * si_prev;
    struct stackinfo * si_next;
    I32 si_cxix;/* current context index */    I32 si_cxmax;/* maximum allocated index */    I32 si_type;/* type of runlevel */    I32 si_markoff;/* offset where markstack begins for us.
* currently used only with DEBUGGING,
* but not # ifdef-ed for bincompat */};

typedef struct stackinfo PERL_SI;

Variable starting with PL_ may be global variable

Variable that start with PL_ can also be global variable. PL_Yes and PL_No are defined in perl.h.

// perl.h

EXTCONST char PL_Yes []

INIT ("1");

EXTCONST char PL_No []

INIT ("");

Variable names starting with PL_ are interpreted variable

Variable names starting with PL_ are interpreted variable. A variable that holds information about the currently running Perl interpreter. It is defined in embedvar.h.

// embedvar.h
# if defined(MULTIPLICITY)
/ * cases 2 and 3 above */
# if defined(PERL_IMPLICIT_CONTEXT)
# define vTHX aTHX
# else
# define vTHX PERL_GET_INTERP
# endif

# define PL_AboveLatin1 (vTHX->IAboveLatin1)
# define PL_Argv (vTHX->IArgv)
# define PL_Cmd (vTHX->ICmd)
# define PL_DBcontrol (vTHX->IDBcontrol)
# define PL_DBcv (vTHX->IDBcv)

To explore further, see what "aTHX" looks like.

// perl.h
# define aTHX my_perl

// perl.h
PerlInterpreter * my_perl;

This seems to be a type called "Perl Interpreter". The definition of this structure is made in a file called "intrpvar.h".

// intrpvar.h
PERLVAR (I, Latin1, SV *)
PERLVAR (I, UpperLatin1, SV *)/* Code points 128 - 255 */PERLVAR (I, AboveLatin1, SV *)
PERLVAR (I, InBitmap, SV *)

A macro called PERLVAR will generate a combined variable name such as "I AboveLatin1".

UNLIKELY is a macro described in perl.h

This is code that is optimized when the condition is false, and is semantically just a Boolean.

// perl.h
# define UNLIKELY (cond) EXPECT (cBOOL (cond), FALSE)

// perl.h
# ifdef HAS_BUILTIN_EXPECT
# define EXPECT (expr, val) __builtin_expect (expr, val)
# else
# define EXPECT (expr, val) (expr)
# endif

// handy.h
# define cBOOL (cbool) ((cbool)? (Bool) 1: (bool) 0)

// handy.h
# ifndef HAS_BOOL
# ifdef bool
# undef bool
# endif
# define bool char
# define HAS_BOOL 1
# endif

If you write "__builtin_expect (A, B)", it will be hint information that A is expected to be a constant B.

The following two are exactly the same as conditional branch, but in the case of UNLIKLY, they are expected to be false and are optimized.

if (condition) {
  ...
}

if (UNLIKLY (condition)) {
  ...
}

There is also LIKELY, which is expected to be true and optimized.

I32 and I16 are integer types

There are types I32 , I16 , U32 , and U16 in the Perl core. These are integer types.

  • I32-an integer type with a minimum of 32 bits
  • I16-an integer type with a minimum of 16 bits
  • U32-Unsigned integer type with at least 32 bits
  • U16-Unsigned integer type with at least 16 bits

I'm wondering why I don't use int or long directly, but probably because Perl runs on many operating systems to abstract integer types. (Perlguts says, "These are usually exactly 32-bit and 16-bit, but on Cray (the OS) they are both 64-bit.")

SV is a scalar type

SV is a scalar type.

SV * sv;

AV is an array type

AV is an array type.

AV * array;

GV is a glob type

GV is a glob type.

GV * gv;

HV is a hash type

HV is a hash type.

HV * hash;

CV is a functional type

CV is a functional type.

CV * cv;

pv is a string

These days I'm calling the core of Perl. It's hard to read because there are many omissions, but I don't get much information even if I search on google, so I'll make a note of it.

pv is a string. If pv is included in the function name, I think it is a function related to a string.

pvn is included in the string and the function that specifies the length of the string. I wonder if n is number n.

In terms of performance, pvn is probably faster because you specify the length yourself.

mg is magic

mg is magic. Magic is, well, magic. Is it additional processing?

Newx allocates memory

Newx allocates memory. It's doing quite complicated things internally, but I think it's a malloc in the sense. It feels like a new memory area.

const means a constant

The core of Perl is written in C. So if you can't read C, you can't read Perl's core. So make a note of the C language information as well.

const means a constant. In other words, it is a read-only variable that cannot be rewritten.

const came out 3 times

While reading the core of Perl, I came across something like this:

const char * const * const search_ext

When const came out three times and I investigated what it meant, the object of const was different. The first const was the const for the variable pointed to by the pointer of the pointer, the second const was the const for the pointer of the pointer, and the third was the const for the pointer (search_ext).

va_list receives variadic arguments

In C language, there is a method of receiving variable arguments. If you write ... in the argument part, you can receive variable arguments.

Type foo (arguments 1, ...);

To use this in a function:

va_list args;
va_start (args, argument 1);
while (condition)
  Type arg = va_arg (args, type);
}
va_end (args);

va_list is a variadic object. Specify the argument name just before the variable length argument starts with va_start . Variadic arguments can be received in sequence with va_arg . Specify the end of reading with va_end .

Related Informatrion