2012-03-31

subfiler after learning some Python

adde/subfiler

2.23, 3.15: intro:

. this subprogram inputs a file with subfile syntax:
ie, a file can represent a folder containing files
(files within files, hence the term subfiles).
. each file having the expected subfile syntax
starts with a datestamp-label,
and then a title representing the folder's pathname,
. each subfile should have a datestamp;
if its title is not already preceded by a datestamp,
the date to use is that of the previous subfile,
else that of the top-level title;
it's an error if the top-level title has no date .
3.12:
. within a subfile, the timestamps don't matter:
they are internal to the subfile,
and are not to be applied to subsequent subfiles .

3.29: version#batch mode:
. it has an input folder named subfiler/,
and if there is no report.txt file in that folder, it makes one,
if the report.txt is there, then it does the job:
for each file in subfiler besides report.txt
it makes a folder of the same name, and fills it with subfiles .

version#interactive mode:
3.15:
. it displays what it understands to be the folder structure,
and gets the user's ok to do one of the following:
# do nothing,
# create a folder of the given structure,
# create the subfiles and do a filesystem merge .

2.23:  the displayed TOC (table of contents):
. the display shows a hierarchical relationship
by way of indentation; eg,
title
 . . . subfile#1
 . . . . . paragraph#1 1st line
 . . . . . paragraph#2 1st line
 . . . subfile#2
  ...
. it shows titles of each of the subfiles
along with first lines of any paragraphs
not assigned to separate files,
showing the relationship as a hierarchy with indentation .
3.25:
. this assures the user that the parsing is correct;
no subfiles were mistaken for paragraphs,
and no paragraphs were mistaken as subfiles .

subfile naming strategy:
3.29:
. for each file from the input folder,
it's creating a collection of files,
so, we have a choice of naming strategies .
. if we use names that start with the date,
then files within a topic can be sorted by date
-- the order in which ideas on the topic evolved .
. the file's modification date is meaningless,
showing only when the subfile was generated,
so the content creation date is noteworthy .
. more than one date? use the first date .
. if its name includes the entire pathname,
then later it can use the file's name to know
where in the obj folder system to put the file
(without having to open the file).
. perhaps the pathname should be reversed,
so that root is at the end .
if the filename exists,
add a number, looping until unique .
3.15: 3.25:
. each subfile has a title which is used for generating
a name for the file that the subfile will be stored in .
. it follows portability rules for filenames,
so it needs to strip out or replace all but ascii
alphanumerics, and ( , ; . );
some punctuations get transformed like so:
? -> " KA"
! -> " BANG "
: / \ | -> " l " -- lowercase L .

3.15: superfiles:

. the top of a file should have a date and title;
these are called the file`date and file`title .

. the file`date format is year.month.day;
the year has just 2 digits (eg, 12 means 2012 );
if the file`date's day =1 then the file`date is a generic:
it's telling you only the year and the month;
subsequent timestamps are of the format:
month.day;
and, these are meant to give the actual day .
. if subsequent timestamps include the year,
it's because the file's dates are spanning multiple years .
[3.28:
. when sending the subfiles to their separate files,
this app should be prefixing the titles with full datestamps,
(ie, including the year) ].

. the file`title shows where in the filesystem
the subfiles need to be placed,
eg, for the name: adde/subfile,
there is a folder named adde,
containing a subfolder named subfile .
. however,
there are some generic subfolders; eg,
the file`title can be named cyb/pim
(pim= personal info mgt);
but the files within are using a shortcut
that leaves out the /pim subfolder name:
cyb/fs means cyb/pim/fs .

3.15: project files:

. if the file`title has subfolders
and then the file includes any chunks (defined as
any sequence of paragraphs having no subfile title;
where title is defined as a pathname whose
root name is one of the user's root subjects)
then this should be considered a project file
meaning this whole file represents a folder
rather than representing a batch merge .

. for example, say that we have a file`title of
pol/purges/reaganomics/housing market crash/-
housing stocks vs housing bonds;
and, we find chunks in this file,
then we see that part of that path
is already in the filesystem:
pol/purges/reaganomics/housing market crash,so we have one level of new folder to make:
(housing stocks vs housing bonds);
and, since this is a project folder,
we give it a datestamped name:
(2008.06.16 housing stocks vs housing bonds).
. then, within that new folder,
we divide the project into number-titled subfiles:

. if a section does have a title, then its new title is
its old title prefixed with the assigned serial number; eg,
01 -- a chunk's title is nothing but the assigned serial number;
02 how bonds differ from stocks
03 how bonds crashed the market
04 -- another chunk
05 web --[a generic title for copied stuff .]

2.23, 3.12, 3.15: subfile formats:

. there are 2 main subfile formats,
that indicate whether a subfile's body
can include blank lines;
a body that does include blanklines
is said to be segmented, a seg'd subfile .

format# non-seg'd subfile:
. after the title is identified,
the next line is not a blankline,
and so the end-of-subfile (end of the title's body)
is determined by the next blankline .

format# seg'd subfile:
. if the title is preceded by a multi-blankline,
and followed by a single blankline,
then the body is segmented (it includes blanklines),
so the end-of-subfile is determined by
the next multi-blankline
(more than 1 blankline consecutively).
3.12:
. within a seg'd subfile,
it prints the first line of every segment
(the line after a single blankline).
. indentation of that printing
shows the seg belongs to the seg'd subfile's title .

being tolerant of format errors:
3.13:
. if the title is preceded by a mult-blankline,
but not followed by a blank,
then it's most likely a non-seg'd subfile,
but, keep an eye out for a blankline followed by a non-title,
in that case there has been a formatting error;
so, revise the context assumption
to that of being in a seg'd subfile .
[3.28:
. the app must adapt to such formatting errors
by providing the missing blankline after the title .]
3.15:
. if in a seg'd subfile,
and a seg appears to start with a subfile title,
then alert the user to a possible formatting error .
2.28:
. but continue to treat it as a seg; because,
one purpose of the seg'd-subfile format
is to indicate that a collection of small articles
should be packed into one subfile .

3.29: was-blogged format:

. a complication of finding subfiles
is the inclusion of relatively new formats that are
documenting how subfiles have been blogged .
. if the blog entry involved more than one subfile
then the usual multi-blankline system is used to indicate
how many of the following subfiles and chunks
are associated with a was-blogged header .
. a was-blogged header has this format:

title of blog -- not a subfile title
(optional list of keywords)
blog's url .

. even if there is only one subfile included,
this use of a was-blogged header may still occur
if the blog's title differed from subfile's header .
. in these cases the was-blogged header should be
copied to each of the enclosed subfiles .
. a complication of this would be a project format
where the blog included not only subfiles
but also untitled paragraphs (chunks)
meant to explain how the subfiles were related .
. in that case the project folder format should be used
( each chunk and subfile become
files within a folder having the blog entry's name).
. the date can listed as yymm00 or yymm99 .

3.29: targets .txt only:
. input folders may include .html as well as .txt
as a record of what has been blogged;
subfiler should be processing only .txt files .

3.25: 3.28: the parameter file:
. the parameter file should contain a list of the user's
root or top-level subjects .
. in order for a pathname to be considered a subfile title
it's root name needs to be included in this parameter file .
. eg, here's part of my parameter file:
addx, adda, adde, addm,
adds, math, engl,
cyb, me
care, gear, bank, cook, med, wealth
psy, pol, relig
--. the list is comma-delimited for in case
the user's root names include spaces .
. we parse by splitting on commas,
then strip any leading or trailing spaces .
. we read a line at a time,
so a newline can optionally serve as a comma;
we discard any empty strings
produced by having a comma at the end of string .
. we assume the parameter file is in the current directory .

incremental progress:

2.23:
. the next version can integrate an editor
for adjusting the mistakes that the user or it made;
the first version just displays an outline
that includes the line numbers
to help you do the file syntax reviewing,
using your usual editor to visit those line#'s .
[2.29:
. it can print every line preceded by a blank line,
print a blank line if the file has
2 or more blank lines in a row .
. or, it generates a file of numbered lines
rather than printing them;
then after you adjust any lines in your editor
and run subfiler on this file,
it copies your changes back to the original file .
it then proceeds to do the subfiler routine .]

2.23:
. a next version might move things to proper folders
according to what the subfile paths are .
. the expectation is that a reviewer has
checked the syntax and made the changes to the file path
so that in case it was written a long time ago,
the path is changed to become currently relevant .
. the first version just generates a folder of files
and leaves that in the current working directory .

structuring a list of lines into an outline tree:

2.23?:  3.27:
. a good function to start with is:
[number of blank lines]( [next line]/.string ).integer
. it returns with the number of blanklines it passed over,
and also modifies the string pointer it was passed
so [next line] contains the next non-blank line .
. if [next line]`length =0,
then eof has been reached .
3.25:
. after reading how to use python,
I'm thinking I'd rather
read the file into a list of strings,
so that line[i] is representing
the string at the ith linenumber .
. then I could make a tree whose leaf nodes contained
integers which were indices into the line[i] list .
3.27: 3.28:
. the file is represented by a list of lists,
where the inner lists represent the subfiles,
and the outer list represents the project folder
or merge command arguments .
. the structure of the inner lists should make it easy to
generate the outline being displayed:
the first column of an outline is the line#,
the 2nd col is the datestamp,
and 3rd col is the title;
hence, these will be 1st, 2nd, and 3rd items
of the inner lists .
. subsequent integers of an inner list
are showing line#'s of segments (paragraphs)
whose first lines should be indented under a title .
eg,
[ file`title
, chunk
, [title`line#, '1.2', subfile#1`title, seg#1, seg#2]
, [title`line#, '1.2', subfile#2`title, seg#1, subtitle, seg#2]
] .

title format:
2.23:
. the title may be missing a colon,
but a subfile title has one div in it,
and is preceded by at least one blank line .
3.25:
. but if the root of the title's pathname
is not among the expected roots
then the meaning depends on context:
# in a seg'd subfile:
. this is a subtitle of the current subfiles .
# in a non-seg'd subfile:
. this is a new subfile,
so assume the pathname is
file`title & current-title .

3.11: 3.27: identify titles:

. a typical title has this pattern:
1 or more blanklines followed by this pattern:
qualifier.domain`domparts/title/subtitle:
eg,
adde/title:
co.adde/title:
aq.gear/obj/title:
apt`wildlife/obj/title:

. if the suspected title is missing a (:) at the end,
or the (:) was replaced by a (;)
but has otherwise got a title syntax,
then it is still surely a title .
. on the other hand,
if the title candidate starts with a ". "
then this situation is a sentence refering to a title,
not the title itself .
. if the title candidate starts with a datestamp:
1.2: -- that is month.day, or
1.2.345: -- month.day.hour+minutes,
1.2 .. 8: -- cross-day range
1.2 .. 2.1: -- cross-month range
1.2, 2.3: --comma-separated multistamp
1.2: 2.3: -- simple multistamp
then remove the datestamp and test again,
since there may be multiple datestamps .
. one date stamp may apply to successive subfiles,
so it may not be there,
and if it's not, then this app needs to add it .
. if there are multiple timestamps in subfile#i,
and then no timestamps in subfile#(i+1),
then that is an error the user must be alerted to:
"(which of multiple timestamps applies to subsequent subfiles?).
. this app's output needs to indicate
what it thought was the right timestamp,
eg, if it had to propagate a timestamp from earlier subfiles,
it can use this syntax:
/1.2/: title: .

[3.27:
. the way to get the title is split on (/),
called the divparts list;
if no (/) exists in string,
return [this is not a title] .
. the dotparts list is split divparts#0 on (.);
domain`= dotparts#end;
if domain has a (`) in it,
then domain`= domain#(start:1st, stop:find(`));
if domain in (list of known domain names)
then this is a title,
else alert user to ambiguity:
is this a domain not yet in domain list? ]

3.22: sci: may not need to worry about unicode:
. the parts of the string I'm dealing with are just ascii,
and the rest are just being written in the same way they are read:
the extended bytes just look like ascii extended chars .

some details with some python code:

3.28:
def dateSplit(line):
""". the quickest way to get past all datestamps is to
check that the line starts with a number,
and then remains in "1234567890., :;?" .
. it splits the given string into
(datestamps, remainder of string ).
"""
 stripped= line.strip()
 if stripped[0] not in "1234567890":
   return ('', line)
 result = 0
 for i in range(len(line)):
   if line[i] not in "1234567890., :;?":
     result = i; break
 return ( line[:result], line[result:])

3.28:
. how to tell if datestamp is a single, not a range of dates?
pick off trailing {:;}, and find "..", ":", ";"

3.28: the parameter file:

def DomainsInFile():
""". this returns a list of user domain names
that it found in the parameter file, parameter.txt .
"""
 domains = set([])
 for line in open( 'parameter.txt' ):
   nextset = set(map( str.strip, line.split(',') ))
   domains.union(nextset)
 if '' in domains: domains.remove('')
 return domains
 #. now the domains set contains the strings
 # that define the root of a subfile title .

3.11: 3.27..28:

def hasTitleFormat(candidate):
 """. needs global list of strings, DomainList,
 and returns truth of candidate has a format of:
 a.b.Domain`part/title/subtitle .
 . the dots  and (`) are optional, the div's are not,
 and Domain must be in DomainList .
"""
 nodiv = (candidate.find('/')== -1)
 if nodiv:
   isTitle = False
 else:
   divparts = candidate.split('/')
   dotparts = divparts[0].split('.')
   candidateDomain = dotparts[-1]
   index = candidateDomain.find('`')
   if index > -1: candidateDomain= candidateDomain[:index]
   isTitle = (candidateDomain in DomainList)
 return isTitle

# init
DomainList = DomainsInFile() # global
# say line is the current line
date, candidate = dateSplit(line)
isTitle = hasTitleFormat(candidate)
# if an internet domain, then candidate is an author identifier .