Data Structures and Algorithms

I’m a big fan of the OWASP vulnerabilities.  Just learned of another resource.
Read a post recently suggesting that .INI files have a place in and around a dynamic language like Python.  The “security” specter was invoked.
This question on StackOverflow showed a profound confusion on fundamentals of OO.  The example, however, was kind of funny.
I read about a worthless project that purported to detect SQL Injection Attacks.  That’s lame because it’s easier to just use bind variables; bind variables make your application simpler and faster as well as more secure.  A reader notes that bind variables are a topic of debate.  Really?  How are bind variables debatable?
The requirements describe a kind of “broker” application that makes heavy use of a vendor’s web services.  Sadly, the requirements also give a lopsided view that leads to heavy refactoring.  Lesson learned: don’t take the requirements literally.
REST has some advantages over SOAP.  Django totally rules.  But the Django-REST interface causes me hand-wringing as I learn more about it.
The Pythonic distinction between __repr__ (“If at all possible, this should look like a valid Python expression that could be used to recreate an object with the same value”) and __str__ (“the ‘informal’ string representation of an object... a more convenient or concise representation [than __repr__]”) is very, very cool.
People get confused by concurrency.  Folks often fetishize some feature or other.  This is about the “equal-sized partitions” fetish.  Other fetishes include locking and I/O processing.
For PyCon ‘07 I presented a paper on how delightfully simple it is to use Python to conform dimensions in a data warehouse.  The algorithm boils down to the setdefault method of a dictionary.  Recently I was asked about using this for “processing gigs of incoming fact data each day”.
Here are some software defects so typical, that I’ve collected a handy short list with acronyms.  I’ve also got a specific technique for remediating those awful Everything In Main programs.
When you address a problem by creating a spreadsheet, you now have two problems.  Sigh.

I had a brain-cramping problem with XML, X12 and the need to support a variety of use cases.  Coincidentally, Ian Bicking posted something that coincidentally lead directly to a much more elegant solution.

The timing was an amazing piece of serendipity – or synchronicity – or luck.

A hot topic – more thoughts flow in from all sources.  Excellent points.  Thanks for thinking.

Got a bunch of physical design questions recently.  The conversation is made more complex by the way CA ERwin throws around terminology; specifically their misuse of “physical”.

The questions were surprising to me.  They seemed to reveal a tenuous grasp on what a database really was – structured, persistent storage.  Somehow, peripheral features seemed had grown to dominate the conversations.

Recently, I worked out the performance implications of two implementations of open-ended date ranges .  The next topic is the handling of different date resolutions.  Bottom Line: Time is Simple, but you can make it complicated.

(Revised to include another DW DateTime technique.)

What’s the “best” way to handle open-ended date ranges in SQL?  Use NULL for the end-date and horse around with IFNULL or COALESCE functions?  Or use a date in the impossibly far future?  This is sometimes called the “Domain Specific Null” problem.  I thought the answer was obvious until I ran some tests.

I had some configuration files in .INI format and .XML format.  Both were a large pain to work with.  I rewrote them into a massive Python object creation expression and – whoops! – ran into an interesting scalability issue.

[Thanks for the feedback; I’ve revised and extended this post.]

XML config files have their place – in standards.  .INI files have their place – in legacy programs.  Here are some more Python configuration file techniques that I’ve used to parse X12N messages.  I think there are two design patterns here: Structural Declaration and Bundled Properties.
XML-based configuration files are fine – when you’re struggling with Java.  INI files are just creepy because they seem to be Yet Another Syntax.  However, Python absolutely rules as a configuration language.
Not News: Formal Methods called into question.  Silly: Metaphorical alignment of formal methods with perpetual motion.
Really complex data is distorted when viewed through the Relational Database lens.  This posting is a rather peculiar example of how a relational model can be nearly impossible to implement.  It’s another Python To The Rescue story.
Got an interesting question last week, asking me to either compare and contrast Oracle’s Pipeline Table Functions and OS features, or to produce a taxonomy of mechanisms for pipeline processing.  Bonus! Information on XPL, the XML Pipeline Language.
There were many thoughtful comments to my various posts on PL/SQL vs. Java performance.  And some confusion, created – obviously – by me.
Some comments via e-mail: “How can you say Java is faster?”  “You didn’t use PL/SQL properly”  And some thoughts on scalability and management.
Some important questions have come up. Generally, these are common, almost standard questions. I failed to address these; mea culpa .
This question came up twice in the past few days: what’s the fastest way to do RDBMS processing? I found “Java vs. PL/SQL: Where Do I Put the SQL? ”, but it was misleading. Indeed, it was not really appropriate to any kind of real-world problem.
See “Spotting Accountants from Twenty Paces ” and “Spreadsheet Risk ”. I found EUSPRIG , which may be helpful in justifying the replacement of a spreadsheet with software that implements some auditable controls. Further, there’s Panko’s “What We Know About Spreadsheet Errors ”.
Sometimes a spreadsheet is two things: some input values that are part of a larger application, plus some up-front calculations that are helpful if they are presented to the user immediately. The calculation part is easy, it’s that “Larger Application” that presents the problem.
The problem with spreadsheets is that they work and they’re effective for a large number of problems. They provide a uniquely rich kind of functionality, and some use cases are a real pain without them. It’s all good until you want the solution to scale to a second person. Then they fall apart. If spreadsheets didn’t do some things so well, we wouldn’t be in the Scale Our Spreadsheet™ (SOS) situation. What to do?
Here’s the full text of the question “in your personal blog would you be willing to discuss ‘a common code base without a common data model’ ?” My first thought is, “You’re kidding, right?” My second thought is, “What is the source of this confusion?”

I think this is the objection in this comment ; it’s hard to be sure: People define the business calendar, but that definition doesn’t – or shouldn’t – or can’t – include the first or last business day of the month.

Why can’t it? It’s hard to say; indeed, with the tiny scraps of information available it’s hard to imagine someone could even provide a recommendation for calculating something so poorly defined.

Not Reingold and Dershowitz – their stuff really works. No, it’s the broader class of “everybody else” who thinks they understand the calendar that I find lack humility.
This table shows the direct cost of fragmentation in size and time. Denormalized table designs have the worst fragmentation. A semi-normalized table optimizes cost of fragmentation and query performance.
The MESS, besides being complex, uses a lot of space, but performs well. A partially normalized representation seems optimal. The fully normalized version makes maintenance and enhancement easy, but does this at the cost of performance.
Fragmentation = Slow; Normalization helps locking and prevents update anomalies. While all true, these issues aren’t on the horizon of people designing or using MESS uni-tables.
Three sample table designs that reflect degrees of normalization to control storage fragmentation.
Why prevent problems? Why characterize storage defragmentation as “lots of additional processing”?

The hallmarks of this MESS design are a large number of optional columns, a large number of null attributes values, and generally sparse data. The CREEP design creates each row with an almost indefinite number of features, including events, conditions, services, processes and relationships. The attributes may not be sparse, but they grow without any practical boundary, and the naive mapping from attribute to column is often inappropriate.

The biggest consequence of a MESS + CREEP design is that we have columns which are initially null, but get filled with large text comments or dates. Before too long we have highly fragmented storage. How do we prevent storage fragmentation and the associated slow-down?

Building Skills Content Management Culture of Complexity Data Structures and Algorithms Databases and Python DocBook Economics of Software Macintosh Methodology for Non-Programmers Open Source Projects Personal Web Toys Technology News Test Driven Reverse Engineering The Lure of XML Unit Testing in Python User Interface War Stories and Advice

Previous topic

Python Object-Relational Mapping (Revised)

Next topic

Security Resources [Update]

This Page