Data Sharing and America Competes Act
by: G.E. Ozz Nixon Jr.
Published: August 2007
©opyright 2009 by Friends of FPC
Today, August 9th 2007, President George W. Bush signed into affect the "America
Creating Opporunities to Meaningfully Promote Excellence in Technology, Education,
and Science Act" requiring civilian federal agencies to provide guidelines, policies
and procedures, to facilitate and optimize the open exchange of data and research
between agencies, the public and policymakers.
Those of you who are familiar with my works on data replication, store-n-forward
technologies, data dictionary compression, narrow and wide band satellite communications
for world new broadcasts -- know this was an exciting step for those of us in the
field of moving, searching and reporing data of all types.
The Nationional Institue of Health (NIH) defines "data" as "recorded information,
regardless of the form or medium on which it may be recorded, and includes writings,
films, sound recordings, pictorials, drawings, procedural manuals, forms, diagrams,
work-flow charts, data files, data processing, statistical records, and other
research data". This in short is a realistic listing of the types of data that
must be searchable - and that data is more than simple records in an SQL database
engine. Which then tearsdown the concept of "Data Replication" as being a useless
phrase used to market mere "SQL database copying" techniques.
As mentioned, I have experience in narrow and wide band communications for world
news... designing UPI (United Press International founded in 1907) structures and
actual encoding and decoding software for companies like Planet Connect. This
information is broadcasted over satellites world over in real-time, one way, out.
Making searching and hashing techniques a requirement for activities happening
now across the globe. I have to support the potential of 7-bit platforms, translate
all images into the LCD (lowest common denominator) - pixel RGB encoding - as 7bit.
Without data loss, without introducing a noticable latency.
In recent years, I was involved in designing a database replication solution,
unfortunately it was driven by narrow minded visionaries - not understanding
data to contain the subset of information NIH describes as "data". Sharing data
is more than posting a file of "files" on an FTP server - the data must be mined
(analyzed and organized). Which is another flaw so many "data replication" experts
fail to understand - like the introduction of XML - does not mean you fixed an
ongoing data mapping problem - it simply means, you have introduced yet another
problem to be solved (YAP2BS).
Common Problems with "data"
< data source >
|
? Type
? LCD (lowest common denominator)
? Age
? Priority
? Keywords
? Self-Hash
? Copy or Link
? Ownership
? Read-only
? Original Source
? Audit-trail
Before data base be analyzed and organized, it must be understood - the brief
listing above shows common problems with RAW data. As a developer you must think
outside of the database and more at the task at hand - sharing "data". So, what
Type of data is this piece of information? Image, Text, PDF, Spreadsheet,
written document, etc. Based upon the answer to this question you open up data
sharing to a wide range of software products and companies. For example, data
replication for hospitals include all of the above formats along with prioprietary
structures and even voice recordings. For this item/element to be of use, one
must find a library or company who provides an API for handling data of said type.
This also means, there are a lot of programs that need to be written - as data
replication is just starting to get traction - yet it will be years before the
data potential can be grasped by companies who pioneered this industry. Primarially
due to their size and poor understanding that what they currently market is like
Yahoo! was in the mid 90's in comparison to what Google is now and where they
are going.
Other topics or concerns with the "data" is finding a lowest common denominator
and capitalizing on it. Like voice recognition, OCR and image recognition, all
the sci-fi of 1990's to now - the LCD in human terms is text. Age and Priority
come to mind working with law enforcement and health care data. Gathering or
generating Keywords and Self-Hash Dictionary is extremely critical, which is
another area where LCD comes in to play - you document a item/element as a
RED Ford Sports car, while someone else looking for this exact item/element
by enter words from a witness as RED Mustang. The common factor of RED is so
low when comparing the other words.
Then you get into ownership, copy of the data versus linked or pointer data. Is
the data read-only or can it be revised, and if the revision occurs does it flow
back to the originator or only as a tag. Then for security, where is the originating
item/element and how did it make it to the current state in the search results?
Then introduce NPI (Non-Public Information) and PCI and other Certifications where
you must protect the data. Do you see how this can be exciting, especially in the
current economic state? Opportunies just opened because a nations leader is starting
to grasp the concept of data sharing for technology, education and sciece... or as
I call it, Finances-Research-Education-and-Everything (FREE). 'Free' Data Sharing
has major hurtdes, and it needs programmers who understand "together" we can all
contribute to data sharing.
This year I have worked on a project code named "Live Academia" which is a powerful
search engine for Academic data. The data is actually collected from individual
high school students, verified by their school counselors, searchable by both
counselors and colleges. The other side of the coin, is all of the colleges share
their academic schedules, demographics, scolarship opportunities, etc. Which all
of the high-school students search for finding candidates to solicate. This project
allows me to develop a web site which operates like a powerful desktop application,
with potentially millions of records, all linking to thousands of granular piece
of information. This data was not being shared prior to this project, or not on
the scale as it is now.
The largest problem with data sharing prior to this act was "data witholding". For
the past 5 years I have dealt first hand with people who will gladly search the
data in the search engine, but refuse to contribute their data. By this act promoting
a new type of thinking; "to facilitate and optimize the open exchange of data and
research between agencies" people like myself all have the opportunity to design
the future of data sharing. See my other research on this topic, were I explain
different search techniques, optimizing data for searches, techniques to present
"did you mean ...." suggestions, how to migrate and propogate data.
G.E. Ozz Nixon Jr.